debian怎么配置静态ip地址
627
2022-09-11
Cilium Vxlan 跨节点通信过程
测试前提
因为 cilium 1.10.6 版本 monitor 存在缺少 HOST NS 封装过程,所以我们使用的是 cilium 1.11.4
进行抓包分析。
具体部署过程请参考 Cilium v1.10.6 安装部署,helm pull 的时候选择 1.11.4 即可,其余修改一样的
cilium 1.10.6 报文格式
cilium 1.11.4 报文格式
跨节点通信特点
分析不同节点Pod之间通信,对应此前我们熟悉的CNI(Calico Flannel)均是使用路由表,FDB表,ARP表等网络知识便可以分析的非常清楚,但是在Cilium中我们发现此种分析思路便"失效"了。究其原因,是由于Cilium的CNI实现结合eBPF技术实现了datapath的"跳跃式"转发。
我们需要结合 Cilium 提供的Tools 来辅助分析
cilium monitor -vvpwruiptables TRACEtcpdump 分析
如下图报文发送路线图,
我们需要对 pod1 的 lxc 网卡进行抓包、vxlan 抓包、ens33 抓包
需要对 pod2 的 lxc 网卡进行抓包、vxlan 抓包
tcpdump
确定 pod 分布情况
node-1:10.0.0.222 (简称:pod1**)
node-2:10.0.1.208 (简称:pod2)
root@master:~# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATEScni-test-76d79dfb85-28bpq 1/1 Running 0 19m 10.0.0.222 node-1.whale.com
pod1 对应网卡 lxc91ffd83cbb3e
root@master:# kubectl exec -it cni-test-76d79dfb85-28bpq -- ethtool -S eth0 NIC statistics: peer_ifindex: 30# node-1 节点root@node-1:# ip link show | grep ^3030: lxc91ffd83cbb3e@if29:
pod 2 对应网卡 lxc978830fe1a23
root@master:# kubectl exec -it cni-test-76d79dfb85-tjhdp -- ethtool -S eth0 NIC statistics: peer_ifindex: 22# node-2 节点root@node-2:# ip link show | grep ^2222: lxc978830fe1a23@if21:
node-1 抓包
tcpdump -pne -i lxc91ffd83cbb3e -w lxc_pod1.captcpdump -pne -i cilium_vxlan -w cilium_vxlan_pod1.captcpdump -pne -i ens33 -w ens33_pod1.cap
node-2 抓包
tcpdump -pne -i lxc978830fe1a23 -w lxc_pod2.captcpdump -pne -i cilium_vxlan -w cilium_vxlan_pod2.cap
ping 测试
kubectl exec -it cni-test-76d79dfb85-28bpq -- ping -c 1 10.0.1.208
lxc_pod1.cap
cilium_vxlan_pod1.cap
ens33_pod1.cap
lxc_pod2.cap
cilium_vxlan_pod2.cap
通过上述抓包,我们验证了场景:
跨节点通信的时候,pod 的报文需要pod内部封装一次,然后通过 HOST NS(宿主机)vxlan 在进行封装,然后通过宿主机的物理网卡传到对端 宿主机,在经过 vxlan 解封装后 直接 redict 到pod 内部,而不需要经过它对应的 lxc 网卡,我们就可以论证下图的通信流程图。
cilium monitor
确定 pod 分布情况
node-1:10.0.0.222 (简称:pod1**)
node-2:10.0.1.208 (简称:pod2)
root@master:~# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATEScni-test-76d79dfb85-28bpq 1/1 Running 0 19m 10.0.0.222 node-1.whale.com
在对应节点分布的 cilium pod 查看他们的网卡信息
圈出来的意思,就是 pod 的 eth0 网卡和
# node-1root@master:# kubectl -n kube-system exec -it cilium-xnmfw -- cilium bpf endpoint listDefaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), clean-cilium-state (init)IP ADDRESS LOCAL ENDPOINT INFO10.0.0.112:0 id=415 flags=0x0000 ifindex=24 mac=EA:CF:FE:E8:E7:26 nodemac=BE:12:EB:4E:E9:30 192.168.0.120:0 (localhost) 10.0.0.215:0 (localhost) 10.0.0.222:0 id=2164 flags=0x0000 ifindex=30 mac=32:30:9C:CA:09:8E nodemac=2E:3C:E3:61:26:45 # node-2root@master:# kubectl -n kube-system exec -it cilium-jztvj -- cilium bpf endpoint listDefaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), clean-cilium-state (init)IP ADDRESS LOCAL ENDPOINT INFO10.0.1.208:0 id=969 flags=0x0000 ifindex=22 mac=DA:97:53:7E:9A:CA nodemac=62:57:5C:C9:D6:0C 192.168.0.130:0 (localhost) 10.0.1.249:0 id=2940 flags=0x0000 ifindex=16 mac=02:55:31:EC:28:60 nodemac=32:FD:46:2F:CB:8A 10.0.1.10:0 (localhost)
node1 抓包
使用 pod1 ping pod2,同时在 pod1 所在节点的 cilium 进行抓包
root@master:~# kubectl exec -it cni-test-76d79dfb85-28bpq -- ping -c 1 10.0.1.208PING 10.0.1.208 (10.0.1.208): 56 data bytes64 bytes from 10.0.1.208: seq=0 ttl=63 time=0.790 ms--- 10.0.1.208 ping statistics ---1 packets transmitted, 1 packets received, 0% packet lossround-trip min/avg/max = 0.790/0.790/0.790 ms
cilium monitor 抓包分析
root@master:~# kubectl -n kube-system exec -it cilium-xnmfw -- cilium monitor -vv >monitor.yaml
关键部分
第一个 Conntrack lookup ,是指定 pod 内部数据出来经过iptables 的过程
node 3232235650 (0xc0a80082)
0x 表示 16进制c0-a8-00-82 --> 192.168.0.130这个是 16 进制的 ip 地址,意思就是对端 node 的 ip地址
------------------------------------------------------------------------------CPU 02: MARK 0x0 FROM 2164 DEBUG: Conntrack lookup 1/2: src=10.0.0.222:4864 dst=10.0.1.208:0CPU 02: MARK 0x0 FROM 2164 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=1CPU 02: MARK 0x0 FROM 2164 DEBUG: CT verdict: New, revnat=0CPU 02: MARK 0x0 FROM 2164 DEBUG: Successfully mapped addr=10.0.1.208 to identity=3352CPU 02: MARK 0x0 FROM 2164 DEBUG: Conntrack create: proxy-port=0 revnat=0 src-identity=3352 lb=0.0.0.0CPU 02: MARK 0x0 FROM 2164 DEBUG: Encapsulating to node 3232235650 (0xc0a80082) from seclabel 3352
第二个 Conntrack lookup 是 HOST NS 中的 iptables 的处理逻辑,我们可以看源和目的地址都是 宿主机的 ip 地址
------------------------------------------------------------------------------CPU 02: MARK 0x0 FROM 3 DEBUG: Conntrack lookup 1/2: src=192.168.0.120:56435 dst=192.168.0.130:8472CPU 02: MARK 0x0 FROM 3 DEBUG: Conntrack lookup 2/2: nexthdr=17 flags=1CPU 02: MARK 0x0 FROM 3 DEBUG: CT entry found lifetime=16823678, revnat=0CPU 02: MARK 0x0 FROM 3 DEBUG: CT verdict: Established, revnat=0
第三个 Conntrack lookup 是返回的 reply 报文,但是并没有 HOST NS 部分的处理。
可以参考下图,我们发现 NodePort Remote Endpoint 的时候是具有 redict 的能力的
CPU 03: MARK 0x0 FROM 0 DEBUG: Tunnel decap: id=3352 flowlabel=0CPU 03: MARK 0x0 FROM 0 DEBUG: Attempting local delivery for container id 2164 from seclabel 3352CPU 03: MARK 0x0 FROM 2164 DEBUG: Conntrack lookup 1/2: src=10.0.1.208:0 dst=10.0.0.222:4864CPU 03: MARK 0x0 FROM 2164 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=0CPU 03: MARK 0x0 FROM 2164 DEBUG: CT entry found lifetime=16823776, revnat=0CPU 03: MARK 0x0 FROM 2164 DEBUG: CT verdict: Reply, revnat=0
node2 抓包
进入 node2 的 cilium 进行 monitor 抓包
root@master:~# kubectl -n kube-system exec -it cilium-jztvj -- cilium monitor -vv >monitor2.yaml
查看报文关键部分
这里看到数据被decap,此时便发到cilium_vxlan接口,第一个 Conntrack lookup 直接在 pod 内部的进行 iptables 的处理。
CPU 03: MARK 0x0 FROM 0 DEBUG: Tunnel decap: id=3352 flowlabel=0CPU 03: MARK 0x0 FROM 0 DEBUG: Attempting local delivery for container id 969 from seclabel 3352
此时数据从cilium_vxlan送到bpf2上的pod中的eth0网卡.(此时也是通过tcpdump抓包可得)
CPU 03: MARK 0x0 FROM 969 DEBUG: Conntrack lookup 1/2: src=10.0.0.222:7936 dst=10.0.1.208:0CPU 03: MARK 0x0 FROM 969 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=0CPU 03: MARK 0x0 FROM 969 DEBUG: CT verdict: New, revnat=0CPU 03: MARK 0x0 FROM 969 DEBUG: Conntrack create: proxy-port=0 revnat=0 src-identity=3352 lb=0.0.0.0
这里我们又看到数据包被 iptables 处理了,但是此时 src 和 dst 发生了变化,也就是一个 ICMP 的reply 报文,然后需要送到 vxlan 的接口
CPU 03: MARK 0x0 FROM 969 from-endpoint: 98 bytes (98 captured), state new, , identity 3352->unknown, orig-ip 0.0.0.0CPU 03: MARK 0x0 FROM 969 DEBUG: Conntrack lookup 1/2: src=10.0.1.208:0 dst=10.0.0.222:7936CPU 03: MARK 0x0 FROM 969 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=1CPU 03: MARK 0x0 FROM 969 DEBUG: CT entry found lifetime=16826421, revnat=0CPU 03: MARK 0x0 FROM 969 DEBUG: CT verdict: Reply, revnat=0CPU 03: MARK 0x0 FROM 969 DEBUG: Successfully mapped addr=10.0.0.222 to identity=3352CPU 03: MARK 0x0 FROM 969 DEBUG: Encapsulating to node 3232235640 (0xc0a80078) from seclabel 3352------------------------------------------------------------------------------
此时我们看到返回的 ICMP reply 报文也经过了 HOST NS的封装
Ethernet {Contents=[..14..] Payload=[..86..] SrcMAC=da:97:53:7e:9a:ca DstMAC=62:57:5c:c9:d6:0c EthernetType=IPv4 Length=0}IPv4 {Contents=[..20..] Payload=[..64..] Version=4 IHL=5 TOS=0 Length=84 Id=29557 Flags= FragOffset=0 TTL=64 Protocol=ICMPv4 Checksum=61574 SrcIP=10.0.1.208 DstIP=10.0.0.222 Options=[] Padding=[]}ICMPv4 {Contents=[..8..] Payload=[..56..] TypeCode=EchoReply Checksum=19037 Id=7936 Seq=0} Failed to decode layer: No decoder for layer type PayloadCPU 03: MARK 0x0 FROM 969 to-overlay: 98 bytes (98 captured), state new, interface cilium_vxlan, , identity 3352->unknown, orig-ip 0.0.0.0CPU 03: MARK 0x0 FROM 385 DEBUG: Conntrack lookup 1/2: src=192.168.0.130:50730 dst=192.168.0.120:8472CPU 03: MARK 0x0 FROM 385 DEBUG: Conntrack lookup 2/2: nexthdr=17 flags=1CPU 03: MARK 0x0 FROM 385 DEBUG: CT verdict: New, revnat=0CPU 03: MARK 0x0 FROM 385 DEBUG: Conntrack create: proxy-port=0 revnat=0 src-identity=0 lb=0.0.0.0
cilium map
cilium 通过维护它自己一些map,来确定对端地址
如果 cilium 容器没有 jq 命令,直接apt install jq 即可
root@node-1:/home/cilium# cilium map get cilium_tunnel_map -o json | jq{ "cache": [ { "desired-action": "sync", "key": "10.0.1.0:0", "value": "192.168.0.130:0" }, { "desired-action": "sync", "key": "10.0.2.0:0", "value": "192.168.0.110:0" } ], "path": "/sys/fs/bpf/tc/globals/cilium_tunnel_map"}
维护 pod ip地址
root@node-1:/home/cilium# cilium map get cilium_ipcache -o json | jq{ "cache": [ { "desired-action": "sync", "key": "10.0.2.166/32", "value": "53142 0 192.168.0.110" }, { "desired-action": "sync", "key": "10.0.1.249/32", "value": "4 0 192.168.0.130" }, { "desired-action": "sync", "key": "10.0.1.208/32", "value": "3352 0 192.168.0.130" }, { "desired-action": "sync", "key": "10.0.0.215/32", "value": "1 0 0.0.0.0" }, { "desired-action": "sync", "key": "10.0.0.112/32", "value": "4 0 0.0.0.0" }, { "desired-action": "sync", "key": "192.168.0.110/32", "value": "7 0 0.0.0.0" }, { "desired-action": "sync", "key": "10.0.2.185/32", "value": "4 0 192.168.0.110" }, { "desired-action": "sync", "key": "192.168.0.130/32", "value": "6 0 0.0.0.0" }, { "desired-action": "sync", "key": "10.0.1.10/32", "value": "6 0 192.168.0.130" }, { "desired-action": "sync", "key": "10.0.2.233/32", "value": "6 0 192.168.0.110" }, { "desired-action": "sync", "key": "0.0.0.0/0", "value": "2 0 0.0.0.0" }, { "desired-action": "sync", "key": "192.168.0.120/32", "value": "1 0 0.0.0.0" }, { "desired-action": "sync", "key": "10.0.2.45/32", "value": "53142 0 192.168.0.110" }, { "desired-action": "sync", "key": "10.0.0.222/32", "value": "3352 0 0.0.0.0" } ], "path": "/sys/fs/bpf/tc/globals/cilium_ipcache"}
我们可以拿 10.0.1.208 举例
value 中的 3352 就是通过 cilium endpoint list 来进行确定的 IDENTITY 的值
{ "desired-action": "sync", "key": "10.0.1.208/32", "value": "3352 0 192.168.0.130" }
IDENTITY 为 3352
root@node-1:/home/cilium# cilium endpoint listENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS ENFORCEMENT ENFORCEMENT 3 Disabled Disabled 1 reserved:host ready 415 Disabled Disabled 4 reserved:health 10.0.0.112 ready 2164 Disabled Disabled 3352 k8s:app=cni-test 10.0.0.222 ready k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default k8s:io.cilium.k8s.policy.cluster=default k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default
结论
cilium 通过维护这些 map ,然后就可以确定 pod 地址和 node 地址之间的对应关系,以及 cilium 维护的 endpoint 的关系表,最终在报文对端进行 redict ,直接跳转到 pod 内部进行处理。
版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。
发表评论
暂时没有评论,来抢沙发吧~