Cilium Vxlan 跨节点通信过程

网友投稿 627 2022-09-11

Cilium Vxlan 跨节点通信过程

测试前提

因为 cilium 1.10.6 版本 monitor 存在缺少 HOST NS 封装过程,所以我们使用的是 cilium 1.11.4

进行抓包分析。

具体部署过程请参考 ​​Cilium v1.10.6 安装部署​​,helm pull 的时候选择 1.11.4 即可,其余修改一样的

cilium 1.10.6 报文格式

cilium 1.11.4 报文格式

跨节点通信特点

分析不同节点Pod之间通信,对应此前我们熟悉的CNI(Calico Flannel)均是使用路由表,FDB表,ARP表等网络知识便可以分析的非常清楚,但是在Cilium中我们发现此种分析思路便"失效"了。究其原因,是由于Cilium的CNI实现结合eBPF技术实现了datapath的"跳跃式"转发。

我们需要结合 Cilium 提供的Tools 来辅助分析

cilium monitor -vvpwruiptables TRACEtcpdump 分析

如下图报文发送路线图,

我们需要对 pod1 的 lxc 网卡进行抓包、vxlan 抓包、ens33 抓包

需要对 pod2 的 lxc 网卡进行抓包、vxlan 抓包

tcpdump

确定 pod 分布情况

node-1:10.0.0.222 (简称:pod1**)

node-2:10.0.1.208 (简称:pod2)

root@master:~# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATEScni-test-76d79dfb85-28bpq 1/1 Running 0 19m 10.0.0.222 node-1.whale.com cni-test-76d79dfb85-tjhdp 1/1 Running 0 19m 10.0.1.208 node-2.whale.com

pod1 对应网卡 lxc91ffd83cbb3e

root@master:# kubectl exec -it cni-test-76d79dfb85-28bpq -- ethtool -S eth0 NIC statistics: peer_ifindex: 30# node-1 节点root@node-1:# ip link show | grep ^3030: lxc91ffd83cbb3e@if29: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000

pod 2 对应网卡 lxc978830fe1a23

root@master:# kubectl exec -it cni-test-76d79dfb85-tjhdp -- ethtool -S eth0 NIC statistics: peer_ifindex: 22# node-2 节点root@node-2:# ip link show | grep ^2222: lxc978830fe1a23@if21: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000

node-1 抓包

tcpdump -pne -i lxc91ffd83cbb3e -w lxc_pod1.captcpdump -pne -i cilium_vxlan -w cilium_vxlan_pod1.captcpdump -pne -i ens33 -w ens33_pod1.cap

node-2 抓包

tcpdump -pne -i lxc978830fe1a23 -w lxc_pod2.captcpdump -pne -i cilium_vxlan -w cilium_vxlan_pod2.cap

ping 测试

kubectl exec -it cni-test-76d79dfb85-28bpq -- ping -c 1 10.0.1.208

lxc_pod1.cap

cilium_vxlan_pod1.cap

ens33_pod1.cap

lxc_pod2.cap

cilium_vxlan_pod2.cap

通过上述抓包,我们验证了场景:

跨节点通信的时候,pod 的报文需要pod内部封装一次,然后通过 HOST NS(宿主机)vxlan 在进行封装,然后通过宿主机的物理网卡传到对端 宿主机,在经过 vxlan 解封装后 直接 redict 到pod 内部,而不需要经过它对应的 lxc 网卡,我们就可以论证下图的通信流程图。

cilium monitor

确定 pod 分布情况

node-1:10.0.0.222 (简称:pod1**)

node-2:10.0.1.208 (简称:pod2)

root@master:~# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATEScni-test-76d79dfb85-28bpq 1/1 Running 0 19m 10.0.0.222 node-1.whale.com cni-test-76d79dfb85-tjhdp 1/1 Running 0 19m 10.0.1.208 node-2.whale.com

在对应节点分布的 cilium pod 查看他们的网卡信息

圈出来的意思,就是 pod 的 eth0 网卡和

# node-1root@master:# kubectl -n kube-system exec -it cilium-xnmfw -- cilium bpf endpoint listDefaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), clean-cilium-state (init)IP ADDRESS LOCAL ENDPOINT INFO10.0.0.112:0 id=415 flags=0x0000 ifindex=24 mac=EA:CF:FE:E8:E7:26 nodemac=BE:12:EB:4E:E9:30 192.168.0.120:0 (localhost) 10.0.0.215:0 (localhost) 10.0.0.222:0 id=2164 flags=0x0000 ifindex=30 mac=32:30:9C:CA:09:8E nodemac=2E:3C:E3:61:26:45 # node-2root@master:# kubectl -n kube-system exec -it cilium-jztvj -- cilium bpf endpoint listDefaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), clean-cilium-state (init)IP ADDRESS LOCAL ENDPOINT INFO10.0.1.208:0 id=969 flags=0x0000 ifindex=22 mac=DA:97:53:7E:9A:CA nodemac=62:57:5C:C9:D6:0C 192.168.0.130:0 (localhost) 10.0.1.249:0 id=2940 flags=0x0000 ifindex=16 mac=02:55:31:EC:28:60 nodemac=32:FD:46:2F:CB:8A 10.0.1.10:0 (localhost)

node1 抓包

使用 pod1 ping pod2,同时在 pod1 所在节点的 cilium 进行抓包

root@master:~# kubectl exec -it cni-test-76d79dfb85-28bpq -- ping -c 1 10.0.1.208PING 10.0.1.208 (10.0.1.208): 56 data bytes64 bytes from 10.0.1.208: seq=0 ttl=63 time=0.790 ms--- 10.0.1.208 ping statistics ---1 packets transmitted, 1 packets received, 0% packet lossround-trip min/avg/max = 0.790/0.790/0.790 ms

cilium monitor 抓包分析

root@master:~# kubectl -n kube-system exec -it cilium-xnmfw -- cilium monitor -vv >monitor.yaml

关键部分

第一个 Conntrack lookup ,是指定 pod 内部数据出来经过iptables 的过程

node 3232235650 (0xc0a80082)

0x 表示 16进制c0-a8-00-82 --> 192.168.0.130这个是 16 进制的 ip 地址,意思就是对端 node 的 ip地址

------------------------------------------------------------------------------CPU 02: MARK 0x0 FROM 2164 DEBUG: Conntrack lookup 1/2: src=10.0.0.222:4864 dst=10.0.1.208:0CPU 02: MARK 0x0 FROM 2164 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=1CPU 02: MARK 0x0 FROM 2164 DEBUG: CT verdict: New, revnat=0CPU 02: MARK 0x0 FROM 2164 DEBUG: Successfully mapped addr=10.0.1.208 to identity=3352CPU 02: MARK 0x0 FROM 2164 DEBUG: Conntrack create: proxy-port=0 revnat=0 src-identity=3352 lb=0.0.0.0CPU 02: MARK 0x0 FROM 2164 DEBUG: Encapsulating to node 3232235650 (0xc0a80082) from seclabel 3352

第二个 Conntrack lookup 是 HOST NS 中的 iptables 的处理逻辑,我们可以看源和目的地址都是 宿主机的 ip 地址

------------------------------------------------------------------------------CPU 02: MARK 0x0 FROM 3 DEBUG: Conntrack lookup 1/2: src=192.168.0.120:56435 dst=192.168.0.130:8472CPU 02: MARK 0x0 FROM 3 DEBUG: Conntrack lookup 2/2: nexthdr=17 flags=1CPU 02: MARK 0x0 FROM 3 DEBUG: CT entry found lifetime=16823678, revnat=0CPU 02: MARK 0x0 FROM 3 DEBUG: CT verdict: Established, revnat=0

第三个 Conntrack lookup 是返回的 reply 报文,但是并没有 HOST NS 部分的处理。

可以参考下图,我们发现 NodePort Remote Endpoint 的时候是具有 redict 的能力的

CPU 03: MARK 0x0 FROM 0 DEBUG: Tunnel decap: id=3352 flowlabel=0CPU 03: MARK 0x0 FROM 0 DEBUG: Attempting local delivery for container id 2164 from seclabel 3352CPU 03: MARK 0x0 FROM 2164 DEBUG: Conntrack lookup 1/2: src=10.0.1.208:0 dst=10.0.0.222:4864CPU 03: MARK 0x0 FROM 2164 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=0CPU 03: MARK 0x0 FROM 2164 DEBUG: CT entry found lifetime=16823776, revnat=0CPU 03: MARK 0x0 FROM 2164 DEBUG: CT verdict: Reply, revnat=0

node2 抓包

进入 node2 的 cilium 进行 monitor 抓包

root@master:~# kubectl -n kube-system exec -it cilium-jztvj -- cilium monitor -vv >monitor2.yaml

查看报文关键部分

这里看到数据被decap,此时便发到cilium_vxlan接口,第一个 Conntrack lookup 直接在 pod 内部的进行 iptables 的处理。

CPU 03: MARK 0x0 FROM 0 DEBUG: Tunnel decap: id=3352 flowlabel=0CPU 03: MARK 0x0 FROM 0 DEBUG: Attempting local delivery for container id 969 from seclabel 3352

此时数据从cilium_vxlan送到bpf2上的pod中的eth0网卡.(此时也是通过tcpdump抓包可得)

CPU 03: MARK 0x0 FROM 969 DEBUG: Conntrack lookup 1/2: src=10.0.0.222:7936 dst=10.0.1.208:0CPU 03: MARK 0x0 FROM 969 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=0CPU 03: MARK 0x0 FROM 969 DEBUG: CT verdict: New, revnat=0CPU 03: MARK 0x0 FROM 969 DEBUG: Conntrack create: proxy-port=0 revnat=0 src-identity=3352 lb=0.0.0.0

这里我们又看到数据包被 iptables 处理了,但是此时 src 和 dst 发生了变化,也就是一个 ICMP 的reply 报文,然后需要送到 vxlan 的接口

CPU 03: MARK 0x0 FROM 969 from-endpoint: 98 bytes (98 captured), state new, , identity 3352->unknown, orig-ip 0.0.0.0CPU 03: MARK 0x0 FROM 969 DEBUG: Conntrack lookup 1/2: src=10.0.1.208:0 dst=10.0.0.222:7936CPU 03: MARK 0x0 FROM 969 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=1CPU 03: MARK 0x0 FROM 969 DEBUG: CT entry found lifetime=16826421, revnat=0CPU 03: MARK 0x0 FROM 969 DEBUG: CT verdict: Reply, revnat=0CPU 03: MARK 0x0 FROM 969 DEBUG: Successfully mapped addr=10.0.0.222 to identity=3352CPU 03: MARK 0x0 FROM 969 DEBUG: Encapsulating to node 3232235640 (0xc0a80078) from seclabel 3352------------------------------------------------------------------------------

此时我们看到返回的 ICMP reply 报文也经过了 HOST NS的封装

Ethernet {Contents=[..14..] Payload=[..86..] SrcMAC=da:97:53:7e:9a:ca DstMAC=62:57:5c:c9:d6:0c EthernetType=IPv4 Length=0}IPv4 {Contents=[..20..] Payload=[..64..] Version=4 IHL=5 TOS=0 Length=84 Id=29557 Flags= FragOffset=0 TTL=64 Protocol=ICMPv4 Checksum=61574 SrcIP=10.0.1.208 DstIP=10.0.0.222 Options=[] Padding=[]}ICMPv4 {Contents=[..8..] Payload=[..56..] TypeCode=EchoReply Checksum=19037 Id=7936 Seq=0} Failed to decode layer: No decoder for layer type PayloadCPU 03: MARK 0x0 FROM 969 to-overlay: 98 bytes (98 captured), state new, interface cilium_vxlan, , identity 3352->unknown, orig-ip 0.0.0.0CPU 03: MARK 0x0 FROM 385 DEBUG: Conntrack lookup 1/2: src=192.168.0.130:50730 dst=192.168.0.120:8472CPU 03: MARK 0x0 FROM 385 DEBUG: Conntrack lookup 2/2: nexthdr=17 flags=1CPU 03: MARK 0x0 FROM 385 DEBUG: CT verdict: New, revnat=0CPU 03: MARK 0x0 FROM 385 DEBUG: Conntrack create: proxy-port=0 revnat=0 src-identity=0 lb=0.0.0.0

cilium map

cilium 通过维护它自己一些map,来确定对端地址

如果 cilium 容器没有 jq 命令,直接​​apt install jq​​ 即可

root@node-1:/home/cilium# cilium map get cilium_tunnel_map -o json | jq{ "cache": [ { "desired-action": "sync", "key": "10.0.1.0:0", "value": "192.168.0.130:0" }, { "desired-action": "sync", "key": "10.0.2.0:0", "value": "192.168.0.110:0" } ], "path": "/sys/fs/bpf/tc/globals/cilium_tunnel_map"}

维护 pod ip地址

root@node-1:/home/cilium# cilium map get cilium_ipcache -o json | jq{ "cache": [ { "desired-action": "sync", "key": "10.0.2.166/32", "value": "53142 0 192.168.0.110" }, { "desired-action": "sync", "key": "10.0.1.249/32", "value": "4 0 192.168.0.130" }, { "desired-action": "sync", "key": "10.0.1.208/32", "value": "3352 0 192.168.0.130" }, { "desired-action": "sync", "key": "10.0.0.215/32", "value": "1 0 0.0.0.0" }, { "desired-action": "sync", "key": "10.0.0.112/32", "value": "4 0 0.0.0.0" }, { "desired-action": "sync", "key": "192.168.0.110/32", "value": "7 0 0.0.0.0" }, { "desired-action": "sync", "key": "10.0.2.185/32", "value": "4 0 192.168.0.110" }, { "desired-action": "sync", "key": "192.168.0.130/32", "value": "6 0 0.0.0.0" }, { "desired-action": "sync", "key": "10.0.1.10/32", "value": "6 0 192.168.0.130" }, { "desired-action": "sync", "key": "10.0.2.233/32", "value": "6 0 192.168.0.110" }, { "desired-action": "sync", "key": "0.0.0.0/0", "value": "2 0 0.0.0.0" }, { "desired-action": "sync", "key": "192.168.0.120/32", "value": "1 0 0.0.0.0" }, { "desired-action": "sync", "key": "10.0.2.45/32", "value": "53142 0 192.168.0.110" }, { "desired-action": "sync", "key": "10.0.0.222/32", "value": "3352 0 0.0.0.0" } ], "path": "/sys/fs/bpf/tc/globals/cilium_ipcache"}

我们可以拿 10.0.1.208 举例

value 中的 3352 就是通过 ​​cilium endpoint list​​ 来进行确定的 IDENTITY 的值

{ "desired-action": "sync", "key": "10.0.1.208/32", "value": "3352 0 192.168.0.130" }

IDENTITY 为 3352

root@node-1:/home/cilium# cilium endpoint listENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS ENFORCEMENT ENFORCEMENT 3 Disabled Disabled 1 reserved:host ready 415 Disabled Disabled 4 reserved:health 10.0.0.112 ready 2164 Disabled Disabled 3352 k8s:app=cni-test 10.0.0.222 ready k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default k8s:io.cilium.k8s.policy.cluster=default k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default

结论

cilium 通过维护这些 map ,然后就可以确定 pod 地址和 node 地址之间的对应关系,以及 cilium 维护的 endpoint 的关系表,最终在报文对端进行 redict ,直接跳转到 pod 内部进行处理。

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:全媒派:“笋丝”的快乐,利路修下班后也想象不到!
下一篇:client-go连接kubernetes集群-update相关操作
相关文章

 发表评论

暂时没有评论,来抢沙发吧~