linux cpu占用率如何看
345
2022-09-11
Cilium 同节点 pod 通信过程
前提条件
ebpf-host-routing
kernel > 5.10
因为在内核 5.11.0 以后,具备了 bpf_redict_peer() 和 bpf_redict_neigh() 这两个非常重要的 helper 函数的能力。此能力是来自 Cilium host Routing 的 Feature 引入。
eBPF Host-Routing
引入官网中的一段描述:
We introduced eBPF-based host-routing in Cilium 1.9 to fully bypass iptables and the upper host stack, and to achieve a faster network namespace switch compared to regular veth device operation. This option is automatically enabled if your kernel supports it. To validate whether your installation is running with eBPF host-routing, run cilium status in any of the Cilium pods and look for the line reporting the status for “Host Routing” which should state “BPF”.
简单翻译如下:
在Cilium 1.9中引入了基于eBPF的 Host Routing,可以完全绕过iptables和上层主机堆栈,并且与常规的veth设备操作相比,实现了更快的网络命名空间切换。如果你的内核支持这个选项,它会自动启用。要验证你的安装是否运行了eBPF主机路由,请在任何一个Cilium pods中运行cilium status,寻找报告 "Host Routing "状态的行,它应该显示 "BPF"。
验证环境 Host Routing 是否是 BPF
root@master:~# kubectl -n kube-system exec -it cilium-fxdkr -- cilium status | grep -i "Host Routing"Defaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), clean-cilium-state (init)Host Routing: BPF
bpf_redict_peer()
如下图所示,bpf_redict_peer() 函数可以将同 一个节点的pod 通信的步骤进行省略,报文在从源pod 出来以后,可以直接 redict(直连) 到目的pod 的内部网卡,而不会经过宿主机上与其对应的 LXC 网卡,这样在报文的路径就少了一步转发。
具体的实现原理,可以查看 cilium 官方社区的一篇博客,这种 redict 的能力,使得 cilium 绕过了 Node 上的 iptables Overhead
cni-benchmark
Node network VS container network
如图所示,在 pod 内部和 在 左边 Node 节点 的 报文封装过程 是类似的,但是 Pod 报文在离开 当前 Node 的时候,还是会在宿主机上同样处理一遍,也就是又经过了一遍的 iptables Overhead。
Container Network VS Cilium eBPF Container network
相比于左边的 标准容器网络, cilium 的 eBPF 容器网络的 Host-Routing 特性,使得左边的图的红色圈出来的部分都被跳过,从而直接返回至容器内部的网卡,这个特性使用的就是 bpf_redict_peer() 函数的能力。
eBPF host-routing 允许绕过主机命名空间中的所有的 iptables 和 上层的overhead 开销,以及穿过 Veth Pair 时的 context-switching 开销,这样网络数据包尽可能早的面向网络设备拾取,并直接传递到 Kubernetes Pod的网络命名空间,在出口端,数据包仍然需要穿过 Veth Pair,被eBPF拾取并直接提交给面向外部的网络接口(就是 宿主机eth0),路由表是直接从 eBPF 查询的,所以这种优化是完全透明,并且与系统上运行的其他提供路由分配的服务兼容。
如图所示,我们按照 icmp 报文简单推测一下抓包逻辑
pod内部 veth 网卡、pod 网卡对应宿主机的 veth 网卡、宿主机网卡
POD1:
pod1 内veth 网卡:可以正常接收到 request 报文 和 reply 报文node 对应veth 网卡:可以正常接收到 request 报文,但是无法接收 reply 报文POD2pod2 内veth网卡:可以正常接收到 request 报文 和 reply 报文node 对应 veth网卡:可以正常接收到 reply 报文,但是无法接收 request 报文
root@master:~# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATEScni-test-76d79dfb85-5l4w9 1/1 Running 0 3d6h 10.0.0.133 node-1.whale.com
源地址: 10.0.0.133 -> 目的地址: 10.0.0.67
pod1 网卡 与宿主机网卡对应关系
root@master:# kubectl exec -it cni-test-76d79dfb85-5l4w9 -- ethtool -S eth0NIC statistics: peer_ifindex: 20 ...# 宿主机第 20 口 对应网卡名称root@node-1:# ip link show | grep ^2020: lxca2865f0fc4cc@if19:
pod 2 网卡 与宿主机网卡对应关系
root@master:# kubectl exec -it cni-test-76d79dfb85-7thf8 -- ethtool -S eth0NIC statistics: peer_ifindex: 22 ...# 宿主机第 22 口 对应网卡名称root@node-1:# ip link show | grep ^2222: lxcec93fc837a06@if21:
pod1 抓包
pod1 内部网卡抓包
kubectl exec -it cni-test-76d79dfb85-5l4w9 -- tcpdump -pne -i eth0 -w pod1_lxc.cap
pod1 网卡对应 宿主机 lxc 网卡抓包
tcpdump -pne -i lxca2865f0fc4cc -w pod1_lxc_pair.cap
pod2 抓包
pod1 内部网卡抓包
kubectl exec -it cni-test-76d79dfb85-7thf8 -- tcpdump -pne -i eth0 -w pod2_lxc.cap
pod1 网卡对应 宿主机 lxc 网卡抓包
tcpdump -pne -i lxcec93fc837a06 -w pod2_lxc_pair.cap
pod1 Ping pod2
root@master:~# kubectl exec -it cni-test-76d79dfb85-5l4w9 -- ping -c 1 10.0.0.67PING 10.0.0.67 (10.0.0.67): 56 data bytes64 bytes from 10.0.0.67: seq=0 ttl=63 time=0.299 ms--- 10.0.0.67 ping statistics ---1 packets transmitted, 1 packets received, 0% packet loss
查看包内容:
结论验证:pod1 对应 lxc 网卡:可以正常接收到 request 报文,但是无法接收 reply 报文
pwru 扩展
使用 pwru 工具抓包pwru
还是上边的源地址: 10.0.0.133 -> 目的地址: 10.0.0.67
pwru 抓包
在对应 node 节点抓包,使用源地址进行抓包
pwru --filter-src-ip 10.0.0.67 --output-tuple2022/05/03 11:44:07 Attaching kprobes...1421 / 1421 [----------------------------------------------------------------------------------------------------------------------------------------] 100.00% 28 p/s
pwru --filter-src-ip 10.0.0.133 --output-tuple2022/05/03 11:43:46 Attaching kprobes...1421 / 1421 [----------------------------------------------------------------------------------------------------------------------------------------] 100.00% 27 p/s
# ping 包测试root@master:~# kubectl exec -it cni-test-76d79dfb85-5l4w9 -- ping -c 1 10.0.0.67 PING 10.0.0.67 (10.0.0.67): 56 data bytes64 bytes from 10.0.0.67: seq=0 ttl=63 time=0.507 ms--- 10.0.0.67 ping statistics ---1 packets transmitted, 1 packets received, 0% packet lossround-trip min/avg/max = 0.507/0.507/0.507 ms
抓包内容:
root@node-1:~# pwru --filter-src-ip 10.0.0.67 --output-tuple2022/05/03 11:44:07 Attaching kprobes...1421 / 1421 [----------------------------------------------------------------------------------------------------------------------------------------] 100.00% 28 p/s2022/05/03 11:44:57 Attached (ignored 52)2022/05/03 11:44:57 Listening for events.. SKB PROCESS FUNC TIMESTAMP0xffff992e59159d00 [ping] ip_send_skb 32578980753945 10.0.0.67:0->10.0.0.133:0(icmp)0xffff992e59159d00 [ping] ip_local_out 32578980762591 10.0.0.67:0->10.0.0.133:0(icmp)0xffff992e59159d00 [ping] __ip_local_out 32578980765265 10.0.0.67:0->10.0.0.133:0(icmp)0xffff992e59159d00 [ping] ip_output 32578980767478 10.0.0.67:0->10.0.0.133:0(icmp)0xffff992e59159d00 [ping] nf_hook_slow 32578980770495 10.0.0.67:0->10.0.0.133:0(icmp)0xffff992e59159d00 [ping] apparmor_ipv4_postroute 32578980773557 10.0.0.67:0->10.0.0.133:0(icmp)0xffff992e59159d00 [ping] ip_finish_output 32578980775740 10.0.0.67:0->10.0.0.133:0(icmp)0xffff992e59159d00 [ping] __cgroup_bpf_run_filter_skb 32578980778080 10.0.0.67:0->10.0.0.133:0(icmp)0xffff992e59159d00 [ping] __ip_finish_output 32578980780955 10.0.0.67:0->10.0.0.133:0(icmp)0xffff992e59159d00 [ping] ip_finish_output2 32578980783494 10.0.0.67:0->10.0.0.133:0(icmp)0xffff992e59159d00 [
root@node-1:~# pwru --filter-src-ip 10.0.0.133 --output-tuple2022/05/03 11:43:46 Attaching kprobes...1421 / 1421 [----------------------------------------------------------------------------------------------------------------------------------------] 100.00% 27 p/s2022/05/03 11:44:39 Attached (ignored 52)2022/05/03 11:44:39 Listening for events.. SKB PROCESS FUNC TIMESTAMP0xffff992e59159900 [ping] ip_send_skb 32578980483886 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] ip_local_out 32578980504988 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] __ip_local_out 32578980513027 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] ip_output 32578980516168 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] nf_hook_slow 32578980520228 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] apparmor_ipv4_postroute 32578980524494 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] ip_finish_output 32578980528041 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] __cgroup_bpf_run_filter_skb 32578980532607 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] __ip_finish_output 32578980535884 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] ip_finish_output2 32578980540962 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] neigh_resolve_output 32578980544866 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] __neigh_event_send 32578980549401 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] eth_header 32578980552201 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] skb_push 32578980554265 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] dev_queue_xmit 32578980558756 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] __dev_queue_xmit 32578980562637 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] netdev_core_pick_tx 32578980564736 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] validate_xmit_skb 32578980569790 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] netif_skb_features 32578980572386 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] passthru_features_check 32578980575187 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] skb_network_protocol 32578980577416 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] validate_xmit_xfrm 32578980580645 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] dev_hard_start_xmit 32578980583435 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] skb_clone_tx_timestamp 32578980586755 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] __dev_forward_skb 32578980589508 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] __dev_forward_skb2 32578980592276 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] skb_scrub_packet 32578980596446 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] eth_type_trans 32578980600345 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] netif_rx 32578980602431 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] netif_rx_internal 32578980604783 10.0.0.133:0->10.0.0.67:0(icmp)0xffff992e59159900 [ping] enqueue_to_backlog 32578980607385 10.0.0.133:0->10.0.0.67:0(icmp)2022/05/03 11:46:05 Perf event ring buffer full, dropped 3 samples0xffff992e59159900 [ping] skb_ensure_writable 32578980650033 10.0.0.133:0->10.0.0.67:0(icmp)
cilium monitor 抓包
查看
kubectl -n kube-system exec -it cilium-qqf8s -- cilium monitor --debug -vvkubectl exec -it cni-test-76d79dfb85-5l4w9 -- ping -c 1 10.0.0.67
# 1. 从这里开始处理分析,这是源地址 10.0.0.133 --> 10.0.0.67,需要经过pod 内部的iptables 处理CPU 02: MARK 0x0 FROM 183 DEBUG: Conntrack lookup 1/2: src=10.0.0.133:26368 dst=10.0.0.67:0CPU 02: MARK 0x0 FROM 183 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=1# 2. 在cilium中的dst地址被标识成:3352。这是cilium的一个特色。这样当pod重启ip地址变了,policy不变CPU 02: MARK 0x0 FROM 183 DEBUG: Successfully mapped addr=10.0.0.67 to identity=3352CPU 02: MARK 0x0 FROM 183 DEBUG: Conntrack create: proxy-port=0 revnat=0 src-identity=3352 lb=0.0.0.0# 3. 这里注意到我们的ID变了,变成了296,变成了另一个pod CPU 02: MARK 0x0 FROM 183 DEBUG: Attempting local delivery for container id 296 from seclabel 3352# 4. 说明我们的包,已经到了目的 pod ,目的 pod 收到了源 pod 发的 ICMP 的 request,同样需要经过pod 内部的iptablesCPU 02: MARK 0x0 FROM 296 DEBUG: Conntrack lookup 1/2: src=10.0.0.133:26368 dst=10.0.0.67:0CPU 02: MARK 0x0 FROM 296 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=0-----------------------------------------------------------------------------------------# 5. 目的 pod 返回的 reply 报文经过 iptables 处理CPU 02: MARK 0x0 FROM 296 DEBUG: Conntrack lookup 1/2: src=10.0.0.67:0 dst=10.0.0.133:26368CPU 02: MARK 0x0 FROM 296 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=1CPU 02: MARK 0x0 FROM 296 DEBUG: CT entry found lifetime=16809827, revnat=0CPU 02: MARK 0x0 FROM 296 DEBUG: CT verdict: Reply, revnat=0# 6.此时数据报文就会通过 bpf_redict_peer() 函数 直接 redict 到本端相同 node 的目的pod 的 eth0接口CPU 02: MARK 0x0 FROM 296 DEBUG: Successfully mapped addr=10.0.0.133 to identity=3352CPU 02: MARK 0x0 FROM 296 DEBUG: Attempting local delivery for container id 183 from seclabel 3352# 7. 数据报文回到源 pod ,此时是经过 pod 内部的,也会经过 iptables 处理CPU 02: MARK 0x0 FROM 183 DEBUG: Conntrack lookup 1/2: src=10.0.0.67:0 dst=10.0.0.133:26368CPU 02: MARK 0x0 FROM 183 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=0
总结
在启用了 eBPF Host-Routing 能力后,通过内核的 bpf_redict_peer() 函数,在同节点的 pod 通信,我们的报文可以 redict 到对端 pod 的内部,从而不经过 iptables overhead 封装,这样可以提升网络的响应速度,避免掉多次封装,影响效率。
版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。
发表评论
暂时没有评论,来抢沙发吧~