kube-proxy的网络实现

大番茄 2019年11月08日 446次浏览

本来只是安装metrics server, 结果发现apiserver访问service ip不通, 我的master节点没有kube-proxy和flannel。
原来以为kube-proxy只是添加了iptables或是ipvs规则,研究发现要复杂的多。

二、简单的环境介绍

service cidr: 10.0.0.0/16
pod cidr: 10.1.0.0/16

flannel: vxlan

三、先报一下结论

1、任何工作节点都拥有service ip。

2、访问service ip就是访问本机。

3、来源地址也是service ip(回环访问)。

4、访问的过程中匹配本机ipvs规则,转而访问service ip后面的地址,但是数据包来源却是service ip。

5、iptables 实现把来源的service ip伪装。这不是结论,只是kube-proxy解决问题的方式。

四、kube-proxy在获取到新的service地址以后做了啥。

这里是ipvs的方式,iptables的没有测试。

1、结论一, 本机拥有service ip。

先看一下通常的路由:

[root@k8s-master2 qfpay]# ip route list
default via 172.100.101.1 dev eth0 proto static metric 100
10.1.15.0/24 dev docker0 proto kernel scope link src 10.1.15.1
10.1.37.0/24 via 10.1.37.0 dev flannel.1 onlink
10.1.51.0/24 via 10.1.51.0 dev flannel.1 onlink
10.1.65.0/24 via 10.1.65.0 dev flannel.1 onlink
10.1.90.0/24 via 10.1.90.0 dev flannel.1 onlink
172.100.101.0/24 dev eth0 proto kernel scope link src 172.100.101.192 metric 100

这个是flnanel产生的,与kube-proxy没有关系,只是到各个容器的路由。不用关注。
来看一下路由的local table:

[root@k8s-master2 qfpay]# ip route list table local
local 10.0.0.1 dev kube-ipvs0 proto kernel scope host src 10.0.0.1
local 10.0.0.2 dev kube-ipvs0 proto kernel scope host src 10.0.0.2
local 10.0.0.3 dev kube-ipvs0 proto kernel scope host src 10.0.0.3
local 10.0.23.176 dev kube-ipvs0 proto kernel scope host src 10.0.23.176
local 10.0.94.107 dev kube-ipvs0 proto kernel scope host src 10.0.94.107
local 10.0.97.75 dev kube-ipvs0 proto kernel scope host src 10.0.97.75
local 10.0.224.12 dev kube-ipvs0 proto kernel scope host src 10.0.224.12
local 10.0.225.18 dev kube-ipvs0 proto kernel scope host src 10.0.225.18
local 10.0.243.185 dev kube-ipvs0 proto kernel scope host src 10.0.243.185
local 10.1.15.0 dev flannel.1 proto kernel scope host src 10.1.15.0
local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1

local table是什么意思, 就是本机访问各个网卡地址的路由。就是把127.0.0.1的删除以后,就ping不通127.0.0.1了。就是用来连接本地的地址的路由。
那么上面这么多的10.0网段的地址是什么意思, 就是要用来访问本地的。上面也可以看得出来地址在网卡kube-ipvs0上面。 看一下网卡。

8: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
    link/ether b6:e6:9a:c4:3e:de brd ff:ff:ff:ff:ff:ff
    inet 10.0.224.12/32 brd 10.0.224.12 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.0.0.2/32 brd 10.0.0.2 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.0.243.185/32 brd 10.0.243.185 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.0.23.176/32 brd 10.0.23.176 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.0.225.18/32 brd 10.0.225.18 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.0.94.107/32 brd 10.0.94.107 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.0.0.1/32 brd 10.0.0.1 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.0.0.3/32 brd 10.0.0.3 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.0.97.75/32 brd 10.0.97.75 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
[root@k8s-master2 qfpay]#

这样就得出:本机拥有service ip

2、结论二,访问service ip就是访问本机。

来个测试:

[root@k8s-master2 ~]# hostname
k8s-master2
[root@k8s-master2 ~]# ssh root@10.0.0.3
root@10.0.0.3's password:
Last login: Fri Nov  8 15:27:30 2019 from 10.0.0.3
[root@k8s-master2 ~]# hostname
k8s-master2

在去除ipvs与iptables干扰以后。
访问service ip其实就是访问本机

3、结论三,来源地址也是service ip(回环访问)

回环访问可能是因为table local路由的关系,原地址与目的地址一样。如:
这一段是后来加的, 里面的地址跟其他部分没有关系。

[root@k8s-node1 ~]# ip route list table local
local 10.0.0.1 dev kube-ipvs0 proto kernel scope host src 10.0.0.1
local 10.0.85.80 dev kube-ipvs0 proto kernel scope host src 10.0.85.80
local 10.5.2.0 dev flannel.1 proto kernel scope host src 10.5.2.0
broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1
local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
......

这个跟kube-proxy没关系, 就是系统本身的规则。


tcpdump抓包,访问以下service ip。查看来源地址.
ping service ip

[root@k8s-master2 ~]# ping 10.0.0.3

然后开一个终端抓包

[root@k8s-master2 ~]# tcpdump -vv -nn -i lo icmp
tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
15:41:40.322156 IP (tos 0x0, ttl 64, id 47578, offset 0, flags [DF], proto ICMP (1), length 84)
    10.0.0.3 > 10.0.0.3: ICMP echo request, id 9159, seq 268, length 64
15:41:40.322185 IP (tos 0x0, ttl 64, id 47579, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.0.3 > 10.0.0.3: ICMP echo reply, id 9159, seq 268, length 64
15:41:41.321951 IP (tos 0x0, ttl 64, id 48440, offset 0, flags [DF], proto ICMP (1), length 84)
    10.0.0.3 > 10.0.0.3: ICMP echo request, id 9159, seq 269, length 64
15:41:41.321979 IP (tos 0x0, ttl 64, id 48441, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.0.3 > 10.0.0.3: ICMP echo reply, id 9159, seq 269, length 64
15:41:42.321991 IP (tos 0x0, ttl 64, id 48838, offset 0, flags [DF], proto ICMP (1), length 84)
    10.0.0.3 > 10.0.0.3: ICMP echo request, id 9159, seq 270, length 64
15:41:42.322019 IP (tos 0x0, ttl 64, id 48839, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.0.3 > 10.0.0.3: ICMP echo reply, id 9159, seq 270, length 64
15:41:43.321986 IP (tos 0x0, ttl 64, id 49494, offset 0, flags [DF], proto ICMP (1), length 84)
    10.0.0.3 > 10.0.0.3: ICMP echo request, id 9159, seq 271, length 64
15:41:43.322013 IP (tos 0x0, ttl 64, id 49495, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.0.3 > 10.0.0.3: ICMP echo reply, id 9159, seq 271, length 64
15:41:44.322010 IP (tos 0x0, ttl 64, id 49575, offset 0, flags [DF], proto ICMP (1), length 84)
    10.0.0.3 > 10.0.0.3: ICMP echo request, id 9159, seq 272, length 64
15:41:44.322040 IP (tos 0x0, ttl 64, id 49576, offset 0, flags [none], proto ICMP (1), length 84)

还测试了本地物理网卡ip以及127.0.0.1。发现来源都是自己的访问ip。

[root@k8s-master2 ~]# tcpdump -vv -nn -i lo icmp
tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
15:51:00.088472 IP (tos 0x0, ttl 64, id 32725, offset 0, flags [DF], proto ICMP (1), length 84)
    127.0.0.1 > 127.0.0.1: ICMP echo request, id 9612, seq 1, length 64
15:51:00.088503 IP (tos 0x0, ttl 64, id 32726, offset 0, flags [none], proto ICMP (1), length 84)
    127.0.0.1 > 127.0.0.1: ICMP echo reply, id 9612, seq 1, length 64
15:51:01.087958 IP (tos 0x0, ttl 64, id 33708, offset 0, flags [DF], proto ICMP (1), length 84)
    127.0.0.1 > 127.0.0.1: ICMP echo request, id 9612, seq 2, length 64
15:51:01.087990 IP (tos 0x0, ttl 64, id 33709, offset 0, flags [none], proto ICMP (1), length 84)
    127.0.0.1 > 127.0.0.1: ICMP echo reply, id 9612, seq 2, length 64
[root@k8s-master2 ~]# tcpdump -vv -nn -i lo icmp
tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
15:49:54.743021 IP (tos 0x0, ttl 64, id 29403, offset 0, flags [DF], proto ICMP (1), length 84)
    172.100.101.192 > 172.100.101.192: ICMP echo request, id 9245, seq 13, length 64
15:49:54.743056 IP (tos 0x0, ttl 64, id 29404, offset 0, flags [none], proto ICMP (1), length 84)
    172.100.101.192 > 172.100.101.192: ICMP echo reply, id 9245, seq 13, length 64
15:49:55.742985 IP (tos 0x0, ttl 64, id 29658, offset 0, flags [DF], proto ICMP (1), length 84)
    172.100.101.192 > 172.100.101.192: ICMP echo request, id 9245, seq 14, length 64
15:49:55.743018 IP (tos 0x0, ttl 64, id 29659, offset 0, flags [none], proto ICMP (1), length 84)
    172.100.101.192 > 172.100.101.192: ICMP echo reply, id 9245, seq 14, length 64
15:49:56.742976 IP (tos 0x0, ttl 64, id 29698, offset 0, flags [DF], proto ICMP (1), length 8

这就得出了第三个结论:来源地址也是service ip。本地回环访问,来源就是目标。

4、结论四,访问的过程中匹配本机ipvs规则,转而访问service ip后面的地址,但是数据包来源却是service ip。

启动kube-proxy以后自动添加的ipvs 规则:

TCP  10.0.0.2:9153 rr
  -> 10.1.65.2:9153               Masq    1      0          0
  -> 10.1.90.2:9153               Masq    1      0          0
  -> 10.1.90.8:9153               Masq    1      0          0
TCP  10.0.0.3:443 rr
  -> 10.1.65.5:443                Masq    1      1          0
TCP  10.0.23.176:443 rr
  -> 10.1.65.7:8443               Masq    1      0          0
TCP  10.0.94.107:3000 rr
  -> 10.1.90.6:3000               Masq    1      0          0

还是测试10.0.0.3。先访问以下通不通:

[root@k8s-master2 ssl]# curl https://10.0.0.3
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {

  },
  "code": 403
}[root@k8s-master2 ssl]#

抓包看看, 只抓到了出去的,能看到目标已经变成10.1.65.5了,而来源变成10.1.15.0是因为flannel.1网卡的ip是10.1.15.0。

[root@k8s-master2 ssl]# tcpdump -vv -nn -i flannel.1 tcp and dst port 443
tcpdump: listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
16:34:05.316553 IP (tos 0x0, ttl 64, id 61978, offset 0, flags [DF], proto TCP (6), length 60)
    10.1.15.0.49036 > 10.1.65.5.443: Flags [S], cksum 0x6435 (incorrect -> 0x9e7d), seq 3345013407, win 43690, options [mss 65495,sackOK,TS val 350234757 ecr 0,nop,wscale 7], length 0
16:34:05.317753 IP (tos 0x0, ttl 64, id 61979, offset 0, flags [DF], proto TCP (6), length 52)
    10.1.15.0.49036 > 10.1.65.5.443: Flags [.], cksum 0x642d (incorrect -> 0x9adf), seq 3345013408, ack 2083764933, win 342, options [nop,nop,TS val 350234758 ecr 257068934], length 0
16:34:05.504440 IP (tos 0x0, ttl 64, id 61980, offset 0, flags [DF], proto TCP (6), length 225)
    10.1.15.0.49036 > 10.1.65.5.443: Flags [P.], cksum 0x64da (incorrect -> 0x84ec), seq 0:173, ack 1, win 342, options [nop,nop,TS val 350234945 ecr 257068934], length 173
16:34:05.515993 IP (tos 0x0, ttl 64, id 61981, offset 0, flags [DF], proto TCP (6), length 52)
    10.1.15.0.49036 > 10.1.65.5.443: Flags [.], cksum 0x642d (incorrect -> 0x91a7), seq 173, ack 1763, win 369, options [nop,nop,TS val 350234957 ecr 257069133], length 0
16:34:05.537198 IP (tos 0x0, ttl 64, id 61982, offset 0, flags [DF], proto TCP (6), length 144)
    10.1.15.0.49036 > 10.1.65.5.443: Flags [P.], cksum 0x6489 (incorrect -> 0x1be7), seq 173:265, ack 1763, win 369, options [nop,nop,TS val 350234978 ecr 257069133], length 92
16:34:05.544203 IP (tos 0x0, ttl 64, id 61983, offset 0, flags [DF], proto TCP (6), length 145)
    10.1.15.0.49036 > 10.1.65.5.443: Flags [P.], cksum 0x648a (incorrect -> 0xced1), seq 265:358, ack 1806, win 369, options [nop,nop,TS val 350234985 ecr 257069160], length 93
16:34:05.558114 IP (tos 0x0, ttl 64, id 61984, offset 0, flags [DF], proto TCP (6), length 75)
    10.1.15.0.49036 > 10.1.65.5.443: Flags [P.], cksum 0x6444 (incorrect -> 0xfdf8), seq 358:381, ack 2209, win 391, options [nop,nop,TS val 350234999 ecr 257069174], length 23

上面没有抓到10.0.0.3的包,去目标宿主机抓包:

[root@k8s-node2 netns]# tcpdump -vv -nn -i flannel.1 tcp and dst port 443
tcpdump: listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
16:52:14.607210 IP (tos 0x0, ttl 64, id 16524, offset 0, flags [DF], proto TCP (6), length 60)
    10.1.15.0.59650 > 10.1.65.5.443: Flags [S], cksum 0x5650 (correct), seq 2579204082, win 43690, options [mss 65495,sackOK,TS val 351324030 ecr 0,nop,wscale 7], length 0
16:52:14.608233 IP (tos 0x0, ttl 64, id 16525, offset 0, flags [DF], proto TCP (6), length 52)
    10.1.15.0.59650 > 10.1.65.5.443: Flags [.], cksum 0x4be6 (correct), seq 2579204083, ack 3177177434, win 342, options [nop,nop,TS val 351324031 ecr 258158208], length 0
16:52:14.793654 IP (tos 0x0, ttl 64, id 16526, offset 0, flags [DF], proto TCP (6), length 225)
    10.1.15.0.59650 > 10.1.65.5.443: Flags [P.], cksum 0x4895 (correct), seq 0:173, ack 1, win 342, options [nop,nop,TS val 351324216 ecr 258158208], length 173
16:52:14.804625 IP (tos 0x0, ttl 64, id 16527, offset 0, flags [DF], proto TCP (6), length 52)
    10.1.15.0.59650 > 10.1.65.5.443: Flags [.], cksum 0x42b3 (correct), seq 173, ack 1763, win 369, options [nop,nop,TS val 351324228 ecr 258158404], length 0
16:52:14.825610 IP (tos 0x0, ttl 64, id 16528, offset 0, flags [DF], proto TCP (6), length 144)

删除iptables里的伪装规则再试试。
为了比较清楚, kube-proxy生成的iptables规则比较复杂,这里清理了所有规则, 并单独添加为10.1.65.5的ip伪装。

[root@k8s-master2 ~]# iptables -t nat -L -v -n --line
Chain PREROUTING (policy ACCEPT 45 packets, 2792 bytes)
num   pkts bytes target     prot opt in     out     source               destination

Chain INPUT (policy ACCEPT 45 packets, 2792 bytes)
num   pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 138 packets, 8360 bytes)
num   pkts bytes target     prot opt in     out     source               destination

Chain POSTROUTING (policy ACCEPT 137 packets, 8300 bytes)
num   pkts bytes target     prot opt in     out     source               destination
1        1    60 MASQUERADE  all  --  *      *       0.0.0.0/0            10.1.65.5

Chain DOCKER (0 references)
num   pkts bytes target     prot opt in     out     source               destination
[root@k8s-master2 ~]# !curl
curl https://10.0.0.3
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {

  },
  "code": 403
}[root@k8s-master2 ~]#

删除以后:

[root@k8s-master2 ~]# iptables -t nat -D POSTROUTING 1
[root@k8s-master2 ~]# iptables -t nat -L -v -n --line
Chain PREROUTING (policy ACCEPT 1 packets, 60 bytes)
num   pkts bytes target     prot opt in     out     source               destination

Chain INPUT (policy ACCEPT 1 packets, 60 bytes)
num   pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 41 packets, 2460 bytes)
num   pkts bytes target     prot opt in     out     source               destination

Chain POSTROUTING (policy ACCEPT 41 packets, 2460 bytes)
num   pkts bytes target     prot opt in     out     source               destination

Chain DOCKER (0 references)
num   pkts bytes target     prot opt in     out     source               destination
[root@k8s-master2 ~]# !curl
curl https://10.0.0.3
^C

已经不行了, 去目标宿主机抓包看看:

[root@k8s-node2 netns]# tcpdump -vv -nn -i flannel.1 tcp and dst port 443
tcpdump: listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
16:49:56.048292 IP (tos 0x0, ttl 64, id 9316, offset 0, flags [DF], proto TCP (6), length 60)
    10.0.0.3.54808 > 10.1.65.5.443: Flags [S], cksum 0x83ff (correct), seq 2764869211, win 43690, options [mss 65495,sackOK,TS val 351185471 ecr 0,nop,wscale 7], length 0
16:49:57.048707 IP (tos 0x0, ttl 64, id 9317, offset 0, flags [DF], proto TCP (6), length 60)
    10.0.0.3.54808 > 10.1.65.5.443: Flags [S], cksum 0x8016 (correct), seq 2764869211, win 43690, options [mss 65495,sackOK,TS val 351186472 ecr 0,nop,wscale 7], length 0
16:49:59.052813 IP (tos 0x0, ttl 64, id 9318, offset 0, flags [DF], proto TCP (6), length 60)
    10.0.0.3.54808 > 10.1.65.5.443: Flags [S], cksum 0x7842 (correct), seq 2764869211, win 43690, options [mss 65495,sackOK,TS val 351188476 ecr 0,nop,wscale 7], length 0

来源都是10.0.0.3.
这里也就表达了结论四,访问的过程中匹配本机ipvs规则,转而访问service ip后面的地址,但是数据包来源却是service ip。