本文通过实验,帮助大家认识docker swarm中的overlay和docker_gwbridge网络。
实验环境搭建
先建立两台物理机组成的docker swarm网络(方法可见《docker swarm(一): 入门,搭建一个简单的swarm集群》):
$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
43k0p9fnwu9dhsyr0n6utfynn * ubuntu Ready Active Leader 19.03.5
gorkh8cb5ylb7szzbbrp2sheu ubuntu-2 Ready Active 19.03.5
创建一个overlay网络。
docker network create -d overlay --attachable --subnet 10.200.0.0/16 overlay_test
当前建立的docker相关的网络有:
$ docker network ls
NETWORK ID NAME DRIVER SCOPE
a473a52d686d bridge bridge local
5e1880193fbf docker_gwbridge bridge local
62ba25167374 host host local
jjyg85t5ta3k ingress overlay swarm
d056684646b3 none null local
hxyiridl2b9r overlay_test overlay swarm
这里关注两个网络:
- overlay_test:overlay网络,实现容器间东西向流量的网络。
- docker_gwbridge: 容器收发南北向报文的网络。
工具准备
我们知道,docker是基于namespace,划分了网络空间。这里先准备一段脚本,由于在各个namespece中,执行对应的网络命令。
#!/bin/bash
NAMESPACE=$1
if [[ -z $NAMESPACE ]]; then
ls -1 /var/run/docker/netns/
exit 0
fi
NAMESPACE_FILE=/var/run/docker/netns/${NAMESPACE}
if [[ ! -f $NAMESPACE_FILE ]]; then
NAMESPACE_FILE=$(docker inspect -f "{{.NetworkSettings.SandboxKey}}" $NAMESPACE 2>/dev/null)
fi
if [[ ! -f $NAMESPACE_FILE ]]; then
echo "Cannot open network namespace '$NAMESPACE': No such file or directory"
exit 1
fi
shift
if [[ $# -lt 1 ]]; then
echo "No command specified"
exit 1
fi
nsenter --net=${NAMESPACE_FILE} $@
它可以查看有哪些namespace:
$ sudo ./docker_netns.sh
1-k2rx924tgr
eab3f856fe9a
ingress_sbox
还可以在指定的namespace下执行命令:
$ sudo ./docker_netns.sh eab3f856fe9a ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
170: eth0@if171: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:0a:00:00:54 brd ff:ff:ff:ff:ff:ff link-netnsid 0
172: eth1@if173: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 1
第二个工具,find_links.sh
#!/bin/bash
DOCKER_NETNS_SCRIPT=./docker_netns.sh
IFINDEX=$1
if [[ -z $IFINDEX ]]; then
for namespace in $($DOCKER_NETNS_SCRIPT); do
printf "\e[1;31m%s: \e[0m\n" $namespace
$DOCKER_NETNS_SCRIPT $namespace ip -c -o link
printf "\n"
done
else
for namespace in $($DOCKER_NETNS_SCRIPT); do
if $DOCKER_NETNS_SCRIPT $namespace ip -c -o link | grep -Pq "^$IFINDEX: "; then
printf "\e[1;31m%s: \e[0m\n" $namespace
$DOCKER_NETNS_SCRIPT $namespace ip -c -o link | grep -P "^$IFINDEX: ";
printf "\n"
fi
done
fi
这个脚本可以根据ifindex查找接口所在的namespace。
$ sudo ./find_links.sh 60
1-hxyiridl2b:
60: veth1@if59: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP mode DEFAULT group default \ link/ether 4a:0a:52:98:84:a7 brd ff:ff:ff:ff:ff:ff link-netnsid 2
网络结构分析
以下,我们通过实验,了解一下overlay网络与docker_gwbridge网络。
我们现在在两个nodes上都创建容器:
$ docker run -d --name busybox --net overlay_test busybox sleep 36000
在容器的环境下,查看一下网络连接:
docker exec busybox ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
59: eth0@if60: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue
link/ether 02:42:0a:c8:00:02 brd ff:ff:ff:ff:ff:ff
inet 10.200.0.2/16 brd 10.200.255.255 scope global eth0
valid_lft forever preferred_lft forever
61: eth1@if62: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue
link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff
inet 172.18.0.3/16 brd 172.18.255.255 scope global eth1
valid_lft forever preferred_lft forever
我们发现,除了回环口外,还有两个接口。10.200.0.2/16即是容器busybox在overlay_test网络上的接口的IP地址。172.18.0.3/16是容器busybox在docker_gwbridge网络上的接口的IP地址。
到目前为止,我们看到的容器网络是这样的。我们只看到了网络地址,还不知道它们间的报文是如何交互的。(192.168.154.2是宿主机的网关)
南北向流量
我们尝试从容器内跟踪访问外部IP的路由
$ docker exec busybox traceroute baidu.com
traceroute to baidu.com (220.181.38.148), 30 hops max, 46 byte packets
1 bogon (172.18.0.1) 0.003 ms 0.004 ms 0.006 ms
2 bogon (192.168.154.2) 0.148 ms 0.330 ms 0.175 ms
...
可见,流量经过172.18.0.1,然后访问到宿主机网关上。
接下来,我们尝试解析出内部网络连接。上面我们已经得知,从容器内部的视角,172.18.0.3所在的接口为:61: eth1@if62。我们可以理解为,此接口的ifindex为61,通过veth连接到ifindex为62的接口上。
我们查找看看62接口的namespace是:
$ sudo ./find_links.sh 62
居然没有显示。这就说明62接口是在宿主机的主namespace中的。我们在宿主机上看看:
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:0c:29:e5:66:45 brd ff:ff:ff:ff:ff:ff
inet 192.168.154.135/24 brd 192.168.154.255 scope global dynamic noprefixroute ens33
valid_lft 1502sec preferred_lft 1502sec
inet6 fe80::f378:1d3:6cde:69bb/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: docker_gwbridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:50:e9:2d:e1 brd ff:ff:ff:ff:ff:ff
inet 172.18.0.1/16 brd 172.18.255.255 scope global docker_gwbridge
valid_lft forever preferred_lft forever
inet6 fe80::42:50ff:fee9:2de1/64 scope link
valid_lft forever preferred_lft forever
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:5d:cd:c3:16 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
23: veth6ee82c3@if22: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP group default
link/ether 4a:71:4d:f7:0e:4e brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::4871:4dff:fef7:e4e/64 scope link
valid_lft forever preferred_lft forever
62: veth0204500@if61: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP group default
link/ether 9e:d6:10:49:8e:42 brd ff:ff:ff:ff:ff:ff link-netnsid 4
inet6 fe80::9cd6:10ff:fe49:8e42/64 scope link
valid_lft forever preferred_lft forever
可见,62接口的master是docker_gwbridge。也就是说,62接口被桥接到docker_gwbridge中。
南北向流量在经过宿主机出口时,还做了NAT转换
$ sudo iptables-save -t nat | grep -- '-A POSTROUTING'
-A POSTROUTING -o docker_gwbridge -m addrtype --src-type LOCAL -j MASQUERADE
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING -s 172.18.0.0/16 ! -o docker_gwbridge -j MASQUERAD
于是,南北向的流量走向就很清晰了。我们的网络拓扑可以更新为:
东西向流量
东西向流量即容器与容器间的流量。我们先测试一下容器间的连通性。
$ docker exec busybox ping 10.200.0.2
PING 10.200.0.2 (10.200.0.2): 56 data bytes
64 bytes from 10.200.0.2: seq=0 ttl=64 time=41.177 ms
64 bytes from 10.200.0.2: seq=1 ttl=64 time=1.181 ms
64 bytes from 10.200.0.2: seq=2 ttl=64 time=1.110 ms
接下来探索这个流量是怎么走的。我们再看一下容器中的网络配置。
$ docker exec busybox ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
59: eth0@if60: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue
link/ether 02:42:0a:c8:00:02 brd ff:ff:ff:ff:ff:ff
inet 10.200.0.2/16 brd 10.200.255.255 scope global eth0
valid_lft forever preferred_lft forever
61: eth1@if62: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue
link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff
inet 172.18.0.3/16 brd 172.18.255.255 scope global eth1
valid_lft forever preferred_lft forever
10.200.0.2所在的接口为,59: eth0@if60。即本接口ifindex为59,连接到ifindex为60的接口上。我们查询一下60接口所在的namespaec。
$ sudo ./find_links.sh 60
1-hxyiridl2b:
60: veth1@if59: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP mode DEFAULT group default \ link/ether 4a:0a:52:98:84:a7 brd ff:ff:ff:ff:ff:ff link-netnsid 2
可见60接口处于1-hxyiridl2b这一namespace中。
$ sudo ./docker_netns.sh 1-hxyiridl2b ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether 0e:2d:34:e6:eb:b7 brd ff:ff:ff:ff:ff:ff
inet 10.200.0.1/16 brd 10.200.255.255 scope global br0
valid_lft forever preferred_lft forever
56: vxlan0@if56: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UNKNOWN group default
link/ether 0e:2d:34:e6:eb:b7 brd ff:ff:ff:ff:ff:ff link-netnsid 0
58: veth0@if57: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP group default
link/ether ea:c1:db:d4:b1:83 brd ff:ff:ff:ff:ff:ff link-netnsid 1
60: veth1@if59: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP group default
link/ether 4a:0a:52:98:84:a7 brd ff:ff:ff:ff:ff:ff link-netnsid 2
在这个namespace中,有一个vxlan出口。docker overlsy就是通过overlay隧道与其它容器通信的。
两个容器虽然是通过vxlan隧道通信,但容器内部却不感知。它们只能看到两个容器处于同一个二层网络中。由vxlan接口将二层报文封装在UDP报文的payload中,发到对端,再由对端的vxlan接口解封装。
我们查看一下namespace 1-hxyiridl2b中的arp地址表:
$ sudo ./docker_netns.sh 1-hxyiridl2b ip neigh
10.200.0.5 dev vxlan0 lladdr 02:42:0a:c8:00:05 PERMANENT
10.200.0.4 dev vxlan0 lladdr 02:42:0a:c8:00:04 PERMANENT
我们可以看到,远端node中的容器IP 10.200.0.4,有体现在本端的arp地址表中。即是通过查找此表,得到对端的二层地址。
我们再来看看,vxlan报文的出口在哪里:
$ sudo ./docker_netns.sh 1-hxyiridl2b bridge fdb
...
02:42:0a:c8:00:05 dev vxlan0 dst 192.168.154.136 link-netnsid 0 self permanent
02:42:0a:c8:00:04 dev vxlan0 dst 192.168.154.136 link-netnsid 0 self permanent
...
这可以理解为VxLAN的VTEP表,即根据MAC地址,查找出VxLAN报文应该封装的外层IP,是192.168.154.136
我们可以画出东西向流量的完整的拓扑了: