概述
kubernetes的service 有iptabes和ipvs的实现。iptables的实现方式基本已经被弃用,ipvs 以灵活的负载均衡策略,效率高的优点,可以完全替代iptables。netfilter是内核包过滤的一个框架,ipvs,iptables,都属于netfilter框架的一部分,各自实现的功能不一样。如iptables可以实现包过滤,包修改,nat,负载均衡等功能,ipvs则主要做负载均衡用。特别的,ipvs可以作为iptables的target,可见iptables的功能性和通用性更强。
下文将依次介绍netfilter的大致原理,iptables的表和链,ipvs的原理,最后通过实例展示在k8s集群中,使用iptables与ipvs的表现形式有什么区别。
Netfilter
netfilter的官方描述可以参见 Netfilter。
说一下我对netfilter的理解。netfilter运行于内核态,大都穿插在协议栈处理的各个关键位置,对包进行处理,进入主机的包都要经过PREROUTING,应用层发出的数据包都要经过OUTPUT链,非本机的包,要经过FORWARDING链(ip_forward使能的情况下)。
下面的代码展示了如何进入netfilter的处理。
// ip层接受外部数据包的入口
NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
net, NULL, skb, dev, NULL,
ip_rcv_finish);
// ip层向上层传递
NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN,
net, NULL, skb, skb->dev, NULL,
ip_local_deliver_finish);
// 向下层协议传递的出口
NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
net, sk, skb, NULL, dev,
ip_finish_output,
!(IPCB(skb)->flags & IPSKB_REROUTED));
// NF_HOOK的宏定义原型,其中okfn是正常处理后的回调函数
static inline int
NF_HOOK(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk, struct sk_buff *skb,
struct net_device *in, struct net_device *out,
int (*okfn)(struct net *, struct sock *, struct sk_buff *))
{
int ret = nf_hook(pf, hook, net, sk, skb, in, out, okfn);
if (ret == 1)
ret = okfn(net, sk, skb);
return ret;
}
// 可以看到大致有哪些hook点。
struct netns_nf {
const struct nf_logger __rcu *nf_loggers[NFPROTO_NUMPROTO];
struct nf_hook_entries __rcu *hooks_ipv4[NF_INET_NUMHOOKS];
struct nf_hook_entries __rcu *hooks_ipv6[NF_INET_NUMHOOKS];
struct nf_hook_entries __rcu *hooks_arp[NF_ARP_NUMHOOKS];
struct nf_hook_entries __rcu *hooks_bridge[NF_INET_NUMHOOKS];
struct nf_hook_entries __rcu *hooks_decnet[NF_DN_NUMHOOKS];
};
通过上面部分代码的示例,NF_HOOK函数族(还有其他类似的)是netfilter提供的类似api或者模块的功能,而且可以猜测,在这个函数里面就会去匹配链上各个表内的规则列表。比如此时的链是LOCAL_IN,那么要去遍历managle、filter、nat表的规则,之后再调用okfn的回调函数。
iptables的表
已nat表为例,其他的表也基本类似。nf_nat_ipv4_ops中有多个hook函数,处在netfilter框架的几个hook点上。到数据包到达hook点时,就会执行相应的函数。
比如当一个进来的数据包达到PRE_ROUTING链时,并且达到nat表时,就会调用iptables_nat_do_chain这个hook函数,在去匹配表里的各个规则。
// iptable_nat_do_chani调用ipt_do_table执行具体操作
static const struct nf_hook_ops nf_nat_ipv4_ops[] = {
// 这里只展示常见的两个,nat在localin和localout上也有hook。
{
.hook = iptable_nat_do_chain
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_PRE_ROUTING,
.priority = NF_IP_PRI_NAT_DST,
},
{
.hook = iptable_nat_do_chain,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_POST_ROUTING,
.priority = NF_IP_PRI_NAT_SRC,
},
};
static int ipt_nat_register_lookups(struct net *net)
{
// 注册hook函数到net的nf中。
ret = nf_nat_l3proto_ipv4_register_fn(net, &nf_nat_ipv4_ops[i]);
}
// 调用注册的hook函数,对应下文的iptable_filter_hook ,iptable_nat_do_chain 。
static inline int
nf_hook_entry_hookfn(const struct nf_hook_entry *entry, struct sk_buff *skb,
struct nf_hook_state *state)
{
return entry->hook(entry->priv, skb, state);
}
....
/*
其他表也是类似的流程。
filter : iptable_filter_hook -> ipt_do_table
nat : iptable_nat_do_chain -> ipt_do_table
raw : iptable_raw_hook -> ipt_do_table
*/
iptables的rules
iptables的规则就是描述一个包应该如何处理,包括如何匹配一个包,匹配包之后的处理。如源地址为192.168.0.1的包,drop掉;所以出主机的包,做masq处理;网段为192.168.3.0/24的包打上0x4000的标签等等。
从代码可以看出数据包是顺序通过netfilter的处理,所以大量的iptables规则势必会影响内核处理网络包的性能。
unsigned int
ipt_do_table(struct sk_buff *skb,
const struct nf_hook_state *state,
struct xt_table *table) {
struct ipt_entry *e;
e = get_entry(table_base, private->hook_entry[hook]);
acpar.match->match(skb, &acpar);
t = ipt_get_target_c(e);
// 这里的target函数就是类似redirect,dnat,snat,set-mark等的包处理函数。
verdict = t->u.kernel.target->target(skb, &acpar)
}
// nat的处理
static struct xt_target xt_nat_target_reg[] __read_mostly = {
{
.name = "SNAT",
.revision = 0,
.checkentry = xt_nat_checkentry_v0,
.destroy = xt_nat_destroy,
.target = xt_snat_target_v0,
.targetsize = sizeof(struct nf_nat_ipv4_multi_range_compat),
.family = NFPROTO_IPV4,
.table = "nat",
.hooks = (1 << NF_INET_POST_ROUTING) |
(1 << NF_INET_LOCAL_IN),
.me = THIS_MODULE,
},
{
.name = "DNAT",
.revision = 0,
.checkentry = xt_nat_checkentry_v0,
.destroy = xt_nat_destroy,
.target = xt_dnat_target_v0,
.targetsize = sizeof(struct nf_nat_ipv4_multi_range_compat),
.family = NFPROTO_IPV4,
.table = "nat",
.hooks = (1 << NF_INET_PRE_ROUTING) |
(1 << NF_INET_LOCAL_OUT),
.me = THIS_MODULE,
},
{
.name = "SNAT",
.revision = 1,
.checkentry = xt_nat_checkentry,
.destroy = xt_nat_destroy,
.target = xt_snat_target_v1,
.targetsize = sizeof(struct nf_nat_range),
.table = "nat",
.hooks = (1 << NF_INET_POST_ROUTING) |
(1 << NF_INET_LOCAL_IN),
.me = THIS_MODULE,
},
{
.name = "DNAT",
.revision = 1,
.checkentry = xt_nat_checkentry,
.destroy = xt_nat_destroy,
.target = xt_dnat_target_v1,
.targetsize = sizeof(struct nf_nat_range),
.table = "nat",
.hooks = (1 << NF_INET_PRE_ROUTING) |
(1 << NF_INET_LOCAL_OUT),
.me = THIS_MODULE,
},
// .....
}
iptables中的MARK打标签的原理
在使用k8s的环境中,如果使用iptables查看本机的规则,会经常看到一些MARK的字样,这个就是iptables的MARK功能。一般流程是这样 :在包处理的前一部分,匹配到对应的包,打上标签,在包处理的后一部分,再处理拥有该标签的包。
会好奇,这个是如何实现呢?标签又是在哪里存储的呢?
// xt_register_target 注册target的处理 reject,mark,
// skb-> mark: Generic packet mark,即skb中有字段记录mark的值
static struct xt_target mark_tg_reg __read_mostly = {
.name = "MARK",
.revision = 2,
.family = NFPROTO_UNSPEC,
.target = mark_tg,
.targetsize = sizeof(struct xt_mark_tginfo2),
.me = THIS_MODULE,
};
mark_tg(struct sk_buff *skb, const struct xt_action_param *par)
{
const struct xt_mark_tginfo2 *info = par->targinfo;
skb->mark = (skb->mark & ~info->mask) ^ info->mark;
return XT_CONTINUE;
}
/* Registration hooks for targets. */
int xt_register_target(struct xt_target *target)
{
u_int8_t af = target->family;
mutex_lock(&xt[af].mutex);
list_add(&target->list, &xt[af].target);
mutex_unlock(&xt[af].mutex);
return 0;
}
ipt_do_table ->
t->u.kernel.target->target(skb, &acpar)
ipvs
ipvs也是属于netfilter,并且在LOCAL_IN,LOCAL_OUT这个关卡上注册了hook函数,那为什么iptables命令看不到这些规则呢?iptables看到的数据是各个关卡上通过iptables命令设置的规则,这些规则聚合成一个hook函数注册到netfilter中,而ipvs是直接注册到netfilter中的函数上的。
ipvs已经可以替代iptables,下面重点分析一下ipvs的实现。
kube-proxy使能ipvs后,会创建一个kube-ipvs0的网卡,该网卡上绑定了集群所有的service的地址。当有pod请求这个service地址时,就会先被该网卡接受,在LOCAL_IN上触发ipvs的处理,根据负载均衡策略选择一个后端服务器,将service的地址替换为pod的ip,并送到local_out进行处理,之后的流程就是pod跨主机通信的流程了。
从下面代码可以看出,ipvs处理的优先级在iptables的表之后,即先处理iptables的规则,再由ipvs进行处理。
static const struct nf_hook_ops ip_vs_ops[] = {
/* After packet filtering, change source only for VS/NAT */
{
.hook = ip_vs_reply4,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_LOCAL_IN,
.priority = NF_IP_PRI_NAT_SRC - 2,
},
/* After packet filtering, forward packet through VS/DR, VS/TUN,
* or VS/NAT(change destination), so that filtering rules can be
* applied to IPVS. */
{
.hook = ip_vs_remote_request4,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_LOCAL_IN,
.priority = NF_IP_PRI_NAT_SRC - 1,
},
/* Before ip_vs_in, change source only for VS/NAT */
{
.hook = ip_vs_local_reply4,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_LOCAL_OUT,
.priority = NF_IP_PRI_NAT_DST + 1,
},
/* After mangle, schedule and forward local requests */
{
.hook = ip_vs_local_request4,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_LOCAL_OUT,
.priority = NF_IP_PRI_NAT_DST + 2,
},
// ....
};
nf_register_net_hooks(net, ip_vs_ops, ARRAY_SIZE(ip_vs_ops));
//不同模式对应不同的发包函数。
static inline void ip_vs_bind_xmit(struct ip_vs_conn *cp)
{
switch (IP_VS_FWD_METHOD(cp)) {
case IP_VS_CONN_F_MASQ:
cp->packet_xmit = ip_vs_nat_xmit;
break;
case IP_VS_CONN_F_TUNNEL:
#ifdef CONFIG_IP_VS_IPV6
if (cp->daf == AF_INET6)
cp->packet_xmit = ip_vs_tunnel_xmit_v6;
else
#endif
cp->packet_xmit = ip_vs_tunnel_xmit;
break;
case IP_VS_CONN_F_DROUTE:
cp->packet_xmit = ip_vs_dr_xmit;
break;
case IP_VS_CONN_F_LOCALNODE:
cp->packet_xmit = ip_vs_null_xmit;
break;
case IP_VS_CONN_F_BYPASS:
cp->packet_xmit = ip_vs_bypass_xmit;
break;
}
}
// 接下来 当包从local_in过来时,调用
ip_vs_remote_request4
=> ip_vs_in
=> cp->packet_xmit
// 将数据包送到local_out处理。
static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb,
struct ip_vs_conn *cp, int local) {
NF_HOOK(pf, NF_INET_LOCAL_OUT, cp->ipvs->net, NULL, skb,
NULL, skb_dst(skb)->dev, dst_output);
}
kube-proxy 使用 iptables
[root@master-9 net]# iptables -L KUBE-SERVICES -t nat -n
Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-SVC-NPX46M4PTMTKRN6Y tcp -- 0.0.0.0/0 10.96.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:443
KUBE-SVC-P4Q3KNUAWJVP4ILH tcp -- 0.0.0.0/0 10.96.0.131 /* default/nginx:http cluster IP */ tcp dpt:80
KUBE-SVC-TCOU7JCQXEZGVUNU udp -- 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:53
KUBE-SVC-I24EZXP75AX5E7TU tcp -- 0.0.0.0/0 10.96.0.199 /* calico-apiserver/calico-api:apiserver cluster IP */ tcp dpt:443
KUBE-SVC-ERIFXISQEP7F7OF4 tcp -- 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:53
KUBE-SVC-JD5MR3NA4I4DYORP tcp -- 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:metrics cluster IP */ tcp dpt:9153
KUBE-SVC-KQVGIOWQAVNMB2ZL tcp -- 0.0.0.0/0 10.96.0.220 /* calico-system/calico-kube-controllers-metrics:metrics-port cluster IP */ tcp dpt:9094
KUBE-SVC-RK657RLKDNVNU64O tcp -- 0.0.0.0/0 10.96.0.246 /* calico-system/calico-typha:calico-typha cluster IP */ tcp dpt:5473
KUBE-NODEPORTS all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL
[root@master-9 net]# iptables -L KUBE-SVC-P4Q3KNUAWJVP4ILH -t nat -n
Chain KUBE-SVC-P4Q3KNUAWJVP4ILH (1 references)
target prot opt source destination
KUBE-MARK-MASQ tcp -- !10.244.0.0/24 10.96.0.131 /* default/nginx:http cluster IP */ tcp dpt:80
KUBE-SEP-5IN3N7CMZK6ATMGU all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx:http */ statistic mode random probability 0.10000000009
KUBE-SEP-HLNRLNS5YZR3HUCE all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx:http */ statistic mode random probability 0.11111111101
KUBE-SEP-ATAKOMWYNQ36NI3T all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx:http */ statistic mode random probability 0.12500000000
KUBE-SEP-BHAOEVLY2MXTCNVF all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx:http */ statistic mode random probability 0.14285714272
KUBE-SEP-PJXLHWLF6ASQ35HU all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx:http */ statistic mode random probability 0.16666666651
KUBE-SEP-G7DLGXRAERZMKSWC all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx:http */ statistic mode random probability 0.20000000019
KUBE-SEP-MUV3XIL573AOQ3RO all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx:http */ statistic mode random probability 0.25000000000
KUBE-SEP-24LCKPV3WIWIN6LO all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx:http */ statistic mode random probability 0.33333333349
KUBE-SEP-CXJ2YZHIBRQ4BYKV all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx:http */ statistic mode random probability 0.50000000000
KUBE-SEP-AG44B2ZFINL2G42M all -- 0.0.0.0/0 0.0.0.0/0 /* default/nginx:http */
kube-proxy 使用 ipvs
[root@10 vs]# iptables -L KUBE-SERVICES -t nat -n
Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-MARK-MASQ all -- !10.244.0.0/16 0.0.0.0/0 /* Kubernetes service cluster ip + port for masquerade purpose */ match-set KUBE-CLUSTER-IP dst,dst
KUBE-NODE-PORT all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL
ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-CLUSTER-IP dst,dst
[root@10 yaml]# ipvsadm -L 10.10.101.91-slave:ndmps
unexpected argument 10.10.101.91-slave:ndmps
[root@10 yaml]# ipvsadm -L
IP Virtual Server version 1.2.1 (size=4096)
TCP 10.10.101.91-slave:http rr
-> 10.244.114.29:http Masq 1 0 0
-> 10.244.186.20:http Masq 1 0 0
-> 10.244.186.21:http Masq 1 0 0
-> 10.244.186.22:http Masq 1 0 0
-> 10.244.186.23:http Masq 1 0 0
-> 10.244.186.24:http Masq 1 0 0
-> 10.244.188.15:http Masq 1 0 0
-> 10.244.188.17:http Masq 1 0 0
-> 10.244.188.18:http Masq 1 0 0
-> 10.244.188.19:http Masq 1 0 0
-> 10.244.188.20:http Masq 1 0 0
-> 10.244.218.17:http Masq 1 0 0
-> 10.244.218.18:http Masq 1 0 0
-> 10.244.218.19:http Masq 1 0 0
-> 10.244.218.20:http Masq 1 0 0
mode为ipvs时nodeport类型的service
主机上虽然监听了对应的端口,即使把kube-proxy停掉,也是不影响访问的。
[root@10 yaml]# ss -lpn |grep 30000
tcp LISTEN 0 32768 *:30000 *:* users:(("kube-proxy",pid=1842,fd=10))
tcp LISTEN 0 32768 :::30000 :::* users:(("kube-proxy",pid=1842,fd=14))
TCP 10.10.101.91:31001 rr
-> 10.244.11.78:80 Masq 1 0 0
-> 10.244.11.81:80 Masq 1 0 0
-> 10.244.12.211:80 Masq 1 0 0
-> 10.244.13.17:80 Masq 1 0 0
-> 10.244.13.81:80 Masq 1 0 0
// 主机上的dump网卡,连状态都是down。
[root@10 yaml]# ip a s kube-ipvs0
5: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN
link/ether 6a:fa:a8:c2:62:8c brd ff:ff:ff:ff:ff:ff
inet 10.96.37.82/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.96.0.1/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.96.0.10/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.96.241.158/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.96.164.59/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet6 2001:db8:42:1::ab46/128 scope global
valid_lft forever preferred_lft forever
inet6 2001:db8:42:1::2021/128 scope global
valid_lft forever preferred_lft forever