kubelet证书过期导致集群异常

环境：自己的虚机测试环境（用到了启动一下的那种）

集群版本为1.18.3

今天启动了下本地部署的k8s集群，想要做一些测试，但发现pod状态有些异常，最后定位到是由于kubelet的客户端证书过期导致的。

其实kubelet证书过期的问题，如果你的集群是一直处于运行状态，并且k8s版本不低于1.8版本，是不会出现的这个问题的，kubelet会在证书即将过期的时候，主动去更新所用的证书。参考：为 kubelet 配置证书轮换

好，废话不多说，来看问题定位及解决办法。如果想要看解决办法，直接跳到问题解决即可。

$ kubectl get nodes -o wide # 查看node节点状态如下

NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME

centos-20-2 Ready master 392d v1.18.3 192.168.20.2 <none> CentOS Linux 7 (Core) 5.4.111-1.el7.elrepo.x86_64 docker://19.3.8

centos-20-3 NotReady <none> 392d v1.18.3 192.168.20.3 <none> CentOS Linux 7 (Core) 5.4.111-1.el7.elrepo.x86_64 docker://19.3.8

centos-20-4 NotReady <none> 392d v1.18.3 192.168.20.4 <none> CentOS Linux 7 (Core) 5.4.111-1.el7.elrepo.x86_64 docker://19.3.8

# 故查看了下关键pod的状态，发现master节点的calico-node异常

$ kubectl get pod -n kube-system -o wide | egrep "calico|etcd|kube-"

calico-kube-controllers-5b8b769fcd-gbd2z 1/1 Running 5 392d 10.100.78.153 centos-20-2 <none> <none>

calico-node-c7xr9 1/1 Running 4 392d 192.168.20.3 centos-20-3 <none> <none>

calico-node-g2j88 0/1 Running 5 392d 192.168.20.2 centos-20-2 <none> <none>

calico-node-nvtck 1/1 Running 2 392d 192.168.20.4 centos-20-4 <none> <none>

etcd-centos-20-2 1/1 Running 5 392d 192.168.20.2 centos-20-2 <none> <none>

kube-apiserver-centos-20-2 1/1 Running 5 392d 192.168.20.2 centos-20-2 <none> <none>

kube-controller-manager-centos-20-2 1/1 Running 5 392d 192.168.20.2 centos-20-2 <none> <none>

kube-proxy-dmdkh 1/1 Running 4 392d 192.168.20.3 centos-20-3 <none> <none>

kube-proxy-qmqq4 1/1 Running 5 392d 192.168.20.2 centos-20-2 <none> <none>

kube-proxy-vqkpw 1/1 Running 2 392d 192.168.20.4 centos-20-4 <none> <none>

kube-scheduler-centos-20-2 1/1 Running 6 392d 192.168.20.2 centos-20-2 <none> <none>

monitor-kube-state-metrics-b7b7ccf8c-dzjl4 2/2 Running 0 151d 10.100.238.207 centos-20-3 <none> <none>

# 通过describe查看pod详情，发现报错如下:

$ kubectl describe pod/calico-node-g2j88 -n kube-system

Normal Created 6m56s kubelet, centos-20-2 Created container calico-node

Warning Unhealthy 6m47s kubelet, centos-20-2 Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:24:58.449 [INFO][181] health.go 156: Number of node(s) with BGP peering established = 0

Warning Unhealthy 6m37s kubelet, centos-20-2 Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:25:08.435 [INFO][258] health.go 156: Number of node(s) with BGP peering established = 0

Warning Unhealthy 6m27s kubelet, centos-20-2 Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:25:18.457 [INFO][291] health.go 156: Number of node(s) with BGP peering established = 0

Warning Unhealthy 6m17s kubelet, centos-20-2 Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:25:28.403 [INFO][330] health.go 156: Number of node(s) with BGP peering established = 0

Warning Unhealthy 6m7s kubelet, centos-20-2 Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:25:38.414 [INFO][359] health.go 156: Number of node(s) with BGP peering established = 0

Warning Unhealthy 5m57s kubelet, centos-20-2 Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:25:48.415 [INFO][385] health.go 156: Number of node(s) with BGP peering established = 0

Warning Unhealthy 5m47s kubelet, centos-20-2 Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:25:58.472 [INFO][420] health.go 156: Number of node(s) with BGP peering established = 0

Warning Unhealthy 5m37s kubelet, centos-20-2 Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:26:08.433 [INFO][447] health.go 156: Number of node(s) with BGP peering established = 0

Warning Unhealthy 5m27s kubelet, centos-20-2 Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:26:18.409 [INFO][474] health.go 156: Number of node(s) with BGP peering established = 0

Warning Unhealthy 117s (x21 over 5m17s) kubelet, centos-20-2 (combined from similar events): Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:29:48.412 [INFO][1099] health.go 156: Number of node(s) with BGP peering established = 0

# 根据报错，大概能猜到，是node节点192.168.20.3,192.168.20.4出了异常，故到这两个node节点上查看kubelet服务的状态，

# 发现服务异常，并且重启无效

$ systemctl status kubelet

● kubelet.service - kubelet: The Kubernetes Node Agent

Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)

Drop-In: /usr/lib/systemd/system/kubelet.service.d

└─10-kubeadm.conf

Active: activating (auto-restart) (Result: exit-code) since 三 2022-05-11 17:35:35 CST; 8s ago

Docs: https://kubernetes.io/docs/

Process: 8341 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=255)

Main PID: 8341 (code=exited, status=255)

5月 11 17:35:35 centos-20-3 systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a

5月 11 17:35:35 centos-20-3 systemd[1]: Unit kubelet.service entered failed state.

5月 11 17:35:35 centos-20-3 systemd[1]: kubelet.service failed.

# 通过下面的命令查看kubelet相关报错（或者/var/log/messages也可以）

$ journalctl -r -u kubelet | less

-- Logs begin at 三 2022-05-11 17:24:18 CST, end at 三 2022-05-11 17:40:33 CST. --

5月 11 17:40:33 centos-20-3 systemd[1]: kubelet.service failed.

5月 11 17:40:33 centos-20-3 systemd[1]: Unit kubelet.service entered failed state.

5月 11 17:40:33 centos-20-3 systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a

5月 11 17:40:33 centos-20-3 kubelet[8752]: F0511 17:40:33.096373 8752 server.go:274] failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory

5月 11 17:40:33 centos-20-3 kubelet[8752]: E0511 17:40:33.096297 8752 bootstrap.go:265] part of the existing bootstrap client certificate is expired: 2022-04-14 08:15:03 +0000 UTC

5月 11 17:40:33 centos-20-3 kubelet[8752]: I0511 17:40:33.082709 8752 server.go:837] Client rotation is on, will bootstrap in background

5月 11 17:40:33 centos-20-3 kubelet[8752]: I0511 17:40:33.082679 8752 plugins.go:100] No cloud provider specified.

5月 11 17:40:33 centos-20-3 kubelet[8752]: I0511 17:40:33.082351 8752 server.go:417] Version: v1.18.3

# 通过上面的报错可以看到，大概是kubelet的证书到期导致了，有效期至2022-04-14 08:15:03

# 查看kubelet客户端证书有效期

$ openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates

notBefore=Apr 14 08:15:03 2021 GMT

notAfter=Apr 14 08:15:03 2022 GMT

通过上面的问题定位，已经明确了问题出在哪里，那么解决起来也就比较简单了。

问题解决

# 将系统时间修改为证书过期前一天

$ date -s 2022-04-13 # 所有master节点及kubelet证书过期的节点都要执行，并尽可能一起执行，保证集群内时间一致

$ systemctl restart kubelet # 重启故障节点的kubelet服务

# 重启后可以观察下/var/log/message的日志，并查看并确认当前节点的kubelet证书已自动轮换

$ openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates

notBefore=Apr 12 15:55:35 2022 GMT

notAfter=Apr 12 15:55:35 2023 GMT

# 恢复节点时间（所有节点执行）

$ ntpdate -u ntp.aliyun.com

# 最后到master节点确定所有node节点状态为 Ready即可

$ kubectl get nodes

NAME STATUS ROLES AGE VERSION

centos-20-2 Ready master 363d v1.18.3

centos-20-3 Ready <none> 363d v1.18.3

centos-20-4 Ready <none> 363d v1.18.3

kubelet证书过期导致集群异常

推荐阅读更多精彩内容