kubelet证书过期导致集群异常

环境:自己的虚机测试环境(用到了启动一下的那种)

集群版本为1.18.3

今天启动了下本地部署的k8s集群,想要做一些测试,但发现pod状态有些异常,最后定位到是由于kubelet的客户端证书过期导致的。

其实kubelet证书过期的问题,如果你的集群是一直处于运行状态,并且k8s版本不低于1.8版本,是不会出现的这个问题的,kubelet会在证书即将过期的时候,主动去更新所用的证书。参考:为 kubelet 配置证书轮换

好,废话不多说,来看问题定位及解决办法。如果想要看解决办法,直接跳到 问题解决 即可。

$ kubectl get nodes -o wide          # 查看node节点状态如下

NAME          STATUS    ROLES    AGE    VERSION  INTERNAL-IP    EXTERNAL-IP  OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME

centos-20-2  Ready      master  392d  v1.18.3  192.168.20.2  <none>        CentOS Linux 7 (Core)  5.4.111-1.el7.elrepo.x86_64  docker://19.3.8

centos-20-3  NotReady  <none>  392d  v1.18.3  192.168.20.3  <none>        CentOS Linux 7 (Core)  5.4.111-1.el7.elrepo.x86_64  docker://19.3.8

centos-20-4  NotReady  <none>  392d  v1.18.3  192.168.20.4  <none>        CentOS Linux 7 (Core)  5.4.111-1.el7.elrepo.x86_64  docker://19.3.8

# 故查看了下关键pod的状态,发现master节点的calico-node异常

$ kubectl get pod -n kube-system -o wide | egrep "calico|etcd|kube-"

calico-kube-controllers-5b8b769fcd-gbd2z    1/1    Running  5          392d  10.100.78.153    centos-20-2  <none>          <none>

calico-node-c7xr9                            1/1    Running  4          392d  192.168.20.3    centos-20-3  <none>          <none>

calico-node-g2j88                            0/1    Running  5          392d  192.168.20.2    centos-20-2  <none>          <none>

calico-node-nvtck                            1/1    Running  2          392d  192.168.20.4    centos-20-4  <none>          <none>

etcd-centos-20-2                            1/1    Running  5          392d  192.168.20.2    centos-20-2  <none>          <none>

kube-apiserver-centos-20-2                  1/1    Running  5          392d  192.168.20.2    centos-20-2  <none>          <none>

kube-controller-manager-centos-20-2          1/1    Running  5          392d  192.168.20.2    centos-20-2  <none>          <none>

kube-proxy-dmdkh                            1/1    Running  4          392d  192.168.20.3    centos-20-3  <none>          <none>

kube-proxy-qmqq4                            1/1    Running  5          392d  192.168.20.2    centos-20-2  <none>          <none>

kube-proxy-vqkpw                            1/1    Running  2          392d  192.168.20.4    centos-20-4  <none>          <none>

kube-scheduler-centos-20-2                  1/1    Running  6          392d  192.168.20.2    centos-20-2  <none>          <none>

monitor-kube-state-metrics-b7b7ccf8c-dzjl4  2/2    Running  0          151d  10.100.238.207  centos-20-3  <none>          <none>

# 通过describe查看pod详情,发现报错如下:

$  kubectl describe pod/calico-node-g2j88 -n kube-system

  Normal  Created        6m56s                  kubelet, centos-20-2  Created container calico-node

  Warning  Unhealthy      6m47s                  kubelet, centos-20-2  Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:24:58.449 [INFO][181] health.go 156: Number of node(s) with BGP peering established = 0

  Warning  Unhealthy      6m37s                  kubelet, centos-20-2  Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:25:08.435 [INFO][258] health.go 156: Number of node(s) with BGP peering established = 0

  Warning  Unhealthy      6m27s                  kubelet, centos-20-2  Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:25:18.457 [INFO][291] health.go 156: Number of node(s) with BGP peering established = 0

  Warning  Unhealthy      6m17s                  kubelet, centos-20-2  Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:25:28.403 [INFO][330] health.go 156: Number of node(s) with BGP peering established = 0

  Warning  Unhealthy      6m7s                  kubelet, centos-20-2  Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:25:38.414 [INFO][359] health.go 156: Number of node(s) with BGP peering established = 0

  Warning  Unhealthy      5m57s                  kubelet, centos-20-2  Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:25:48.415 [INFO][385] health.go 156: Number of node(s) with BGP peering established = 0

  Warning  Unhealthy      5m47s                  kubelet, centos-20-2  Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:25:58.472 [INFO][420] health.go 156: Number of node(s) with BGP peering established = 0

  Warning  Unhealthy      5m37s                  kubelet, centos-20-2  Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:26:08.433 [INFO][447] health.go 156: Number of node(s) with BGP peering established = 0

  Warning  Unhealthy      5m27s                  kubelet, centos-20-2  Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:26:18.409 [INFO][474] health.go 156: Number of node(s) with BGP peering established = 0

  Warning  Unhealthy      117s (x21 over 5m17s)  kubelet, centos-20-2  (combined from similar events): Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:29:48.412 [INFO][1099] health.go 156: Number of node(s) with BGP peering established = 0




# 根据报错,大概能猜到,是node节点192.168.20.3,192.168.20.4出了异常,故到这两个node节点上查看kubelet服务的状态,

# 发现服务异常,并且重启无效

$ systemctl status kubelet

● kubelet.service - kubelet: The Kubernetes Node Agent

  Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)

  Drop-In: /usr/lib/systemd/system/kubelet.service.d

          └─10-kubeadm.conf

  Active: activating (auto-restart) (Result: exit-code) since 三 2022-05-11 17:35:35 CST; 8s ago

    Docs: https://kubernetes.io/docs/

  Process: 8341 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=255)

Main PID: 8341 (code=exited, status=255)

5月 11 17:35:35 centos-20-3 systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a

5月 11 17:35:35 centos-20-3 systemd[1]: Unit kubelet.service entered failed state.

5月 11 17:35:35 centos-20-3 systemd[1]: kubelet.service failed.

# 通过下面的命令查看kubelet相关报错(或者/var/log/messages也可以)

$ journalctl -r -u kubelet | less

-- Logs begin at 三 2022-05-11 17:24:18 CST, end at 三 2022-05-11 17:40:33 CST. --

5月 11 17:40:33 centos-20-3 systemd[1]: kubelet.service failed.

5月 11 17:40:33 centos-20-3 systemd[1]: Unit kubelet.service entered failed state.

5月 11 17:40:33 centos-20-3 systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a

5月 11 17:40:33 centos-20-3 kubelet[8752]: F0511 17:40:33.096373    8752 server.go:274] failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory

5月 11 17:40:33 centos-20-3 kubelet[8752]: E0511 17:40:33.096297    8752 bootstrap.go:265] part of the existing bootstrap client certificate is expired: 2022-04-14 08:15:03 +0000 UTC

5月 11 17:40:33 centos-20-3 kubelet[8752]: I0511 17:40:33.082709    8752 server.go:837] Client rotation is on, will bootstrap in background

5月 11 17:40:33 centos-20-3 kubelet[8752]: I0511 17:40:33.082679    8752 plugins.go:100] No cloud provider specified.

5月 11 17:40:33 centos-20-3 kubelet[8752]: I0511 17:40:33.082351    8752 server.go:417] Version: v1.18.3

# 通过上面的报错可以看到,大概是kubelet的证书到期导致了,有效期至2022-04-14 08:15:03

# 查看kubelet客户端证书有效期

$ openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates

notBefore=Apr 14 08:15:03 2021 GMT

notAfter=Apr 14 08:15:03 2022 GMT

通过上面的问题定位,已经明确了问题出在哪里,那么解决起来也就比较简单了。

问题解决

# 将系统时间修改为证书过期前一天

$ date -s 2022-04-13    # 所有master节点及kubelet证书过期的节点都要执行,并尽可能一起执行,保证集群内时间一致

$ systemctl restart kubelet    # 重启故障节点的kubelet服务

# 重启后可以观察下/var/log/message的日志,并查看并确认当前节点的kubelet证书已自动轮换

$ openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates

notBefore=Apr 12 15:55:35 2022 GMT

notAfter=Apr 12 15:55:35 2023 GMT

# 恢复节点时间(所有节点执行)

$ ntpdate -u ntp.aliyun.com

# 最后到master节点确定所有node节点状态为 Ready即可

$ kubectl get nodes

NAME          STATUS  ROLES    AGE    VERSION

centos-20-2  Ready    master  363d  v1.18.3

centos-20-3  Ready    <none>  363d  v1.18.3

centos-20-4  Ready    <none>  363d  v1.18.3

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容