前言
最近公司的 k8s 集群出现了一个问题:在执行任何 kubectl 命令时都会出现以下错误,本文就记录一下该问题的溯源过程以及解决方式,希望对大家有帮助:
The connection to the server 192.168.100.170:6443 was refused - did you specify the right host or port?
问题溯源
相信很多朋友都遇到过这个问题,6443
是 k8s APIServer 的默认端口,出现访问被拒绝肯定是 kubelet 有问题或者被防火墙拦截了,这里先看一下这个端口上的 kubelet 是不是还或者:
netstat -pnlt | grep 6443
运行之后什么都没有返回,也就是说 APIServer 完全没有提供服务,那我们就去查看一下 kubelet 的日志,大家都知道使用 kubeadm 搭建的 k8s集群里,APIServer 都是在 docker 里运行的,这里我们先找到对应的容器,记得加 -a
,因为该容器可能已经处于非正常状态了:
docker ps -a | grep apiserver
# 输出
f40d97ee2be6 40a63db91ef8 "kube-apiserver --au…" 2 minutes ago Exited (255) 2 minutes ago k8s_kube-apiserver_kube-apiserver-master1_kube-system_7beef975d93d634ecee05282d3d3a9ac_718
4b866fe71e33 registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.1 "/pause" 2 days ago Up 2 days k8s_POD_kube-apiserver-master1_kube-system_7beef975d93d634ecee05282d3d3a9ac_0
这里能看到两个容器,可以看到 容器的状态已经是 Exited
了,注意下面的pause
容器,这个只是用来引导 APIServer 的,并不是服务的实际运行容器,所以看不到日志,所以查看日志时不要输错容器 id 了。接下来查看 APIServer 的日志:
docker logs -f f40d97ee2be6
# 输出
I1230 01:39:42.942786 1 server.go:557] external host was not specified, using 192.168.100.171
I1230 01:39:42.942924 1 server.go:146] Version: v1.13.1
I1230 01:39:43.325424 1 plugins.go:158] Loaded 8 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,Priority,DefaultTolerationSeconds,DefaultStorageClass,MutatingAdmissionWebhook.
I1230 01:39:43.325451 1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
I1230 01:39:43.326327 1 plugins.go:158] Loaded 8 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,Priority,DefaultTolerationSeconds,DefaultStorageClass,MutatingAdmissionWebhook.
I1230 01:39:43.326340 1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
F1230 01:40:03.328865 1 storage_decorator.go:57] Unable to create storage backend: config (&{ /registry [https://127.0.0.1:2379] /etc/kubernetes/pki/apiserver-etcd-client.key /etc/kubernetes/pki/apiserver-etcd-client.crt /etc/kubernetes/pki/etcd/ca.crt true 0xc0004bd440 <nil> 5m0s 1m0s}), err (dial tcp 127.0.0.1:2379: connect: connection refused)
从最后一行可以看到,是 APIServer 在尝试创建存储时出现了问题,导致无法正确启动服务,由于 k8s 是使用 etcd 作为存储的,所以我们再来查看 etcd 的日志。
注意,我这里 etcd 也是运行在 docker 里的,如果你是直接以 service 的形式运行的话需要使用 systemctl status etcd
来查看日志,下面是 docker 的 etcd 日志查看:
# 查看 etcd 容器,注意 etcd 也有对应的 pause 容器
docker ps -a | grep etcd
# 输出
1b8b522ee4e8 3cab8e1b9802 "etcd --advertise-cl…" 7 minutes ago Exited (2) 6 minutes ago k8s_etcd_etcd-master1_kube-system_1051dec0649f2b816946cb1fea184325_942
c9440543462e registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.1 "/pause" 2 days ago Up 2 days k8s_POD_etcd-master1_kube-system_1051dec0649f2b816946cb1fea184325_0
# 查看 etcd 日志
docker logs -f 1b8b522ee4e8
# 输出
2019-12-30 01:43:44.075758 I | raft: 92b79bbe6bd2706a is starting a new election at term 165711
2019-12-30 01:43:44.075806 I | raft: 92b79bbe6bd2706a became candidate at term 165712
2019-12-30 01:43:44.075819 I | raft: 92b79bbe6bd2706a received MsgVoteResp from 92b79bbe6bd2706a at term 165712
2019-12-30 01:43:44.075832 I | raft: 92b79bbe6bd2706a [logterm: 82723, index: 84358879] sent MsgVote request to a25634eca298ea33 at term 165712
2019-12-30 01:43:44.075844 I | raft: 92b79bbe6bd2706a [logterm: 82723, index: 84358879] sent MsgVote request to 645060e8e879847c at term 165712
2019-12-30 01:43:45.075783 I | raft: 92b79bbe6bd2706a is starting a new election at term 165712
2019-12-30 01:43:45.075818 I | raft: 92b79bbe6bd2706a became candidate at term 165713
2019-12-30 01:43:45.075830 I | raft: 92b79bbe6bd2706a received MsgVoteResp from 92b79bbe6bd2706a at term 165713
2019-12-30 01:43:45.075840 I | raft: 92b79bbe6bd2706a [logterm: 82723, index: 84358879] sent MsgVote request to 645060e8e879847c at term 165713
2019-12-30 01:43:45.075849 I | raft: 92b79bbe6bd2706a [logterm: 82723, index: 84358879] sent MsgVote request to a25634eca298ea33 at term 165713
2019-12-30 01:43:45.928418 E | etcdserver: publish error: etcdserver: request timed out
2019-12-30 01:43:46.363974 I | etcdmain: rejected connection from "192.168.100.181:35914" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")
2019-12-30 01:43:46.364006 I | etcdmain: rejected connection from "192.168.100.181:35912" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")
2019-12-30 01:43:46.477058 I | etcdmain: rejected connection from "192.168.100.181:35946" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")
2019-12-30 01:43:46.483326 I | etcdmain: rejected connection from "192.168.100.181:35944" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")
2019-12-30 01:43:46.575790 I | raft: 92b79bbe6bd2706a is starting a new election at term 165713
2019-12-30 01:43:46.575818 I | raft: 92b79bbe6bd2706a became candidate at term 165714
2019-12-30 01:43:46.575829 I | raft: 92b79bbe6bd2706a received MsgVoteResp from 92b79bbe6bd2706a at term 165714
2019-12-30 01:43:46.575839 I | raft: 92b79bbe6bd2706a [logterm: 82723, index: 84358879] sent MsgVote request to 645060e8e879847c at term 165714
2019-12-30 01:43:46.575848 I | raft: 92b79bbe6bd2706a [logterm: 82723, index: 84358879] sent MsgVote request to a25634eca298ea33 at term 165714
2019-12-30 01:43:46.595828 I | etcdmain: rejected connection from "192.168.100.181:35962" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")
2019-12-30 01:43:46.597536 I | etcdmain: rejected connection from "192.168.100.181:35964" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")
2019-12-30 01:43:46.709028 I | etcdmain: rejected connection from "192.168.100.181:35970" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")
2019-12-30 01:43:46.714243 I | etcdmain: rejected connection from "192.168.100.181:35972" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")
2019-12-30 01:43:46.928411 W | rafthttp: health check for peer a25634eca298ea33 could not connect: dial tcp 192.168.100.191:2380: getsockopt: connection refused
...
可以看到 etcd 一直在循环输出上面的错误日志直到超时退出,从里面可以提取到一条关键错误,就是 error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid
。这个错误对于经常维护 k8s 集群的朋友可能很熟悉了,又是证书到期了。
这个集群有三台 master,分别是 171
、181
和191
,可以从错误信息前看到是在请求 181
时出现了证书验证失败的问题,我们登陆 181
机器来验证错误:
# 进入 k8s 证书目录
cd /etc/kubernetes/pki
# 查看证书到期时间
openssl x509 -in etcd/server.crt -noout -text |grep ' Not '
# 输出
Not Before: Dec 26 08:12:11 2018 GMT
Not After : Dec 26 08:12:11 2019 GMT
经过排查,发现 k8s 的相关证书都没事,但是 etcd 的证书都到期了。关于 k8s 需要的证书可以看这篇文章,接下来我们就来解决问题:
问题解决
注意,由于 k8s 版本问题,这一部分的内容可能和你的不太一样,我所使用的版本如下:
root@master1:~# kubelet --version
Kubernetes v1.13.1
root@master1:~# kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T10:36:44Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
如果版本相差过大的话请进行百度,相关的解决方案还是挺多的,下面解决方案请先配合 -h
使用,注意:以下操作会导致服务停止,请谨慎执行:
备份原始文件
cd /etc
cp -r kubernetes kubernetes.bak
重新生成证书
重新生成证书需要集群初始化时的配置文件,我的配置文件kubeadm.yaml
如下:
kind: ClusterConfiguration
apiVersion: kubeadm.k8s.io/v1beta1
controlPlaneEndpoint: "192.168.100.170:6443"
apiServer:
certSANS:
- master1
- master2
- master3
- 192.168.100.170
- 192.168.100.171
- 192.168.100.181
- 192.168.100.191
其中 192.168.100.170
是 VIP,171
、181
、191
分别对应master1
、master2
、master3
主机。接下来使用配置文件重新签发证书,每个管理节点都要执行:
kubeadm init phase certs all --config=kubeadm.yaml
重新生成配置文件
kubeadm init phase kubeconfig all --config kubeadm.yaml
这个命令也需要每个管理节点都执行一次,被重新生成的配置文件包括下列几个:
- admin.conf
- controller-manager.conf
- kubelet.conf
- scheduler.conf
重启管理节点的 k8s
重启 etcd,apiserver,controller-manager,scheduler 容器,一般情况下 kubectl 都可以正常使用了,记得kubectl get nodes
查看节点的状态。
重新生成工作节点的配置文件
如果上一步查看的工作节点的状态还是为 NotReady
的话,就需要重新进行生成,如果你根证书也更换了的话就会导致这个问题,工作节点的证书也会失效,直接备份并移除下面的证书并重启 kubelet 即可:
mv /var/lib/kubelet/pki /var/lib/kubelet/pki.bak
systemctl daemon-reload && systemctl restart kubelet
如果不行的话就直接把管理节点的/etc/kubernetes/pki/ca.crt
复制到对应工作节点的相同目录下然后再次启动 kubelet。等待三分钟左右应该就可以在管理节点上看到该工作节点的状态变为Ready
。
总结
k8s 的证书只有一年的设置确定有点坑,虽然为了让使用者更新到最新版本的本意是好的。如果你现在 k8s 集群还是正常但是并没有执行过证书更新操作的话,请及时查看你的证书到期时间,等到证书到期就为时已晚了。