如果你第一个替换的就是第一个控制节点,那么请注意一定要按照如下操作流程进行。
在kubespray的部署方式中,默认会认为序列为1的节点应该执行集群初始化,无论是etcd还是kubeadm。所以如果替换的是第一个节点,应该就是要把该节点的第一序列位改成其他完好的节点。
### 1) Change control plane nodes order in inventory
from
source-ini
[kube_control_plane]
node-1
node-2
node-3
to
source-ini
[kube_control_plane]
node-2
node-3
node-1
2) Remove old first control plane node from cluster
With the old node still in the inventory, run remove-node.yml
. You need to pass -e node=node-1
to the playbook to limit the execution to the node being removed. If the node you want to remove is not online, you should add reset_nodes=false
and allow_ungraceful_removal=true
to your extra-vars.
3) Edit cluster-info configmap in kube-public namespace
kubectl edit cm -n kube-public cluster-info
Change ip of old kube_control_plane node with ip of live kube_control_plane node (server
field). Also, update certificate-authority-data
field if you changed certs.
4) Add new control plane node
Update inventory (if needed)
Run cluster.yml
with --limit=kube_control_plane
尤其要注意第三个步骤,将原来的cluster-info 从指向master1转为指向master2
< certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUM1ekNDQWMrZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJeU1Ea3lOREV6TURVMU1sb1hEVE15TURreU1URXpNRFUxTWxvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBSjk1Cmw4NXhaUU56akI3VE9sY3FmUSs5OHU3cWZWYkFCMmx1c2NXVGR6NEF4N3hBME1id2U4REJTd1BXUnpEbURUczkKYWJha056VGpEeDUralJlUldaU29EdmZHNDhTaTVUeFdybDRROVdYNGpjMXhUQjJCTDNWTklqUFFBTUxuK0hOaAozVkQ3VjJaYkJLaDRySUpIaEZlVERDV3U1S3kweUtGYnFqS2gvUXZDbUJ1QlJkZlVaQkdha1pFbVZYOWlKd3YvClZlazRkb2pyb3Q4emNRajhGazVQd0RUeE0zREc5My8zS3MySnd3RTBJOWhkZTlBdDlPZTRzdmtuUmgyOTdlb0QKMXltZjRmc1YzU2IxbGFSbG82MnpTblRkWjJXWmhXVHFGK0ZsQ1pNTGs2d3M4dkE4VWdwTFl4U2w5N0tEV0srVgpteVEzeHB3ZTR2TjUxdWpYZU84Q0F3RUFBYU5DTUVBd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0hRWURWUjBPQkJZRUZNNzExekJHNy9MNGFoVk5hZjBqNVVUYVRpdlBNQTBHQ1NxR1NJYjMKRFFFQkN3VUFBNElCQVFDV3VsMUo5aHVLUldTSmJoVWU3STdiNVBzMXJRWnNucFZmalRrYkdUdXJ4bTZScWZ2QQpuc0x5NTNiM0swSnRYeFJTK0pTeWFtR000Zzcxck9MRGx1SkJJcFVoVzR4VU9SU0duNDM1cmI1TjNRekZ5RnJsCkwxVCt1YytEY0pFUFg0T092SUlvSzhMbTlNaW1FNXJBcW9JWFpLcVZDZ25UWGN0QlpTOUFZOW1NWUVoaHphOCsKWXJjWTZjMjJZSWIxb0oxMVlDbndiVUZkNm9VVW1YYXUxV3Y0MnFJNHRxYlNMQ0VxRTlZNERlREdBdkpGNFZqawp1cG9sYjQzVzVIWE1wb0gyOGxhejVadE5aMzRqbS83RTM0ZkhJZHAzOVZoWE1LVk9pTlhXKysvaWN3ckVEUit2CjRMelZGTzYyR1dnUDNiRWVreW5hN1F1Rmh5OWs5b0dRVlJWSQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
< server: https://10.120.33.146:6443
---
> certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUM1ekNDQWMrZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJeE1EZ3hNakEyTVRreU0xb1hEVE14TURneE1EQTJNVGt5TTFvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBTUVOCmxrNll5ZWtwMSs0RHQzS1BtajVhSHZZZ2J4Q0krSWpVSmV0TVp2UmNzRXR3TmU1YU9yRWZFZXdOK3NyeFRJUVkKSEJHSkxIbUMzZ0VzQjNjdmdIWDZZaGlKWVZMYVZBVi9aN1puK0J3cFllbVVKaWFkNXBxMTZvUDl4dXlZZHpaZgpybCtYajdMai9HdGFYQXNrNmZSS2hzTXVyMmlBMmpBTkwzRG8yZEdHRUtleVNIQVFBaEZqTEErSk9SdElKZEYzCjZyWndsdDNOM2MweFRHU0Y5OGJqMFl5MDR4cG1qVXp1cHVQUWovOEVTT1JaUkhxS0FoRHJLeU5vWDBHbzhSY1AKKzNoa3dvTnVZbit0dTg3Mzk5dG1lUU5DMXpKQkpZemFVQTAxMkRiSWk5bzltcnNTZklpSEtQTjlqeGlFU1BhTAphWTJUV0FWeGt2UzQxY1V1M2hNQ0F3RUFBYU5DTUVBd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0hRWURWUjBPQkJZRUZFaGRIc3BySThhUVJhV3J0NWhaZFBocTJtazRNQTBHQ1NxR1NJYjMKRFFFQkN3VUFBNElCQVFBbVAyaStUdGQ3czVKdEIvWkFoMVBFWkV2V0NLTXpWcHdYaW1NUzlhOWxldjdlRFp0VApzSEdYMzhSeUhDNGdKb2N3S1VXamZ5YlBFUnJkTTY1cUN2SXVONW9nQmphZU1iYjRNTUpnM0d4cE45a3RvaU9PCktsa1hKblVHZm83MkpCNTBTSnpJdGthbHFPelhENkgzbzUxQTNYbHp6MUZENTdhRERFZEkxMUZJY2ozTk4vVkoKaVRzSHZyaVd4MGtDK0V1eXhYWE9ma1p3VEkrSjFnMWx2NkZPYW9ZcWZhYVpVQ3cyTmFLc1dMTG9FT2FiNG15TgptV25pQ1M2Q2h6K2xBa2Q5N0w5ck12WmRKZWxlMEJNWmZXSGZTbzJRSlRvc0dMdDdWY2YrVlRmSE9vQlRBNGlXCmpwLzVINVVZdmJrQUV1SmpVV1hCYTZLNTR5N3JJdEhBeUVidwotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
> server: https://10.120.34.53:6443
# 直接从cat /etc/kubernetes/admin.conf拷贝,改下管理ip即可
如果不改动该项,重新执行cluster.yaml后,master-1无法加入集群,报错如下“
[root@pro-k8s-master-1 ~]# cat /etc/kubernetes/kubeadm-controlplane.yaml
apiVersion: kubeadm.k8s.io/v1beta2
kind: JoinConfiguration
discovery:
bootstrapToken:
apiServerEndpoint: 10.120.34.53:6443
token: m49jmj.fv3zqgm57tnwgtor
unsafeSkipCAVerification: true
timeout: 5m0s
tlsBootstrapToken: m49jmj.fv3zqgm57tnwgtor
controlPlane:
localAPIEndpoint:
advertiseAddress: 10.120.33.146
bindPort: 6443
certificateKey: c8d4ef0b01aa6e54caed3d6fd2a1da2a7ada69b3833aeadaf3bac8a81cd01cfa
nodeRegistration:
name: pro-k8s-master-1
criSocket: /var/run/dockershim.sock
[root@pro-k8s-master-1 ~]# /usr/local/bin/kubeadm join --config /etc/kubernetes/kubeadm-controlplane.yaml --ignore-preflight-errors=all
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Get "https://10.120.33.146:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config?timeout=10s": dial tcp 10.120.33.146:6443: connect: connection refused
To see the stack trace of this error execute with --v=5 or higher
因为加入的时候,是从cm: kube-public cluster-info中读取到的集群证书校验数据 和 kube-api endpoint
node join的2个场景:
以下两种场景的加入都依赖两个需要注意的配置项
kube-public cluster-info configmap,里面存储集群证书校验数据,以及集群的kube-api endpoint,如果是lb vip则不用改,如果是节点ip,那么这里就需要改,否则无法加入节点。
控制面第一个节点进行kube-adm init时需要指定 --upload-certs
kube-adm init时,会生成ca crt key等证书文件,指定该参数当新node加入节点时,会自动同步crt文件到/etc/kubernetes/ssl 目录下
- 普通 worker加入
- 控制面节点加入加入
kubeadm join 10.120.34.53:6443 --token 5r6s2m.g7wimq9aoist154w --discovery-token-ca-cert-hash sha256:eadb3051b0ea751f058de8805e3c2569769ae8346a889acb835b542b22840d58 --control-plane --certificate-key 6ed53f56e64f12c0cb7a3024203e2a99e95aaaacc52b58b8100a025ff577257e
# 控制面需要指定 --control-plane
替换第三个节点
在关闭第三个节点后,etcd 会出现丢失选举的情况,应该有超过15s中
[root@pro-k8s-master-2 ~]# hostname=`hostname`
[root@pro-k8s-master-2 ~]# export ETCDCTL_API=3
[root@pro-k8s-master-2 ~]# export ETCDCTL_CERT=/etc/ssl/etcd/ssl/admin-$hostname.pem
[root@pro-k8s-master-2 ~]# export ETCDCTL_KEY=/etc/ssl/etcd/ssl/admin-$hostname-key.pem
[root@pro-k8s-master-2 ~]# export ETCDCTL_CACERT=/etc/ssl/etcd/ssl/ca.pem
[root@pro-k8s-master-2 ~]# export ETCDCTL_ENDPOINTS="https://10.120.33.146:2379,https://10.120.34.53:2379,https://10.120.35.101:2379"
[root@pro-k8s-master-2 ~]# # 确认
[root@pro-k8s-master-2 ~]# etcdctl member list
89b51fbdfb2a9906, started, etcd2, https://10.120.34.53:2380, https://10.120.34.53:2379, false
8fc45151ffa61a8e, started, etcd1, https://10.120.33.146:2380, https://10.120.33.146:2379, false
d2fccd85a33c58f9, started, etcd3, https://10.120.35.101:2380, https://10.120.35.101:2379, false
[root@pro-k8s-master-2 ~]# etcdctl endpoint status -w table
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.120.33.146:2379 | 8fc45151ffa61a8e | 3.4.13 | 100 MB | false | false | 1561 | 284432261 | 284432261 | |
| https://10.120.34.53:2379 | 89b51fbdfb2a9906 | 3.4.13 | 100 MB | true | false | 1561 | 284432262 | 284432261 | |
| https://10.120.35.101:2379 | d2fccd85a33c58f9 | 3.4.13 | 100 MB | false | false | 1561 | 284432262 | 284432262 | |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@pro-k8s-master-2 ~]#
[root@pro-k8s-master-2 ~]# logout
Connection to pro-k8s-master-2 closed.
[root@deployer ~]# ssh pro-k8s-master-2
Last login: Wed Sep 28 10:47:26 2022 from 10.120.33.122
[root@pro-k8s-master-2 ~]#
[root@pro-k8s-master-2 ~]#
[root@pro-k8s-master-2 ~]#
[root@pro-k8s-master-2 ~]# etcdctl endpoint status -w table
{"level":"warn","ts":"2022-09-28T10:49:44.482Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection closed"}
Failed to get the status of endpoint 127.0.0.1:2379 (context deadline exceeded)
+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@pro-k8s-master-2 ~]# etcdctl endpoint status -w table
{"level":"warn","ts":"2022-09-28T10:50:09.362Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection closed"}
Failed to get the status of endpoint 127.0.0.1:2379 (context deadline exceeded)
+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@pro-k8s-master-2 ~]#
[root@pro-k8s-master-2 ~]#
[root@pro-k8s-master-2 ~]# hostname=`hostname`
[root@pro-k8s-master-2 ~]# export ETCDCTL_API=3
[root@pro-k8s-master-2 ~]# export ETCDCTL_CERT=/etc/ssl/etcd/ssl/admin-$hostname.pem
[root@pro-k8s-master-2 ~]# export ETCDCTL_KEY=/etc/ssl/etcd/ssl/admin-$hostname-key.pem
[root@pro-k8s-master-2 ~]# export ETCDCTL_CACERT=/etc/ssl/etcd/ssl/ca.pem
[root@pro-k8s-master-2 ~]# export ETCDCTL_ENDPOINTS="https://10.120.33.146:2379,https://10.120.34.53:2379,https://10.120.35.101:2379"
[root@pro-k8s-master-2 ~]# # 确认
[root@pro-k8s-master-2 ~]# etcdctl member list
89b51fbdfb2a9906, started, etcd2, https://10.120.34.53:2380, https://10.120.34.53:2379, false
8fc45151ffa61a8e, started, etcd1, https://10.120.33.146:2380, https://10.120.33.146:2379, false
d2fccd85a33c58f9, started, etcd3, https://10.120.35.101:2380, https://10.120.35.101:2379, false
[root@pro-k8s-master-2 ~]# etcdctl endpoint status -w table
{"level":"warn","ts":"2022-09-28T10:50:27.214Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://10.120.35.101:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to get the status of endpoint https://10.120.35.101:2379 (context deadline exceeded)
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.120.33.146:2379 | 8fc45151ffa61a8e | 3.4.13 | 100 MB | false | false | 1561 | 284434365 | 284434365 | |
| https://10.120.34.53:2379 | 89b51fbdfb2a9906 | 3.4.13 | 100 MB | true | false | 1561 | 284434365 | 284434365 | |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@pro-k8s-master-2 ~]#
需要先确认leader能够再恢复,否则先等一下不要立即进行节点替换