master一节点删除后，重新加入节点

上一节在操作master节点的时候，删除了一个master节点，即执行了kubeadm reset，当时删除也是为了定位是否是集群的原因

一、恢复操作：

1. 重新生成token，执行如下命令

[root@k8s-test-master03 ~]#  kubeadm token create 
in7k87.9iwm83473t1vehpk
[root@k8s-test-master03 ~]# openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'
dc40145602096b410d8b60c26fb4767115cb9ee6cb097f887c728993bcaf6dee

1. 执行已经剔除的节点3的重置
  kubeadm reset

[root@k8s-test-master03 ~]# kubeadm reset
[reset] Reading configuration from the cluster...
[reset] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
W0630 15:47:30.957297   21465 reset.go:96] [reset] Unable to fetch the kubeadm-config ConfigMap from cluster: failed to get node registration: failed to get node name from kubelet config: open /etc/kubernetes/kubelet.conf: no such file or directory
[reset] WARNING: Changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] Are you sure you want to proceed? [y/N]: y
[preflight] Running pre-flight checks
W0630 15:47:34.091929   21465 removeetcdmember.go:79] [reset] No kubeadm config, using etcd pod spec to get data directory
[reset] No etcd config found. Assuming external etcd
[reset] Please, manually reset etcd to prevent further issues
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]
[reset] Deleting contents of stateful directories: [/var/lib/kubelet /etc/cni/net.d /var/lib/dockershim /var/run/kubernetes /var/lib/cni]

The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually by using the "iptables" command.

If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.

The reset process does not clean your kubeconfig files and you must remove them manually.
Please, check the contents of the $HOME/.kube/config file.

执行加入节点
kubeadm join 10.2.135.250:8443 --token in7k87.9iwm83473t1vehpk --discovery-token-ca-cert-hash sha256:dc40145602096b410d8b60c26fb4767115cb9ee6cb097f887c728993bcaf6dee --control-plane --certificate-key df65c6f2cc3a0846a3c5f7459a92050e3093ee55a03fd06dcec94e2eca0129eb
此处使用的是上面生成的token信息

[root@k8s-test-master03 ~]# kubeadm join 10.2.135.250:8443 --token in7k87.9iwm83473t1vehpk     --discovery-token-ca-cert-hash sha256:dc40145602096b410d8b60c26fb4767115cb9ee6cb097f887c728993bcaf6dee     --control-plane --certificate-key df65c6f2cc3a0846a3c5f7459a92050e3093ee55a03fd06dcec94e2eca0129eb
[preflight] Running pre-flight checks
        [WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[preflight] Running pre-flight checks before initializing the new control plane instance
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[download-certs] Downloading the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [k8s-test-master03 localhost] and IPs [10.2.135.105 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [k8s-test-master03 localhost] and IPs [10.2.135.105 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [k8s-test-master03 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.k8s.cluster.local] and IPs [10.4.128.1 10.2.135.105 10.2.135.250]
[certs] Generating "front-proxy-client" certificate and key
[certs] Valid certificates and keys now exist in "/etc/kubernetes/pki"
[certs] Using the existing "sa" key
[kubeconfig] Generating kubeconfig files
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[check-etcd] Checking that the etcd cluster is healthy
error execution phase check-etcd: etcd cluster is not healthy: failed to dial endpoint https://10.2.135.105:2379 with maintenance client: context deadline exceeded
To see the stack trace of this error execute with --v=5 or higher

根据关键信息 "error execution phase check-etcd" 可知，可能是在执行加入 etcd 时候出现的错误，导致 master 无法加入原先的 kubernetes 集群。

二、分析问题

查看集群节点列表
首先通过 Kuberctl 工具检查一下现有的节点信息：

[root@k8s-test-master01 ~]# kubectl get nodes
NAME                STATUS   ROLES    AGE    VERSION
k8s-test-master01   Ready    master   255d   v1.16.2
k8s-test-master02   Ready    master   255d   v1.16.2
k8s-test-node01     Ready    <none>   95d    v1.16.2
k8s-test-node02     Ready    <none>   255d   v1.16.2
k8s-test-node03     Ready    <none>   255d   v1.16.2
k8s-test-node04     Ready    <none>   73d    v1.16.2

可以看到，k8s-test-master03 节点确实不在节点列表中

查看 Kubeadm 配置信息
在看看 Kubernetes 集群中的 kubeadm 配置信息：

[root@k8s-test-master01 ~]# kubectl describe configmaps kubeadm-config -n kube-system

获取到的内容如下：

Name:         kubeadm-config
Namespace:    kube-system
Labels:       <none>
Annotations:  <none>

Data
====
ClusterConfiguration:
----
apiServer:
  extraArgs:
    authorization-mode: Node,RBAC
  timeoutForControlPlane: 4m0s
apiVersion: kubeadm.k8s.io/v1beta2
certificatesDir: /etc/kubernetes/pki
clusterName: kubernetes
controlPlaneEndpoint: 10.2.135.250:8443
controllerManager: {}
dns:
  type: CoreDNS
etcd:
  local:
    dataDir: /var/lib/etcd
imageRepository: registry.cn-hangzhou.aliyuncs.com/google_containers
kind: ClusterConfiguration
kubernetesVersion: v1.16.2
networking:
  dnsDomain: k8s.cluster.local
  podSubnet: 10.4.192.0/18
  serviceSubnet: 10.4.128.0/18
scheduler: {}

ClusterStatus:
----
apiEndpoints:
  k8s-test-master01:
    advertiseAddress: 10.2.135.103
    bindPort: 6443
  k8s-test-master02:
    advertiseAddress: 10.2.135.104
    bindPort: 6443
  k8s-test-master03:
    advertiseAddress: 10.2.135.105
    bindPort: 6443
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterStatus

Events:  <none>

可也看到 k8s-test-master03 节点信息还存在与 kubeadm 配置中，说明 etcd 中还存储着 k8s-test-master03 相关信息。

3、分析问题所在及解决方案
因为集群是通过 kubeadm 工具搭建的，且使用了 etcd 镜像方式与 master 节点一起，所以每个 Master 节点上都会存在一个 etcd 容器实例。当剔除一个 master 节点时 etcd 集群未删除剔除的节点的 etcd 成员信息，该信息还存在 etcd 集群列表中。
所以，我们需要进入 etcd 手动删除 etcd 成员信息。

三、解决问题

1、获取 Etcd 镜像列表
首先获取集群中的 etcd pod 列表

[root@k8s-test-master01 ~]# kubectl get pods -n kube-system|grep etcd
etcd-k8s-test-master01                      1/1     Running   0          255d
etcd-k8s-test-master02                      1/1     Running   0          255d

2、进入 Etcd 容器并删除节点信息
选择上面两个 etcd 中任意一个 pod，通过 kubectl 工具进入 pod 内部：

[root@k8s-test-master01 ~]# kubectl exec -it etcd-k8s-test-master01 sh -n kube-system

进入容器后，按照下面步骤执行

# export ETCDCTL_API=3
# 
# 
# alias etcdctl='etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key'
# etcdctl member list
501e067f89adf844, started, k8s-test-master02, https://10.2.135.104:2380, https://10.2.135.104:2379
5a2eb770e1acc51c, started, k8s-test-master01, https://10.2.135.103:2380, https://10.2.135.103:2379
609d044994869d03, started, k8s-test-master03, https://10.2.135.105:2380, https://10.2.135.105:2379
# [root@k8s-test-master01 ~]# 
[root@k8s-test-master01 ~]# 
[root@k8s-test-master01 ~]# 
[root@k8s-test-master01 ~]# kubectl exec -it etcd-k8s-test-master01 sh -n kube-system
# export ETCDCTL_API=3
# alias etcdctl='etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key'
# etcdctl member list
501e067f89adf844, started, k8s-test-master02, https://10.2.135.104:2380, https://10.2.135.104:2379
5a2eb770e1acc51c, started, k8s-test-master01, https://10.2.135.103:2380, https://10.2.135.103:2379
609d044994869d03, started, k8s-test-master03, https://10.2.135.105:2380, https://10.2.135.105:2379
# etcdctl member remove 609d044994869d03
Member 609d044994869d03 removed from cluster b40c215cd54426ac
#  etcdctl member list
501e067f89adf844, started, k8s-test-master02, https://10.2.135.104:2380, https://10.2.135.104:2379
5a2eb770e1acc51c, started, k8s-test-master01, https://10.2.135.103:2380, https://10.2.135.103:2379
# exit

3、再次在k8s-test-master03 执行reset

[root@k8s-test-master03 ~]# ip addr|grep 10.
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    inet 10.2.135.105/24 brd 10.2.135.255 scope global ens160
3: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN qlen 1000
4: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master virbr0 state DOWN qlen 1000
[root@k8s-test-master03 ~]# kubeadm reset
[reset] Reading configuration from the cluster...
[reset] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
W0630 16:02:10.154744    5030 reset.go:96] [reset] Unable to fetch the kubeadm-config ConfigMap from cluster: failed to get node registration: failed to get corresponding node: nodes "k8s-test-master03" not found
[reset] WARNING: Changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] Are you sure you want to proceed? [y/N]: y
[preflight] Running pre-flight checks
W0630 16:02:12.130860    5030 removeetcdmember.go:79] [reset] No kubeadm config, using etcd pod spec to get data directory
[reset] No etcd config found. Assuming external etcd
[reset] Please, manually reset etcd to prevent further issues
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]
[reset] Deleting contents of stateful directories: [/var/lib/kubelet /etc/cni/net.d /var/lib/dockershim /var/run/kubernetes /var/lib/cni]

The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually by using the "iptables" command.

If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.

The reset process does not clean your kubeconfig files and you must remove them manually.
Please, check the contents of the $HOME/.kube/config file.
[root@k8s-test-master03 ~]# cd /etc/kubernetes/
[root@k8s-test-master03 kubernetes]# ls
manifests  pki
[root@k8s-test-master03 kubernetes]# mv manifests manifests_bak

4、再次加入

[root@k8s-test-master03 ~]# kubeadm join 10.2.135.250:8443 --token in7k87.9iwm83473t1vehpk     --discovery-token-ca-cert-hash sha256:dc40145602096b410d8b60c26fb4767115cb9ee6cb097f887c728993bcaf6dee     --control-plane --certificate-key df65c6f2cc3a0846a3c5f7459a92050e3093ee55a03fd06dcec94e2eca0129eb
[preflight] Running pre-flight checks
        [WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[preflight] Running pre-flight checks before initializing the new control plane instance
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[download-certs] Downloading the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [k8s-test-master03 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.k8s.cluster.local] and IPs [10.4.128.1 10.2.135.105 10.2.135.250]
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [k8s-test-master03 localhost] and IPs [10.2.135.105 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [k8s-test-master03 localhost] and IPs [10.2.135.105 127.0.0.1 ::1]
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Valid certificates and keys now exist in "/etc/kubernetes/pki"
[certs] Using the existing "sa" key
[kubeconfig] Generating kubeconfig files
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[check-etcd] Checking that the etcd cluster is healthy
[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.16" ConfigMap in the kube-system namespace
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Activating the kubelet service
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
[etcd] Announced new etcd member joining to the existing etcd cluster
[etcd] Creating static Pod manifest for "etcd"
[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[mark-control-plane] Marking the node k8s-test-master03 as control-plane by adding the label "node-role.kubernetes.io/master=''"
[mark-control-plane] Marking the node k8s-test-master03 as control-plane by adding the taints [node-role.kubernetes.io/master:NoSchedule]

This node has joined the cluster and a new control plane instance was created:

* Certificate signing request was sent to apiserver and approval was received.
* The Kubelet was informed of the new secure connection details.
* Control plane (master) label and taint were applied to the new node.
* The Kubernetes control plane instances scaled up.
* A new etcd member was added to the local/stacked etcd cluster.

To start administering your cluster from this node, you need to run the following as a regular user:

        mkdir -p $HOME/.kube
        sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
        sudo chown $(id -u):$(id -g) $HOME/.kube/config

Run 'kubectl get nodes' to see this node join the cluster.

备注：
1、其中在reset以后，未执行rm -rf /etc/kubernetes/manifests ,依然导致启动不了，主要是etcd并没有启动，导致apiserver无法启动，删除后，再次reset以及加入集群，成功。
2、 kubeadm，kubelet，kubectl版本不对,导致加入不了集群，后面重新安装为同一个版本，否则提示如下错误:

[root@k8s-test-master03 ~]# kubeadm join 10.2.135.250:8443 --token in7k87.9iwm83473t1vehpk     --discovery-token-ca-cert-hash sha256:dc40145602096b410d8b60c26fb4767115cb9ee6cb097f887c728993bcaf6dee     --control-plane --certificate-key afd2a7d1ef6316fcad8adccb9caf6a93741909660140f5700664fd9d063cbd4f
[preflight] Running pre-flight checks
        [WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: this version of kubeadm only supports deploying clusters with the control plane version >= 1.17.0. Current version: v1.16.2

k8s-test-master03 安装的是v1.18.2 ，而原集群安装的是v1.16.2
3、还有个yum 问题，重置下yum的db即可：rpm --rebuilddb
4、清理无用镜像：docker system prue -a

继续上一篇k8s集群down的补充

继续上一篇k8s集群down的补充

master一节点删除后，重新加入节点

一、恢复操作：

二、分析问题

三、解决问题