生产etcd服务器掉电故障修复

客户现场集群异常掉电,我们于中午进行远程恢复集群。启动etcd服务时。出现如下错误

member c77b7b06d2075637 has already been bootstrapped

查看资料说是：
One of the member was bootstrapped via discovery service. You must remove the previous data-dir to clean up the member information. Or the member will ignore the new configuration and start with the old configuration. That is why you see the mismatch.
大概意思：
其中一个成员是通过discovery service引导的。必须删除以前的数据目录来清理成员信息。否则成员将忽略新配置，使用旧配置。这就是为什么你看到了不匹配。
看到了这里，问题所在也就很明确了，启动失败的原因在于data-dir （/var/lib/etcd/default.etcd）中记录的信息与 etcd启动的选项所标识的信息不太匹配造成的。

解决方案：将该节点的etcd从集群中移除，并删除相关数据（后面可同步恢复）。再重新加入etcd集群。
1.查看现有etcd节点

export ETCDCTL_API=3
etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem  member list
c666144c29031acd, started, etcd-host0, https://20.140.249.65:2380, https://20.140.249.65:2379
c77b7b06d2075637, started, etcd-host1, https://20.140.249.66:2380, https://20.140.249.66:2379
f11a3a48abfa96dd, started, etcd-host2, https://20.140.249.67:2380, https://20.140.249.67:2379

2.将报错节点移除

export ETCDCTL_API=3
[root@ga-k8s1 data]# etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem  member remove c77b7b06d2075637
Member c77b7b06d2075637 removed from cluster 7ab1847bce8f7723

3.修改/usr/lib/systemd/system/etcd.service

[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target
Documentation=https://github.com/coreos

[Service]
Type=notify
WorkingDirectory=/app/etcd/
ExecStart=/usr/local/bin/etcd \
  --name=etcd-host0  \
  --data-dir=/app/etcd \
  --cert-file=/etc/etcd/ssl/etcd.pem \
  --key-file=/etc/etcd/ssl/etcd-key.pem \
  --trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
  --peer-cert-file=/etc/etcd/ssl/etcd.pem \
  --peer-key-file=/etc/etcd/ssl/etcd-key.pem \
  --peer-trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
  --peer-client-cert-auth \
  --client-cert-auth \
  --initial-advertise-peer-urls=https://20.140.249.66:2380 \
  --listen-peer-urls=https://20.140.249.66:2380 \
  --listen-client-urls=https://20.140.249.66:2379,https://127.0.0.1:2379 \
  --advertise-client-urls=https://20.140.249.66:2379 \
  --initial-cluster-token=etcd-cluster-0 \
  --initial-cluster=etcd-host0=https://20.140.249.65:2380,etcd-host1=https://20.140.249.66:2380,etcd-host2=https://20.140.249.67:2380 \
  --initial-cluster-state=existing \  # 将new这个参数修改成existing.
Restart=on-failure
RestartSec=5
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

4.删除数据

rm -rf /var/lib/etcd/
rm -rf /app/etcd/  # WorkingDirectory=/app/etcd/

5.重新将etcd节点进行添加

export ETCDCTL_API=2
etcdctl --endpoints=https://127.0.0.1:2379 --ca-file=/etc/kubernetes/ssl/ca.pem --cert-file=/etc/etcd/ssl/etcd.pem --key-file=/etc/etcd/ssl/etcd-key.pem  member add  etcd-host1 https://20.140.249.66:2380

6.启动etcd,重新加入的节点会向前两个节点重新同步数据

systemctl daemon-reload && systemctl start etcd