生产etcd服务器掉电故障修复

客户现场集群异常掉电,我们于中午进行远程恢复集群。启动etcd服务时。出现如下错误

member c77b7b06d2075637 has already been bootstrapped

查看资料说是:
One of the member was bootstrapped via discovery service. You must remove the previous data-dir to clean up the member information. Or the member will ignore the new configuration and start with the old configuration. That is why you see the mismatch.
大概意思:
其中一个成员是通过discovery service引导的。必须删除以前的数据目录来清理成员信息。否则成员将忽略新配置,使用旧配置。这就是为什么你看到了不匹配。
看到了这里,问题所在也就很明确了,启动失败的原因在于data-dir (/var/lib/etcd/default.etcd)中记录的信息与 etcd启动的选项所标识的信息不太匹配造成的。

解决方案:将该节点的etcd从集群中移除,并删除相关数据(后面可同步恢复)。再重新加入etcd集群。
1.查看现有etcd节点

export ETCDCTL_API=3
etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem  member list
c666144c29031acd, started, etcd-host0, https://20.140.249.65:2380, https://20.140.249.65:2379
c77b7b06d2075637, started, etcd-host1, https://20.140.249.66:2380, https://20.140.249.66:2379
f11a3a48abfa96dd, started, etcd-host2, https://20.140.249.67:2380, https://20.140.249.67:2379

2.将报错节点移除

export ETCDCTL_API=3
[root@ga-k8s1 data]# etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem  member remove c77b7b06d2075637
Member c77b7b06d2075637 removed from cluster 7ab1847bce8f7723

3.修改/usr/lib/systemd/system/etcd.service

[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target
Documentation=https://github.com/coreos

[Service]
Type=notify
WorkingDirectory=/app/etcd/
ExecStart=/usr/local/bin/etcd \
  --name=etcd-host0  \
  --data-dir=/app/etcd \
  --cert-file=/etc/etcd/ssl/etcd.pem \
  --key-file=/etc/etcd/ssl/etcd-key.pem \
  --trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
  --peer-cert-file=/etc/etcd/ssl/etcd.pem \
  --peer-key-file=/etc/etcd/ssl/etcd-key.pem \
  --peer-trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
  --peer-client-cert-auth \
  --client-cert-auth \
  --initial-advertise-peer-urls=https://20.140.249.66:2380 \
  --listen-peer-urls=https://20.140.249.66:2380 \
  --listen-client-urls=https://20.140.249.66:2379,https://127.0.0.1:2379 \
  --advertise-client-urls=https://20.140.249.66:2379 \
  --initial-cluster-token=etcd-cluster-0 \
  --initial-cluster=etcd-host0=https://20.140.249.65:2380,etcd-host1=https://20.140.249.66:2380,etcd-host2=https://20.140.249.67:2380 \
  --initial-cluster-state=existing \  # 将new这个参数修改成existing.
Restart=on-failure
RestartSec=5
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

4.删除数据

rm -rf /var/lib/etcd/
rm -rf /app/etcd/  # WorkingDirectory=/app/etcd/

5.重新将etcd节点进行添加

export ETCDCTL_API=2
etcdctl --endpoints=https://127.0.0.1:2379 --ca-file=/etc/kubernetes/ssl/ca.pem --cert-file=/etc/etcd/ssl/etcd.pem --key-file=/etc/etcd/ssl/etcd-key.pem  member add  etcd-host1 https://20.140.249.66:2380

6.启动etcd,重新加入的节点会向前两个节点重新同步数据

systemctl daemon-reload && systemctl start etcd
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。