解决k8s集群中Redis Cluster故障

k8s集群中的一个node节点故障，将这个node节点下线后上面的pod迁移到其他节点，但是大量pod都产生报错。经排查，是由于redis集群故障导致。但是查看resdis pod，都是running状态，如下图

redis pod状态

由于这些pod是组成集群使用，既然pod是正常的，应用又报redis链接的错误，所以问题肯定出在Redis Cluster上，查看Redis Cluster状态：

Redis Cluster状态

既然redis集群故障，直接删掉redis，用helm重装。但是结果还是无法组成集群。仔细分析故障原因，得出结论：
k8s架构示意图如下：

示意图

这个示意图我只画出三个node，简单表达一下意思即可。三个node上各运行了一个master和一个slave节点。由于node3节点故障已经移除集群，这个节点上之前运行的其他无状态pod迁移到其他节点可以正常运行，但是master2和slave2在node3上有持久化数据，虽然在node4上重建了，但是由于缺失数据，原来的集群状态被破坏了，所以重新部署也无法恢复，由于是master2和slave2的数据都丢失了，集群无法重建。通过开发了解到，redis上都是缓存数据，丢失影响不大，于是删除本地持久化数据，重新部署redis node，再手动创建集群。

#为了可以手动指定master节点，第一步只用maser创建cluster集群，再手动加入slave节点
./redis-trib.rb create --replicas 0 172.29.11.9:8382 172.29.11.15:8382 172.29.11.20:8382
>>> Creating cluster
>>> Performing hash slots allocation on 3 nodes...
Using 3 masters:
172.29.11.9:8382
172.29.11.15:8382
172.29.11.20:8382
M: 7f8e4fbd362fd003b1890aa24dd673d06d401500 172.29.11.9:8382
   slots:0-5460 (5461 slots) master
M: 89b3c9925dd167bb7292dcacc715c949506cb022 172.29.11.15:8382
   slots:5461-10922 (5462 slots) master
M: 5160f483410cf0b2fd3fc55d6844f5336f1e1c47 172.29.11.20:8382
   slots:10923-16383 (5461 slots) master
Can I set the above configuration? (type 'yes' to accept): yes
>>> Nodes configuration updated
>>> Assign a different config epoch to each node
>>> Sending CLUSTER MEET messages to join the cluster
Waiting for the cluster to join..
>>> Performing Cluster Check (using node 172.29.11.9:8382)
M: 7f8e4fbd362fd003b1890aa24dd673d06d401500 172.29.11.9:8382
   slots:0-5460 (5461 slots) master
M: 89b3c9925dd167bb7292dcacc715c949506cb022 172.29.11.15:8382
   slots:5461-10922 (5462 slots) master
M: 5160f483410cf0b2fd3fc55d6844f5336f1e1c47 172.29.11.20:8382
   slots:10923-16383 (5461 slots) master
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
#为了避免一个节点故障导致丢失一组redis数据，添加slave节点时要避免同一组的master和slave运行在一个节点上
#添加slave1节点
./redis-trib.rb add-node --master-id 7f8e4fbd362fd003b1890aa24dd673d06d401500 --slave 172.29.11.15:8383 172.29.11.9:8382
>>> Adding node 172.29.11.15:8383 to cluster 172.29.11.9:8382
>>> Performing Cluster Check (using node 172.29.11.9:8382)
M: 7f8e4fbd362fd003b1890aa24dd673d06d401500 172.29.11.9:8382
   slots:0-5460 (5461 slots) master
   0 additional replica(s)
M: 5160f483410cf0b2fd3fc55d6844f5336f1e1c47 172.29.11.20:8382
   slots:10923-16383 (5461 slots) master
   0 additional replica(s)
M: 89b3c9925dd167bb7292dcacc715c949506cb022 172.29.11.15:8382
   slots:5461-10922 (5462 slots) master
   0 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
>>> Send CLUSTER MEET to node 172.29.11.15:8383 to make it join the cluster.
Waiting for the cluster to join.
>>> Configure node as replica of 172.29.11.9:8382.
[OK] New node added correctly.
./redis-trib.rb add-node --master-id 89b3c9925dd167bb7292dcacc715c949506cb022 --slave 172.29.11.20:8383 172.29.11.15:8382
>>> Adding node 172.29.11.20:8383 to cluster 172.29.11.15:8382
>>> Performing Cluster Check (using node 172.29.11.15:8382)
M: 89b3c9925dd167bb7292dcacc715c949506cb022 172.29.11.15:8382
   slots:5461-10922 (5462 slots) master
   0 additional replica(s)
S: 31866adbf9034931f63bb69b62b5169f0d01ab71 172.29.11.15:8383
   slots: (0 slots) slave
   replicates 7f8e4fbd362fd003b1890aa24dd673d06d401500
M: 5160f483410cf0b2fd3fc55d6844f5336f1e1c47 172.29.11.20:8382
   slots:10923-16383 (5461 slots) master
   0 additional replica(s)
M: 7f8e4fbd362fd003b1890aa24dd673d06d401500 172.29.11.9:8382
   slots:0-5460 (5461 slots) master
   1 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
>>> Send CLUSTER MEET to node 172.29.11.20:8383 to make it join the cluster.
Waiting for the cluster to join.
>>> Configure node as replica of 172.29.11.15:8382.
[OK] New node added correctly.
./redis-trib.rb add-node --master-id 5160f483410cf0b2fd3fc55d6844f5336f1e1c47 --slave 172.29.11.9:8383 172.29.11.20:8382
>>> Adding node 172.29.11.9:8383 to cluster 172.29.11.20:8382
>>> Performing Cluster Check (using node 172.29.11.20:8382)
M: 5160f483410cf0b2fd3fc55d6844f5336f1e1c47 172.29.11.20:8382
   slots:10923-16383 (5461 slots) master
   0 additional replica(s)
S: 1546356362eac839986133a43f571fc59cdc4503 172.29.11.20:8383
   slots: (0 slots) slave
   replicates 89b3c9925dd167bb7292dcacc715c949506cb022
M: 89b3c9925dd167bb7292dcacc715c949506cb022 172.29.11.15:8382
   slots:5461-10922 (5462 slots) master
   1 additional replica(s)
S: 31866adbf9034931f63bb69b62b5169f0d01ab71 172.29.11.15:8383
   slots: (0 slots) slave
   replicates 7f8e4fbd362fd003b1890aa24dd673d06d401500
M: 7f8e4fbd362fd003b1890aa24dd673d06d401500 172.29.11.9:8382
   slots:0-5460 (5461 slots) master
   1 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
>>> Send CLUSTER MEET to node 172.29.11.9:8383 to make it join the cluster.
Waiting for the cluster to join.
>>> Configure node as replica of 172.29.11.20:8382.
[OK] New node added correctly.

三个节点都添加完成，并且没有报错。进入一个master节点查看集群状态：

/data# redis-cli -p 8382
127.0.0.1:8382> cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:3
cluster_my_epoch:1
cluster_stats_messages_sent:12915
cluster_stats_messages_received:12915
127.0.0.1:8382>

集群状态终于恢复正常。重建后的Redis Cluster集群架构示意图如下

重建后架构

总结：对于有状态的应用，redis、mysql等，容器化时一定要考虑周全，避免主从节点运行在一个节点上。对于redis应用，如果读写I/O不是特别高，还是建议直接使用主从复制架构，故障恢复简单且迅速。

解决k8s集群中Redis Cluster故障

推荐阅读更多精彩内容