k8s集群中的一个node节点故障,将这个node节点下线后上面的pod迁移到其他节点,但是大量pod都产生报错。经排查,是由于redis集群故障导致。但是查看resdis pod,都是running状态,如下图
redis pod状态
由于这些pod是组成集群使用,既然pod是正常的,应用又报redis链接的错误,所以问题肯定出在Redis Cluster上,查看Redis Cluster状态:
Redis Cluster状态
既然redis集群故障,直接删掉redis,用helm重装。但是结果还是无法组成集群。仔细分析故障原因,得出结论:
k8s架构示意图如下:
示意图
这个示意图我只画出三个node,简单表达一下意思即可。三个node上各运行了一个master和一个slave节点。由于node3节点故障已经移除集群,这个节点上之前运行的其他无状态pod迁移到其他节点可以正常运行,但是master2和slave2在node3上有持久化数据,虽然在node4上重建了,但是由于缺失数据,原来的集群状态被破坏了,所以重新部署也无法恢复,由于是master2和slave2的数据都丢失了,集群无法重建。通过开发了解到,redis上都是缓存数据,丢失影响不大,于是删除本地持久化数据,重新部署redis node,再手动创建集群。
#为了可以手动指定master节点,第一步只用maser创建cluster集群,再手动加入slave节点
./redis-trib.rb create --replicas 0 172.29.11.9:8382 172.29.11.15:8382 172.29.11.20:8382
>>> Creating cluster
>>> Performing hash slots allocation on 3 nodes...
Using 3 masters:
172.29.11.9:8382
172.29.11.15:8382
172.29.11.20:8382
M: 7f8e4fbd362fd003b1890aa24dd673d06d401500 172.29.11.9:8382
slots:0-5460 (5461 slots) master
M: 89b3c9925dd167bb7292dcacc715c949506cb022 172.29.11.15:8382
slots:5461-10922 (5462 slots) master
M: 5160f483410cf0b2fd3fc55d6844f5336f1e1c47 172.29.11.20:8382
slots:10923-16383 (5461 slots) master
Can I set the above configuration? (type 'yes' to accept): yes
>>> Nodes configuration updated
>>> Assign a different config epoch to each node
>>> Sending CLUSTER MEET messages to join the cluster
Waiting for the cluster to join..
>>> Performing Cluster Check (using node 172.29.11.9:8382)
M: 7f8e4fbd362fd003b1890aa24dd673d06d401500 172.29.11.9:8382
slots:0-5460 (5461 slots) master
M: 89b3c9925dd167bb7292dcacc715c949506cb022 172.29.11.15:8382
slots:5461-10922 (5462 slots) master
M: 5160f483410cf0b2fd3fc55d6844f5336f1e1c47 172.29.11.20:8382
slots:10923-16383 (5461 slots) master
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
#为了避免一个节点故障导致丢失一组redis数据,添加slave节点时要避免同一组的master和slave运行在一个节点上
#添加slave1节点
./redis-trib.rb add-node --master-id 7f8e4fbd362fd003b1890aa24dd673d06d401500 --slave 172.29.11.15:8383 172.29.11.9:8382
>>> Adding node 172.29.11.15:8383 to cluster 172.29.11.9:8382
>>> Performing Cluster Check (using node 172.29.11.9:8382)
M: 7f8e4fbd362fd003b1890aa24dd673d06d401500 172.29.11.9:8382
slots:0-5460 (5461 slots) master
0 additional replica(s)
M: 5160f483410cf0b2fd3fc55d6844f5336f1e1c47 172.29.11.20:8382
slots:10923-16383 (5461 slots) master
0 additional replica(s)
M: 89b3c9925dd167bb7292dcacc715c949506cb022 172.29.11.15:8382
slots:5461-10922 (5462 slots) master
0 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
>>> Send CLUSTER MEET to node 172.29.11.15:8383 to make it join the cluster.
Waiting for the cluster to join.
>>> Configure node as replica of 172.29.11.9:8382.
[OK] New node added correctly.
./redis-trib.rb add-node --master-id 89b3c9925dd167bb7292dcacc715c949506cb022 --slave 172.29.11.20:8383 172.29.11.15:8382
>>> Adding node 172.29.11.20:8383 to cluster 172.29.11.15:8382
>>> Performing Cluster Check (using node 172.29.11.15:8382)
M: 89b3c9925dd167bb7292dcacc715c949506cb022 172.29.11.15:8382
slots:5461-10922 (5462 slots) master
0 additional replica(s)
S: 31866adbf9034931f63bb69b62b5169f0d01ab71 172.29.11.15:8383
slots: (0 slots) slave
replicates 7f8e4fbd362fd003b1890aa24dd673d06d401500
M: 5160f483410cf0b2fd3fc55d6844f5336f1e1c47 172.29.11.20:8382
slots:10923-16383 (5461 slots) master
0 additional replica(s)
M: 7f8e4fbd362fd003b1890aa24dd673d06d401500 172.29.11.9:8382
slots:0-5460 (5461 slots) master
1 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
>>> Send CLUSTER MEET to node 172.29.11.20:8383 to make it join the cluster.
Waiting for the cluster to join.
>>> Configure node as replica of 172.29.11.15:8382.
[OK] New node added correctly.
./redis-trib.rb add-node --master-id 5160f483410cf0b2fd3fc55d6844f5336f1e1c47 --slave 172.29.11.9:8383 172.29.11.20:8382
>>> Adding node 172.29.11.9:8383 to cluster 172.29.11.20:8382
>>> Performing Cluster Check (using node 172.29.11.20:8382)
M: 5160f483410cf0b2fd3fc55d6844f5336f1e1c47 172.29.11.20:8382
slots:10923-16383 (5461 slots) master
0 additional replica(s)
S: 1546356362eac839986133a43f571fc59cdc4503 172.29.11.20:8383
slots: (0 slots) slave
replicates 89b3c9925dd167bb7292dcacc715c949506cb022
M: 89b3c9925dd167bb7292dcacc715c949506cb022 172.29.11.15:8382
slots:5461-10922 (5462 slots) master
1 additional replica(s)
S: 31866adbf9034931f63bb69b62b5169f0d01ab71 172.29.11.15:8383
slots: (0 slots) slave
replicates 7f8e4fbd362fd003b1890aa24dd673d06d401500
M: 7f8e4fbd362fd003b1890aa24dd673d06d401500 172.29.11.9:8382
slots:0-5460 (5461 slots) master
1 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
>>> Send CLUSTER MEET to node 172.29.11.9:8383 to make it join the cluster.
Waiting for the cluster to join.
>>> Configure node as replica of 172.29.11.20:8382.
[OK] New node added correctly.
三个节点都添加完成,并且没有报错。进入一个master节点查看集群状态:
/data# redis-cli -p 8382
127.0.0.1:8382> cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:3
cluster_my_epoch:1
cluster_stats_messages_sent:12915
cluster_stats_messages_received:12915
127.0.0.1:8382>
集群状态终于恢复正常。重建后的Redis Cluster集群架构示意图如下
重建后架构
总结:对于有状态的应用,redis、mysql等,容器化时一定要考虑周全,避免主从节点运行在一个节点上。对于redis应用,如果读写I/O不是特别高,还是建议直接使用主从复制架构,故障恢复简单且迅速。