如何恢复处于 fail 状态的 redis 集群

悬赏园豆：30 [已解决问题] 解决于 2022-06-24 10:56

3 个 master 节点的 redis 集群处于 fail 状态，cluster info 输出如下：

cluster info
cluster_state:fail
cluster_slots_assigned:14060
cluster_slots_ok:14060
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:3
cluster_size:3
cluster_current_epoch:9
cluster_my_epoch:9
cluster_stats_messages_ping_sent:1405
cluster_stats_messages_pong_sent:1480
cluster_stats_messages_fail_sent:2
cluster_stats_messages_sent:2887
cluster_stats_messages_ping_received:1480
cluster_stats_messages_pong_received:1403
cluster_stats_messages_received:2883

每个节点都处于下面的状态：

Ready to accept connections

请问如何恢复集群？

redis

问题补充：

后来其中2个节点出现下面的日志

redis-cache-1

1:M 24 Jun 2022 00:49:49.695 * Ready to accept connections
1:M 24 Jun 2022 00:50:42.537 * FAIL message received from 0503e8066076b194a3fbe922ac1e94f454ac6b78 about 6caf3c5483eabf9ee6aed6c3cbb240183eab7cc4
1:M 24 Jun 2022 00:50:43.002 # Address updated for node 6caf3c5483eabf9ee6aed6c3cbb240183eab7cc4, now 192.168.11.227:6379
1:M 24 Jun 2022 00:51:12.903 * Clear FAIL state for node 6caf3c5483eabf9ee6aed6c3cbb240183eab7cc4: is reachable again and nobody is serving its slots after some time.

redis-cache-2

1:M 24 Jun 2022 00:45:03.228 * Ready to accept connections
1:M 24 Jun 2022 00:49:49.697 # Address updated for node 19d9caa5600555e8de83e33a5cfce0ef50c956e8, now 192.168.19.70:6379
1:M 24 Jun 2022 00:50:42.531 * Marking node 6caf3c5483eabf9ee6aed6c3cbb240183eab7cc4 as failing (quorum reached).
1:M 24 Jun 2022 00:50:43.002 # Address updated for node 6caf3c5483eabf9ee6aed6c3cbb240183eab7cc4, now 192.168.11.227:6379
1:M 24 Jun 2022 00:51:12.652 * Clear FAIL state for node 6caf3c5483eabf9ee6aed6c3cbb240183eab7cc4: is reachable again and nobody is serving its slots after some time.

dudu | 高人七级 | 园豆：23423
提问于：2022-06-24 09:18

被标记为 failing 的节点是 redis-cache-0

– dudu 3年前

第一次执行 redis-cli -a $REDIS_PASSWORD --cluster fix 192.168.11.227:6379 --cluster-fix-with-unreachable-masters （redis-cache-0的IP），问题依旧

– dudu 3年前

第二次执行 redis-cli -a $REDIS_PASSWORD --cluster fix 192.168.19.70:6379 --cluster-fix-with-unreachable-masters （redis-cache-1的IP），2个节点（redis-cache-0与redis-cache-2）恢复正常：Cluster state changed: ok

– dudu 3年前

第三次执行 redis-cli -a $REDIS_PASSWORD --cluster fix 192.168.12.128:6379 --cluster-fix-with-unreachable-masters （redis-cache-2的IP），3个节点全部恢复正常

– dudu 3年前

< >

最佳答案

对集群进行3次 fix 操作后终于解决了！

第1次 fix，从 redis-cache-0(192.168.11.227) 开始

redis-cli -a $REDIS_PASSWORD --cluster fix 192.168.11.227:6379 --cluster-fix-with-unreachable-masters

第2次 fix，从 redis-cache-1(192.168.19.70) 开始

redis-cli -a $REDIS_PASSWORD --cluster fix 192.168.19.70:6379 --cluster-fix-with-unreachable-masters

第3次 fix，从 redis-cache-2(192.168.12.128) 开始

redis-cli -a $REDIS_PASSWORD --cluster fix 192.168.12.128:6379 --cluster-fix-with-unreachable-masters

恢复后，3个节点的日志都显示

Cluster state changed: ok

dudu | 高人七级 |园豆：23423 | 2022-06-24 10:54

清除回答草稿

您需要登录以后才能回答，未注册用户请先注册。