【Azure Redis】Redis服务端的故障转移(Failover)导致客户端应用出现15分钟超时问题的模拟及解决

问题描述

使用Azure Cache for Redis服务,遇见了因服务端的维护而触发故障转移,因为客户端是在Linux环境中,并且使用了Lettuce SDK,因为Lettuce目前有一个超时15分钟的known issue。

问题解答

如何主动复现故障转移呢?

通过Azure Cache for Redis的Azure门户,进入Reboot“重启节点”。注意,只有重启Primary才能模拟故障转移(Failover)场景。


image.png

那些措施可以解决问题呢?

1、修改TCP settings,以减少tcp retransmission的时间以缓解该问题: net.ipv4.tcp_retries2 = 5

2、For Spring Boot integrated with Lettuce,重写ClientResources类文件,以缓解该问题:


image.png

3、Lettuce SDK的开源社区中也提到了一种修改TCP_USER_TIMEOUT的方式以缓解该问题,对于直接使用Lettuce SDK的方式可以参考:

  • Use Lettuce >= 6.3.0
<dependencies>
    <dependency>
        <groupId>io.lettuce</groupId>
        <artifactId>lettuce-core</artifactId>
        <version>6.3.0.RELEASE</version>
    </dependency>
    <dependency>
        <groupId>io.netty</groupId>
        <artifactId>netty-transport-native-epoll</artifactId>
        <version>4.1.100.Final</version>
        <classifier>linux-x86_64</classifier>
    </dependency>
</dependencies>
  • Config TCP_USER_TIMEOUT
import io.lettuce.core.ClientOptions;
import io.lettuce.core.RedisClient;
import io.lettuce.core.RedisURI;
import io.lettuce.core.SocketOptions;
import io.lettuce.core.SocketOptions.KeepAliveOptions;
import io.lettuce.core.SocketOptions.TcpUserTimeoutOptions;
import io.lettuce.core.api.StatefulRedisConnection;
import io.lettuce.core.api.sync.RedisCommands;
import java.time.Duration;

public class LettuceExample {
    /**
     * Enable TCP keepalive and configure the following three parameters:
     *  TCP_KEEPIDLE = 30
     *  TCP_KEEPINTVL = 10
     *  TCP_KEEPCNT = 3
     */
    private static final int TCP_KEEPALIVE_IDLE = 30;

    /**
     * The TCP_USER_TIMEOUT parameter can avoid situations where Lettuce remains stuck in a continuous timeout loop during a failure or crash event. 
     * refer: https://github.com/lettuce-io/lettuce-core/issues/2082
     */
    private static final int TCP_USER_TIMEOUT = 30;

    private static RedisClient client = null;
    private static StatefulRedisConnection<String, String> connection = null;

    public static void main(String[] args) {
        // Replace the values of host, user, password, and port with the actual instance information. 
        String host = "r-bp1s1bt2tlq3p1****.redis.rds.aliyuncs.com";
        String user = "r-bp1s1bt2tlq3p1****";
        String password = "Da****3";
        int port = 6379;

        // Config RedisURL
        RedisURI uri = RedisURI.Builder
                .redis(host, port)
                .withAuthentication(user, password)
                .build();

        // Config TCP KeepAlive
        SocketOptions socketOptions = SocketOptions.builder()
                .keepAlive(KeepAliveOptions.builder()
                        .enable()
                        .idle(Duration.ofSeconds(TCP_KEEPALIVE_IDLE))
                        .interval(Duration.ofSeconds(TCP_KEEPALIVE_IDLE/3))
                        .count(3)
                        .build())
                .tcpUserTimeout(TcpUserTimeoutOptions.builder()
                        .enable()
                        .tcpUserTimeout(Duration.ofSeconds(TCP_USER_TIMEOUT))
                        .build())
                .build();

        client = RedisClient.create(uri);
        client.setOptions(ClientOptions.builder()
                .socketOptions(socketOptions)
                .build());
        connection = client.connect();
        RedisCommands<String, String> commands = connection.sync();

        System.out.println(commands.set("foo", "bar"));
        System.out.println(commands.get("foo"));

        // If your application exits and you want to destroy the resources, call this method. Then, the connection is closed, and the resources are released. 
        connection.close();
        client.shutdown();
    }
}

4、使用其他SDK例如Jedis等,来规避该类问题。

参考资料

Linux 托管客户端应用程序的 TCP 设置 : https://docs.azure.cn/zh-cn/azure-cache-for-redis/cache-best-practices-connection#tcp-settings-for-linux-hosted-client-applications

Add support for disconnect on timeout to recover early from no RST packet failures : https://github.com/redis/lettuce/issues/2082#issuecomment-2290496556

当在复杂的环境中面临问题,格物之道需:浊而静之徐清,安以动之徐生。 云中,恰是如此!

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容