问题描述
使用Azure Cache for Redis服务,遇见了因服务端的维护而触发故障转移,因为客户端是在Linux环境中,并且使用了Lettuce SDK,因为Lettuce目前有一个超时15分钟的known issue。
问题解答
如何主动复现故障转移呢?
通过Azure Cache for Redis的Azure门户,进入Reboot“重启节点”。注意,只有重启Primary才能模拟故障转移(Failover)场景。
那些措施可以解决问题呢?
1、修改TCP settings,以减少tcp retransmission的时间以缓解该问题: net.ipv4.tcp_retries2 = 5
2、For Spring Boot integrated with Lettuce,重写ClientResources类文件,以缓解该问题:
3、Lettuce SDK的开源社区中也提到了一种修改TCP_USER_TIMEOUT的方式以缓解该问题,对于直接使用Lettuce SDK的方式可以参考:
- Use Lettuce >= 6.3.0
<dependencies>
<dependency>
<groupId>io.lettuce</groupId>
<artifactId>lettuce-core</artifactId>
<version>6.3.0.RELEASE</version>
</dependency>
<dependency>
<groupId>io.netty</groupId>
<artifactId>netty-transport-native-epoll</artifactId>
<version>4.1.100.Final</version>
<classifier>linux-x86_64</classifier>
</dependency>
</dependencies>
- Config TCP_USER_TIMEOUT
import io.lettuce.core.ClientOptions;
import io.lettuce.core.RedisClient;
import io.lettuce.core.RedisURI;
import io.lettuce.core.SocketOptions;
import io.lettuce.core.SocketOptions.KeepAliveOptions;
import io.lettuce.core.SocketOptions.TcpUserTimeoutOptions;
import io.lettuce.core.api.StatefulRedisConnection;
import io.lettuce.core.api.sync.RedisCommands;
import java.time.Duration;
public class LettuceExample {
/**
* Enable TCP keepalive and configure the following three parameters:
* TCP_KEEPIDLE = 30
* TCP_KEEPINTVL = 10
* TCP_KEEPCNT = 3
*/
private static final int TCP_KEEPALIVE_IDLE = 30;
/**
* The TCP_USER_TIMEOUT parameter can avoid situations where Lettuce remains stuck in a continuous timeout loop during a failure or crash event.
* refer: https://github.com/lettuce-io/lettuce-core/issues/2082
*/
private static final int TCP_USER_TIMEOUT = 30;
private static RedisClient client = null;
private static StatefulRedisConnection<String, String> connection = null;
public static void main(String[] args) {
// Replace the values of host, user, password, and port with the actual instance information.
String host = "r-bp1s1bt2tlq3p1****.redis.rds.aliyuncs.com";
String user = "r-bp1s1bt2tlq3p1****";
String password = "Da****3";
int port = 6379;
// Config RedisURL
RedisURI uri = RedisURI.Builder
.redis(host, port)
.withAuthentication(user, password)
.build();
// Config TCP KeepAlive
SocketOptions socketOptions = SocketOptions.builder()
.keepAlive(KeepAliveOptions.builder()
.enable()
.idle(Duration.ofSeconds(TCP_KEEPALIVE_IDLE))
.interval(Duration.ofSeconds(TCP_KEEPALIVE_IDLE/3))
.count(3)
.build())
.tcpUserTimeout(TcpUserTimeoutOptions.builder()
.enable()
.tcpUserTimeout(Duration.ofSeconds(TCP_USER_TIMEOUT))
.build())
.build();
client = RedisClient.create(uri);
client.setOptions(ClientOptions.builder()
.socketOptions(socketOptions)
.build());
connection = client.connect();
RedisCommands<String, String> commands = connection.sync();
System.out.println(commands.set("foo", "bar"));
System.out.println(commands.get("foo"));
// If your application exits and you want to destroy the resources, call this method. Then, the connection is closed, and the resources are released.
connection.close();
client.shutdown();
}
}
4、使用其他SDK例如Jedis等,来规避该类问题。
参考资料
Linux 托管客户端应用程序的 TCP 设置 : https://docs.azure.cn/zh-cn/azure-cache-for-redis/cache-best-practices-connection#tcp-settings-for-linux-hosted-client-applications
Add support for disconnect on timeout to recover early from no RST
packet failures : https://github.com/redis/lettuce/issues/2082#issuecomment-2290496556
当在复杂的环境中面临问题,格物之道需:浊而静之徐清,安以动之徐生。 云中,恰是如此!