一次因TCP挥手导致的HTTP请求异常
线上业务需要http需要和多个上游交互,使用httpclient包装的请求类,通过日志扫描系统的异常(grep),发现到达某个上游的请求,每天都会有几次异常,导致请求失败,同时抛出 NoHTTPResponseException
!
通过Apache官方文档,目测是因为服务端过于拥挤导致连接未响应,直接被丢弃了,后经询问,服务器此时的QPS并不高
因为是服务器之前频繁的通信,所以最好使用长连接,http层加入keepalive机制。我们把httpclient的keepalive的时间设置成30秒,不过后面依然会出现,必定是另有原因。
通过Linux的tcpdump,我们抓取指定上游的tcp报文,当再次出现异常时,我们通过时间从dump的日志中有了实质性的发现
# 假定上游服务器地址: 11.22.33.44
# dump本机和指定主机交互的所有流量
nohup tcpdump -i eth0 host 11.22.33.44 -nnn &
dump下来的日志,格式如下:
11:29:43.939417 IP 11.22.33.44.8211 > 172.19.0.1.59024: Flags [P.], seq 2506:2511, ack 4122, win 78, length 5
11:29:43.939424 IP 172.19.0.1.59024 > 11.22.33.44.8211: Flags [.], ack 2511, win 312, length 0
11:29:43.941362 IP 11.22.33.44.8211 > 172.19.0.1.59046: Flags [.], ack 411, win 60, length 0
11:29:44.891319 IP 11.22.33.44.8211 > 172.19.0.1.59046: Flags [P.], seq 1:247, ack 411, win 60, length 246
11:29:44.891384 IP 172.19.0.1.59046 > 11.22.33.44.8211: Flags [.], ack 247, win 237, length 0
11:29:44.891322 IP 11.22.33.44.8211 > 172.19.0.1.59046: Flags [P.], seq 247:252, ack 411, win 60, length 5
11:29:44.891396 IP 172.19.0.1.59046 > 11.22.33.44.8211: Flags [.], ack 252, win 237, length 0
11:29:50.007828 IP 172.19.0.1.59046 > 11.22.33.44.8211: Flags [P.], seq 411:824, ack 252, win 237, length 413
11:29:50.030832 IP 11.22.33.44.8211 > 172.19.0.1.59046: Flags [.], ack 824, win 62, length 0
题外话:日志中发现,我们有很多端口和 8211 进行交互,也就是意味着有多个tcp连接
通过监控,我们发现异常出现在11:43:53.21出现了问题,在这个时间点附近的日志如下:
11:43:38.215638 IP 11.22.33.44.8211 > 172.19.0.1.59602: Flags [P.], seq 749:754, ack 1240, win 64, length 5
11:43:38.215669 IP 172.19.0.1.59602 > 11.22.33.44.8211: Flags [.], ack 754, win 254, length 0
11:43:52.898135 IP 11.22.33.44.8211 > 172.19.0.1.59614: Flags [F.], seq 252, ack 414, win 60, length 0
11:43:52.936947 IP 172.19.0.1.59614 > 11.22.33.44.8211: Flags [.], ack 253, win 237, length 0
11:43:53.212664 IP 172.19.0.1.59602 > 11.22.33.44.8211: Flags [P.], seq 1240:1650, ack 754, win 254, length 410
11:43:53.215228 IP 11.22.33.44.8211 > 172.19.0.1.59602: Flags [F.], seq 754, ack 1240, win 64, length 0
11:43:53.215294 IP 172.19.0.1.59602 > 11.22.33.44.8211: Flags [F.], seq 1650, ack 755, win 254, length 0
11:43:53.235410 IP 11.22.33.44.8211 > 172.19.0.1.59602: Flags [R], seq 1246386853, win 0, length 0
11:43:53.238071 IP 11.22.33.44.8211 > 172.19.0.1.59602: Flags [R], seq 1246386854, win 0, length 0
11:43:54.747270 IP 172.19.0.1.59614 > 11.22.33.44.8211: Flags [F.], seq 414, ack 253, win 237, length 0
11:43:54.747378 IP 172.19.0.1.59622 > 11.22.33.44.8211: Flags [S], seq 2657401300, win 29200, options [mss 1460,sackOK,TS val 3040913527 ecr 0,nop,wscale 7], length 0
基本可以确定是本地端口 59602 和远程 8211 建立的这条tcp连接
在日志中通过端口号 59602 把所有的tcpdump信息全部取出,得到:
11:43:20.822142 IP 172.19.0.1.59602 > 11.22.33.44.8211: Flags [S], seq 921702317, win 29200, options [mss 1460,sackOK,TS val 3040905046 ecr 0,nop,wscale 7], length 0
11:43:20.844940 IP 11.22.33.44.8211 > 172.19.0.1.59602: Flags [S.], seq 1246386099, ack 921702318, win 29200, options [mss 1444,nop,nop,sackOK,nop,wscale 9], length 0
11:43:20.844966 IP 172.19.0.1.59602 > 11.22.33.44.8211: Flags [.], ack 1, win 229, length 0
11:43:20.845028 IP 172.19.0.1.59602 > 11.22.33.44.8211: Flags [P.], seq 1:414, ack 1, win 229, length 413
11:43:20.867813 IP 11.22.33.44.8211 > 172.19.0.1.59602: Flags [.], ack 414, win 60, length 0
11:43:23.053412 IP 11.22.33.44.8211 > 172.19.0.1.59602: Flags [P.], seq 1:247, ack 414, win 60, length 246
11:43:23.053461 IP 172.19.0.1.59602 > 11.22.33.44.8211: Flags [.], ack 247, win 237, length 0
11:43:23.053415 IP 11.22.33.44.8211 > 172.19.0.1.59602: Flags [P.], seq 247:252, ack 414, win 60, length 5
11:43:23.053473 IP 172.19.0.1.59602 > 11.22.33.44.8211: Flags [.], ack 252, win 237, length 0
11:43:23.364308 IP 172.19.0.1.59602 > 11.22.33.44.8211: Flags [P.], seq 414:827, ack 252, win 237, length 413
11:43:23.387087 IP 11.22.33.44.8211 > 172.19.0.1.59602: Flags [.], ack 827, win 62, length 0
11:43:23.873071 IP 11.22.33.44.8211 > 172.19.0.1.59602: Flags [P.], seq 252:498, ack 827, win 62, length 246
11:43:23.873112 IP 172.19.0.1.59602 > 11.22.33.44.8211: Flags [.], ack 498, win 245, length 0
11:43:23.873074 IP 11.22.33.44.8211 > 172.19.0.1.59602: Flags [P.], seq 498:503, ack 827, win 62, length 5
11:43:23.873122 IP 172.19.0.1.59602 > 11.22.33.44.8211: Flags [.], ack 503, win 245, length 0
11:43:36.883015 IP 172.19.0.1.59602 > 11.22.33.44.8211: Flags [P.], seq 827:1240, ack 503, win 245, length 413
11:43:36.905781 IP 11.22.33.44.8211 > 172.19.0.1.59602: Flags [.], ack 1240, win 64, length 0
11:43:38.215636 IP 11.22.33.44.8211 > 172.19.0.1.59602: Flags [P.], seq 503:749, ack 1240, win 64, length 246
11:43:38.215662 IP 172.19.0.1.59602 > 11.22.33.44.8211: Flags [.], ack 749, win 254, length 0
11:43:38.215638 IP 11.22.33.44.8211 > 172.19.0.1.59602: Flags [P.], seq 749:754, ack 1240, win 64, length 5
11:43:38.215669 IP 172.19.0.1.59602 > 11.22.33.44.8211: Flags [.], ack 754, win 254, length 0
11:43:53.212664 IP 172.19.0.1.59602 > 11.22.33.44.8211: Flags [P.], seq 1240:1650, ack 754, win 254, length 410
11:43:53.215228 IP 11.22.33.44.8211 > 172.19.0.1.59602: Flags [F.], seq 754, ack 1240, win 64, length 0
11:43:53.215294 IP 172.19.0.1.59602 > 11.22.33.44.8211: Flags [F.], seq 1650, ack 755, win 254, length 0
11:43:53.235410 IP 11.22.33.44.8211 > 172.19.0.1.59602: Flags [R], seq 1246386853, win 0, length 0
11:43:53.238071 IP 11.22.33.44.8211 > 172.19.0.1.59602: Flags [R], seq 1246386854, win 0, length 0
文字可能不太直观,同时ip和端口不太适合阅读,所以替换成client和server,截了张图:
为了不迷路,简单普及下tcpdump中flags的几个标志的意义:
- S:SYN标志位=1的报文 用于握手建立连接时
- P:PSH标志位=1的报文 用于发送业务数据时
- F:FIN标志位=1的报文 用于挥手断开连接时
- R:RST标志位=1的报文 用于重置异常连接时
- .:ACK标志位=1的报文 用于确认收到消息时
状态描述:
- 第28行,第三次数据交换结束,client向server发送ack报文,本次http交互结束
- 第31行,client第四次发起数据请求
- 第32行,服务器发回一个FIN报文,表示要中断TCP连接
问题:
- 前3次数据交换没有问题,为什么在第4次发送数据后,居然立马收到FIN报文?
- 28行和32行,时间差刚好15s,这个数字不像是随机的,应该是什么机制中设置的吧?
释疑:
- FIN包的ack是1240,这是client上一交互环节发送的最后一个报文,应该是没有收到31行报文,应该是client和server几乎同时发送了PSH和FIN报文,client以为TCP没断就放心发送报文了!
- 经核实,上游使用阿里云SLB负载均衡,长连接默认保持15秒,闲置15秒无数据交互就会主动断掉!同时,一般的服务器或反向代理也会有这个机制。同时也解释了为什么之前已经设置了30s的超时时间,依然会出现这种异常了,因为服务器说:
15秒 < 30秒,老弟,我先断了喂 拜拜~
那么,我们需要对httputil进行改造,使其能够支持失败请求重试,Apache的httpclient已经支持了,不过默认不重试,加上就好了。
httpClient = HttpClients.custom()
.setKeepAliveStrategy((response, context) -> DEFAULT_KEEP_ALIVE_TIME)
.setRetryHandler(DefaultHttpRequestRetryHandler.INSTANCE)
.setConnectionManager(cm)
.build();
如需定制化重试机制,实现HttpRequestRetryHandler即可,官网示例连接
在最后,热心的同事提到,如果网络时延15ms,15秒的keepalive timeout,则这种异常出现的几率是 15 ÷ 15000 = 1‰
,也不知道对不对,统计了下,好像差不多,哈哈