Linux TCP拥塞控制的代码实现
如果发送的报文丢失,TCP需要重传丢失的报文以保证可靠性。那TCP如何知道报文丢失,又应该选择在什么时候发送重传报文呢?
TCP接收端返回的ACK报文只会确认收到的最后一个连续包,也就是说如果收到1,2,4,5,6...这样序列的报文,因为丢失了报文序列3,所以无论后面收到多少个包,都只能确认前两个报文。同时ACK报文中序列号的值,是“期望”收到的报文序列,并非是“已经”收到的报文序列。因此如果上面的报文序列从seq.num = 1000开始,每个报文的长度都是len = 1000的话,ACK报文的序列号应该是3000(期望收到序号为3000的报文)。
TCP想要知道发送的报文丢失,可以根据时间和事件这两个维度。时间维度指的是超过多长时间未收到报文的确认,则可以认为报文丢失了。事件维度则更为精准,接收端告知哪些报文已经收到了,从而推测哪些报文可能出现了丢失。
超时重传是依据一个超时时间来判定报文丢失的方式。如果长时间未收到任何报文的ACK,则可以认为发生了非常严重的拥塞。因此超时重传有非常激进的退避策略:拥塞窗口降至1。
RTO计算
通过超时来判定报文丢失的方式叫“超时重传”。这需要一个超时重传定时器支持,定时器的超时时间叫做RTO。RTO的计算公式如下:
RTO = min[UBOUND,max[LBOUND,(BETA*SRTT)]]
其中UBOUND是超时时间的上限(如1min),LBOUND是超时时间下限(如1sec)。
SRTT是根据每包ACK计算的平滑RTT,BETA是乘因子,取值在1.3~2.0之间。
SRTT = ( ALPHA * SRTT ) + ((1-ALPHA) * RTT)
SRTT计算公式如上,起哄ALPHA取值为0.8或0.9。
初始值没有SRTT可以参与计算,取值为一个恒定值(如1sec)。
超时时间遵守指数增长的规则,超时重传后RTO会增倍。
RTO = RTO << 1
在Linux 3.10内核版本中,这些变量的取值为:
-
UBOUND
: 定义在include/net/tcp.h中的宏TCP_RTO_MAX,值为((unsigned)(120HZ))*。 -
LBOUND
: 定义在include/net/tcp.h中的宏TCP_RTO_MIN,值为((unsigned)(HZ/5))。 -
初始值
: 定义在include/net/tcp.h中的宏TCP_TIMEOUT_INIT,值为((unsigned)(HZ))。
其中HZ的值为1000,即UBOUND
值为120秒,LBOUND
值为200毫秒,初始值为1秒。
RTO值存储在结构体struct inet_connection_sock
中,成员名为icsk_rto
, 该结构体属于struct tcp_sock
,包含在inlcude/linux/tcp.h
中。
SRTT的值存储在struct tcp_sock
中,成员名为srtt_us
,要注意的是它存储的值实际上是真实值右移3位后的值。
struct tcp_sock
中还有一个值存储了RTT偏差的平滑值,成员名为rttvar_us
。在评估RTT值的时候会同步计算此rttvar_us
的值,并最终使用该值与SRTT一起计算出RTO。
计算RTO的函数为:RTO = SRTT + RTTVAR
/* Calculate rto without backoff. This is the second half of Van Jacobson's
* routine referred to above.
*/
static void tcp_set_rto(struct sock *sk)
{
const struct tcp_sock *tp = tcp_sk(sk);
/* Old crap is replaced with new one. 8)
*
* More seriously:
* 1. If rtt variance happened to be less 50msec, it is hallucination.
* It cannot be less due to utterly erratic ACK generation made
* at least by solaris and freebsd. "Erratic ACKs" has _nothing_
* to do with delayed acks, because at cwnd>2 true delack timeout
* is invisible. Actually, Linux-2.4 also generates erratic
* ACKs in some circumstances.
*/
inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);
/* 2. Fixups made earlier cannot be right.
* If we do not estimate RTO correctly without them,
* all the algo is pure shit and should be replaced
* with correct one. It is exactly, which we pretend to do.
*/
/* NOTE: clamping at TCP_RTO_MIN is not required, current algo
* guarantees that rto is higher.
*/
tcp_bound_rto(sk);
}
static inline void tcp_bound_rto(const struct sock *sk)
{
if (inet_csk(sk)->icsk_rto > TCP_RTO_MAX)
inet_csk(sk)->icsk_rto = TCP_RTO_MAX;
}
static inline u32 __tcp_set_rto(const struct tcp_sock *tp)
{
return usecs_to_jiffies((tp->srtt_us >> 3) + tp->rttvar_us);
}
计算SRTT的函数为
/* Called to compute a smoothed rtt estimate. The data fed to this
* routine either comes from timestamps, or from segments that were
* known _not_ to have been retransmitted [see Karn/Partridge
* Proceedings SIGCOMM 87]. The algorithm is from the SIGCOMM 88
* piece by Van Jacobson.
* NOTE: the next three routines used to be one big routine.
* To save cycles in the RFC 1323 implementation it was better to break
* it up into three procedures. -- erics
*/
static void tcp_rtt_estimator(struct sock *sk, long mrtt_us)
在函数tcp_ack_update_rtt
中更新了SRTT与RTO的值:
static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
long seq_rtt_us, long sack_rtt_us)
{
const struct tcp_sock *tp = tcp_sk(sk);
/* Prefer RTT measured from ACK's timing to TS-ECR. This is because
* broken middle-boxes or peers may corrupt TS-ECR fields. But
* Karn's algorithm forbids taking RTT if some retransmitted data
* is acked (RFC6298).
*/
if (seq_rtt_us < 0)
seq_rtt_us = sack_rtt_us;
/* RTTM Rule: A TSecr value received in a segment is used to
* update the averaged RTT measurement only if the segment
* acknowledges some new data, i.e., only if it advances the
* left edge of the send window.
* See draft-ietf-tcplw-high-performance-00, section 3.3.
*/
if (seq_rtt_us < 0 && tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
flag & FLAG_ACKED)
seq_rtt_us = jiffies_to_usecs(tcp_time_stamp - tp->rx_opt.rcv_tsecr);
if (seq_rtt_us < 0)
return false;
tcp_rtt_estimator(sk, seq_rtt_us);
tcp_set_rto(sk);
/* RFC6298: only reset backoff on valid RTT measurement. */
inet_csk(sk)->icsk_backoff = 0;
return true;
}
最终调用链关系为:
tcp_rcv_established=>start: tcp_rcv_established
tcp_ack=>operation: tcp_ack
tcp_clean_rtx_queue=>operation: tcp_clean_rtx_queue
tcp_ack_update_rtt=>operation: tcp_ack_update_rtt
tcp_rtt_estimator=>operation: tcp_rtt_estimator
tcp_set_rto=>operation: tcp_rtt_estimator && tcp_set_rto
tcp_rcv_established->tcp_ack->tcp_clean_rtx_queue->tcp_ack_update_rtt->tcp_set_rto
重传定时器初始化
重传定时器的值类型为ICSK_TIME_RETRANS
icsk_retransmit_timer
在inet_csk_init_xmit_timers
函数中被初始化:
/*
* Using different timers for retransmit, delayed acks and probes
* We may wish use just one timer maintaining a list of expire jiffies
* to optimize.
*/
void inet_csk_init_xmit_timers(struct sock *sk,
void (*retransmit_handler)(unsigned long),
void (*delack_handler)(unsigned long),
void (*keepalive_handler)(unsigned long))
{
struct inet_connection_sock *icsk = inet_csk(sk);
setup_timer(&icsk->icsk_retransmit_timer, retransmit_handler,
(unsigned long)sk);
setup_timer(&icsk->icsk_delack_timer, delack_handler,
(unsigned long)sk);
setup_timer(&sk->sk_timer, keepalive_handler, (unsigned long)sk);
icsk->icsk_pending = icsk->icsk_ack.pending = 0;
}
注册的第二个函数指针void (*retransmit_handler)(unsigned long)
为tcp_write_timer
,tcp_write_timer
的主要处理逻辑在tcp_write_timer_handler
,最终ICSK_TIME_RETRANS
消息由函数tcp_retransmit_timer
处理。整个注册流程如下:
tcp_init_sock=>start: tcp_init_sock
tcp_init_xmit_timers=>operation: tcp_init_xmit_timers
inet_csk_init_xmit_timers=>operation: inet_csk_init_xmit_timers
tcp_init_sock->tcp_init_xmit_timers->inet_csk_init_xmit_timers
registe=>operation: Registe fn tcp_write_timer
inet_csk_init_xmit_timers->registe
void tcp_write_timer_handler(struct sock *sk)
{
struct inet_connection_sock *icsk = inet_csk(sk);
int event;
if (sk->sk_state == TCP_CLOSE || !icsk->icsk_pending)
goto out;
if (time_after(icsk->icsk_timeout, jiffies)) {
sk_reset_timer(sk, &icsk->icsk_retransmit_timer, icsk->icsk_timeout);
goto out;
}
event = icsk->icsk_pending;
switch (event) {
// .....
case ICSK_TIME_RETRANS:
icsk->icsk_pending = 0;
tcp_retransmit_timer(sk);
break;
// ......
out:
sk_mem_reclaim(sk);
}
设置重传定时器
传输当前窗口的第一个报文时就需要安装重传定时器。不是每个发送出去的报文都需要一个独立的重传定时器,而是为当前Burst出去的窗口安装一个重传定时器。
在struct tcp_sock
结构体中有变量packets_out
记录了当前正在网络中的报文数(in-flight数据)。当packets_out
为0时,设置好重传定时器。
设置重传定时的函数为inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto, TCP_RTO_MAX)
。在tcp_rearm_rto
函数中包裹调用:
/* Restart timer after forward progress on connection.
* RFC2988 recommends to restart timer to now+rto.
*/
void tcp_rearm_rto(struct sock *sk)
{
const struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
/* If the retrans timer is currently being used by Fast Open
* for SYN-ACK retrans purpose, stay put.
*/
if (tp->fastopen_rsk)
return;
if (!tp->packets_out) {
inet_csk_clear_xmit_timer(sk, ICSK_TIME_RETRANS);
} else {
u32 rto = inet_csk(sk)->icsk_rto;
/* Offset the time elapsed after installing regular RTO */
if (icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
struct sk_buff *skb = tcp_write_queue_head(sk);
const u32 rto_time_stamp =
tcp_skb_timestamp(skb) + rto;
s32 delta = (s32)(rto_time_stamp - tcp_time_stamp);
/* delta may not be positive if the socket is locked
* when the retrans timer fires and is rescheduled.
*/
rto = max_t(int, delta, 1);
}
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
TCP_RTO_MAX);
}
}
而tcp_rearm_rto
函数在tcp_event_new_data_sent()
函数中被调用,调用的条件为发送报文前packet_out
的值为0。
调用流程图如下:
st=>start: tcp_write_xmit
end=>end: end
tcp_event_new_data_sent=>operation: tcp_event_new_data_sent
check_packet_out=>condition: If Old Packet Out Is Zero
tcp_rearm_rto=>operation: tcp_rearm_rto
st->tcp_event_new_data_sent->check_packet_out
check_packet_out(no)->end
check_packet_out(yes)->tcp_rearm_rto
inet_csk_reset_xmit_timer=>operation: inet_csk_reset_xmit_timer(ICSK_TIME_RETRANS)
tcp_rearm_rto->inet_csk_reset_xmit_timer->end
每次收到ACK更新RTT后,需要重置定时器,如果所有报文都被ACK(packet_out为0),则删除定时器。此过程由tcp_clean_rtx_queue
函数实现。更新RTO的机制会延长报文的ACK时间,因为每次更新RTO都是以当前时间为基础的。
在丢失恢复过程中传输分段,如果重传的是重传队列的第一个分段,也需要重置重传定时器。这是在函数tcp_xmit_retransmit_queue
中:
/* This gets called after a retransmit timeout, and the initially
* retransmitted data is acknowledged. It tries to continue
* resending the rest of the retransmit queue, until either
* we've sent it all or the congestion window limit is reached.
* If doing SACK, the first ACK which comes back for a timeout
* based retransmit packet might feed us FACK information again.
* If so, we use it to avoid unnecessarily retransmissions.
*/
void tcp_xmit_retransmit_queue(struct sock *sk){
// ......
if (skb == tcp_write_queue_head(sk))
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
inet_csk(sk)->icsk_rto,
TCP_RTO_MAX);
}
tcp_ack=>start: tcp_ack
tcp_fastretrans_alert=>operation: tcp_fastretrans_alert
TCP_CA_Loss=>condition: In TCP_CA_Loss State ?
tcp_process_loss=>operation: tcp_process_loss
tcp_xmit_retransmit_queue=>operation: tcp_xmit_retransmit_queue
check_header=>condition: skb == tcp_write_queue_head(sk) ?
tcp_ack->tcp_fastretrans_alert->TCP_CA_Loss
TCP_CA_Loss(yes)->tcp_process_loss->tcp_xmit_retransmit_queue->check_header
inet_csk_reset_xmit_timer=>operation: inet_csk_reset_xmit_timer(ICSK_TIME_RETRANS)
check_header(yes)->inet_csk_reset_xmit_timer
超时重传函数处理
超时重传处理函数是tcp_retransmit_timer
,主要完成以下几件事情:
- 检查当前socket状态
- 检查是否超过了重传次数(
tcp_write_timeout函数
) - 进入Loss状态,开始慢启动。(
tcp_enter_loss
) - 重传丢失报文。此时如果重传失败,是因为本地拥塞导致,因此不改变RTO时间,直接重设RTO定时器。
- 更新
icsk_backoff
与icsk_retransmits
。注意不会更新RTT。 - RTO时间呈指数退避。
icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
- 再次设置重传定时器。
进入慢启动的操作在函数tcp_entry_loss中实现:
/* Enter Loss state. If we detect SACK reneging, forget all SACK information
* and reset tags completely, otherwise preserve SACKs. If receiver
* dropped its ofo queue, we will know this due to reneging detection.
*/
void tcp_enter_loss(struct sock *sk)
{
const struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
bool new_recovery = icsk->icsk_ca_state < TCP_CA_Recovery;
bool is_reneg; /* is receiver reneging on SACKs? */
/* Reduce ssthresh if it has not yet been made inside this window. */
if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
!after(tp->high_seq, tp->snd_una) ||
(icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
tp->prior_ssthresh = tcp_current_ssthresh(sk);
tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk); // 这里拥塞避免阈值由拥塞算法实现
tcp_ca_event(sk, CA_EVENT_LOSS); // 通知LOSS事件给拥塞算法
tcp_init_undo(tp);
}
tp->snd_cwnd = 1; // 拥塞窗口降低至最低值
tp->snd_cwnd_cnt = 0;
tp->snd_cwnd_stamp = tcp_time_stamp;
tp->retrans_out = 0;
tp->lost_out = 0;
// ....
tcp_set_ca_state(sk, TCP_CA_Loss); // 设置CA状态为LOSS,拥塞算法会得到
// ...
}