redis系列之sentinel的故障转移

故障转移

接着上章构建的sentinel网络构建后分析sentinel的故障转移。sentinel本身做为redis的分布式存储的高可用方案,进行故障转移就是高可用方案解决的核心问题。同样在分析sentinel的故障转移的方案前,先理解三个问题:

  1. 如何确认故障的发生?
  2. 故障发生后,谁来进行转移操作?
  3. 如何进行转移操作?

个人认为这个是所有故障转移方案中不得不解决的三个问题。在redis中,面对这三个问题的就是sentinel节点而master和slave则是sentinel操作的对象,因而sentinel具有监督者的身份。在实际应用中,一般是由sentinels集群共同来监控master节点的,这样就可以让sentinel集群具备具一定容错性,当某个sentinel节点出现问题时,sentinels体系结构也能够继续的进行服务。

确认故障

主观下线

在上章的网络构建代码中知道通过时间事件周期性方法,sentinel会向masterslave每10s发送info命令、至少每1s发送ping命令。其中ping命令的作用则启着探测master节点的作用。

  • 当sentinel向master发送ping命令时,如果收到的返回结果不是有效回复+PONG、-LOADING、-MASTERDOWN中的一种。当sentinel在配置的down-after-milliseconds时间内连续收到无效回复,便会将在对应的sentinelRedisInstanceflags属性上带上SRI_S_DOWN的标记,认为主观下线。
  • flags是int类型占2个字节有16位,因此flags的几个标志位的具体内容如下:
#define SRI_MASTER  (1<<0)
#define SRI_SLAVE   (1<<1)
#define SRI_SENTINEL (1<<2)
#define SRI_S_DOWN (1<<3)   /* Subjectively down (no quorum). */
#define SRI_O_DOWN (1<<4)   /* Objectively down (confirmed by others). */
#define SRI_MASTER_DOWN (1<<5) /* A Sentinel with this flag set thinks that
                                   its master is down. */
#define SRI_FAILOVER_IN_PROGRESS (1<<6) /* Failover is in progress for
                                           this master. */
#define SRI_PROMOTED (1<<7)            /* Slave selected for promotion. */
#define SRI_RECONF_SENT (1<<8)     /* SLAVEOF <newmaster> sent. */
#define SRI_RECONF_INPROG (1<<9)   /* Slave synchronization in progress. */
#define SRI_RECONF_DONE (1<<10)     /* Slave synchronized with new master. */
#define SRI_FORCE_FAILOVER (1<<11)  /* Force failover with master up. */
#define SRI_SCRIPT_KILL_SENT (1<<12) /* SCRIPT KILL already sent on -BUSY */

如:

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1

0-12位分别代表着上图从上到下标志的意思,使用这种记录法不仅可以节约空间还可以同时表示多种状态。0和1分别表示是否处于该状态。通过简单的或运算就可以设置对应的标志位,而不影响其他标志位。

  • down-after-milliseconds可通过配置文件sentinel.conf配置也可以通过连接了sentinel的客户端发送命令设置,对于该配置的比较维度是在master维度,对于sentinel监听的不同master可以配置不同的down-after-milliseconds值,而该master对应的slavessentinels同样继承该值,使用其来判断节点是否下线。
  • 其实整个ping命令的探测模式不仅是针对master,对于slavesentinel实例也是如此。这里以master为例讲解。
  • 由于确认主观下线依赖于down-after-milliseconds值,而该值可以配置从而监听同一台mastersentinels则可以配置不同的主观下线时间。

在sentinel中每个时间周期,都会遍历检查对应的节点是否主观下线,这个周期事件在上章中有提及。
sentinelHandleRedisInstance

/* Perform scheduled operations for the specified Redis instance. */
void sentinelHandleRedisInstance(sentinelRedisInstance *ri) {
    /* ========== MONITORING HALF ============ */
    /* Every kind of instance */
    sentinelReconnectInstance(ri);
    sentinelSendPeriodicCommands(ri);

    /* ============== ACTING HALF ============= */
    /* We don't proceed with the acting half if we are in TILT mode.
     * TILT happens when we find something odd with the time, like a
     * sudden change in the clock. */
    if (sentinel.tilt) {
        if (mstime()-sentinel.tilt_start_time < SENTINEL_TILT_PERIOD) return;
        sentinel.tilt = 0;
        sentinelEvent(LL_WARNING,"-tilt",NULL,"#tilt mode exited");
    }

    /* Every kind of instance */
    sentinelCheckSubjectivelyDown(ri);

    /* Masters and slaves */
    if (ri->flags & (SRI_MASTER|SRI_SLAVE)) {
        /* Nothing so far. */
    }

    /* Only masters */
    if (ri->flags & SRI_MASTER) {
        sentinelCheckObjectivelyDown(ri);
        if (sentinelStartFailoverIfNeeded(ri))
            sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_ASK_FORCED);
        sentinelFailoverStateMachine(ri);
        sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_NO_FLAGS);
    }
}

sentinelCheckSubjectivelyDown

/* Is this instance down from our point of view? */
void sentinelCheckSubjectivelyDown(sentinelRedisInstance *ri) {
    mstime_t elapsed = 0;

    if (ri->link->act_ping_time)
        elapsed = mstime() - ri->link->act_ping_time;
    else if (ri->link->disconnected)
        elapsed = mstime() - ri->link->last_avail_time;

    /* Check if we are in need for a reconnection of one of the
     * links, because we are detecting low activity.
     *
     * 1) Check if the command link seems connected, was connected not less
     *    than SENTINEL_MIN_LINK_RECONNECT_PERIOD, but still we have a
     *    pending ping for more than half the timeout. */
    if (ri->link->cc &&
        (mstime() - ri->link->cc_conn_time) >
        SENTINEL_MIN_LINK_RECONNECT_PERIOD &&
        ri->link->act_ping_time != 0 && /* Ther is a pending ping... */
        /* The pending ping is delayed, and we did not received
         * error replies as well. */
        (mstime() - ri->link->act_ping_time) > (ri->down_after_period/2) &&
        (mstime() - ri->link->last_pong_time) > (ri->down_after_period/2))
    {
        instanceLinkCloseConnection(ri->link,ri->link->cc);
    }

    /* 2) Check if the pubsub link seems connected, was connected not less
     *    than SENTINEL_MIN_LINK_RECONNECT_PERIOD, but still we have no
     *    activity in the Pub/Sub channel for more than
     *    SENTINEL_PUBLISH_PERIOD * 3.
     */
    if (ri->link->pc &&
        (mstime() - ri->link->pc_conn_time) >
         SENTINEL_MIN_LINK_RECONNECT_PERIOD &&
        (mstime() - ri->link->pc_last_activity) > (SENTINEL_PUBLISH_PERIOD*3))
    {
        instanceLinkCloseConnection(ri->link,ri->link->pc);
    }

    /* Update the SDOWN flag. We believe the instance is SDOWN if:
     *
     * 1) It is not replying.
     * 2) We believe it is a master, it reports to be a slave for enough time
     *    to meet the down_after_period, plus enough time to get two times
     *    INFO report from the instance. */
    if (elapsed > ri->down_after_period ||
        (ri->flags & SRI_MASTER &&
         ri->role_reported == SRI_SLAVE &&  mstime() - ri->role_reported_time >
          (ri->down_after_period+SENTINEL_INFO_PERIOD*2)))
    {
        /* Is subjectively down */
        if ((ri->flags & SRI_S_DOWN) == 0) {
            sentinelEvent(LL_WARNING,"+sdown",ri,"%@");
            ri->s_down_since_time = mstime();
            ri->flags |= SRI_S_DOWN;
        }
    } else {
        /* Is subjectively up */
        if (ri->flags & SRI_S_DOWN) {
            sentinelEvent(LL_WARNING,"-sdown",ri,"%@");
            ri->flags &= ~(SRI_S_DOWN|SRI_SCRIPT_KILL_SENT);
        }
    }
}
  • 检查command连接是否需要被关闭。
  • 检查pubsub连接是否需要重被关闭。
  • 更新SDOWN标志位,规则:一是没有在规定的时间(默认30s)连续没有回应。二是当其slave上报的连接时间间隔时间要大于down_after_period+SENTINEL_INFO_PERIOD*2时间(即30s+20s)时也将被认为主观下线,因为slave已经长时间联系不到master了。

客观下线

当一台sentinel检测到master节点已经掉线,并已经将其在自己维护的状态中设置为SRI_S_DOWN时,由于是在sentinel集群中且每个节点判断master下线的时间间隔可能不一样,所以它必须要去询问其他sentinel节点这台监督的master节点是否下线。那么问题就来了:

  1. 怎么广播命令去询问其他sentinel节点对某个master的下线探测结果?
  2. 怎么统计探测的结果?
  3. 怎么让所有的节点对master状态的认知都保持一致?

通过这三个问题,又发现了分布式解决方法中两个常见的问题:

  • 命令的广播。
  • 如何达成共识,最终保证状态的一致性。

看到sentinel对于这一问题的解决方案,如果可以,我们也可以自己思考一下对于这些问题自己的解决方案,是否可以比sentinel做的更好。

master下线状态信息的询问广播

入口还是上面那段sentinelHandleRedisInstance代码

...
 /* Only masters */
    if (ri->flags & SRI_MASTER) {
        sentinelCheckObjectivelyDown(ri);
        if (sentinelStartFailoverIfNeeded(ri))
            sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_ASK_FORCED);
        sentinelFailoverStateMachine(ri);
        sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_NO_FLAGS);
    }

在这段代码中,sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_NO_FLAGS);触发了sentinel在检测到自己监督的master主观下线之后去询问其他sentinel的方法。

/* If we think the master is down, we start sending
 * SENTINEL IS-MASTER-DOWN-BY-ADDR requests to other sentinels
 * in order to get the replies that allow to reach the quorum
 * needed to mark the master in ODOWN state and trigger a failover. */
#define SENTINEL_ASK_FORCED (1<<0)
void sentinelAskMasterStateToOtherSentinels(sentinelRedisInstance *master, int flags) {
    dictIterator *di;
    dictEntry *de;

    di = dictGetIterator(master->sentinels);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *ri = dictGetVal(de);
        mstime_t elapsed = mstime() - ri->last_master_down_reply_time;
        char port[32];
        int retval;

        /* If the master state from other sentinel is too old, we clear it. */
        if (elapsed > SENTINEL_ASK_PERIOD*5) {
            ri->flags &= ~SRI_MASTER_DOWN;
            sdsfree(ri->leader);
            ri->leader = NULL;
        }

        /* Only ask if master is down to other sentinels if:
         *
         * 1) We believe it is down, or there is a failover in progress.
         * 2) Sentinel is connected.
         * 3) We did not received the info within SENTINEL_ASK_PERIOD ms. */
        if ((master->flags & SRI_S_DOWN) == 0) continue;
        if (ri->link->disconnected) continue;
        if (!(flags & SENTINEL_ASK_FORCED) &&
            mstime() - ri->last_master_down_reply_time < SENTINEL_ASK_PERIOD)
            continue;

        /* Ask */
        ll2string(port,sizeof(port),master->addr->port);
        retval = redisAsyncCommand(ri->link->cc,
                    sentinelReceiveIsMasterDownReply, ri,
                    "SENTINEL is-master-down-by-addr %s %s %llu %s",
                    master->addr->ip, port,
                    sentinel.current_epoch,
                    (master->failover_state > SENTINEL_FAILOVER_STATE_NONE) ?
                    sentinel.myid : "*");
        if (retval == C_OK) ri->link->pending_commands++;
    }
    dictReleaseIterator(di);
}

通过遍历自己维护的sentinels dict向其他的sentinel节点发送SENTINEL is-master-down-by-addr命令,命令格式如:SENTINEL is-master-down-by-addr <master-ip> <master-port> <current_epoch> <leader_id>,其中leader_id的参数在第一次询问客观下线时,默认*号。接着看该命令的解析方法sentinelReceiveIsMasterDownReply

/* Receive the SENTINEL is-master-down-by-addr reply, see the
 * sentinelAskMasterStateToOtherSentinels() function for more information. */
void sentinelReceiveIsMasterDownReply(redisAsyncContext *c, void *reply, void *privdata) {
    sentinelRedisInstance *ri = privdata;
    instanceLink *link = c->data;
    redisReply *r;

    if (!reply || !link) return;
    link->pending_commands--;
    r = reply;

    /* Ignore every error or unexpected reply.
     * Note that if the command returns an error for any reason we'll
     * end clearing the SRI_MASTER_DOWN flag for timeout anyway. */
    if (r->type == REDIS_REPLY_ARRAY && r->elements == 3 &&
        r->element[0]->type == REDIS_REPLY_INTEGER &&
        r->element[1]->type == REDIS_REPLY_STRING &&
        r->element[2]->type == REDIS_REPLY_INTEGER)
    {
        ri->last_master_down_reply_time = mstime();
        if (r->element[0]->integer == 1) {
            ri->flags |= SRI_MASTER_DOWN;
        } else {
            ri->flags &= ~SRI_MASTER_DOWN;
        }
        if (strcmp(r->element[1]->str,"*")) {
            /* If the runid in the reply is not "*" the Sentinel actually
             * replied with a vote. */
            sdsfree(ri->leader);
            if ((long long)ri->leader_epoch != r->element[2]->integer)
                serverLog(LL_WARNING,
                    "%s voted for %s %llu", ri->name,
                    r->element[1]->str,
                    (unsigned long long) r->element[2]->integer);
            ri->leader = sdsnew(r->element[1]->str);
            ri->leader_epoch = r->element[2]->integer;
        }
    }
}

该命令返回三个值如:

127.0.0.1:26380> sentinel is-master-down-by-addr 127.0.0.1 6379 0 *
1) (integer) 0
2) "*"
3) (integer) 0
  1. <down_state> :master的下线状态,0未下线,1已下线。当返回为已下线时,会同步更新flags的对应的第5位标志位SRI_MASTER_DOWN为1。
  2. <leader_runid>:leader sentinel的runid,像第一次的客观下线检测时返回*,因为命令发送的时候<leader_id>*
  3. <leader_epoch>:当前投票纪元,当runid为*时,该值总为0。

该命令也是领头选举时发送的命令,稍后介绍。在询问完其他sentinel该master的状态的后,在下个周期,会进行客观下线检查。但是在此之前还需要分析一个逻辑就是sentinel如何处理sentinel is-master-down-by-addr命令的。回忆起上章初始化时加载的命令表,在sentinel命令注册的方法sentinelCommand中相关该命令的部分代码

...
 else if (!strcasecmp(c->argv[1]->ptr,"is-master-down-by-addr")) {
        /* SENTINEL IS-MASTER-DOWN-BY-ADDR <ip> <port> <current-epoch> <runid>
         *
         * Arguments:
         *
         * ip and port are the ip and port of the master we want to be
         * checked by Sentinel. Note that the command will not check by
         * name but just by master, in theory different Sentinels may monitor
         * differnet masters with the same name.
         *
         * current-epoch is needed in order to understand if we are allowed
         * to vote for a failover leader or not. Each Sentinel can vote just
         * one time per epoch.
         *
         * runid is "*" if we are not seeking for a vote from the Sentinel
         * in order to elect the failover leader. Otherwise it is set to the
         * runid we want the Sentinel to vote if it did not already voted.
         */
        sentinelRedisInstance *ri;
        long long req_epoch;
        uint64_t leader_epoch = 0;
        char *leader = NULL;
        long port;
        int isdown = 0;

        if (c->argc != 6) goto numargserr;
        if (getLongFromObjectOrReply(c,c->argv[3],&port,NULL) != C_OK ||
            getLongLongFromObjectOrReply(c,c->argv[4],&req_epoch,NULL)
                                                              != C_OK)
            return;
        ri = getSentinelRedisInstanceByAddrAndRunID(sentinel.masters,
            c->argv[2]->ptr,port,NULL);

        /* It exists? Is actually a master? Is subjectively down? It's down.
         * Note: if we are in tilt mode we always reply with "0". */
        if (!sentinel.tilt && ri && (ri->flags & SRI_S_DOWN) &&
                                    (ri->flags & SRI_MASTER))
            isdown = 1;

        /* Vote for the master (or fetch the previous vote) if the request
         * includes a runid, otherwise the sender is not seeking for a vote. */
        if (ri && ri->flags & SRI_MASTER && strcasecmp(c->argv[5]->ptr,"*")) {
            leader = sentinelVoteLeader(ri,(uint64_t)req_epoch,
                                            c->argv[5]->ptr,
                                            &leader_epoch);
        }
...

从这个代码看,当收到其他sentinel节点的关于master下线询问,是直接读取对应master实例对象中保存的flags的状态的,并不会触发一些再次探测等其他操作。

master客观下线状态的检查

这里只有对master节点才会进行客观下线判断代码如下:
sentinelCheckObjectivelyDown

/* Is this instance down according to the configured quorum?
 *
 * Note that ODOWN is a weak quorum, it only means that enough Sentinels
 * reported in a given time range that the instance was not reachable.
 * However messages can be delayed so there are no strong guarantees about
 * N instances agreeing at the same time about the down state. */
void sentinelCheckObjectivelyDown(sentinelRedisInstance *master) {
    dictIterator *di;
    dictEntry *de;
    unsigned int quorum = 0, odown = 0;

    if (master->flags & SRI_S_DOWN) {
        /* Is down for enough sentinels? */
        quorum = 1; /* the current sentinel. */
        /* Count all the other sentinels. */
        di = dictGetIterator(master->sentinels);
        while((de = dictNext(di)) != NULL) {
            sentinelRedisInstance *ri = dictGetVal(de);

            if (ri->flags & SRI_MASTER_DOWN) quorum++;
        }
        dictReleaseIterator(di);
        if (quorum >= master->quorum) odown = 1;
    }

    /* Set the flag accordingly to the outcome. */
    if (odown) {
        if ((master->flags & SRI_O_DOWN) == 0) {
            sentinelEvent(LL_WARNING,"+odown",master,"%@ #quorum %d/%d",
                quorum, master->quorum);
            master->flags |= SRI_O_DOWN;
            master->o_down_since_time = mstime();        }
    } else {
        if (master->flags & SRI_O_DOWN) {
            sentinelEvent(LL_WARNING,"-odown",master,"%@");
            master->flags &= ~SRI_O_DOWN;
        }
    }
}

当询问后的结果都处理更新至对应sentinels结构中时,就可以开始查看,对应的master被判断下线的数量是否超过了在配置文件sentinel.conf中配置的quorum值,如果达到或者超过该值即认为该master进入了客观下线的状态,将会修改其标志位SRI_O_DOWN为1正式进入接下来的选举领头。

问题:在sentinel监督的master由主观下线状态到客观下线的过程,从命令广播和判断master客观下线这个共识,sentinel并没有采用什么特殊的算法,特别是master的客观下线这个状态,那么sentinel的选举领头会不会在一个所有的sentinel节点都达到一致的状态后进行呢?
答: 从代码看sentinel几乎并没有刻意的去同步一次master状态在sentinel集群中的客观状态,也就是说master的客观下线,需要等到集群中绝大部分节点都通过周期性事件判断出master主观下线,才有可能形成客观下线。客观下线这一状态是在通过命令不断交互慢慢达成的一个共识。而在由某个节点主观下线到整个集群客观下线的整个共识形成中,sentinel is-master-down-by-addr命令大量充斥在sentinel的网络结构中。当某个sentinel节点的客观条件得到满足时,选举故障转移的领头选举便也开始了。由于每个sentinel节点客观条件是手动可配,并没有什么算法来支持自动调节,这点确是可以有必要学习区块链,个人觉得这个客观条件触发本身就是调节集群达成共识快慢的一个重要因子。最后值得说明的一点是,从整个集群看,当开始进入领头选举状态时,集群中可能还有sentinel节点并没有判断出该master已经掉线。

选举leader节点

当sentinel集群中的某个节点已经识别到master进入客观下线的状态,那么开始发起选举领头的投票。还是上面那段sentinelHandleRedisInstance代码

...
 /* Only masters */
    if (ri->flags & SRI_MASTER) {
        sentinelCheckObjectivelyDown(ri);
        if (sentinelStartFailoverIfNeeded(ri))
            sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_ASK_FORCED);
        sentinelFailoverStateMachine(ri);
        sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_NO_FLAGS);
    }

而这次先关注的是sentinelStartFailoverIfNeeded方法,顾明思义该方法是判断是否需要开始故障转移,如果需要则开始进行领头选举。代码如下:

/* This function checks if there are the conditions to start the failover,
 * that is:
 *
 * 1) Master must be in ODOWN condition.
 * 2) No failover already in progress.
 * 3) No failover already attempted recently.
 *
 * We still don't know if we'll win the election so it is possible that we
 * start the failover but that we'll not be able to act.
 *
 * Return non-zero if a failover was started. */
int sentinelStartFailoverIfNeeded(sentinelRedisInstance *master) {
    /* We can't failover if the master is not in O_DOWN state. */
    if (!(master->flags & SRI_O_DOWN)) return 0;

    /* Failover already in progress? */
    if (master->flags & SRI_FAILOVER_IN_PROGRESS) return 0;

    /* Last failover attempt started too little time ago? */
    if (mstime() - master->failover_start_time <
        master->failover_timeout*2)
    {
        if (master->failover_delay_logged != master->failover_start_time) {
            time_t clock = (master->failover_start_time +
                            master->failover_timeout*2) / 1000;
            char ctimebuf[26];

            ctime_r(&clock,ctimebuf);
            ctimebuf[24] = '\0'; /* Remove newline. */
            master->failover_delay_logged = master->failover_start_time;
            serverLog(LL_WARNING,
                "Next failover delay: I will not start a failover before %s",
                ctimebuf);
        }
        return 0;
    }

    sentinelStartFailover(master);
    return 1;
}
  • master必须满足客观下线。
  • master没有在故障转移中。
  • master是不是距离上次尝试故障转移时间间隔小于2倍故障转移超时(默认超时是3分钟),意思是如果出现故障转移超时默认至少隔六分钟再开始下一轮。
  • 如果以上三点都满足的话执行sentinelStartFailover方法。

在开始看sentinelStartFailover方法之前又有两个问题需要我们在下面的代码分析中得以解决:

  1. 故障转移中,是在什么时候设置的状态。这个状态是集群中sentinel都同步的一个状态,还是单个被选举出来sentinel节点的自身内部的状态?
  2. 故障转移的超时指的是什么超时,什么情况会引起超时,这种超时会导致领头重新选举吗?

带着问题看到sentinelStartFailover的代码

/* Setup the master state to start a failover. */
void sentinelStartFailover(sentinelRedisInstance *master) {
    serverAssert(master->flags & SRI_MASTER);

    master->failover_state = SENTINEL_FAILOVER_STATE_WAIT_START;
    master->flags |= SRI_FAILOVER_IN_PROGRESS;
    master->failover_epoch = ++sentinel.current_epoch;
    sentinelEvent(LL_WARNING,"+new-epoch",master,"%llu",
        (unsigned long long) sentinel.current_epoch);
    sentinelEvent(LL_WARNING,"+try-failover",master,"%@");
    master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;
    master->failover_state_change_time = mstime();
}

这里主要是初始化了开始故障转移新纪元的配置:

  • 将状态机置为SENTINEL_FAILOVER_STATE_WAIT_START状态。
  • flags标记为SRI_FAILOVER_IN_PROGRESS表示正在进行故障转移,周期事件可以不用重复进行。
  • 更新故障转移的纪元。
  • 设置故障转移开始时间(不知道为什么要加1000以内的随机数)。
  • 设置故障转移状态机的改变时间。

从这个方法中,解决了第一个问题SRI_FAILOVER_IN_PROGRESS标志位的设置,表示当前sentinel节点正在进行故障转移。接着回到之前询问其他sentinel节点master状态的方法sentinelAskMasterStateToOtherSentinels,此时入参flags=SENTINEL_ASK_PERIOD,意味着sentinel将再次向其他节点发送SENTINEL is-master-down-by-addr命令,只不过这次<runId>参数不再是空而是加上了当前sentinelrunId,期望其他节点选举其为leader节点。

leader节点的选规则

同样如同客观下线SENTINEL is-master-down-by-addr命令的处理一样,只是会多调用sentinelVoteLeader方法

...
 else if (!strcasecmp(c->argv[1]->ptr,"is-master-down-by-addr")) {
   ...
        /* Vote for the master (or fetch the previous vote) if the request
         * includes a runid, otherwise the sender is not seeking for a vote. */
        if (ri && ri->flags & SRI_MASTER && strcasecmp(c->argv[5]->ptr,"*")) {
            leader = sentinelVoteLeader(ri,(uint64_t)req_epoch,
                                            c->argv[5]->ptr,
                                            &leader_epoch);
        }
...

sentinelVoteLeader

/* Vote for the sentinel with 'req_runid' or return the old vote if already
 * voted for the specifed 'req_epoch' or one greater.
 *
 * If a vote is not available returns NULL, otherwise return the Sentinel
 * runid and populate the leader_epoch with the epoch of the vote. */
char *sentinelVoteLeader(sentinelRedisInstance *master, uint64_t req_epoch, char *req_runid, uint64_t *leader_epoch) {
    if (req_epoch > sentinel.current_epoch) {
        sentinel.current_epoch = req_epoch;
        sentinelFlushConfig();
        sentinelEvent(LL_WARNING,"+new-epoch",master,"%llu",
            (unsigned long long) sentinel.current_epoch);
    }

    if (master->leader_epoch < req_epoch && sentinel.current_epoch <= req_epoch)
    {
        sdsfree(master->leader);
        master->leader = sdsnew(req_runid);
        master->leader_epoch = sentinel.current_epoch;
        sentinelFlushConfig();
        sentinelEvent(LL_WARNING,"+vote-for-leader",master,"%s %llu",
            master->leader, (unsigned long long) master->leader_epoch);
        /* If we did not voted for ourselves, set the master failover start
         * time to now, in order to force a delay before we can start a
         * failover for the same master. */
        if (strcasecmp(master->leader,sentinel.myid))
            master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;
    }

    *leader_epoch = master->leader_epoch;
    return master->leader ? sdsnew(master->leader) : NULL;
}
  • 同步投票纪元。
  • 当master没有设置leader时,就将广播中的runId设置为leader,这个runId可能是sentinel自己的。
  • 当runId不是自己时,设置故障转移开始的时间。
  • 每次这些状态的改动都保存至配置文件中去。

在这里我们有看到了sentinel集群中对于failover状态开始时间的一个统一同步,非leader的sentinel节点是在收到投票的命令广播时认为故障转移开始。整个投票的过程有如下的交互流程如下:

  • sentinel节点发送SENTINEL is-master-down-by-addr命令要求接收节点设置自己为leader,此时有两种情况:1)接收节点在当前投票纪元中没有设置leader,便设置将其设置为leader。2)接收节点在当前投票纪元中已设置了leader,便将已设置的leader返回。
  • sentinel节点,接收到返回结果后,将leader runid结果更新在对应的sentinelsentinelRedisInstance结构中,以便后续统计票数。
  • 通过一轮询问,询问的sentinel节点就将会获得其他sentinel的投票结果。
  • 进入状态机中的SENTINEL_FAILOVER_STATE_WAIT_START进行唱票。
void sentinelFailoverWaitStart(sentinelRedisInstance *ri) {
    char *leader;
    int isleader;

    /* Check if we are the leader for the failover epoch. */
    leader = sentinelGetLeader(ri, ri->failover_epoch);
    isleader = leader && strcasecmp(leader,sentinel.myid) == 0;
    sdsfree(leader);

    /* If I'm not the leader, and it is not a forced failover via
     * SENTINEL FAILOVER, then I can't continue with the failover. */
    if (!isleader && !(ri->flags & SRI_FORCE_FAILOVER)) {
        int election_timeout = SENTINEL_ELECTION_TIMEOUT;

        /* The election timeout is the MIN between SENTINEL_ELECTION_TIMEOUT
         * and the configured failover timeout. */
        if (election_timeout > ri->failover_timeout)
            election_timeout = ri->failover_timeout;
        /* Abort the failover if I'm not the leader after some time. */
        if (mstime() - ri->failover_start_time > election_timeout) {
            sentinelEvent(LL_WARNING,"-failover-abort-not-elected",ri,"%@");
            sentinelAbortFailover(ri);
        }
        return;
    }
    sentinelEvent(LL_WARNING,"+elected-leader",ri,"%@");
    if (sentinel.simfailure_flags & SENTINEL_SIMFAILURE_CRASH_AFTER_ELECTION)
        sentinelSimFailureCrash();
    ri->failover_state = SENTINEL_FAILOVER_STATE_SELECT_SLAVE;
    ri->failover_state_change_time = mstime();
    sentinelEvent(LL_WARNING,"+failover-state-select-slave",ri,"%@");
}

sentinelGetLeader

/* Scan all the Sentinels attached to this master to check if there
 * is a leader for the specified epoch.
 *
 * To be a leader for a given epoch, we should have the majority of
 * the Sentinels we know (ever seen since the last SENTINEL RESET) that
 * reported the same instance as leader for the same epoch. */
char *sentinelGetLeader(sentinelRedisInstance *master, uint64_t epoch) {
    dict *counters;
    dictIterator *di;
    dictEntry *de;
    unsigned int voters = 0, voters_quorum;
    char *myvote;
    char *winner = NULL;
    uint64_t leader_epoch;
    uint64_t max_votes = 0;

    serverAssert(master->flags & (SRI_O_DOWN|SRI_FAILOVER_IN_PROGRESS));
    counters = dictCreate(&leaderVotesDictType,NULL);

    voters = dictSize(master->sentinels)+1; /* All the other sentinels and me.*/

    /* Count other sentinels votes */
    di = dictGetIterator(master->sentinels);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *ri = dictGetVal(de);
        if (ri->leader != NULL && ri->leader_epoch == sentinel.current_epoch)
            sentinelLeaderIncr(counters,ri->leader);
    }
    dictReleaseIterator(di);

    /* Check what's the winner. For the winner to win, it needs two conditions:
     * 1) Absolute majority between voters (50% + 1).
     * 2) And anyway at least master->quorum votes. */
    di = dictGetIterator(counters);
    while((de = dictNext(di)) != NULL) {
        uint64_t votes = dictGetUnsignedIntegerVal(de);

        if (votes > max_votes) {
            max_votes = votes;
            winner = dictGetKey(de);
        }
    }
    dictReleaseIterator(di);

    /* Count this Sentinel vote:
     * if this Sentinel did not voted yet, either vote for the most
     * common voted sentinel, or for itself if no vote exists at all. */
    if (winner)
        myvote = sentinelVoteLeader(master,epoch,winner,&leader_epoch);
    else
        myvote = sentinelVoteLeader(master,epoch,sentinel.myid,&leader_epoch);

    if (myvote && leader_epoch == epoch) {
        uint64_t votes = sentinelLeaderIncr(counters,myvote);

        if (votes > max_votes) {
            max_votes = votes;
            winner = myvote;
        }
    }

    voters_quorum = voters/2+1;
    if (winner && (max_votes < voters_quorum || max_votes < master->quorum))
        winner = NULL;

    winner = winner ? sdsnew(winner) : NULL;
    sdsfree(myvote);
    dictRelease(counters);
    return winner;
}
  • 先统计选票,采用的是redis自己的数据结构leaderVotesDictType(本质可以理解为一个k-v的Map),将选票分类整合票数。
  • 选出票数最多的runid。
  • 然后投出自己的一票,如果有winer,就将该票投给winer,如果没有就把票投给自己。
  • 当winer的选票大于所有节点的一半以上或者大于监督master时配置的quorum时,winner产生。而该winner就是被选举出来的leader。

NOTE: 在leader选举的一开始,sentinel节点是不会投票给自己的。但sentinel是可以投票给自己的,sentinel的节点有两个投票时机,但每个节点在当前纪元只能投一次票。sentinel节点拉票的过程是异步的,所以可能有些询问的结果都会得不到及时的反馈。

从上面整个选举的过程,发现要产生leader有几个重要的条件,

  • 至少我们会收到集群中voters/2+1个节点的投票(包括节点自己),如果设置的quorum小于voters/2+1,就是quorum个节点。
  • 选票最多的节点得到一定要或者一半节点以上支持票在成为leader
  • 选票产生的结果在当前纪元内才有效。

因为整个拉票的过程是异步的,并且如果有节点掉线的话,或者票数最多的节点满足不了上述的要求的话,那么当前纪元时产生不了最终的leader的,只能等待超时,然后开启下一轮的新纪元,直到该次故障转移leader被选举出来,进入到状态机的下一个状态。

leader选举小结

上面通过代码解释了sentinelleader选举流程,来总结一下sentinel是如何达到一个共识的状态。
在一个集群中所有节点要达到一个共识就需要交互集群维度的状态,sentinel节点通过发送SENTINEL is-master-down-by-addr命令来交互,获得其他节点的内容。因为每个sentinel节点自身都会维护一份基于master维度的数据结构,某一方面我们可以把它理解成路由表,而SENTINEL is-master-down-by-addr命令则可以理解为交互的协议。由于sentinel体系网络结构的特殊性,sentinel节点是通过订阅了共同master节点的hello频道间接相互发现的。这个master节点充当了媒介。而master节点和sentinel节点却由属于截然不同的两种功能的节点。
简而言之,抽象出来的分布式集群高可用性,需要解决的基础就是:

  • 集群节点间通信网络的构建
  • 集群节点间协议交互的传播方式
  • 集群节点间交互的协议

sentinel体系中除了解决这些基础问题之外,就是如何达成共识。我们知道不管是master的客观下线还是故障转移的leader选举都一个共识达成的过程。自然形成共识的规则、标准我们希望每个节点(这里指的是sentinel节点)是一致的从而来保证每个节点都是公平的。当然在sentinel中有手动配置的quorum,其实这个quorum个人认为它是调节整个sentinel集群达到共识状态的一个重要因子,可惜的是这个因子每个节点可配,并不是整个集群可配。这使得单个节点获得了巨大的决定权,有点破坏了集群的稳定性。
言归正传,对于每个sentinel节点而言都在进行着自己对master节点的周期探测,当有一个节点探测到其监督的master掉线的状态并认为其主观下线的话,那么sentinel体系的第一次共识决定便开始了。因为该节点会开始不停的询问其他节点,是否也认为该master已经下线。如果已经下线的话,将会更新对应sentinelRedisInstance中的flags,随着时间的推移,该节点会得到越来越多其他节点判断检测的master下线的节点,直到某个临界值。换个角度看,其实集群的每个节点都在自己的周期探测中逐渐进入到判断master节点客观下线的状态。因此集群中这一状态的获得,并不需要互相通知,都是靠自感知的。而单个节点获知其他节点的状态也可以看做是轮询的。此时,当集群中的某个节点率先满足了设置master节点客观下线条件时(>=quorum值),便开始第二轮共识"发起投票"。前面也有提及就是在集群中这两轮共识状态并没有明显的界限,都是由每个节点自己去获得,并不会被其他节点状态所影响。也值得一提的就是sentinel节点选票时却是每个sentinel节点都有投票权,即便是并没有确认master节点已下线的节点也可以参与。在这一轮的共识中有一个条件就是,进行故障转移leader的票数一定至少要超过集群中节点的一半。并且这个选举是有时间期限的,在规定期限内没有获得这个leader,将会进行下一轮的投票,直到在这个期限内获得leader,这个共识便达成了,因此对于每一轮的投票,都有一个epoch纪元来控制,都点类似于版本号。从两轮共识中又可以抽象出来sentinel节点满足达成共识必要的五点,:

  • 每个节点都可以参加选举和投票,当前纪元有且仅有一票。保证每个节点对此轮选举计算得结果是一致。
  • 交互选举的结果。
  • 选举结果达成共识的触发规则(votes/2+1)。
  • 选举结果达成共识有时间期限。

故障转移其他操作

状态机

结束了leader选举后,被选举leader节点,便开始了正式的故障转移。在前面通过代码也发现了,sentinel是通过一个状态机来操作进行故障转移。

void sentinelFailoverStateMachine(sentinelRedisInstance *ri) {
    serverAssert(ri->flags & SRI_MASTER);

    if (!(ri->flags & SRI_FAILOVER_IN_PROGRESS)) return;

    switch(ri->failover_state) {
        case SENTINEL_FAILOVER_STATE_WAIT_START:
            sentinelFailoverWaitStart(ri);
            break;
        case SENTINEL_FAILOVER_STATE_SELECT_SLAVE:
            sentinelFailoverSelectSlave(ri);
            break;
        case SENTINEL_FAILOVER_STATE_SEND_SLAVEOF_NOONE:
            sentinelFailoverSendSlaveOfNoOne(ri);
            break;
        case SENTINEL_FAILOVER_STATE_WAIT_PROMOTION:
            sentinelFailoverWaitPromotion(ri);
            break;
        case SENTINEL_FAILOVER_STATE_RECONF_SLAVES:
            sentinelFailoverReconfNextSlave(ri);
            break;
    }
}

整个的状态变化图如:

SENTINEL_FAILOVER_STATE_WAIT_START
                ||
                \/
SENTINEL_FAILOVER_STATE_SELECT_SLAVE
                ||
                \/
SENTINEL_FAILOVER_STATE_SEND_SLAVEOF_NOONE
                ||
                \/
SENTINEL_FAILOVER_STATE_WAIT_PROMOTION
                ||
                \/
SENTINEL_FAILOVER_STATE_RECONF_SLAVES
                ||
                \/
SENTINEL_FAILOVER_STATE_UPDATE_CONFIG
选择slave

选择的规则如下调用链sentinelFailoverSelectSlave->sentinelSelectSlave

sentinelRedisInstance *sentinelSelectSlave(sentinelRedisInstance *master) {
    sentinelRedisInstance **instance =
        zmalloc(sizeof(instance[0])*dictSize(master->slaves));
    sentinelRedisInstance *selected = NULL;
    int instances = 0;
    dictIterator *di;
    dictEntry *de;
    mstime_t max_master_down_time = 0;

    if (master->flags & SRI_S_DOWN)
        max_master_down_time += mstime() - master->s_down_since_time;
    max_master_down_time += master->down_after_period * 10;

    di = dictGetIterator(master->slaves);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *slave = dictGetVal(de);
        mstime_t info_validity_time;

        if (slave->flags & (SRI_S_DOWN|SRI_O_DOWN)) continue;
        if (slave->link->disconnected) continue;
        if (mstime() - slave->link->last_avail_time > SENTINEL_PING_PERIOD*5) continue;
        if (slave->slave_priority == 0) continue;

        /* If the master is in SDOWN state we get INFO for slaves every second.
         * Otherwise we get it with the usual period so we need to account for
         * a larger delay. */
        if (master->flags & SRI_S_DOWN)
            info_validity_time = SENTINEL_PING_PERIOD*5;
        else
            info_validity_time = SENTINEL_INFO_PERIOD*3;
        if (mstime() - slave->info_refresh > info_validity_time) continue;
        if (slave->master_link_down_time > max_master_down_time) continue;
        instance[instances++] = slave;
    }
    dictReleaseIterator(di);
    if (instances) {
        qsort(instance,instances,sizeof(sentinelRedisInstance*),
            compareSlavesForPromotion);
        selected = instance[0];
    }
    zfree(instance);
    return selected;
}

选择策略:

  • 排除已经判断主客观判断掉线的。
  • 排除已经断开连接的。
  • 排除超过5*SENTINEL_PING_PERIOD秒(即5s)没有获得ping回应的。
  • 排除优先级为0的。
  • 如果master是SRI_S_DOWN的状态sentinel会每1s发送info给slave所以此时排除超过SENTINEL_PING_PERIOD*5秒(即5s)没有获得info回应的,反之排除超过3* SENTINEL_INFO_PERIOD秒(即30s)没有获得info回应的。
  • 排除与master保持连接时间要大于master客观下线的时间或者master->down_after_period * 10。这样可以尽可能保证slavemaster掉线前是与master保持连接的。
  • 剩下的slave按照如下规则选出一个slave,先按照优先级选优先级最高的,再按照slave复制的offset的大小,尽可能挑offset最大的。表示数据的完整度最接近master,最后按照runId的大小,选择runId最大的。
int compareSlavesForPromotion(const void *a, const void *b) {
    sentinelRedisInstance **sa = (sentinelRedisInstance **)a,
                          **sb = (sentinelRedisInstance **)b;
    char *sa_runid, *sb_runid;

    if ((*sa)->slave_priority != (*sb)->slave_priority)
        return (*sa)->slave_priority - (*sb)->slave_priority;

    /* If priority is the same, select the slave with greater replication
     * offset (processed more data from the master). */
    if ((*sa)->slave_repl_offset > (*sb)->slave_repl_offset) {
        return -1; /* a < b */
    } else if ((*sa)->slave_repl_offset < (*sb)->slave_repl_offset) {
        return 1; /* a > b */
    }

    /* If the replication offset is the same select the slave with that has
     * the lexicographically smaller runid. Note that we try to handle runid
     * == NULL as there are old Redis versions that don't publish runid in
     * INFO. A NULL runid is considered bigger than any other runid. */
    sa_runid = (*sa)->runid;
    sb_runid = (*sb)->runid;
    if (sa_runid == NULL && sb_runid == NULL) return 0;
    else if (sa_runid == NULL) return 1;  /* a > b */
    else if (sb_runid == NULL) return -1; /* a < b */
    return strcasecmp(sa_runid, sb_runid);
}
发送将升级slave至master的命令

调用链sentinelFailoverSendSlaveOfNoOne->sentinelSendSlaveOf

void sentinelFailoverSendSlaveOfNoOne(sentinelRedisInstance *ri) {
    int retval;

    /* We can't send the command to the promoted slave if it is now
     * disconnected. Retry again and again with this state until the timeout
     * is reached, then abort the failover. */
    if (ri->promoted_slave->link->disconnected) {
        if (mstime() - ri->failover_state_change_time > ri->failover_timeout) {
            sentinelEvent(LL_WARNING,"-failover-abort-slave-timeout",ri,"%@");
            sentinelAbortFailover(ri);
        }
        return;
    }

    /* Send SLAVEOF NO ONE command to turn the slave into a master.
     * We actually register a generic callback for this command as we don't
     * really care about the reply. We check if it worked indirectly observing
     * if INFO returns a different role (master instead of slave). */
    retval = sentinelSendSlaveOf(ri->promoted_slave,NULL,0);
    if (retval != C_OK) return;
    sentinelEvent(LL_NOTICE, "+failover-state-wait-promotion",
        ri->promoted_slave,"%@");
    ri->failover_state = SENTINEL_FAILOVER_STATE_WAIT_PROMOTION;
    ri->failover_state_change_time = mstime();
}

向被选出来的slave发送一个slaveof no one的命令将其升级为master,而且这次发送命令,并不会注册slave返回结果处理方法,而是通过sentinelslave发送的info命令,来获知slave的角色是否已被改变。当然如果在发送之前发现与已选择的slave断开了连接则,宣告故障转移超时失败,重置故障转移,进入新一轮的投票选举。

等待slave升级

slaveof no one的命令发出后,故障转移的状态机便进入了SENTINEL_FAILOVER_STATE_WAIT_PROMOTION状态,处于这个状态的sentinel只是检查一下failover_state_change_time是否已经超时,如果超时则宣告故障转移超时失败,重置故障转移,进入新一轮的投票选举。

/* We actually wait for promotion indirectly checking with INFO when the
 * slave turns into a master. */
void sentinelFailoverWaitPromotion(sentinelRedisInstance *ri) {
    /* Just handle the timeout. Switching to the next state is handled
     * by the function parsing the INFO command of the promoted slave. */
    if (mstime() - ri->failover_state_change_time > ri->failover_timeout) {
        sentinelEvent(LL_WARNING,"-failover-abort-slave-timeout",ri,"%@");
        sentinelAbortFailover(ri);
    }
}

前面在发送slaveof no one命令的时候有提到,sentinel并没有注册响应回调方法,而是通过周期性的info命令来探测slave的角色改变,关于info命令返回结果的解析上章也有提到,再次回到这段代码。

/* Process the INFO output from masters. */
void sentinelRefreshInstanceInfo(sentinelRedisInstance *ri, const char *info) {
    sds *lines;
    int numlines, j;
    int role = 0;

    /* cache full INFO output for instance */
    sdsfree(ri->info);
    ri->info = sdsnew(info);

    /* The following fields must be reset to a given value in the case they
     * are not found at all in the INFO output. */
    ri->master_link_down_time = 0;

    ...
     
    /* Handle slave -> master role switch. */
    if ((ri->flags & SRI_SLAVE) && role == SRI_MASTER) {
        /* If this is a promoted slave we can change state to the
         * failover state machine. */
        if ((ri->flags & SRI_PROMOTED) &&
            (ri->master->flags & SRI_FAILOVER_IN_PROGRESS) &&
            (ri->master->failover_state ==
                SENTINEL_FAILOVER_STATE_WAIT_PROMOTION))
        {
            /* Now that we are sure the slave was reconfigured as a master
             * set the master configuration epoch to the epoch we won the
             * election to perform this failover. This will force the other
             * Sentinels to update their config (assuming there is not
             * a newer one already available). */
            ri->master->config_epoch = ri->master->failover_epoch;
            ri->master->failover_state = SENTINEL_FAILOVER_STATE_RECONF_SLAVES;
            ri->master->failover_state_change_time = mstime();
            sentinelFlushConfig();
            sentinelEvent(LL_WARNING,"+promoted-slave",ri,"%@");
            if (sentinel.simfailure_flags &
                SENTINEL_SIMFAILURE_CRASH_AFTER_PROMOTION)
                sentinelSimFailureCrash();
            sentinelEvent(LL_WARNING,"+failover-state-reconf-slaves",
                ri->master,"%@");
            sentinelCallClientReconfScript(ri->master,SENTINEL_LEADER,
                "start",ri->master->addr,ri->addr);
            sentinelForceHelloUpdateForMaster(ri->master);
        } else {
            /* A slave turned into a master. We want to force our view and
             * reconfigure as slave. Wait some time after the change before
             * going forward, to receive new configs if any. */
            mstime_t wait_time = SENTINEL_PUBLISH_PERIOD*4;

            if (!(ri->flags & SRI_PROMOTED) &&
                 sentinelMasterLooksSane(ri->master) &&
     sentinelRedisInstanceNoDownFor(ri,wait_time) &&
                 mstime() - ri->role_reported_time > wait_time)
            {
                int retval = sentinelSendSlaveOf(ri,
                        ri->master->addr->ip,
                        ri->master->addr->port);
                if (retval == C_OK)
                    sentinelEvent(LL_NOTICE,"+convert-to-slave",ri,"%@");
            }
        }
    }

    /* Handle slaves replicating to a different master address. */
    if ((ri->flags & SRI_SLAVE) &&
        role == SRI_SLAVE &&
        (ri->slave_master_port != ri->master->addr->port ||
         strcasecmp(ri->slave_master_host,ri->master->addr->ip)))
    {
        mstime_t wait_time = ri->master->failover_timeout;

        /* Make sure the master is sane before reconfiguring this instance
         * into a slave. */
        if (sentinelMasterLooksSane(ri->master) &&
            sentinelRedisInstanceNoDownFor(ri,wait_time) &&
            mstime() - ri->slave_conf_change_time > wait_time)
        {
            int retval = sentinelSendSlaveOf(ri,
                    ri->master->addr->ip,
                    ri->master->addr->port);
            if (retval == C_OK)
                sentinelEvent(LL_NOTICE,"+fix-slave-config",ri,"%@");
        }
    }

    /* Detect if the slave that is in the process of being reconfigured
     * changed state. */
    if ((ri->flags & SRI_SLAVE) && role == SRI_SLAVE &&
        (ri->flags & (SRI_RECONF_SENT|SRI_RECONF_INPROG)))
    {
        /* SRI_RECONF_SENT -> SRI_RECONF_INPROG. */
        if ((ri->flags & SRI_RECONF_SENT) &&
            ri->slave_master_host &&
            strcmp(ri->slave_master_host,
                    ri->master->promoted_slave->addr->ip) == 0 &&
            ri->slave_master_port == ri->master->promoted_slave->addr->port)
        {
            ri->flags &= ~SRI_RECONF_SENT;
            ri->flags |= SRI_RECONF_INPROG;
     sentinelEvent(LL_NOTICE,"+slave-reconf-inprog",ri,"%@");
        }

        /* SRI_RECONF_INPROG -> SRI_RECONF_DONE */
        if ((ri->flags & SRI_RECONF_INPROG) &&
            ri->slave_master_link_status == SENTINEL_MASTER_LINK_STATUS_UP)
        {
            ri->flags &= ~SRI_RECONF_INPROG;
            ri->flags |= SRI_RECONF_DONE;
            sentinelEvent(LL_NOTICE,"+slave-reconf-done",ri,"%@");
        }
    }
}

这里只截取了有关slave-·>master的部分代码。

  • 先检验 sentinel监督的slave是否是正处在这种故障转移的状态中。
  • 如果是则更新配置纪元、还有设置进入下一状态SENTINEL_FAILOVER_STATE_RECONF_SLAVES、修改状态变更的时间以及保存至配置文件。
  • 调用client的重新配置的脚本。
  • 调用sentinelForceHelloUpdateForMaster->sentinelForceHelloUpdateDictOfRedisInstances方法,如此来使得下一个周期广播hello msg
/* Reset last_pub_time in all the instances in the specified dictionary
 * in order to force the delivery of an Hello update ASAP. */
void sentinelForceHelloUpdateDictOfRedisInstances(dict *instances) {
    dictIterator *di;
    dictEntry *de;

    di = dictGetSafeIterator(instances);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *ri = dictGetVal(de);
        if (ri->last_pub_time >= (SENTINEL_PUBLISH_PERIOD+1))
            ri->last_pub_time -= (SENTINEL_PUBLISH_PERIOD+1);
    }
    dictReleaseIterator(di);
}

/* This function forces the delivery of an "Hello" message (see
 * sentinelSendHello() top comment for further information) to all the Redis
 * and Sentinel instances related to the specified 'master'.
 *
 * It is technically not needed since we send an update to every instance
 * with a period of SENTINEL_PUBLISH_PERIOD milliseconds, however when a
 * Sentinel upgrades a configuration it is a good idea to deliever an update
 * to the other Sentinels ASAP. */
int sentinelForceHelloUpdateForMaster(sentinelRedisInstance *master) {
    if (!(master->flags & SRI_MASTER)) return C_ERR;
    if (master->last_pub_time >= (SENTINEL_PUBLISH_PERIOD+1))
        master->last_pub_time -= (SENTINEL_PUBLISH_PERIOD+1);
    sentinelForceHelloUpdateDictOfRedisInstances(master->sentinels);
    sentinelForceHelloUpdateDictOfRedisInstances(master->slaves);
    return C_OK;
}
重新配置slave

在通过info命令探测到被选举的slave已成功变成master后,进入SENTINEL_FAILOVER_STATE_RECONF_SLAVES状态,重新配置剩下的slaves

/* Send SLAVE OF <new master address> to all the remaining slaves that
 * still don't appear to have the configuration updated. */
void sentinelFailoverReconfNextSlave(sentinelRedisInstance *master) {
    dictIterator *di;
    dictEntry *de;
    int in_progress = 0;

    di = dictGetIterator(master->slaves);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *slave = dictGetVal(de);

        if (slave->flags & (SRI_RECONF_SENT|SRI_RECONF_INPROG))
            in_progress++;
    }
    dictReleaseIterator(di);

    di = dictGetIterator(master->slaves);
    while(in_progress < master->parallel_syncs &&
          (de = dictNext(di)) != NULL)
    {
        sentinelRedisInstance *slave = dictGetVal(de);
        int retval;

        /* Skip the promoted slave, and already configured slaves. */
        if (slave->flags & (SRI_PROMOTED|SRI_RECONF_DONE)) continue;

        /* If too much time elapsed without the slave moving forward to
         * the next state, consider it reconfigured even if it is not.
         * Sentinels will detect the slave as misconfigured and fix its
         * configuration later. */
        if ((slave->flags & SRI_RECONF_SENT) &&
            (mstime() - slave->slave_reconf_sent_time) >
            SENTINEL_SLAVE_RECONF_TIMEOUT)
        {
            sentinelEvent(LL_NOTICE,"-slave-reconf-sent-timeout",slave,"%@");
            slave->flags &= ~SRI_RECONF_SENT;
            slave->flags |= SRI_RECONF_DONE;
        }

        /* Nothing to do for instances that are disconnected or already
         * in RECONF_SENT state. */
        if (slave->flags & (SRI_RECONF_SENT|SRI_RECONF_INPROG)) continue;
        if (slave->link->disconnected) continue;

        /* Send SLAVEOF <new master>. */
        retval = sentinelSendSlaveOf(slave,
                master->promoted_slave->addr->ip,
                master->promoted_slave->addr->port);
        if (retval == C_OK) {            
            slave->flags |= SRI_RECONF_SENT;
            slave->slave_reconf_sent_time = mstime();
            sentinelEvent(LL_NOTICE,"+slave-reconf-sent",slave,"%@");
            in_progress++;
        }
    }
    dictReleaseIterator(di);

    /* Check if all the slaves are reconfigured and handle timeout. */
    sentinelFailoverDetectEnd(master);
}

向其他slaves发送SLAVE OF <new master address>命令,而这个命令,也没有状态回复,依然是通过info命令的探测得知,每个slave是否已重新配置了新的master

/* Process the INFO output from masters. */
void sentinelRefreshInstanceInfo(sentinelRedisInstance *ri, const char *info) {
    sds *lines;
    int numlines, j;
    int role = 0;

    /* cache full INFO output for instance */
    sdsfree(ri->info);
    ri->info = sdsnew(info);

    /* The following fields must be reset to a given value in the case they
     * are not found at all in the INFO output. */
    ri->master_link_down_time = 0;

    ...

    /* Detect if the slave that is in the process of being reconfigured
     * changed state. */
    if ((ri->flags & SRI_SLAVE) && role == SRI_SLAVE &&
        (ri->flags & (SRI_RECONF_SENT|SRI_RECONF_INPROG)))
    {
        /* SRI_RECONF_SENT -> SRI_RECONF_INPROG. */
        if ((ri->flags & SRI_RECONF_SENT) &&
            ri->slave_master_host &&
            strcmp(ri->slave_master_host,
                    ri->master->promoted_slave->addr->ip) == 0 &&
            ri->slave_master_port == ri->master->promoted_slave->addr->port)
        {
            ri->flags &= ~SRI_RECONF_SENT;
            ri->flags |= SRI_RECONF_INPROG;
     sentinelEvent(LL_NOTICE,"+slave-reconf-inprog",ri,"%@");
        }

        /* SRI_RECONF_INPROG -> SRI_RECONF_DONE */
        if ((ri->flags & SRI_RECONF_INPROG) &&
            ri->slave_master_link_status == SENTINEL_MASTER_LINK_STATUS_UP)
        {
            ri->flags &= ~SRI_RECONF_INPROG;
            ri->flags |= SRI_RECONF_DONE;
            sentinelEvent(LL_NOTICE,"+slave-reconf-done",ri,"%@");
        }
    }
}

如上的代码,其实在最后还有一个等待其他slave转向promote slave的状态变化过程。

SRI_RECONF_SENT->SRI_RECONF_INPROG->SRI_RECONF_DONE
  • SRI_RECONF_SENT:就是前面已发送SLAVE OF <new master address>的状态。
  • SRI_RECONF_INPROG:就是收到SLAVE OF <new master address>命令的slave已经配置成新master的从服务器的状态。
  • SRI_RECONF_DONE:就是slave重新配置master结束的状态。达到这个状态有一个前提条件是master_link_status:up则表示slave节点的重新配置master结束。

最后在sentinelFailoverReconfNextSlave调用了sentinelFailoverDetectEnd方法来检查是否所有的slave都已正常配置了新的master。如果都已经配置完毕,则进入到了下一个状态SENTINEL_FAILOVER_STATE_UPDATE_CONFIG

void sentinelFailoverDetectEnd(sentinelRedisInstance *master) {
    int not_reconfigured = 0, timeout = 0;
    dictIterator *di;
    dictEntry *de;
    mstime_t elapsed = mstime() - master->failover_state_change_time;

    /* We can't consider failover finished if the promoted slave is
     * not reachable. */
    if (master->promoted_slave == NULL ||
        master->promoted_slave->flags & SRI_S_DOWN) return;

    /* The failover terminates once all the reachable slaves are properly
     * configured. */
    di = dictGetIterator(master->slaves);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *slave = dictGetVal(de);

        if (slave->flags & (SRI_PROMOTED|SRI_RECONF_DONE)) continue;
        if (slave->flags & SRI_S_DOWN) continue;
        not_reconfigured++;
    }
    dictReleaseIterator(di);

    /* Force end of failover on timeout. */
    if (elapsed > master->failover_timeout) {
        not_reconfigured = 0;
        timeout = 1;
        sentinelEvent(LL_WARNING,"+failover-end-for-timeout",master,"%@");
    }

    if (not_reconfigured == 0) {
        sentinelEvent(LL_WARNING,"+failover-end",master,"%@");
        master->failover_state = SENTINEL_FAILOVER_STATE_UPDATE_CONFIG;
        master->failover_state_change_time = mstime();
    }

    /* If I'm the leader it is a good idea to send a best effort SLAVEOF
     * command to all the slaves still not reconfigured to replicate with
     * the new master. */
    if (timeout) {
        dictIterator *di;
        dictEntry *de;

        di = dictGetIterator(master->slaves);
        while((de = dictNext(di)) != NULL) {
            sentinelRedisInstance *slave = dictGetVal(de);
            int retval;

            if (slave->flags & (SRI_RECONF_DONE|SRI_RECONF_SENT)) continue;
            if (slave->link->disconnected) continue;
                    retval = sentinelSendSlaveOf(slave,
                    master->promoted_slave->addr->ip,
                    master->promoted_slave->addr->port);
            if (retval == C_OK) {
                sentinelEvent(LL_NOTICE,"+slave-reconf-sent-be",slave,"%@");
                slave->flags |= SRI_RECONF_SENT;
            }
        }
        dictReleaseIterator(di);
    }
}
更新master地址

在完成所有的slave转换后,故障转移已变成SENTINEL_FAILOVER_STATE_UPDATE_CONFIG状态。当sentinel处于这种状态时,代码在周期方法中处理该状态,而不是在状态机中处理的。主要是因为,处于这种状态的master将要被选举的slave替换,只需要改变原master的地址。且这个方法是递归的,如果不将该状态的处理放置在原master及其slavesentinel节点的周期性事件处理的最后面的话,有可能会引起一些不必要的问题。重新设置的代码如下,重新创建了一个slaves的dict。并将原来的master节点变为slave加入到字典表中。而原有的master sentinelRedisInstance的地址将会被替换各种状态和连接都会被重置sentinelResetMaster

/* Perform scheduled operations for all the instances in the dictionary.
 * Recursively call the function against dictionaries of slaves. */
void sentinelHandleDictOfRedisInstances(dict *instances) {
    dictIterator *di;
    dictEntry *de;
    sentinelRedisInstance *switch_to_promoted = NULL;

    /* There are a number of things we need to perform against every master. */
    di = dictGetIterator(instances);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *ri = dictGetVal(de);

        sentinelHandleRedisInstance(ri);
        if (ri->flags & SRI_MASTER) {
            sentinelHandleDictOfRedisInstances(ri->slaves);
            sentinelHandleDictOfRedisInstances(ri->sentinels);
            if (ri->failover_state == SENTINEL_FAILOVER_STATE_UPDATE_CONFIG) {
                switch_to_promoted = ri;
            }
        }
    }
    if (switch_to_promoted)
        sentinelFailoverSwitchToPromotedSlave(switch_to_promoted);
    dictReleaseIterator(di);
}

sentinelFailoverSwitchToPromotedSlave->sentinelResetMasterAndChangeAddress

/* Reset the specified master with sentinelResetMaster(), and also change
 * the ip:port address, but take the name of the instance unmodified.
 *
 * This is used to handle the +switch-master event.
 *
 * The function returns C_ERR if the address can't be resolved for some
 * reason. Otherwise C_OK is returned.  */
int sentinelResetMasterAndChangeAddress(sentinelRedisInstance *master, char *ip, int port) {
    sentinelAddr *oldaddr, *newaddr;
    sentinelAddr **slaves = NULL;
    int numslaves = 0, j;
    dictIterator *di;
    dictEntry *de;

    newaddr = createSentinelAddr(ip,port);
    if (newaddr == NULL) return C_ERR;

    /* Make a list of slaves to add back after the reset.
     * Don't include the one having the address we are switching to. */
    di = dictGetIterator(master->slaves);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *slave = dictGetVal(de);

        if (sentinelAddrIsEqual(slave->addr,newaddr)) continue;
        slaves = zrealloc(slaves,sizeof(sentinelAddr*)*(numslaves+1));
        slaves[numslaves++] = createSentinelAddr(slave->addr->ip,
                                                 slave->addr->port);
    }
    dictReleaseIterator(di);

    /* If we are switching to a different address, include the old address
     * as a slave as well, so that we'll be able to sense / reconfigure
     * the old master. */
    if (!sentinelAddrIsEqual(newaddr,master->addr)) {
        slaves = zrealloc(slaves,sizeof(sentinelAddr*)*(numslaves+1));
        slaves[numslaves++] = createSentinelAddr(master->addr->ip,
                                                 master->addr->port);
    }

    /* Reset and switch address. */
    sentinelResetMaster(master,SENTINEL_RESET_NO_SENTINELS);
    oldaddr = master->addr;
    master->addr = newaddr;
    master->o_down_since_time = 0;
    master->s_down_since_time = 0;

    /* Add slaves back. */
    for (j = 0; j < numslaves; j++) {
        sentinelRedisInstance *slave;
        slave = createSentinelRedisInstance(NULL,SRI_SLAVE,slaves[j]->ip,
                    slaves[j]->port, master->quorum, master);
        releaseSentinelAddr(slaves[j]);
        if (slave) sentinelEvent(LL_NOTICE,"+slave",slave,"%@");
    }
    zfree(slaves);

    /* Release the old address at the end so we are safe even if the function
     * gets the master->addr->ip and master->addr->port as arguments. */
    releaseSentinelAddr(oldaddr);
    sentinelFlushConfig();
    return C_OK;
}

故障转移终于结束了,但还有一个遗留的问题尚未解决就是,故障转移只有被选举的leader才能操作,其他sentinel节点是如何同步到被选举的新master并更新对应的结构的呢?还记得在处理info命令中当收到的role由slave转为master时,代码会强制更新hello msg的pub的周期,尽快的广播hello msg,因此又看回到hello msg的处理方法中的部分代码。

/* Process an hello message received via Pub/Sub in master or slave instance,
 * or sent directly to this sentinel via the (fake) PUBLISH command of Sentinel.
 *
 * If the master name specified in the message is not known, the message is
 * discarded. */
void sentinelProcessHelloMessage(char *hello, int hello_len) {
    /* Format is composed of 8 tokens:
     * 0=ip,1=port,2=runid,3=current_epoch,4=master_name,
     * 5=master_ip,6=master_port,7=master_config_epoch. */
    int numtokens, port, removed, master_port;
    uint64_t current_epoch, master_config_epoch;
    char **token = sdssplitlen(hello, hello_len, ",", 1, &numtokens);
    sentinelRedisInstance *si, *master;

    if (numtokens == 8) {
        /* Obtain a reference to the master this hello message is about */
        master = sentinelGetMasterByName(token[4]);
        if (!master) goto cleanup; /* Unknown master, skip the message. */

        /* First, try to see if we already have this sentinel. */
        port = atoi(token[1]);
        master_port = atoi(token[6]);
        si = getSentinelRedisInstanceByAddrAndRunID(
                        master->sentinels,token[0],port,token[2]);
        current_epoch = strtoull(token[3],NULL,10);
        master_config_epoch = strtoull(token[7],NULL,10);
        ...
        /* Update master info if received configuration is newer. */
        if (si && master->config_epoch < master_config_epoch) {
            master->config_epoch = master_config_epoch;
            if (master_port != master->addr->port ||
                strcmp(master->addr->ip, token[5]))
            {
                sentinelAddr *old_addr;

                sentinelEvent(LL_WARNING,"+config-update-from",si,"%@");
                sentinelEvent(LL_WARNING,"+switch-master",
                    master,"%s %s %d %s %d",
                    master->name,
                    master->addr->ip, master->addr->port,
                    token[5], master_port);

                old_addr = dupSentinelAddr(master->addr);
                sentinelResetMasterAndChangeAddress(master, token[5], master_port);
                sentinelCallClientReconfScript(master,
                    SENTINEL_OBSERVER,"start",
                    old_addr,master->addr);
                releaseSentinelAddr(old_addr);
            }
        }

        /* Update the state of the Sentinel. */
        if (si) si->last_hello_time = mstime();
    }

cleanup:
    sdsfreesplitres(token,numtokens);
}

其他节点当发现master的配置纪元小于广播的配置纪元,且masteripport都变了时,开始重置master了,方法还是上面分析过的sentinelResetMasterAndChangeAddress。至此最后的谜团也解开了,其他sentinel的监督状态也得到了更新,注意从代码看master name非常重要,升级slave的时候master name依然不变。
到这里有关sentinel的故障转移的绝大部分内容都已经分析完了,基本流程也都串起来了。

小结

  1. 确认节点下线,分为了主观下线和客观下线。
  2. 主观下线是一段时间内探测的ping命令返回无效。主观下线探测是对所有节点都一致的,该时间可配,且以master为维度配置。
  3. 客观下线,是只针对master节点的,通过向其他sentinel节点发送SENTINEL is-master-down-by-addr命令来进行询问其他节点master下线的问题,并达成共识的一个状态。
  4. 选举leader节点进行故障转移,当sentinel集群中有节点检测到某个master满足客观下线的条件(判断master下线的节点数大于配置的quorum),便触发了leader选举。
  5. 每个sentinel节点都可以参加选举和进行投票,但当前纪元的投票,每个节点有且只有1投票,整个选举有时间限制(默认10s,如果配置的故障转移超时时间小于10s,则为故障转移超时时间),在一定时间内,没有选举出leader,便更新纪元,重新开始新一轮的选举。
  6. sentinel节点也是通过SENTINEL is-master-down-by-addr命令来进行拉票,因此该命令在整个故障转移中有两种作用。
  7. 最先获得至少集群节点数一半以上投票的节点当选leader
  8. sentinel使用状态机来控制故障转移的流程。每个状态都是异步且在不同周期被调用。
  9. leader在slaves节点中选举最合适成为新的master,并向其发送slave of no one来进行升级。且通过info命令来获取新master的转换信息。
  10. leader通过向其他slave节点发送slave of <new master address>来重新配置master,也是通过info命令的探测来其他节点的配置是否已经配置完成。
    11.当slaves的节点重新构建完成,leader开始更新master的结构,重新建立slaves dict,并重置mastersentinelRedisInstance。但一直保持master name不变。
    12.其他sentinel节点在leader完成了故障转移后,通过订阅了同一批的slave节点的hello频道,收到leader广播的hello msg而更新自身的master结构数据。
  11. 最后通过redis的sentinel解决方案就可以更好的去理解Raft算法的内容了。
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,657评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,662评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,143评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,732评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,837评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,036评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,126评论 3 410
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,868评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,315评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,641评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,773评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,470评论 4 333
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,126评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,859评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,095评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,584评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,676评论 2 351