I was reading the raft implementation recently, and learned a lot from it. I met some problems which are very likely I understood something wrong. Would appreciate it if someone can help with my process.
Supposing a very normal 4-node cluster, with priority node 0 > 1 > 2 > 3
- initial state
| node | currVotedFor | currTerm | role | lastParseResult |
|-------|--------------|----------|-----------| -- |
| node0 | node0 | 1 | leader | PASS |
| node1 | node0 | 1 | follower | WAIT_TO_REVOTE |
| node2 | node0 | 1 | follower | WAIT_TO_REVOTE |
| node3 | node0 | 1 | follower | WAIT_TO_REVOTE |
- and node 0 goes down
node | currVotedFor | currTerm | role | lastParseResult |
---|---|---|---|---|
node1 | node0 | 1 | follower | WAIT_TO_REVOTE |
node2 | node0 | 1 | follower | WAIT_TO_REVOTE |
node3 | node0 | 1 | follower | WAIT_TO_REVOTE |
- node 1 timeout first, become a candidate. issues a vote request with term 1 (hasn't increase term yet), but it will be refused because node 2, 3 still believes there is a leader, so they would return REJECT_ALREADY_HAS_LEADER and do nothing. node 1 upon receiving these responses would reset timer, and stay in WAIT_TO_REVOTE state. Same thing happens to node2. after node 2 received REJECT_ALREADY_HAS_LEADER responses, the state would be
node | currVotedFor | currTerm | role | lastParseResult |
---|---|---|---|---|
node1 | node1 | 1 | candidate | WAIT_TO_REVOTE |
node2 | node2 | 1 | candidate | WAIT_TO_REVOTE |
node3 | node0 | 1 | follower | WAIT_TO_REVOTE |
- This way, only when node 3 timeout lastly it would request votes without getting any REJECT_ALREADY_HAS_LEADER response, and continue to WAIT_TO_VOTE_NEXT state.
From the time it received 2 REJECT_ALREADY_VOTED( because node 1, 2 all voted for themselves). it would start a timer : lastVotedTime + random value between 300ms and 1000 ms (this value can be changed). Since this is a random value, it might be that node 3 eventually has the smallest timeout interval, and after the smallest timer expires,it increase its term and force node 1 and node 2 to increase their term too (set their needIncreaseTermImmediately). This way node 3 will be the final leader, which reversed the priority order.
This can be mitigated by giving node 3 a much larger timeout interval, but still we might look for something better..
Some thoughts: In raft we would usually first increase term and request vote, here node 1 requests votes without increase its term and I suppose this is implementing the pre-vote algorithm mentioned in the paper.
IMHO, pre-vote is for a potential candidate(like node1) to check if it is possible to pass the election(more up-to-date than a majority of nodes) before increasing term. In our current implementation, node 1 first requests votes with term 1, and other followers with same term would return REJECT_ALREADY_HAS_LEADER. In this case other followers didn't check if the incoming vote request is more up-to-date than followers themselves. If we can somehow subdivide the REJECT_ALREADY_HAS_LEADER response to something like
- REJECT_ALREADY_HAS_LEADER_PREVOTE_ACCEPT
- REJECT_ALREADY_HAS_LEADER_PREVOTE_REJECT
then node1 receives enough prevote_accept, it can increase its term and start a revote immediately. Otherwise it just remains in WAIT_TO_REVOTE process.
p.s. Actually I didn't find how the pre-vote was implemented in Dledger, wondering if there are some pre-vote design documentations? Thanks if anyone can give me some hints on pre-vote implementation in Dledger.