1.关于Leader的minCommittedLog和maxCommittedLog
相关属性是定义在类ZKDatabase中
protected long minCommittedLog, maxCommittedLog;
public static final int commitLogCount = 500;
protected static int commitLogBuffer = 700;
protected LinkedList<Proposal> committedLog = new LinkedList<Proposal>();
该committedLog列表主要作用是为了leader和follower之间快速完成数据同步用的缓存。
方法addCommittedProposal
:
public void addCommittedProposal(Request request) {
WriteLock wl = logLock.writeLock();
try {
wl.lock();
// 当前列表已经大于缓存大小(默认值为500)
if (committedLog.size() > commitLogCount) {
// 大于的话,就将列表头部第一个commit log删除
committedLog.removeFirst();
// minCommittedLog,就是committedLog列表中第一个元素
minCommittedLog = committedLog.getFirst().packet.getZxid();
}
if (committedLog.size() == 0) {
minCommittedLog = request.zxid;
maxCommittedLog = request.zxid;
}
ByteArrayOutputStream baos = new ByteArrayOutputStream();
BinaryOutputArchive boa = BinaryOutputArchive.getArchive(baos);
try {
request.hdr.serialize(boa, "hdr");
if (request.txn != null) {
request.txn.serialize(boa, "txn");
}
baos.close();
} catch (IOException e) {
LOG.error("This really should be impossible", e);
}
QuorumPacket pp = new QuorumPacket(Leader.PROPOSAL, request.zxid,
baos.toByteArray(), null);
Proposal p = new Proposal();
p.packet = pp;
p.request = request;
committedLog.add(p);
// maxCommittedLog,就是指向committedLog列表的最后一个元素
maxCommittedLog = p.packet.getZxid();
} finally {
wl.unlock();
}
}
每当有新提交的议案,都会调用addCommittedProposal添加到committedLog列表中。
2 Leader的lead流程
在QuorumPeer
中,如果是LEADING
,
setLeader(makeLeader(logFactory));
leader.lead();
setLeader(null);
2.1 loadData
在lead()方法中,其中比较重要的是zk.loadData()
,即调用LeaderZooKeeperServer的loadData()
,该方法就是为了重置sessions和data
当一个新的Leader选举出来,开始执行lead()方法,就会调用loadData()方法。Server的事务数据库,其实在在运行领导者选举之前进行了初始化,以便服务器可以为其初始投票选择其zxid。该zxid通过QuorumPeer#getLastLoggedZxid获取。
基于上,我们不需要再次对其进行初始化,并且避免了再次加载它的麻烦。 不重新加载它对于承载大型数据库的应用程序尤为重要。
loadData()中,
// 检查zkDb是否已经被初始化
if(zkDb.isInitialized()){
setZxid(zkDb.getDataTreeLastProcessedZxid());
}
else {
setZxid(zkDb.loadDataBase());
}
loadDataBase方法:
将磁盘上的数据库加载到内存中,并将事务添加到内存中的commitlog中。
调用FileTxnSnapLog类型的snapLog的restore
方法
public long loadDataBase() throws IOException {
long zxid = snapLog.restore(dataTree, sessionsWithTimeouts, commitProposalPlaybackListener);
initialized = true;
return zxid;
}
可以看见restore的第三个参数commitProposalPlaybackListener
,这里面就会调用addCommittedProposal()
2.2 启动LearnerCnxAcceptor线程,接收Followers的请求
LearnerCnxAcceptor:这是Leader的一个内部类,Leader就是通过它来与Learner建立连接
LearnerCnxAcceptor中的run方法:
- 在创建Leader实例的时候会初始一个ServerSocket用来与Follower以及Observer通信,端口是quorumAddress即投票端口
- 等待Learner的连接
- 如果有Learner与Leader建立连接成功,设置socket超时时间为initLimit*tickTime,然后通过leader.nodelay这个设置来判断是否开启Nagle算法,默认是开启的。
- 创建一个LearnerHandler实例来处理与Learner的消息并且交互它
LearnerHandler的run方法:
public void run() {
try {
leader.addLearnerHandler(this);
tickOfNextAckDeadline = leader.self.tick.get()
+ leader.self.initLimit + leader.self.syncLimit;
ia = BinaryInputArchive.getArchive(bufferedInput);
bufferedOutput = new BufferedOutputStream(sock.getOutputStream());
oa = BinaryOutputArchive.getArchive(bufferedOutput);
QuorumPacket qp = new QuorumPacket();
// 读取消息
ia.readRecord(qp, "packet");
// 如果发送的类型,不是FOLLOWERINFO或者OBSERVERINFO,报错
if(qp.getType() != Leader.FOLLOWERINFO && qp.getType() != Leader.OBSERVERINFO){
LOG.error("First packet " + qp.toString()
+ " is not FOLLOWERINFO or OBSERVERINFO!");
return;
}
byte learnerInfoData[] = qp.getData();
if (learnerInfoData != null) {
if (learnerInfoData.length == 8) {
ByteBuffer bbsid = ByteBuffer.wrap(learnerInfoData);
this.sid = bbsid.getLong();
} else {
LearnerInfo li = new LearnerInfo();
ByteBufferInputStream.byteBuffer2Record(ByteBuffer.wrap(learnerInfoData), li);
this.sid = li.getServerid();
this.version = li.getProtocolVersion();
}
} else {
this.sid = leader.followerCounter.getAndDecrement();
}
LOG.info("Follower sid: " + sid + " : info : "
+ leader.self.quorumPeers.get(sid));
if (qp.getType() == Leader.OBSERVERINFO) {
learnerType = LearnerType.OBSERVER;
}
long lastAcceptedEpoch = ZxidUtils.getEpochFromZxid(qp.getZxid());
long peerLastZxid;
StateSummary ss = null;
long zxid = qp.getZxid();
long newEpoch = leader.getEpochToPropose(this.getSid(), lastAcceptedEpoch);
if (this.getVersion() < 0x10000) {
// we are going to have to extrapolate the epoch information
long epoch = ZxidUtils.getEpochFromZxid(zxid);
ss = new StateSummary(epoch, zxid);
// fake the message
leader.waitForEpochAck(this.getSid(), ss);
} else {
byte ver[] = new byte[4];
ByteBuffer.wrap(ver).putInt(0x10000);
QuorumPacket newEpochPacket = new QuorumPacket(Leader.LEADERINFO, ZxidUtils.makeZxid(newEpoch, 0), ver, null);
oa.writeRecord(newEpochPacket, "packet");
bufferedOutput.flush();
QuorumPacket ackEpochPacket = new QuorumPacket();
ia.readRecord(ackEpochPacket, "packet");
if (ackEpochPacket.getType() != Leader.ACKEPOCH) {
LOG.error(ackEpochPacket.toString()
+ " is not ACKEPOCH");
return;
}
ByteBuffer bbepoch = ByteBuffer.wrap(ackEpochPacket.getData());
ss = new StateSummary(bbepoch.getInt(), ackEpochPacket.getZxid());
leader.waitForEpochAck(this.getSid(), ss);
}
peerLastZxid = ss.getLastZxid();
/* the default to send to the follower */
int packetToSend = Leader.SNAP;
long zxidToSend = 0;
long leaderLastZxid = 0;
/** the packets that the follower needs to get updates from **/
long updates = peerLastZxid;
/* we are sending the diff check if we have proposals in memory to be able to
* send a diff to the
*/
ReentrantReadWriteLock lock = leader.zk.getZKDatabase().getLogLock();
ReadLock rl = lock.readLock();
try {
rl.lock();
final long maxCommittedLog = leader.zk.getZKDatabase().getmaxCommittedLog();
final long minCommittedLog = leader.zk.getZKDatabase().getminCommittedLog();
LOG.info("Synchronizing with Follower sid: " + sid
+" maxCommittedLog=0x"+Long.toHexString(maxCommittedLog)
+" minCommittedLog=0x"+Long.toHexString(minCommittedLog)
+" peerLastZxid=0x"+Long.toHexString(peerLastZxid));
LinkedList<Proposal> proposals = leader.zk.getZKDatabase().getCommittedLog();
if (peerLastZxid == leader.zk.getZKDatabase().getDataTreeLastProcessedZxid()) {
// Follower is already sync with us, send empty diff
LOG.info("leader and follower are in sync, zxid=0x{}",
Long.toHexString(peerLastZxid));
packetToSend = Leader.DIFF;
zxidToSend = peerLastZxid;
} else if (proposals.size() != 0) {
LOG.debug("proposal size is {}", proposals.size());
if ((maxCommittedLog >= peerLastZxid)
&& (minCommittedLog <= peerLastZxid)) {
LOG.debug("Sending proposals to follower");
// as we look through proposals, this variable keeps track of previous
// proposal Id.
long prevProposalZxid = minCommittedLog;
// Keep track of whether we are about to send the first packet.
// Before sending the first packet, we have to tell the learner
// whether to expect a trunc or a diff
boolean firstPacket=true;
// If we are here, we can use committedLog to sync with
// follower. Then we only need to decide whether to
// send trunc or not
packetToSend = Leader.DIFF;
zxidToSend = maxCommittedLog;
for (Proposal propose: proposals) {
// skip the proposals the peer already has
if (propose.packet.getZxid() <= peerLastZxid) {
prevProposalZxid = propose.packet.getZxid();
continue;
} else {
// If we are sending the first packet, figure out whether to trunc
// in case the follower has some proposals that the leader doesn't
if (firstPacket) {
firstPacket = false;
// Does the peer have some proposals that the leader hasn't seen yet
if (prevProposalZxid < peerLastZxid) {
// send a trunc message before sending the diff
packetToSend = Leader.TRUNC;
zxidToSend = prevProposalZxid;
updates = zxidToSend;
}
}
queuePacket(propose.packet);
QuorumPacket qcommit = new QuorumPacket(Leader.COMMIT, propose.packet.getZxid(),
null, null);
queuePacket(qcommit);
}
}
} else if (peerLastZxid > maxCommittedLog) {
LOG.debug("Sending TRUNC to follower zxidToSend=0x{} updates=0x{}",
Long.toHexString(maxCommittedLog),
Long.toHexString(updates));
packetToSend = Leader.TRUNC;
zxidToSend = maxCommittedLog;
updates = zxidToSend;
} else {
LOG.warn("Unhandled proposal scenario");
}
} else {
// just let the state transfer happen
LOG.debug("proposals is empty");
}
LOG.info("Sending " + Leader.getPacketType(packetToSend));
leaderLastZxid = leader.startForwarding(this, updates);
} finally {
rl.unlock();
}
QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER,
ZxidUtils.makeZxid(newEpoch, 0), null, null);
if (getVersion() < 0x10000) {
oa.writeRecord(newLeaderQP, "packet");
} else {
queuedPackets.add(newLeaderQP);
}
bufferedOutput.flush();
//Need to set the zxidToSend to the latest zxid
if (packetToSend == Leader.SNAP) {
zxidToSend = leader.zk.getZKDatabase().getDataTreeLastProcessedZxid();
}
oa.writeRecord(new QuorumPacket(packetToSend, zxidToSend, null, null), "packet");
bufferedOutput.flush();
/* if we are not truncating or sending a diff just send a snapshot */
if (packetToSend == Leader.SNAP) {
LOG.info("Sending snapshot last zxid of peer is 0x"
+ Long.toHexString(peerLastZxid) + " "
+ " zxid of leader is 0x"
+ Long.toHexString(leaderLastZxid)
+ "sent zxid of db as 0x"
+ Long.toHexString(zxidToSend));
// Dump data to peer
leader.zk.getZKDatabase().serializeSnapshot(oa);
oa.writeString("BenWasHere", "signature");
}
bufferedOutput.flush();
// Start sending packets
new Thread() {
public void run() {
Thread.currentThread().setName(
"Sender-" + sock.getRemoteSocketAddress());
try {
sendPackets();
} catch (InterruptedException e) {
LOG.warn("Unexpected interruption",e);
}
}
}.start();
/*
* Have to wait for the first ACK, wait until
* the leader is ready, and only then we can
* start processing messages.
*/
qp = new QuorumPacket();
ia.readRecord(qp, "packet");
if(qp.getType() != Leader.ACK){
LOG.error("Next packet was supposed to be an ACK");
return;
}
LOG.info("Received NEWLEADER-ACK message from " + getSid());
leader.waitForNewLeaderAck(getSid(), qp.getZxid(), getLearnerType());
syncLimitCheck.start();
// now that the ack has been processed expect the syncLimit
sock.setSoTimeout(leader.self.tickTime * leader.self.syncLimit);
/*
* Wait until leader starts up
*/
synchronized(leader.zk){
while(!leader.zk.isRunning() && !this.isInterrupted()){
leader.zk.wait(20);
}
}
// Mutation packets will be queued during the serialize,
// so we need to mark when the peer can actually start
// using the data
//
queuedPackets.add(new QuorumPacket(Leader.UPTODATE, -1, null, null));
while (true) {
qp = new QuorumPacket();
ia.readRecord(qp, "packet");
long traceMask = ZooTrace.SERVER_PACKET_TRACE_MASK;
if (qp.getType() == Leader.PING) {
traceMask = ZooTrace.SERVER_PING_TRACE_MASK;
}
if (LOG.isTraceEnabled()) {
ZooTrace.logQuorumPacket(LOG, traceMask, 'i', qp);
}
tickOfNextAckDeadline = leader.self.tick.get() + leader.self.syncLimit;
ByteBuffer bb;
long sessionId;
int cxid;
int type;
switch (qp.getType()) {
case Leader.ACK:
if (this.learnerType == LearnerType.OBSERVER) {
if (LOG.isDebugEnabled()) {
LOG.debug("Received ACK from Observer " + this.sid);
}
}
syncLimitCheck.updateAck(qp.getZxid());
leader.processAck(this.sid, qp.getZxid(), sock.getLocalSocketAddress());
break;
case Leader.PING:
// Process the touches
ByteArrayInputStream bis = new ByteArrayInputStream(qp
.getData());
DataInputStream dis = new DataInputStream(bis);
while (dis.available() > 0) {
long sess = dis.readLong();
int to = dis.readInt();
leader.zk.touch(sess, to);
}
break;
case Leader.REVALIDATE:
bis = new ByteArrayInputStream(qp.getData());
dis = new DataInputStream(bis);
long id = dis.readLong();
int to = dis.readInt();
ByteArrayOutputStream bos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(bos);
dos.writeLong(id);
boolean valid = leader.zk.touch(id, to);
if (valid) {
try {
//set the session owner
// as the follower that
// owns the session
leader.zk.setOwner(id, this);
} catch (SessionExpiredException e) {
LOG.error("Somehow session " + Long.toHexString(id) + " expired right after being renewed! (impossible)", e);
}
}
if (LOG.isTraceEnabled()) {
ZooTrace.logTraceMessage(LOG,
ZooTrace.SESSION_TRACE_MASK,
"Session 0x" + Long.toHexString(id)
+ " is valid: "+ valid);
}
dos.writeBoolean(valid);
qp.setData(bos.toByteArray());
queuedPackets.add(qp);
break;
case Leader.REQUEST:
bb = ByteBuffer.wrap(qp.getData());
sessionId = bb.getLong();
cxid = bb.getInt();
type = bb.getInt();
bb = bb.slice();
Request si;
if(type == OpCode.sync){
si = new LearnerSyncRequest(this, sessionId, cxid, type, bb, qp.getAuthinfo());
} else {
si = new Request(null, sessionId, cxid, type, bb, qp.getAuthinfo());
}
si.setOwner(this);
leader.zk.submitRequest(si);
break;
default:
LOG.warn("unexpected quorum packet, type: {}", packetToString(qp));
break;
}
}
} catch (IOException e) {
if (sock != null && !sock.isClosed()) {
LOG.error("Unexpected exception causing shutdown while sock "
+ "still open", e);
//close the socket to make sure the
//other side can see it being close
try {
sock.close();
} catch(IOException ie) {
// do nothing
}
}
} catch (InterruptedException e) {
LOG.error("Unexpected exception causing shutdown", e);
} finally {
LOG.warn("******* GOODBYE "
+ (sock != null ? sock.getRemoteSocketAddress() : "<null>")
+ " ********");
shutdown();
}
}
分析过程:
- 将这个LearnerHandler实例添加到Leader的LearnerHandler列表learners中
- 获取下一个ACK截止时间,计算方法为:当前tick+initLimit+syncLimit(Leader在一个tickTime周期内会跟Learner进行两次PING的交互,然后这个tick就会自增1)
- 分别获取自定义jute协议中的BinaryInputArchive和BinaryOutputArchive以用作数据读写
- 读取Learner发来的数据包,判断是否是FOLLOWERINFO或者OBSERVERINFO,如果都不是,那么直接退出run()方法,因为Learner连接上Leader发送的第一个数据包必须是FOLLOWERINFO或者OBSERVERINFO
- 如果数据包的data字段存在的话,那么解析这个数据包的data字段,分别获取serverId、版本号version和投票验证器版本configVersion,接下来对比Leader和Learner的投票验证器版本;如果data字段不存在,则以-1再递减的形式赋予serverId
- 获取followerInfo以及确定learnerType,接下来就为LearnerHandler注册JMX服务
-
ZxidUtils.getEpochFromZxid(qp.getZxid());
通过数据包传来的zxid获取到最近一次通过事务投票的epoch即lastAcceptedEpoch,紧接着调用getEpochToPropose(long sid, long lastAcceptedEpoch)
获取当前集群的领导纪元并计算出集群Leader的zxid(getEpochToPropose这个方法,执行方其实只有两类即LearnerMaster和LearnerHandler,每个执行这个方法的线程都会执行wait()陷入等待,等待周期为initTime * tickTime,直到投票验证器通过了获取集群领导纪元的提案后退出方法栈执行后面的流程)
后面会详细介绍下getEpochToPropose - 判断协议版本是否是0x10000,最新版本的是0x10000,然后发送LEADERINFO数据包给Learner并等待接收Learner发送的ACKEPOCH数据包,然后调用waitForEpochAck(long sid, StateSummary ss)方法等待initLimit * tickTime时间周期内领导纪元的ack
- 接下来进入数据同步过程,针对3种不同的情况,具体后面再讲
- 最新版本将NEWLEADER数据包缓存到阻塞队列queuedPackets中以等待稍后的发送,老版本直接发送NEWLEADER数据包给Learner;
接下来启动一个发送缓存数据包的线程,这个线程将循环从阻塞队列queuedPackets中获取缓存的数据包并将它发送给相应的Learner;
然后等待接收Learner对NEWLEADER消息发送的ACK数据包,接收到之后将会调用waitForNewLeaderAck方法等待initLimit * tickTime时间周期内NEWLEADER的ACK; - 启动同步检查器syncLimitCheck,设置Socket超时时间为syncLimit * tickTime,然后等待learnerMaster即Leader或者ObserverMaster的Zookeeper服务启动,接下来再缓存UPTODATE数据包到阻塞队列queuedPackets中等待稍后的发送
- 循环处理Learner发来的数据包,首先读取数据包,然后设置新的下一个ACK截止时间tickOfNextAckDeadline并统计数据包接收数量,针对不同类型的数据包处理方式也不尽相同,如下:
- ACK:这个是针对事务请求的提案,表示Follower同意当前事务请求的提案,首先会更新同步检查器syncLimitCheck中提案的处理时间,然后将会由Leader进行事务处理
- PING:这个是处理LearnerMaster主动发送PING数据包给Learner之后,Learner回复的数据包,数据包中的data字段存储的是会话信息,然后将会由LearnerMaster延长会话过期时间
- REVALIDATE:这个是用于重新验证并激活会话的
- REQUEST:这个是处理Learner转发的事务请求以及sync请求,因为Learner无权处理事务请求,应该交由Leader处理
2.3 数据包发送线程
// Start sending packets
new Thread() {
public void run() {
Thread.currentThread().setName(
"Sender-" + sock.getRemoteSocketAddress());
try {
sendPackets();
} catch (InterruptedException e) {
LOG.warn("Unexpected interruption",e);
}
}
}.start();
private void sendPackets() throws InterruptedException {
long traceMask = ZooTrace.SERVER_PACKET_TRACE_MASK;
while (true) {
try {
QuorumPacket p;
p = queuedPackets.poll();
if (p == null) {
bufferedOutput.flush();
p = queuedPackets.take();
}
if (p == proposalOfDeath) {
// Packet of death!
break;
}
if (p.getType() == Leader.PING) {
traceMask = ZooTrace.SERVER_PING_TRACE_MASK;
}
if (p.getType() == Leader.PROPOSAL) {
syncLimitCheck.updateProposal(p.getZxid(), System.nanoTime());
}
if (LOG.isTraceEnabled()) {
ZooTrace.logQuorumPacket(LOG, traceMask, 'o', p);
}
oa.writeRecord(p, "packet");
} catch (IOException e) {
if (!sock.isClosed()) {
LOG.warn("Unexpected exception at " + this, e);
try {
// this will cause everything to shutdown on
// this learner handler and will help notify
// the learner/observer instantaneously
sock.close();
} catch(IOException ie) {
LOG.warn("Error closing socket for handler " + this, ie);
}
}
break;
}
}
}
- 启动一个线程专门用来发送缓存在queuedPackets队列中的数据包
- 循环从queuedPackets这个队列中获取数据包,首先会进行直接获取,如果获取不到则刷新缓冲区将数据发送出去并且阻塞直到获取新的数据包为止
- 进行指标统计,如果是事务请求还需要更新同步检查器syncLimitCheck
- 重置lastZxid并将数据包写到缓冲区,然后统计已发送的数据包
2.4 同步检查器syncLimitCheck
private class SyncLimitCheck {
private boolean started = false;
private long currentZxid = 0;
private long currentTime = 0;
private long nextZxid = 0;
private long nextTime = 0;
public synchronized void start() {
started = true;
}
public synchronized void updateProposal(long zxid, long time) {
if (!started) {
return;
}
if (currentTime == 0) {
currentTime = time;
currentZxid = zxid;
} else {
nextTime = time;
nextZxid = zxid;
}
}
public synchronized void updateAck(long zxid) {
if (currentZxid == zxid) {
currentTime = nextTime;
currentZxid = nextZxid;
nextTime = 0;
nextZxid = 0;
} else if (nextZxid == zxid) {
LOG.warn("ACK for " + zxid + " received before ACK for " + currentZxid + "!!!!");
nextTime = 0;
nextZxid = 0;
}
}
public synchronized boolean check(long time) {
if (currentTime == 0) {
return true;
} else {
long msDelay = (time - currentTime) / 1000000;
return (msDelay < learnerMaster.syncTimeout());
}
}
}
- updateProposal:LearnerHandler发送提案时,检查是否已启动以及当前已发送ACK的提案的时间currentTime,如果currentTime等于0,则将这个提案设置为当前提案,否则设置为下一提案
- updateAck:LearnerHandler接收到Learner发送的ACK时,检查当前提案是否是刚刚发送ACK的提案,如果是的话更新当前提案的时间为之前调用updateProposal方法设置的nextTime 和 nextZxid并重置nextTime 和 nextZxid为0,如果不是的话说明请求处理乱序了,也重置nextTime 和 nextZxid为0
- check:LearnerHandler发送PING数据包给Learner之前会进行提案处理时间的检查,检查逻辑就是判断currentTime是否等于0以及当前时间减去currentTime是否小于同步超时时间syncLimit * tickTime;假如一个提案一直未提交,直到发送PING数据包时发现超时将关闭当前LearnerHandler(注意:一个tickTime周期内发送两次PING数据包)