Zookeeper论文
https://www.datadoghq.com/pdf/zab.totally-ordered-broadcast-protocol.2008.pdf
为什么叫Zookeeper
在立项初期,考虑到之前内部很多项目都是使用动物的名字来命名的(例如著名的Pig项目), 雅虎工程师希望给这个项目也取一个动物的名字。时任研究院的首席科学家 RaghuRamakrishnan 开玩笑地说:“在这样下去,我们这儿就变成动物园了!”此话一出,大家纷纷表示就叫动物园管理员吧一一一因为各个以动物命名的分布式组件放在一起,雅虎的整个分布式系统看上去就像一个大型的动物园了,而 Zookeeper 正好要用来进行分布式环境的协调一一于是,Zookeeper 的名字也就由此诞生了
--摘自知乎
Zookeeper,如一个基于内存的、提供协作服务的、分布式的文件系统
ZooKeeper was designed to store coordination data: status information, configuration, location information, etc.
ZooKeeper: A Distributed Coordination Service for Distributed Applications
Zookeeper 保障数据(如配置项)一致性的servers集群 ,用于分布式应用(distributed applications)
- much like a file system
ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers (we call these registers znodes), much like a file system.
The name space provided by ZooKeeper is much like that of a standard file system. A name is a sequence of path elements separated by a slash ("/"). Every znode in ZooKeeper's name space is identified by a path. And every znode has a parent whose path is a prefix of the znode with one less element; the exception to this rule is root ("/") which has no parent. Also, exactly like standard file systems, a znode cannot be deleted if it has any children.
The main differences between ZooKeeper and standard file systems are that every znode can have data associated with it (every file can also be a directory and vice-versa) and znodes are limited to the amount of data that they can have.
- client-ZooKeeperService communicate
The servers that make up the ZooKeeper service must all know about each other. Clients only connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heartbeats. If the TCP connection to the server breaks, the client will connect to a different server. When a client first connects to the ZooKeeper service, the first ZooKeeper server will setup a session for the client. If the client needs to connect to another server, this session will get reestablished with the new server.
- read, write and sync of ZooKeeperService
- Read requests sent by a ZooKeeper client are processed locally at the ZooKeeper server to which the client is connected. The same as Watch. 读和watch都是读客户端直连的ZooKeeper server. Read responses will be stamped with the last zxid(ZooKeeper Transaction Id) processed by the server that services the read.
- Write requests are forwarded to other ZooKeeper servers and go through consensus before a response is generated. 写操作是等a quorum of followers投票同意然后leader最终下发commit,才返回,会损害[可用性]。Sync requests are also forwarded to another server, but do not actually go through consensus.
Thus, the throughput of read requests scales with the number of servers and the throughput of write requests decreases with the number of servers.
增加ZooKeeper server,有利于读场景的横向扩容,但是写操作会更加耗时。因此,不适合做服务注册和发现
see
https://cwiki.apache.org/confluence/display/ZOOKEEPER/ProjectDescription
https://zookeeper.apache.org/doc/r3.8.0/zookeeperOver.html
for more
一句话总结zookeeper特点:
zookeeper是典型的、高CP(CAP理论)的分布式协同数据集群,关注点是一致性和分区容错性
其数据一致性保障,导致写比较重(因为需要等集群中的a quorum of followers投票同意才可以最终commit),写期间可用性降低
当然,任何分布式系统的数据都不能做到完全同时的同步,不同机器上的数据同步总是存在先后,这是无法避免的。我们能做的只是保证数据的最终一致并且缩短数据不一致的时间间隔
背景知识:
zab的原子广播使用了two-phase commit的思想
two-phase commit(两阶段提交机制):
- The commit-request phase (or voting phase), in which a coordinator process attempts to prepare all the transaction's participating processes (named participants, cohorts, or workers) to take the necessary steps for either committing or aborting the transaction and to vote, either "Yes": commit (if the transaction participant's local portion execution has ended properly), or "No": abort (if a problem has been detected with the local portion), and
- The commit phase, in which, based on voting of the participants, the coordinator decides whether to commit (only if all have voted "Yes") or abort the transaction (otherwise), and notifies the result to all the participants. The participants then follow with the needed actions (commit or abort) with their local transactional resources (also called recoverable resources; e.g., database data) and their respective portions in the transaction's other output (if applicable).
two-phase commit Message flow:
Coordinator Participant
QUERY TO COMMIT
-------------------------------->
VOTE YES/NO prepare*/abort*
<-------------------------------
commit*/abort* COMMIT/ROLLBACK
-------------------------------->
ACKNOWLEDGEMENT commit*/abort*
<--------------------------------
end
Disadvantages:
The greatest disadvantage of the two-phase commit protocol is that it is a blocking protocol.
If the coordinator fails permanently, some participants will never resolve their transactions: After a participant has sent an agreement message to the coordinator, it will block until a commit or rollback is received.
思考
-
为什么zab只需要过半follower投票同意就可以下发commit?
论文原文是这样说的:We are able to simplify the two-phase commit protocol because we do not have aborts; followers either acknowledge the leader’s proposal or they abandon the leader. The lack of aborts also mean that we
can commit once a quorum of servers ack the proposal rather than waiting for all servers to respond也就是说,只要follower正常存活,那么它只会ack the leader’s proposal,而不存在拒绝的情况。因为它是follower,只会follow;如果不follow,就是down了or any other else,脱离了zookeeper service。
在任何正常存活的follower都会无条件投票通过,其逆否命题就是,没投票通过的follower都不在正常存活状态这样的前提下,If we extract the properties that we really need from our use of majorities, we have that we only need to guarantee that groups of processes used to validate an operation by voting (e.g., acknowledging a leader proposal) pairwise intersect in at least one server. Using majorities guarantees such a property. However, there are other ways of constructing quorums different from majorities. For example, we can assign weights to the votes of servers, and say that the votes of some servers are more important. To obtain a quorum, we get enough votes so that the sum of weights of all votes is larger than half of the total sum of all weights.
为什么下发commit后leader不需要等待follower回复?
commit一旦下发,开弓没有回头箭,回复or not 不会产生任何影响