高度CP的Zookeeper 2022-08-02

Zookeeper论文

https://www.datadoghq.com/pdf/zab.totally-ordered-broadcast-protocol.2008.pdf

为什么叫Zookeeper

在立项初期，考虑到之前内部很多项目都是使用动物的名字来命名的（例如著名的Pig项目), 雅虎工程师希望给这个项目也取一个动物的名字。时任研究院的首席科学家 RaghuRamakrishnan 开玩笑地说：“在这样下去，我们这儿就变成动物园了！”此话一出，大家纷纷表示就叫动物园管理员吧一一一因为各个以动物命名的分布式组件放在一起，雅虎的整个分布式系统看上去就像一个大型的动物园了，而 Zookeeper 正好要用来进行分布式环境的协调一一于是，Zookeeper 的名字也就由此诞生了
--摘自知乎

Zookeeper，如一个基于内存的、提供协作服务的、分布式的文件系统

ZooKeeper was designed to store coordination data: status information, configuration, location information, etc.

ZooKeeper: A Distributed Coordination Service for Distributed Applications

Zookeeper 保障数据（如配置项）一致性的servers集群，用于分布式应用（distributed applications）

- much like a file system

ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers (we call these registers znodes), much like a file system.

The name space provided by ZooKeeper is much like that of a standard file system. A name is a sequence of path elements separated by a slash ("/"). Every znode in ZooKeeper's name space is identified by a path. And every znode has a parent whose path is a prefix of the znode with one less element; the exception to this rule is root ("/") which has no parent. Also, exactly like standard file systems, a znode cannot be deleted if it has any children.

The main differences between ZooKeeper and standard file systems are that every znode can have data associated with it (every file can also be a directory and vice-versa) and znodes are limited to the amount of data that they can have.

- client-ZooKeeperService communicate

The servers that make up the ZooKeeper service must all know about each other. Clients only connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heartbeats. If the TCP connection to the server breaks, the client will connect to a different server. When a client first connects to the ZooKeeper service, the first ZooKeeper server will setup a session for the client. If the client needs to connect to another server, this session will get reestablished with the new server.

- read, write and sync of ZooKeeperService

Read requests sent by a ZooKeeper client are processed locally at the ZooKeeper server to which the client is connected. The same as Watch. 读和watch都是读客户端直连的ZooKeeper server. Read responses will be stamped with the last zxid(ZooKeeper Transaction Id) processed by the server that services the read.
Write requests are forwarded to other ZooKeeper servers and go through consensus before a response is generated. 写操作是等a quorum of followers投票同意然后leader最终下发commit，才返回，会损害[可用性]。Sync requests are also forwarded to another server, but do not actually go through consensus.

Thus, the throughput of read requests scales with the number of servers and the throughput of write requests decreases with the number of servers.
增加ZooKeeper server，有利于读场景的横向扩容，但是写操作会更加耗时。因此，不适合做服务注册和发现

see
https://cwiki.apache.org/confluence/display/ZOOKEEPER/ProjectDescription
https://zookeeper.apache.org/doc/r3.8.0/zookeeperOver.html
for more

一句话总结zookeeper特点：
zookeeper是典型的、高CP(CAP理论)的分布式协同数据集群，关注点是一致性和分区容错性
其数据一致性保障，导致写比较重（因为需要等集群中的a quorum of followers投票同意才可以最终commit），写期间可用性降低

当然，任何分布式系统的数据都不能做到完全同时的同步，不同机器上的数据同步总是存在先后，这是无法避免的。我们能做的只是保证数据的最终一致并且缩短数据不一致的时间间隔

背景知识：

zab的原子广播使用了two-phase commit的思想

two-phase commit（两阶段提交机制）：

The commit-request phase (or voting phase), in which a coordinator process attempts to prepare all the transaction's participating processes (named participants, cohorts, or workers) to take the necessary steps for either committing or aborting the transaction and to vote, either "Yes": commit (if the transaction participant's local portion execution has ended properly), or "No": abort (if a problem has been detected with the local portion), and
The commit phase, in which, based on voting of the participants, the coordinator decides whether to commit (only if all have voted "Yes") or abort the transaction (otherwise), and notifies the result to all the participants. The participants then follow with the needed actions (commit or abort) with their local transactional resources (also called recoverable resources; e.g., database data) and their respective portions in the transaction's other output (if applicable).

two-phase commit Message flow:

Coordinator                                          Participant
                             QUERY TO COMMIT
                 -------------------------------->
                             VOTE YES/NO             prepare*/abort*
                 <-------------------------------
commit*/abort*               COMMIT/ROLLBACK
                 -------------------------------->
                             ACKNOWLEDGEMENT          commit*/abort*
                 <--------------------------------  
end

Disadvantages:

The greatest disadvantage of the two-phase commit protocol is that it is a blocking protocol.
If the coordinator fails permanently, some participants will never resolve their transactions: After a participant has sent an agreement message to the coordinator, it will block until a commit or rollback is received.

思考

为什么zab只需要过半follower投票同意就可以下发commit?
论文原文是这样说的:

We are able to simplify the two-phase commit protocol because we do not have aborts; followers either acknowledge the leader’s proposal or they abandon the leader. The lack of aborts also mean that we
can commit once a quorum of servers ack the proposal rather than waiting for all servers to respond

也就是说，只要follower正常存活，那么它只会ack the leader’s proposal，而不存在拒绝的情况。因为它是follower，只会follow；如果不follow，就是down了or any other else，脱离了zookeeper service。

在任何正常存活的follower都会无条件投票通过，其逆否命题就是，没投票通过的follower都不在正常存活状态这样的前提下，If we extract the properties that we really need from our use of majorities, we have that we only need to guarantee that groups of processes used to validate an operation by voting (e.g., acknowledging a leader proposal) pairwise intersect in at least one server. Using majorities guarantees such a property. However, there are other ways of constructing quorums different from majorities. For example, we can assign weights to the votes of servers, and say that the votes of some servers are more important. To obtain a quorum, we get enough votes so that the sum of weights of all votes is larger than half of the total sum of all weights.
为什么下发commit后leader不需要等待follower回复？
commit一旦下发，开弓没有回头箭，回复or not 不会产生任何影响

最后编辑于：2022.08.04 17:31:29

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 219,869评论 6赞 508
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 93,716评论 3赞 396
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 166,223评论 0赞 357
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 59,047评论 1赞 295
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 68,089评论 6赞 395
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,839评论 1赞 308
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,516评论 3赞 420
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,410评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,920评论 1赞 319
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 38,052评论 3赞 340
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,179评论 1赞 352
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,868评论 5赞 346
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,522评论 3赞 331
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 32,070评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,186评论 1赞 272
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,487评论 3赞 375
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 45,162评论 2赞 356