Distributed systems theory for the distributed systems engineer 翻译 中英对照

Distributed systems theory for the distributed systems engineer

适合 分布式系统工程师 的 分布式系统理论

Gwen Shapira, who at the time was an engineer at Cloudera and now is spreading the Kafka gospel, asked a question on Twitter that got me thinking.

Gwen Shapira曾在Cloudera做工程师,现在宣传Kafka,他在Twitter问了以下问题,使我有所思考。

I need to improve my proficiency in distributed systems theory. Where do I start? Any recommended books?
我想在分布式理论上有所提升。应该从哪开始?有推荐的书?
— Gwen (Chen) Shapira (@gwenshap) August 7, 2014

My response of old might have been “well, here’s the FLP paper, and here’s the Paxos paper, and here’s the Byzantine generals paper…”,
我第一反应是“可以看:FLP论文、paxos论文、Byzantine将军论文”,
and I’d have prescribed a laundry list of primary source material which would have taken at least six months to get through if you rushed.
我推荐的主要阅读材料,如果你贸然去读,你至少要阅读6个月才会有感觉。
But I’ve come to thinking that recommending a ton of theoretical papers is often precisely the wrong way to go about learning distributed systems theory (unless you are in a PhD program).
由此可知,推荐一吨的理论论文让你阅读,这是了解分布式系统的错误的方式。(除非你在读博士)
Papers are usually deep, usually complex, and require both serious study, and usually significant experience to glean their important contributions and to place them in context.
论文一般是深奥、复杂的,而且需要一系列学习和丰富的经验才能感觉到其贡献、才能其放到对应的场景(以理解和应用)。
What good is requiring that level of expertise of engineers?
工程师了解分布式理论有什么好处?

And yet, unfortunately, there’s a paucity of good ‘bridge’ material that summarises, distills and contextualises the important results and ideas in distributed systems theory;
很不幸,几乎没有好的引导文章,来总结、提炼、场景化 分布式系统理论中的重要结论和想法;
particularly material that does so without condescending.
特别是 通俗易懂的引导文章 更没有。
Considering that gap lead me to another interesting question:
考虑这样的空白区域,让我想问另一个问题:

What distributed systems theory should a distributed systems engineer know?
一个分布式系统工程师应该了解什么样的分布式系统理论?

A little theory is, in this case, not such a dangerous thing.
这种情况下,了解一点点理论并不是坏事。
So I tried to come up with a list of what I consider the basic concepts that are applicable to my every-day job as a distributed systems engineer.
我日常工作是一个分布式系统工程师,我认为适合我的基本概念,下面会给出这些基本概念。
Let me know what you think I missed!
你认为我缺失的请告知我!

First steps 准备

These four readings do a pretty good job of explaining what about building distributed systems is challenging.
下面四个读物解释了构建分布式系统会遇到的困难。
Collectively they outline a set of abstract but technical difficulties that the distributed systems engineer has to overcome, and set the stage for the more detailed investigation in later sections
这些读物都勾勒了一些列 抽象而非技术 的困难,分布式系统工程师必须要克服这些困难。这些读物的后面章节有更详细的研究。

Distributed Systems for Fun and Profit is a short book which tries to cover some of the basic issues in distributed systems including the role of time and different strategies for replication.
Distributed Systems for Fun and Profit 是一本小书,它想覆盖分布式系统中的一些基本问题,包括 时钟所起的作用、不同策略的复制。

Notes on distributed systems for young bloods - not theory, but a good practical counterbalance to keep the rest of your reading grounded.
Notes on distributed systems for young bloods - 非理论,而是一个很好的实践,以让你落到实处。

A Note on Distributed Systems - a classic paper on why you can’t just pretend all remote interactions are like local objects.
A Note on Distributed Systems - 一个经典论文,关于 为什么你不能假装所有远程交互像本地对象一样。

The fallacies of distributed computing - 8 fallacies of distributed computing that set the stage for the kinds of things system designers forget.
The fallacies of distributed computing 分布式计算的8个错误的推论,以提醒系统设计者。

You should know about safety and liveness properties:
你应该知道 安全 和 活力:

  • safety properties say that nothing bad will ever happen. For example, the property of never returning an inconsistent value is a safety property, as is never electing two leaders at the same time.

  • 安全 说的是 永远不会发生坏事。比如,不返回不一致的值 是 一种 安全, 同一时刻不会选出两个 主节点 也是 一种 安全。

  • liveness properties say that something good will eventually happen. For example, saying that a system will eventually return a result to every API call is a liveness property, as is guaranteeing that a write to disk always eventually completes.

  • 活力 说的是 好事情终究会发生。比如,对于每个api调用,一个系统终究会返回一个结果,这是一种 活力;保证一次写磁盘最终总能结束,这是一种 活力。

Failure and Time 失败和时钟

Many difficulties that the distributed systems engineer faces can be blamed on two underlying causes:
分布式系统工程师面对的许多困难可以归结为以下两个原因:

  1. Processes may fail

  2. 进程可能失败

  3. There is no good way to tell that they have done so

There is a very deep relationship between what, if anything, processes share about their knowledge of time, what failure scenarios are possible to detect, and what algorithms and primitives may be correctly implemented.
进程间怎么共用时钟、什么样的失败可以检测、什么样的算法和原语可以被正确实现,这三者之间有很深的联系。
Most of the time, we assume that two different nodes have absolutely no shared knowledge of what time it is, or how quickly time passes.
一般情况下,我们假设不同节点绝对无法共用时钟(时刻值或流过了多少时间)

You should know:
你应该知道:

The basic tension of fault tolerance 容错导致的基本矛盾

A system that tolerates some faults without degrading must be able to act as though those faults had not occurred.
一个系统容忍一些错误而没有降级 必须能当成 就像这些错误没有发生过一样。
This means usually that parts of the system must do work redundantly, but doing more work than is absolutely necessary typically carries a cost both in performance and resource consumption.
这意味着系统的一部分要冗余地工作(同样的功能部署多个节点),冗余是绝对必要的,冗余一般会带来性能和资源的消耗。
This is the basic tension of adding fault tolerance to a system.
这就是给一个系统添加冗余的基本矛盾。

You should know:
你应该知道:

  • The quorum technique for ensuring single-copy serialisability. See Skeen’s original paper, but perhaps better is Wikipedia’s entry.

  • 确保串行单复制的多数派技术. 见 Skeen的原始论文, 不过或许更好的是 Wikipedia’s entry.
    (多数派中有一个是主节点,其余为从节点,以主节点接收到的写请求序列为准[串行],主节点单方面的要求从们接受字节的写请求序列[从节点不得反抗、不得有异议:从节点是非恶意的、遵守全局规则的、非拜占庭的])

  • About 2-phase-commit, 3-phase-commit and Paxos, and why they have different fault-tolerance properties.

  • 两步提交三步提交Paxos, 以及为什么他们不同于容错.

  • How eventual consistency, and other techniques, seek to avoid this tension at the cost of weaker guarantees about system behaviour. The Dynamo paper is a great place to start, but also Pat Helland’s classic Life Beyond Transactions is a must-read.

  • 最终一致性、其他技术 以 对系统行为做更弱的保证 为代价 来 设法避开 此矛盾 . 可以看 Dynamo 论文 , 不过 必须要读 Pat Helland的论文 经典 Life Beyond Transactions .

Basic primitives 基本原语

There are few agreed-upon basic building blocks in distributed systems, but more are beginning to emerge. You should know what the following problems are, and where to find a solution for them:
在分布式系统中,很少有约定的基本构建块,更多的是处于形成中的基本构建块。有应该知道下面的问题是什么,并且从哪能找到他们的解决方案:

Fundamental Results 基础结论

Some facts just need to be internalised. There are more than this, naturally, but here’s a flavour:
有些事实只需要主观理解(不需要关注证明).

  • You can’t implement consistent storage and respond to all requests if you might drop messages between processes. This is the CAP theorem.

  • 如果节点间可能丢失消息[:P],那么你不可能 既 实现一致性存储[:C] 又 响应所有时刻的请求[:A]. 这就是 CAP理论.

  • Consensus is impossible to implement in such a way that it both a) is always correct and b) always terminates if even one machine might fail in an asynchronous system with crash-* stop failures (the FLP result). The first slides - before the proof gets going - of my Papers We Love SF talk do a reasonable job of explaining the result, I hope. Suggestion: there’s no real need to understand the proof.

  • 在一个异步系统中,一致性不可能以这样一个途径实现:既a) 总是正确的 ; 又b) 总是能结束 即使只有一个节点可能以 崩溃-*停止 失败 (FLP结论). 在看证明之前,看下我以简明的方式解释FLP结论的论文 Papers We Love SF talk . 建议: 没有理解证明的需求.
    (一个异步系统中,假设节点崩溃后停止而不是奔溃后又恢复;1、要确保结果总是正确的,2、每次写请求能够在有限时间内返回结果。这两点没法同时满足:这就是FLP结论)

  • Consensus is impossible to solve in fewer than 2 rounds of messages in general.

  • 一般地,只进行少于2轮的消息传递,不可能达成一致性 .

  • Atomic broadcast is exactly as hard as consensus - in a precise sense, if you solve atomic broadcast, you solve consensus, and vice versa. Chandra and Toueg prove this, but you just need to know that it’s true.

  • 原子广播和一致性,二者的难度精确的相等。更直白的说,如果你能解原子广播,那么你也能解一致性,反之亦然。 Chandra 和 Toueg 证明了这一点, 但是你只需要知道这个论断是成立的。

Real systems 真实系统

The most important exercise to repeat is to read descriptions of new, real systems, and to critique their design decisions. Do this over and over again. Some suggestions:
最重要的、应该不断重复的实践是:读新的、真实的系统的描述,并评价他们设计的决定。 下面是建议的系统:

Google:

Not Google:

Postscript 结尾

If you tame all the concepts and techniques on this list, I’d like to talk to you about engineering positions working with the menagerie of distributed systems we curate at Cloudera.
如果你驯服了这个列表中的所有概念和技术,我很乐意和你聊聊Cloudera的分布式系统工程师职位。

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,928评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,192评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 159,468评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,186评论 1 286
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,295评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,374评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,403评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,186评论 0 269
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,610评论 1 306
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,906评论 2 328
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,075评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,755评论 4 337
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,393评论 3 320
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,079评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,313评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,934评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,963评论 2 351

推荐阅读更多精彩内容