袖珍分布式系统（四）

本文是Distributed systems for fun and profit的第四部分，本文是阅读该文后的一些记录。

Replication

Replication 问题是分布式系统中最主要的问题，本文讨论 Replication 问题不会泛泛而谈，而是会聚焦于：

leader election,
failure detection,
mutual exclusion,
consensus and global snapshots

我们首先得知道Replication 问题本质上是一个group communication 问题，而实现Replication的方法有很多，我们不会讲一些具体的算法，而是会讲一些共性的东西。既然Replication是一个通信问题，那我们就来看下同步和异步两个通信模型：

我们可以将复制步骤分为4步：

(Request) The client sends a request to a server
(Sync) The synchronous portion of the replication takes place
(Response) A response is returned to the client
(Async) The asynchronous portion of the replication takes place

基于上面的4个步骤，又可以分为同步和异步复制

Synchronous replication

同步复制（synchronous replication）又称为：active, or eager, or push, or pessimistic replication，其原理如下图

同步复制中client发送请求，接着客户端阻塞，第一个server发送请求给S2和S3，等待都响应后再回复客户端。

从上述过程的描述中，我们可以知道：

系统的性能由最慢的server决定
系统对于网络延迟非常敏感，意味请求返回需要等待每一个server响应请求
一旦其中一个server fail，系统将只能提供读服务

Asynchronous replication

异步复制在master (/leader / coordinator)收到请求后，只是简单的做一些local处理，然后就返回了，不阻塞客户端，接着异步将请求复制给其他server。

从性能角度看，因为复制是异步进行的，延迟非常小，但是系统的一致性是弱一致的，最重要的是系统无法保证数据的持久性，因为写入master的数据，可能在master复制给其他slaver的前，master就故障了，此时数据的就丢失了。

An overview of major replication approaches

讲完同步和异步复制两个思路后，我们来看一些稍微具体的算法，有许多方法来对复制算法分类，除了上面的同步和异步外，还能这么分：

Replication methods that prevent divergence (single copy systems) and
Replication methods that risk divergence (multi-master systems)

第一类系统遵循着“"behave like a single system”的原则，当partial failures发生的时候，算法能保证系统中只有一份数据是有效的，实现single-copy consistency的算法主要有：

1n messages (asynchronous primary/backup)
2n messages (synchronous primary/backup)
4n messages (2-phase commit, Multi-Paxos)
6n messages (3-phase commit, Paxos with repeated leader election)

这些算法的不同之处在于他们考虑的types of faults不同，上面简单的通过算法中交换消息的数量进行了划分，这么划分的原因是因为作者尝试回答：what are we buying with the added message exchanges?

先看一张图：

上图来自：Google App Engine的co-founder Ryan Barrett在2009年的google i/o上的演讲《Transaction Across DataCenter》（视频： http://www.youtube.com/watch?v=srOgpXECblk）

consistency, latency, throughput, data loss and failover characteristics 归根结底来自于两个不同的复制方法：同步还是异步。下面我们具体看下每一种复制方法。

Primary/backup replication

All updated are performed on the primary, and a log of operations (or alternatively, changes) is shipped across the network to the backup replicas.

最基本的一种复制方法，在复制log上有两种方法：

asynchronous primary/backup replication and
synchronous primary/backup replication

同步方法需要涉及到两个消息（"update" + "acknowledge receipt"），而异步则只有一个("update")。

但是在mysql中，即使是同步复制也不能完全保证数据的一致性，考虑场景：

the primary receives a write and sends it to the backup
the backup persists and ACKs the write
and then primary fails before sending ACK to the client

在primary返回客户端之前，primary挂了，此时backup提交了，客户端认为primay没有提交，如果此时马上切换到backup，backup是已经提交了，造成不一致，那有什么方法能解决呢？这就要多引入一个来回的消息，这就是2PC。

Two phase commit (2PC)

2PC相比较于Primary/backup的1PC，其多出来的一步提供了可以回滚操作的能力。

Note：2PC assumes that the data in stable storage at each node is never lost and that no node crashes forever. Data loss is still possible if the data in the stable storage is corrupted in a crash

2PC基于的假设是数据存储是可靠的，节点不会永久故障，因此一旦这些假设不满足，数据还是有可能丢失的。

在前一章CAP理论中，我们讲过2PC是一个CA系统，其考虑的失败模型中没有考虑network partitions，一旦发送网络分区，只能终止服务，人工接入，因此现在的系统一般都会考虑使用a partition tolerant consensus algorithm，能够很好的处理网络分区

Partition tolerant consensus algorithms

分区容忍的算法比较有名的就是Paxos和Raft，在具体看算法之前，我们先回答第一个问题：

What is a network partition?什么是网络分区

A network partition is the failure of a network link to one or several nodes. The nodes themselves continue to stay active, and they may even be able to receive requests from clients on their side of the network partition

网络分区的一个特点是我们很难网络分区和节点故障区分开，一旦网络分区发生，系统中就会有多个部分都是出于active的状态，在Primary/backup中就会出现两个primary。因此，Partition tolerant consensus algorithms必须要解决的一个问题就是：during a network partition, only one partition of the system remains active

解决的方法主要从：

Majority decisions：在N个节点中，只有有N/2+1个还正常就能正常工作
Roles：有两种思路（all nodes may have the same responsibilities, or nodes may have separate, distinct roles.）通过选出一个master，能使系统变得更有效，最简单的好处就是：操作都经过master，就使得所有的操作都强制排序了。
Epochs：Epochs作用类似于逻辑时钟，能够使得不同节点对当前系统状态有个统一的认知。

除了上面给出的方法外，还需要注意的点有：

practical optimizations:
- avoiding repeated leader election via leadership leases (rather than heartbeats)【防止重复leader选举，手段是通过租期而不是心跳】
- avoiding repeated propose messages when in a stable state where the leader identity does not change【防止重复propose消息】
ensuring that followers and proposers do not lose items in stable storage and that results stored in stable storage are not subtly corrupted (e.g. disk corruption)【对于items要持久化存储防止丢失】
enabling cluster membership to change in a safe manner (e.g. base Paxos depends on the fact that majorities always intersect in one node, which does not hold if the membership can change arbitrarily)
procedures for bringing a new replica up to date in a safe and eﬃcient manner after a crash, disk loss or when a new node is provisioned
procedures for snapshotting and garbage collecting the data required to guarantee safety after some reasonable period (e.g. balancing storage requirements and fault tolerance requirements)

总结

本章作者主要介绍了保证strong consistency的各种算法，以比较同步和异步复制开始，然后逐渐讨论随着考虑的错误变多算法需要怎么调整，下面是算法的一些关键点总结：

Primary/Backup
- Single, static master
- Replicated log, slaves are not involved in executing operations
- No bounds on replication delay
- Not partition tolerant
- Manual/ad-hoc failover, not fault tolerant, "hot backup"
2PC
- Unanimous vote: commit or abort
- Static master
- 2PC cannot survive simultaneous failure of the coordinator and a node during a commit
- Not partition tolerant, tail latency sensitive
Paxos
- Majority vote
- Dynamic master
- Robust to n/2-1 simultaneous failures as part of protocol
- Less sensitive to tail latency