利用Nomad构建弹性基础设施: 容错和中断恢复

SEP 20 2018 CHRISTIE KOEHLER

This is the fourth and final post in the series Building Resilient Infrastructure with Nomad (Part 1, Part 2, Part3). In this series we explore how Nomad handles unexpected failures, outages, and routine maintenance of cluster infrastructure, often without operator intervention.

In this post we’ll explore how Nomad’s design and use of the Raft consensus algorithm provides resiliency against data loss and how to recover from outages.

We’ll assume a production deployment, with the recommended minimum of 3 or 5 Nomad servers. For information deploying a Nomad cluster in production, see Bootstrapping a Nomad Cluster.

You’ll recall from Post 2 in this series, that a Nomad cluster is composed of nodes running the Nomad agent, either in server or client mode. Clients are responsible for running tasks, while servers are responsible for managing the cluster. Client nodes make up the majority of the cluster, and are very lightweight as they interface with the server nodes and maintain very little state of their own. Each cluster has usually 3 or 5 server mode agents and potentially thousands of clients.

Diagram showing Nomad's Nomad Single-Region Architecture design.

Nomad models infrastructure as regions and datacenters. Regions may contain multiple datacenters. For example the “us” region might have a “us-west” and “us-east” datacenter. Nomad servers manage state and make scheduling decisions for the region to which they are assigned.

Nomad and Raft

Nomad uses a consensus protocol based on the Raft algorithm to provide data consistency across server nodes within a region. Raft nodes are always in one of three states: follower, candidate, or leader. All nodes initially start out as a follower. In this state, nodes can accept log entries from a leader and cast votes. If no entries are received for some time, nodes self-promote to the candidate state. In the candidate state, nodes request votes from their peers. If a candidate receives a majority of votes, then it is promoted to a leader. The leader must accept new log entries and replicate to all the other followers

Raft-based consensus is fault-tolerant up to the point where a majority of nodes are available. Otherwise it is impossible to process log entries or reason about peer membership.

For example, a Raft cluster of 3 nodes can tolerate 1 node failure, because the 2 remaining servers constitute a majority quorum. A cluster of 5 nodes can tolerate 2 node failures, etc. The recommended configuration is to either run 3 or 5 Nomad servers per region. This maximizes availability without greatly sacrificing performance. The deployment table below summarizes the potential cluster size options and the fault tolerance of each.

Servers Quorum Size Failure Tolerance
1 1 0
2 2 0
3 2 1
4 3 1
5 3 2
6 4 2
7 4 3

Nomad’s fault tolerance means that there are two types of outages you might experience:

  • failure of a minority number of servers within a region, resulting in no loss of quorum
  • failure of a majority number of servers within a region, resulting in a loss of quorum

Each scenario has a different recovery procedure.

Recovering from failure of minority number of servers (no loss of quorum)

Nomad’s use of Raft means that the failure of a minority number of servers is transparent to clients. Cluster operation will continue normally with jobs running on client nodes and servers continuing to schedule jobs and monitor the cluster.

However, to protect against data loss in the event of further server node failures, the cluster should be restored to full health by repairing or replacing the failed servers and rejoining them to the cluster. This requires operator intervention but the process is fairly straightforward.

For each of the failed servers, first try bringing the failed server back online and having it rejoin the cluster with the same IP address as before. If you need to rebuild a new server to replace the failed one, have the new server join the cluster with the same IP address as the failed server. Both options will return the cluster to a fully healthy state once all the failed servers have rejoined.

If the rebuilt server cannot have the same IP address as the one it is replacing, the failed server needs to be removed from the cluster. This dead server cleanup happens automatically as part of autopilot. Operators can also remove servers manually with the nomad server force-leave command. If that command isn’t able to remove the server, use the nomad operator raft remove-peer command to remove the stale peer server on the fly with no downtime.

Recovering from failure of majority number of servers (loss of quorum)

In the case of the failure of a majority number of servers within a region, quorum will be lost, resulting in a complete outage at the control plane level and perhaps some loss of data. Fortunately, partial recovery is possible using data on the remaining servers. This requires operator intervention. Also, it’s important to note that a Nomad server outage doesn’t necessarily mean workloads have stopped. If the Nomad clients and their hosts are operational, workloads will continue to run while the Nomad servers are without quorum.

Briefly, the recovery procedure involves stopping all remaining servers, creating a raft/peers.json with entries for all remaining servers, and then restarting them. The cluster should be able to elect a leader once the remaining servers are all restarted with an identical raft/peers.json configuration. Any new servers you introduce later, e.g. to restore majority quorum, can be fresh with totally clean data directories and joined using Nomad's server join command.

In extreme cases, it should be possible to recover with just a single remaining server by starting that single server with itself as the only peer in the raft/peers.json recovery file.

However, using raft/peers.json for recovery can cause uncommitted Raft log entries to be implicitly committed, so this should only be used after an outage where no other option is available to recover a lost server. Make sure you don't have any automated processes that will put the peers file in place on a periodic basis.

See Outage Recovery for complete details on recovering from either type of outage.

Outages and regions

Diagram showing Nomad's Nomad Multi-Region Architecture design.

In some cases, for either availability or scalability, you may choose to run multiple regions. Nomad supports federating multiple regions together into a single cluster.

A region typically maps to a large geographic region, for example us or europe, with potentially multiple zones, which map to datacenters such as us-west and us-east. Nodes belonging to the same datacenter should be colocated (they should share a local LAN connection).

Regions are the scheduling boundary. They are fully independent from each other, and do not share jobs, clients, or state. They are loosely-coupled using a gossip protocol over WAN, which allows users to submit jobs to any region or query the state of any region transparently. Requests are forwarded to the server in the appropriate region to be processed and the results returned. Data is not replicated between regions. This allows regions to be “bulk heads” against failures in other regions.

Clients are configured to communicate with their regional servers and communicate using remote procedure calls (RPC) to register themselves, send heartbeats for liveness, wait for new allocations, and update the status of allocations.

Because of Nomad’s two-tier approach, if you lose several Nomad servers within one datacenter, you may notexperience loss of quorum because servers in other datacenters within the same region participate in the Raft consensus are counted towards quorum. If you have experienced a loss of a minority number of Nomad servers within a single region, your cluster will not experience an outage and you won’t have to do any data recovery. Restore your cluster to full health using the procedure mentioned above and in detail in our Outage Guide.

If a majority number of server nodes fail within a region, there will be loss of quorum and complete outage at the control plane level and possibly some loss of data about jobs running in that region. Other regions in the cluster won’t be affected by the outage. Jobs running in other regions will continue to run even if they were submitted via Nomad servers in the affected region. In the affected region, as long as Nomad client and their hosts are operational, workloads will continue to run. The recovery procedure for this scenario is explained briefly above and in our Outage Guide. Once you restore the failed servers and restore cluster health, Nomad will become aware of all jobs that are still running and reconcile state.

Summary

In this fourth and final post in our series on Building Resilient Infrastructure with Nomad (Part 1, Part 2, Part3) we covered how Nomad’s infrastructure model and use of the Raft consensus algorithm provides resiliency against data loss and how to recover from outages. Nomad is designed to handle transient and permanent failures of both clients and servers gracefully to make it easy to run highly reliable infrastructure.

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 215,634评论 6 497
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,951评论 3 391
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 161,427评论 0 351
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,770评论 1 290
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,835评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,799评论 1 294
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,768评论 3 416
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,544评论 0 271
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,979评论 1 308
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,271评论 2 331
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,427评论 1 345
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,121评论 5 340
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,756评论 3 324
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,375评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,579评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,410评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,315评论 2 352

推荐阅读更多精彩内容