Principles of Site Reliability Engineering at Google

Over the last several years, the concept of “DevOps” has swept through the engineering ecosystem, but there is a new concept that is gaining momentum, namely, the concept of “Site Reliability Engineering.” This concept was created by Ben Treynor at Google. And, in 2014, a conference was created, called SREcon, to bring together the growing community of liked-minded engineers. Google has also released a free book. The purpose of this blog post is to describe the nine major principles of Site Reliability Engineering at Google.

The first principle is to hire coders. In practice, at Google, they often hire Systems Administrators as well as Developers for the Site Reliability Engineer (SRE) position. Nevertheless, the primary duty of an SRE is to write code. In fact, one of the main concepts of site reliability engineering is “what happens when one hires a developer to do operations?” Hopefully, the developer will attempt to automate him/herself out of a job.

As a compute cluster scales linearly to accommodate more users and as software scales by adding more features, human resources should also scale linearly to manage the additional systems and to troubleshoot the increased surface area of additional features. However, an alternative to hiring more and more engineers to accommodate linear growth is an intense focus on automation. If a small group of engineers can devote most of their time to automating manual tasks and to doing auto-remediation of issues, then a compute cluster can grow linearly while the engineering group can continue to remain small.

So, the first principle of site reliability engineering is to hire great coders and let them leave if they want to leave. The part about letting them leave without a penalty is also important. If the manual work continues to be overwhelming and not enough attention is being paid to automation, then let the engineer transfer back into a more traditional development role of adding features to a product.

The second principle of site reliability engineering is to hire your SREs and your developers from the same staffing pool and treat them all as developers. An SRE is a developer. But, rather than adding features, the SRE developer is working on improving the reliability of the system. At Google, it is common for a developer to do a rotational assignment as a SRE in Mission Control. If he/she likes the work, he/she can stay, if not, he/she can go back to doing traditional development.

It is also important that there is not a line of separation between SREs and developers. Rather, the developers, who are adding features, continue to share at least 5 percent of the operational on-call workload, and they handle the spillover from the SRE team.

So, **the third principle **of site reliability engineering is that about 5 percent of the ops work goes to the dev team, plus all overflow. The development team always remains in the operational loop. In fact, if a development team adds features that results in instability to the system — the software product produces a number of incidents in a short period of time — then it is possible for the SREs to kick a product (or software) back to the development team and say that it is not ready for SRE support. In other words, the developers who created the product have to assume full-time support of the product, if it is not ready for production support.

The fourth principle of site reliability engineering is to cap the SRE-operational load at 50 percent (usually 30 percent). In other words, at least half of their time, SREs should be working on automation and improving reliability. One way that Google enforces this is that they limit the number of issues that an SRE is able to work on for any given shift. Typically, an issue that results in an interruption (or a alert) takes six hours to process. Of course, the resolution to the problem typically takes minutes, but the resolution process takes approximately six hours. The process includes a postmortem document, a postmortem review meeting and a set of action items, which are placed into a ticketing system. So, an SRE can only handle a maximum of two operational issues during a 12-hour shift. If there are more issues, these issues spill over to the development teams.

The fifth principle is that an on-call team has a minimum of 8 engineers for one location (or 6 engineers in each of two locations), handling a maximum of 2 events per shift. The reason for a minimum of 8 engineers is so that each engineer is on-call two weeks out of every month with a 12-hour shift. Having enough engineers on the team results in a reasonable workload and minimizes burnout.

The sixth principle of site reliability engineering is that postmortems are blameless and focus on process and technology. The central idea is that when things go wrong, the problem is the system, the process, the environment and the technology stack. Of course, there could be some human error involved, and it is very likely that the quick remediation of the problem was a result of the outstanding talent on the SRE team. Nevertheless, the focus is on how to make things better, so the focus is on the strategy, the structure and the systems. Could our monitoring, alerting and tools be better? How can we fix problem so that it does not happen again?

Ideally, an SRE team should not face the same problems repeatedly. The result of a postmortem are a list of action items for changing and improving the system. And, there should be ample time in the schedule to work on these action items. One SRE adage is do it once manually, and the second time, automate it. Again, the primary job of an SRE is to work on automation so as to improve the system. So, as the SRE tries to work him/herself out of a job, the cluster can grow and more features can be introduced without having to grow the size of the team.

The seventh principle is to have a written Service Level Objective (SLO) for each service and to measure performance against it. A Service Level Agreement (SLA) is a contract between a service provider and a customer. SLOs are the agreed upon means of measuring the performance of a service provider. SLOs are composed of Service Level Indicators (SLI). An SLI is merely something that you measure — it is a graph on your dashboard. But, when you attach a threshold to an SLI and generate an alert, this should be tied to your SLO. Typically, we measure the availability of a service, and the SLO is a threshold for how much unavailability will be tolerated. Is your objective to have your service available 99.9 percent of the time? If so, this means that you can tolerate 10 minutes and 5 seconds of unavailability per week (and 43 minutes and 50 seconds per month).

Different services will have different SLOs, and the SLO should guide your behavior. For example, if your customer can only tolerate 4 minutes and 23 seconds of unavailability per month (or 99.99 percent availability), then when you roll out a change, you will only roll it out to ten percent of systems in the cluster. Leave it running for a few hours, and then roll it out to an additional 10 percent, and so on. In other words, you will be very conservative in your deployments. But, if a service is not mission critical and you have an SLO with only 99 percent availability, then you can afford to be less controlled and less conservative in your deployment. It is important to note that “availability” can be many faceted, but SLOs should be measurable, easily understandable and meaningful. The goal of an SLO is to guide behavior and to put guards on action..

The eighth principle is to use SLO budgets as your launch criteria. The best way to insure stability of a system is not to introduce any change into the system. Of course, we want to constantly add features to software, and usage growth demands that we continually upgrade the cluster. But, your SLOs should guide you with respect to how much change to introduce and on what schedule. The idea of a “budget” is similar to the idea of a bank account. One cannot make withdrawals on a bank account that has a zero balance. Likewise, if you are exceeding your SLO, you must stop introducing change. I believe that Google uses a monthly SLO. So, if a service has an availability of 99.9 percent, then that service has a budget of 43 minutes and 50 seconds of unavailability per month. So, feel free to launch new features as long as you have the budget for it. However, when you approach your budget in a given month, you must curtail adding new features and introducing change until your budget is replenished. By having an SLO budget and allowing it to dictate your behavior, you are ensuring quality and maintaining a high-level of customer satisfaction.

The ninth and final principle of site reliability engineering is “practice, practice, practice.” If you do your job correctly, then you should have a quiet system. In fact if your system is redundant and resilient, your troubleshooting skills can get rusty and operational readiness can diminish. Netflix introduced a “Chaos Monkey” into their system, not only to test for redundancy and resiliency, but to improve operational readiness. At Google, one of the most popular SRE events is called the “Wheel of Misfortune.” The game starts with a pie chart, that comprises a frequency distribution of the outages that they have seen in the last month or two. And, the engineers’ role play an outage from the pie chart. One engineer is selected as the on-call engineer, while another describes an outage scenario. As the two engineers do a dry run of an outage, the other engineers take notes, and there is a mini postmortem at the end. The overall goal is to cut the amount of time to resolve issues, and practice can help to dramatically reduce times to resolution.

To review, these are the nine principles of site reliability engineering.

  1. To hire great coders and let them leave if they want to leave.
  2. To hire your SREs and your developers from the same staffing pool and treat them all as developers.
  3. About 5 percent of the ops work goes to the dev team, plus all overflow.
  4. To cap the SRE-operational load at 50 percent (usually 30 percent)
  5. An on-call team has a minimum of 8 engineers for one location (or 6 engineers in each of two locations).
  6. Postmortems are blameless and focus on process and technology
  7. To have a written Service Level Objective (SLO) for each service and to measure performance against it.
  8. To use SLO budgets as your launch criteria.
  9. Practice and make it fun.

These nine principles of site reliability engineering are not my own. I got them from Ben Treynor’s keynote address at SREcon 2014. These principles have been developed at Google and tested over time. I hope to make use these principles in my own work and to inform my future deliberations on the role of DevOps.

原文:
https://medium.com/@jdavidmitchell/principles-of-site-reliability-engineering-at-google-8382b054e498

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,445评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,889评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,047评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,760评论 1 276
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,745评论 5 367
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,638评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,011评论 3 398
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,669评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,923评论 1 299
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,655评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,740评论 1 330
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,406评论 4 320
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,995评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,961评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,197评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,023评论 2 350
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,483评论 2 342

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,283评论 0 10
  • 每个成功的男人背后,都有一个默默付出的女人。股神巴菲特曾说过:“我这一生最重要的投资不是购买了哪支股票,而是选择了...
    飘香有约阅读 413评论 0 2
  • 今天在地铁上看见一男一女在看书。男的手捧新书,认真的读着。女的那拿笔时不时标记做题。心里一下子觉得好熟悉。 以前我...
    通往女神路上的cindy阅读 513评论 0 51