《Site.Reliability.Engineering.2016.3》SRE:Google运维解密

[TOC]

induction

what is SRE?

  • SRE的本质:

    • availability
    • latency
    • performance
    • efficiency
    • change management
    • monitoring
    • emergency response
    • capacity planning
  • 服务100%可能与99.9%可用的差别:

100%需要多做很多的努力,而对用户来说99.9%与100%没太大差异,因为服务与用户之间还有很多媒介(wifi,网络环境等),即使100%了,也可能因为中间的媒介导致用户感受到得只有99.9%

核心方法论

长期关注研发工作

  • 工作目标:运维时间控制在50%内,超过的比例通过运维开发工程师设计自动化软件控制在50%
  • 方法:
    • 转移工作到研发团队
    • 指派bug和工单到研发团队

基于不破坏SLO下,追求最大改变速度

SLO: service level object 服务水平目标

可用性定义

  • 用户感受到满意的可用性等级?
  • 当用户不满意时,有哪些可替代的方法?
  • 不同的可用性等级,用户的使用习惯是怎么样的?

监控 Monitoring

合理的监控输出:

  • alerts:必须马上做出响应处理
  • tickets:相当于警告,不需要马上处理,延后处理
  • logging: 不需要关注的信息,记录方便以后查看

及时响应 Emergency Response

指标:

  • MTTF: mean time to failure, 平均失效时间

  • MTTR:mean time to restoration: 平均恢复时间

方法:故障预案准备

变更管理 Change Management

  • Implementing progressive rollouts
  • Quickly and accurately detecting problems
  • Rolling back changes safely when problems arise

需求预测与容量规划 Demand Forecasting and Capacity Planning

容量规划需要考虑的事情:

  • 精准的自然增长需求预测
  • 非自然增长关联的预测
  • 周期性调整测试,将容量与服务关联

快速服务部署 Provisioning

Provisioning combines both change management and capacity planning. In our experience, provisioning must be conducted quickly and only when necessary, as capacity is expensive.

效率与性能 Efficiency and Performance

SRE 需要关注效率与性能,这与快速部署关联

Google Envirmonts

terminology

  • Machine: A piece of hardware (or perhaps a VM)
  • Server: A piece of software that implements a service
  • Racks: Tens of machines are placed in a rack.
  • Row: Racks stand in a row
  • Cluster: One or more rows form a cluster
  • Datacenter: A datacenter building houses multiple clusters
  • Campus: Multiple datacenter buildings that are located close together form a campus

Embracing Risk

可用性计算

  • 时间维度:availability = uptime/ (uptime + downtime)
  • 分布式维度:availability = successful requests / total requests

Risk Tolerance of Consumer Services

  • Target level of availability
  • Types of failures
  • Cost
  • Other service metrics

Risk Tolerance of Infrastructure Services

  • Target level of availability
  • Types of failures
  • Cost

Forming Your Error Budget

  • Product Management defines an SLO, which sets an expectation of how much
    uptime the service should have per quarter
  • The actual uptime is measured by a neutral third party: our monitoring system.
  • The difference between these two numbers is the “budget” of how much “unreli‐ ability” is remaining for the quarter.
  • As long as the uptime measured is above the SLO—in other words, as long as there is error budget remaining—new releases can be pushed.

Service Level Objective

service level indicator in practice

  • Collecting Indicators(users care about):
    • User-facing serving systems: availability, latency, and throughput
    • Storage systems: latency, availability, and durability
    • Big data systems: data processing pipelines, throughput, end-to-end latency
    • All systems: correctness
    • Others: error rate
  • Aggregation
    • Using percentiles for indicators
  • Standardize Indicators

service level objective in parctice

  • example:
    • lower bound ≤ SLI ≤ upper bound.
    • SLI ≤ target
  • Defining Objectives:
    • For maximum clarity, SLOs should specify how they’re measured and the conditions
      under which they’re valid.
    • eg:
      • 99% (averaged over 1 minute) of Get RPC calls will complete in less than 100 ms (measured across all the backend servers).
      • 90% of Get RPC calls will complete in less than 1 ms
      • 99% of Get RPC calls will complete in less than 10 ms
  • Choosing Targets:
    • Don’t pick a target based on current performance
    • Keep it simple
    • Avoid absolutes
    • Have as few SLOs as possible
    • Perfection can wait: It’s better to start with a loose target that you tighten than to choose an overly strict target that has to be relaxed when you discover it’s unat‐
      tainable.
  • Control Measures:
    • Monitor and measure the system’s SLIs
    • Compare the SLIs to the SLOs, and decide whether or not action is needed
    • If action is needed, figure out what needs to happen in order to meet the target
    • Take that action
  • SLOs Set Expectations:
    • Keep a safety margin
    • Don’t overachieve

service level agreements in practice

  • an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.
  • SRE’s role is to help them understand the likelihood and difficulty of meeting the SLOs contained in the SLA.
  • It is wise to be conservative in what you advertise to users, as the broader the constituency, the harder it is to change or delete SLAs that prove to be unwise or difficult to work with.

Eliminating Toil

Toil Define

  • manual
  • repetitive
  • automatable
  • tactical
  • no enduring value
  • O(n) with service growth

calculating toil

  • 个体值班时间/运维人员一轮轮班时间。四个运维人员,每个人值班一周,运维时间占比:1/4=25%

What Qualifies as Engineering

  • Software engineering
  • Systems engineering: 线上环境配置,线上环境优化。一次性工作,免去重复劳动,初始化工作,参数优化。
  • Toil:Work directly tied to running a service that is repetitive, manual, etc.
  • Overhead: Administrative work not tied directly to running a service. Examples include hir‐ ing, HR paperwork, team/company meetings, bug queue hygiene, snippets, peer reviews and self-assessments, and training courses.

the bad of toil

  • Career stagnation
  • Low morale
  • Creates confusion
  • Slows progress
  • Sets bad precedent
  • Promotes attrition
  • Causes breach of faith

monitoring

Definitions

  • Monitoring
  • White-box monitoring
  • Black-box monitoring
  • Dashboard
  • Alart
  • Root cause
  • Node and machine
  • Push

Four Golden Signals

  • Latency
  • Traffic
  • Errors
  • Saturation

Worrying About Your Tail

  • use histogram instead mean(avg) metric

Choosing an Appropriate Resolution for Measurements

  • 收集
  • 设置粒度,取样
  • 聚合

Principles

  • Alerts on different latency thresholds, at different percentiles, on all kinds of dif‐ ferent metrics
  • Extra code to detect and expose possible causes
  • Associated dashboards for each of these possible causes
  • The rules that catch real incidents most often should be as simple, predictable, and reliable as possible.
  • Data collection, aggregation, and alerting configuration that is rarely exercised (e.g., less than once a quarter for some SRE teams) should be up for removal.
  • Signals that are collected, but not exposed in any prebaked dashboard nor used by any alert, are candidates for removal.
  • Every time the pager goes off, I should be able to react with a sense of urgency. I can only react with a sense of urgency a few times a day before I become fatigued.
  • Every page should be actionable.
  • Every page response should require intelligence. If a page merely merits a robotic response, it shouldn’t be a page.
  • Pages should be about a novel problem or an event that hasn’t been seen before.

临时方案

  • 调整部分阀值
  • 临时方案过渡

Effective Troubleshooting

最好的方法:知道系统如何设计,如何构建起来的(可以不用太细,再通过model的过程排错)。

model

  1. Problem Report
  2. Triage
  3. Examine
  4. Diagnose
  5. Test/Treat -loop-> 2/3
  6. Cure

Problem Report

就包含如下信息:

  • expected behavior
  • actiual behavior
  • optional: how to reproduce this behavior.

辅助的工具:

  • 告警信息平台,可查看告警相关联的信息,尽量做到看这些信息就能定位原因,并修复。

Triage

  • 事故定级:冷静定级
  • 止损优于排查

Examine

  • 监控系统:监控某些metric
  • logging:
    • 分级
    • 取样
    • 日志查询平台:支持某种语言去查询

Diagnose

  • Simplify and reduce
    • 黑盒测试
      • 正向测试
      • negative测试
    • 分而治之
      • 分两部分:比如分区,分地域
        • 分层
  • Ask "what", "where" and "why": 递归反推原因
  • 事件记录:
    • 配置改变
    • 代码上线
    • 系统配置改变
    • 节点变化
    • 其他
  • 特殊系统:专门为某些服务设计的排查系统

Test And Treat

  • 列出几条可能的原因
  • 设计测试方案
    • 首先设计最容易测试的
    • 各个测试间应该互斥
    • 测试的结果可能误导认知。
    • 前后测试可能相互影响。比如负载变高了
    • 有些测试比较难操作,尽量避免做这些测试。
  • 总结:
    • 要明白要测试什么,要做哪些测试,测试的结果是什么
    • 如果是复杂的且多的测试,及时记录文档,避免需要重复这些步骤

Negative Results Are Magic

  • 负面效果不能被忽略
  • 负面效果至关重要
  • 测试中使用的工具和方法,在将来的工作中会用到
  • 发布负面效果对整个行业有帮助

Cure

  • 确认原因
  • 编写事故报告
  • 修复

Make Troubleshooting Easier

两大原则:

  • 服务可观察:输出各种有用指标,日志,在服务设计时就需要考虑到
  • 设计良好易理解的组件接口
  • 良好的全链路追踪系统:方便追踪上下游
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,258评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,335评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 162,225评论 0 353
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,126评论 1 292
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,140评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,098评论 1 295
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,018评论 3 417
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,857评论 0 273
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,298评论 1 310
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,518评论 2 332
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,678评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,400评论 5 343
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,993评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,638评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,801评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,661评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,558评论 2 352

推荐阅读更多精彩内容