Google Testing Blog:Optimal Logging

谷歌工程师总结的关于日志记录的最佳实践经验

Google Testing Blog:Optimal Logging(日志记录最佳实践)
原文链接:http://club.oneapm.com/t/optimal-logging/561/2
译自Optimal Logging,by Anthony Vallone,Google Testing Blog

要找到一个系统问题的根本原因,你需要多长时间?5分钟?还是5天?如果你的答案接近5分钟,很大可能是因为你的生产环境和测试环境使用了非常好的日志记录。更常见的情况是,诸如日志、异常处理、甚至测试这类非核心的工作,被当作一种出现问题后的补救方式。同异常处理和测试一样,日志记录真的也需要策略,无论是生产环境还是测试环境。永远不要低估日志的作用。有了使用得当的日志,你甚至可以说debug不是必需的。下面是多年来对我非常有用的日志记录指导原则。

保持适度

切勿记录过多。大量的磁盘空间被日志占用说明你没有想过应该记录什么。如果记录了太多,你就还需要设计出复杂的方法来减少磁盘访问、保留历史记录、归档大量数据、以及在这些数据中查询。最关键的是,你将发现在这么多垃圾中找到有用信息是多么的困难。

唯一一个比记录过多日志还差的事是,记录的过少。日志通常有两个主要目的:定位问题和事件确认。如果你的日志不能明确一个bug的原因,或者某个事务是否执行,你就记录的过少了。

适合记录的:

  • 重要启动配置
  • 错误
  • 警告
  • 持久性数据的更改
  • 主要系统组件间的请求和响应
  • 重要的状态变化
  • 用户交互
  • 有已知失败风险的调用
  • 较长时间的等待
  • 长期运行任务的周期性进度
  • 重要的逻辑分支和导向分支的条件
  • 从高层方法来汇总处理流程的步骤和事件——避免在底层方法中记录复杂流程的每一步骤

不适合记录的:

  • 方法入口——不要记录方法入口,除非它非常重要或者是日志处于调试级别。
  • 循环中的数据——避免在循环的多次迭代中记录日志。如果是小循环,或者是间歇性的记录,倒也无妨。
  • 大消息或者文件的内容——截断或者使用一些有利于调试的方式进行汇总。
  • 良性错误——那些不是真正错误的错误会令日志的读者感到困惑。当异常处理是执行成功的一部分时,有时会遇到这种情况。
  • 重复的错误——不要重复的记录相同或者相似的错误。这样可能会快速的填满日志,并且隐藏掉真正的问题。各种类型错误发生的频率最好有监视器(monitor)处理。日志只需捕获问题的详细信息。

多个日志级别

不要把所有信息都记录在同一个日志级别中。绝大多数的日志库都提供多个级别,系统启动时可以进行指定。这样可以很方便的控制日志详尽程度。

典型的级别有:

  • 调试——最详细,但是只有在开发或者调试时适用。
  • 信息——最常用的级别。
  • 警告——奇怪的或者不在预期之内的一些状态,但是可接受。
  • 错误——有错误发生,但是流程不受影响。
  • 严重——流程无法继续,系统将关闭或者重启。

从实际使用来讲,只需要两种级别的日志配置:

  • 生产环境——除了调试级别,其他全开。如果生产环境发生了问题,日志应该能够指明原因。
  • 开发和调试——编写新代码或是尝试复现问题时,打开全部级别。

测试日志同样重要

日志的质量对于测试代码和产品代码同样重要。当一次测试运行失败时,日志应当明确的指出这个错误是来自测试本身还是生产系统。如果做不到这一点,那么测试的日志是有问题的。

测试日志应该必需包括:

  • 测试执行环境
  • 初始状态
  • 准备步骤
  • 用例步骤
  • 与系统的交互
  • 期望的结果
  • 实际的结果
  • 清理步骤

利用临时日志队列实现条件性的详细信息控制

发生错误时,日志应当包含大量的详细信息。但不幸的是,当遇到一个错误时,导致这个错误发生的详细信息可能已经无法获得了。如果你听从了“不要记录过多”的建议,在错误日志之前的那些日志可能无法提供足够的细节。解决这个问题的一个好的方式是,在内存中创建临时的日志队列。在事务的处理过程中,将每一步的详尽信息追加到队列中。如果事务成功完成,丢弃这个队列,只记录一个汇总。如果发生了错误,就把错误和队列里的全部内容记录下来。这一方法对于系统交互的日志尤其有效。

问题是机遇

当生产环境出现问题时,你必将集中精力寻找并且修复问题,但是也不要忘记考虑一下日志。如果你费了很大力气才找到问题的原因,这将是个非常好的机会来改善你的日志。修复问题前,先修复你的日志记录,使其可以清楚的指明问题原因。如果这个问题又一次发生,将会很容易辨认。

如果无法复现问题,或者测试结果不确定,改进日志以便可以在问题再次发生时将其记录下来。

在整个开发的生命周期内,都应该持续的利用问题来改进日志。写新代码时,试着少用debug,只使用日志。这些日志是否能够说明发生了什么?如果不能,日志就是不充分的。

最好也记录性能数据

记录时间数据可以用来帮助定位性能问题。例如,要找到一个大型系统的超时原因是很困难的,除非你能够追踪每一个重要处理步骤的耗时情况。这是很容易做到的,只需记录那些可能会比较耗时的调用的开始和结束时间即可:

  • 重要的系统调用
  • 网络请求
  • CPU密集运算
  • 连接设备的交互
  • 事务
  • 在多线程和多进程中追踪痕迹

在涉及到多线程或多进程的处理时,要为事务创建独一的标识。事务初始化时创建ID,将它传入每一个为此事务工作的部分。当记录关于此事务的日志时,每一个部分都应该记录下这个ID。这样,在多个事务并行执行时,追踪一个特定的事务会容易很多。

监控和日志相互完善

一个生产服务应该既有日志也有监控。监控提供了一种实时的对于系统状态的统计汇总。它可以提醒你,是否一定比例的某个类型请求失败了,是否系统正在经受不正常的流量访问,性能是否在下降,或者其他的一些异常。在某些情况下,只是这些信息就可以为找到问题原因提供线索。不过,大多数情况下,监控警报只是为了简单的触发你的调查。监控将问题的症状展现给我们。日志则针对各个事务提供了详细的信息和状态,这样你才能全面的理解问题的原因。


英语原文:

by Anthony Vallone

How long does it take to find the root cause of a failure in your system? Five minutes? Five days? If you answered close to five minutes, it’s very likely that your production system and tests have great logging. All too often, seemingly unessential features like logging, exception handling, and (dare I say it) testing are an implementation afterthought. Like exception handling and testing, you really need to have a strategy for logging in both your systems and your tests. Never underestimate the power of logging. With optimal logging, you can even eliminate the necessity for debuggers. Below are some guidelines that have been useful to me over the years.

Channeling Goldilocks

Never log too much. Massive, disk-quota burning logs are a clear indicator that little thought was put in to logging. If you log too much, you’ll need to devise complex approaches to minimize disk access, maintain log history, archive large quantities of data, and query these large sets of data. More importantly, you’ll make it very difficult to find valuable information in all the chatter.

The only thing worse than logging too much is logging too little. There are normally two main goals of logging: help with bug investigation and event confirmation. If your log can’t explain the cause of a bug or whether a certain transaction took place, you are logging too little.

Good things to log:

  • Important startup configuration
  • Errors
  • Warnings
  • Changes to persistent data
  • Requests and responses between major system components
  • Significant state changes
  • User interactions
  • Calls with a known risk of failure
  • Waits on conditions that could take measurable time to satisfy
  • Periodic progress during long-running tasks
  • Significant branch points of logic and conditions that led to the branch
  • Summaries of processing steps or events from high level functions - Avoid logging every step of a complex process in low-level functions.

Bad things to log:

  • Function entry - Don’t log a function entry unless it is significant or logged at the debug level. Data within a loop - Avoid logging from many iterations of a loop. It is OK to log from iterations of small loops or to log periodically from large loops.
  • Content of large messages or files - Truncate or summarize the data in some way that will be useful to debugging.
  • Benign errors - Errors that are not really errors can confuse the log reader. This sometimes happens when exception handling is part of successful execution flow.
  • Repetitive errors - Do not repetitively log the same or similar error. This can quickly fill a log and hide the actual cause. Frequency of error types is best handled by monitoring. Logs only need to capture detail for some of those errors.

There is More Than One Level

Don't log everything at the same log level. Most logging libraries offer several log levels, and you can enable certain levels at system startup. This provides a convenient control for log verbosity.

The classic levels are:

  • Debug - verbose and only useful while developing and/or debugging.
  • Info - the most popular level.
  • Warning - strange or unexpected states that are acceptable.
  • Error - something went wrong, but the process can recover.
  • Critical - the process cannot recover, and it will shutdown or restart.

Practically speaking, only two log configurations are needed:

  • Production - Every level is enabled except debug. If something goes wrong in production, the logs should reveal the cause.
  • Development & Debug - While developing new code or trying to reproduce a production issue, enable all levels.

Test Logs Are Important To

Log quality is equally important in test and production code. When a test fails, the log should clearly show whether the failure was a problem with the test or production system. If it doesn't, then test logging is broken.

Test logs should always contain:

  • Test execution environment
  • Initial state
  • Setup steps
  • Test case steps
  • Interactions with the system
  • Expected results
  • Actual results
  • Teardown steps

Conditional Verbosity With Temporary Log Queues

When errors occur, the log should contain a lot of detail. Unfortunately, detail that led to an error is often unavailable once the error is encountered. Also, if you’ve followed advice about not logging too much, your log records prior to the error record may not provide adequate detail. A good way to solve this problem is to create temporary, in-memory log queues. Throughout processing of a transaction, append verbose details about each step to the queue. If the transaction completes successfully, discard the queue and log a summary. If an error is encountered, log the content of the entire queue and the error. This technique is especially useful for test logging of system interactions.

Failures and Flakiness Are Opportunities

When production problems occur, you’ll obviously be focused on finding and correcting the problem, but you should also think about the logs. If you have a hard time determining the cause of an error, it's a great opportunity to improve your logging. Before fixing the problem, fix your logging so that the logs clearly show the cause. If this problem ever happens again, it’ll be much easier to identify.

If you cannot reproduce the problem, or you have a flaky test, enhance the logs so that the problem can be tracked down when it happens again.

Using failures to improve logging should be used throughout the development process. While writing new code, try to refrain from using debuggers and only use the logs. Do the logs describe what is going on? If not, the logging is insufficient.

Might As Well Log Performance Data

Logged timing data can help debug performance issues. For example, it can be very difficult to determine the cause of a timeout in a large system, unless you can trace the time spent on every significant processing step. This can be easily accomplished by logging the start and finish times of calls that can take measurable time:

  • Significant system calls
  • Network requests
  • CPU intensive operations
  • Connected device interactions
  • Transactions

Following the Trail Through Many Threads and Processes

You should create unique identifiers for transactions that involve processing across many threads and/or processes. The initiator of the transaction should create the ID, and it should be passed to every component that performs work for the transaction. This ID should be logged by each component when logging information about the transaction. This makes it much easier to trace a specific transaction when many transactions are being processed concurrently.

Monitoring and Logging Complement Each Other

A production service should have both logging and monitoring. Monitoring provides a real-time statistical summary of the system state. It can alert you if a percentage of certain request types are failing, it is experiencing unusual traffic patterns, performance is degrading, or other anomalies occur. In some cases, this information alone will clue you to the cause of a problem. However, in most cases, a monitoring alert is simply a trigger for you to start an investigation. Monitoring shows the symptoms of problems. Logs provide details and state on individual transactions, so you can fully understand the cause of problems.

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,080评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,422评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 157,630评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,554评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,662评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,856评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,014评论 3 408
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,752评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,212评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,541评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,687评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,347评论 4 331
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,973评论 3 315
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,777评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,006评论 1 266
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,406评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,576评论 2 349

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,312评论 0 10
  • 那女孩一听,突然又跑了回来,一下子抱住男孩的头深深来了一记狂吻,吻完竟然狠狠地甩了一巴掌在男孩脸上,然后大声骂道...
    翔言阅读 101评论 0 0
  • 星耀小学亲子共成长,寒假陪读。六年四班崔金浩读书时间长40分,爸爸陪读。
    没事一个阅读 166评论 0 0
  • 您说 不一样的心境 就会读出不同的诗意 本意只是找书 但是着迷般拿起就放不下了 您是否已找到那个三月的伊人 您是否...
    朵朵随笔阅读 243评论 0 3
  • 大一开学一个月,大学生活的雏形开始在眼前展开了。 或许不像高中,所有人都知道大学的轻松自由,然而我感觉到的却是...
    321知识学徒阅读 157评论 0 0