perf-2 火焰图

keywords

perf

火焰图

0. 引言

我们经常需要分析程序性能，看看时间被谁偷走了。下面记录在linux中性能分析工具perf

1. 性能分析

1.1. 火焰图

性能分析之前，我们先了解下火焰图，这里不再赘婿，参考如何读懂火焰图？

下面我们介绍如何产生火焰图

1.2. perf 采集数据

采集的指令

sudo perf record -F 99 -p $pid  --call-graph dwarf sleep 10

其中：

-F 表示采集频率，这里是99
-p 表示采集程序的pid（如果不使用-p，表示采集系统当前的整体情况，一般不会这么做）
--call-graph dwarf 表示调用堆栈的采集方式，详细参见: perf-record(1) — Linux manual page

--call-graph
Setup and enable call-graph (stack chain/backtrace)
recording, implies -g. Default is "fp" (for user space).
The unwinding method used for kernel space is dependent on the
unwinder used by the active kernel configuration, i.e
CONFIG_UNWINDER_FRAME_POINTER (fp) or CONFIG_UNWINDER_ORC (orc)
Any option specified here controls the method used for user space.
Valid options are "fp" (frame pointer), "dwarf" (DWARF's CFI -
Call Frame Information) or "lbr" (Hardware Last Branch Record
facility).
In some systems, where binaries are build with gcc
--fomit-frame-pointer, using the "fp" method will produce bogus
call graphs, using "dwarf", if available (perf tools linked to
the libunwind or libdw library) should be used instead.
Using the "lbr" method doesn't require any compiler options. It
will produce call graphs from the hardware LBR registers. The
main limitation is that it is only available on new Intel
platforms, such as Haswell. It can only get user call chain. It
doesn't work with branch stack sampling at the same time.
When "dwarf" recording is used, perf also records (user) stack dump
when sampled. Default size of the stack dump is 8192 (bytes).
User can change the size by passing the size after comma like
"--call-graph dwarf,4096".
When "fp" recording is used, perf tries to save stack enties
up to the number specified in sysctl.kernel.perf_event_max_stack
by default. User can change the number by passing it after comma
like "--call-graph fp,32".

sleep 表示采集时间

采集到数据后，如果不绘制火焰图，可以通过report指令查看

sudo perf report

产生类似于下面这样的数据

Samples: 920  of event 'cycles:ppp', Event count (approx.): 484442404
  Children      Self  Command        Shared Object                             Symbol
+   61.49%     0.00%  daq_py.SHM     libstdc++.so.6.0.25                       [.] 0xffffff806e1c8e13
+   36.23%     0.00%  daq_py.SHM     libcyber.so.4.4.1.5                       [.] 0xffffff806dcc1e3b
+   25.25%     0.00%  daq_py.SHM     libpthread-2.27.so                        [.] start_thread
+   21.82%     0.00%  daq_py.SHM     [kernel.kallsyms]                         [k] el0_svc_naked
+   21.68%    21.68%  daq_py.SHM     [kernel.kallsyms]                         [k] _raw_spin_unlock_irqrestore
+   17.85%     0.00%  daq_py.SHM     libpy_cyber3.so                           [.] 0xffffff806ceb4ca8
+   17.85%     0.00%  daq_py.SHM     libcyber.so.4.4.1.5                       [.] 0xffffff806dcb694b
+   17.85%     0.00%  daq_py.SHM     libcyber.so.4.4.1.5                       [.] 0xffffff806dcc59bf
+   17.85%     0.00%  daq_py.SHM     libcyber.so.4.4.1.5                       [.] 0xffffff806dcc020b
+   17.85%    17.85%  daq_py.SHM     libpy_cyber3.so                           [.] 0x0000000000092ca8

其中关于Children和Self，参考这篇文章Linux kernel profiling with perf

总结就是：Self表示该行的执行耗时；Children表示某个调用堆栈总的耗时。问题排查，一般关注Self，这才是罪魁祸首

1.3. 数据准备

sudo perf script -i perf.data &> perf.unfold

1.4. 生成火焰图

这里依赖一个开源的库，地址
FlameGraph
因此对应指令

git clone https://github.com/brendangregg/FlameGraph.git
./FlameGraph/stackcollapse-perf.pl perf.unfold &> perf.folded
./FlameGraph/flamegraph.pl perf.folded > perf.svg

用浏览器打开perf.svg即可