linux 内核core文件调试工具----crash

简介

 当我的用户态由于段错误或者其他错误发生了而退出进程时一般会在/var/crash/目录下生成相应的core文件(前提是配置了core文件转储机制)。对于一个core文件我们最为关心的就是在哪一行出错?出的什么错?哪个函数或者变量有问题?基本上只要回答了前面的是那个问题,该core的使命就已经完成了!一般用"gdb -c /var/crash/xxx.core”就能查看了,但是如果我们的内核程序奔溃了gdb就搞不定了,这时我们需要另一个工具crash

crash is a self-contained tool that can be used to investigate either live systems, kernel core dumps created from dump creation facilities such as kdump, kvmdump, xendump, the
netdump and diskdump.

环境准备

我的环境是Centos7, 内核版本:kernel-3.10.0-514.26.2.el7.x86_64
1.根据实验环境的内核版本安装对应的kernel debug info,装完的环境应该是

[root@xt2 ~]# rpm -qa |grep kernel
kernel-headers-3.10.0-514.16.1.el7.x86_64
kernel-debuginfo-common-x86_64-3.10.0-514.26.2.el7.x86_64
kernel-3.10.0-514.26.2.el7.x86_64
kernel-debuginfo-3.10.0-514.26.2.el7.x86_64
kernel-devel-3.10.0-514.26.2.el7.x86_6
  1. 直接yum install crash 安装crash工具

使用方法

1.从vmcore-dmesg.txt里查看大概出错类型
[260651.860820] BUG: unable to handle kernel NULL pointer dereference at           (null)
[260651.861697] IP: [<          (null)>]           (null)
[260651.862250] PGD 18f073067 PUD 18ca38067 PMD 0
[260651.862723] Oops: 0010 [#1] SMP
[260651.863060] Modules linked in: fuse(OE) ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio loop dm_mod intel_powerclamp iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel snd_intel8x0 snd_ac97_codec lrw gf128mul glue_helper ac97_bus ablk_helper snd_seq ppdev cryptd i2c_piix4 snd_seq_device snd_pcm i2c_core snd_timer video snd soundcore pcspkr parport_pc parport sg ip_tables xfs libcrc32c sd_mod sr_mod crc_t10dif cdrom crct10dif_generic ata_generic pata_acpi ahci libahci ata_piix crct10dif_pclmul crct10dif_common crc32c_intel serio_raw libata e1000 fjes [last unloaded: fuse]
[260651.868329] CPU: 1 PID: 10361 Comm: write_file_step Tainted: G           OE  ------------   3.10.0-514.26.2.el7.x86_64 #1
[260651.869389] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[260651.870121] task: ffff880087abde20 ti: ffff88018bea4000 task.ti: ffff88018bea4000
[260651.872062] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
[260651.875082] RSP: 0018:ffff88018bea7bf8  EFLAGS: 00010246
[260651.876682] RAX: ffffffffa03de780 RBX: 0000000000000000 RCX: 000000000000000a
[260651.880485] RDX: 0000000000000000 RSI: ffff880097fd4150 RDI: ffff88018f1e0900
[260651.883539] RBP: ffff88018bea7cb8 R08: 0000000000000000 R09: ffff88018bea7c58
[260651.886301] R10: 0000000000000000 R11: 000000000000000a R12: 000000000000000a
[260651.889198] R13: 0000000000001000 R14: ffff880097fd4150 R15: ffff88018bea7e20
[260651.891572] FS:  00007fef1a655740(0000) GS:ffff880198440000(0000) knlGS:0000000000000000
[260651.894542] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[260651.896141] CR2: 0000000000000000 CR3: 0000000188a83000 CR4: 00000000000406e0
[260651.899396] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[260651.902082] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[260651.904294] Stack:
[260651.905478]  ffffffff81181bbe ffff88018bea7c60 ffff88018bea7e68 0000000000000000
[260651.908038]  0000000000001000 ffff880087abde20 ffff88018bea7fd8 0000000097fd4000
[260651.910554]  0000000000000000 0000000000000000 ffffffffa03de780 ffff88018f1e0900
[260651.913064] Call Trace:
[260651.914226]  [<ffffffff81181bbe>] ? generic_file_buffered_write+0x11e/0x2a0
[260651.915669]  [<ffffffff811831a2>] __generic_file_aio_write+0x1e2/0x400
[260651.917126]  [<ffffffff81183419>] generic_file_aio_write+0x59/0xa0
[260651.918235]  [<ffffffffa03dad85>] fuse_file_aio_write+0x175/0x390 [fuse]
[260651.919780]  [<ffffffff81180c2b>] ? unlock_page+0x2b/0x30
[260651.920787]  [<ffffffff811acd74>] ? do_read_fault.isra.42+0xe4/0x130
[260651.921859]  [<ffffffff811fe18d>] do_sync_write+0x8d/0xd0
[260651.922860]  [<ffffffff811fe9fd>] vfs_write+0xbd/0x1e0
[260651.923913]  [<ffffffff811ff6d2>] SyS_pwrite64+0x92/0xc0
[260651.925008]  [<ffffffff81697809>] system_call_fastpath+0x16/0x1b
[260651.926307] Code:  Bad RIP value.
[260651.927594] RIP  [<          (null)>]           (null)
[260651.928938]  RSP <ffff88018bea7bf8>
[260651.930136] CR2: 0000000000000000

从上面的日志说明我们访问了一个空指针

2.调试crash
备注:为了每次执行简单,我定义了crash的别名:`alias crash='crash /usr/lib/debug/lib/modules/3.10.0-514.26.2.el7.x86_64/vmlinux’`
[root@xt1 127.0.0.1-2019.09.30-11:15:31]# crash vmcore
crash 7.2.3-10.el7
Copyright (C) 2002-2017  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <[http://gnu.org/licenses/gpl.html](http://gnu.org/licenses/gpl.html)>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying”
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu”...

      KERNEL: /usr/lib/debug/lib/modules/3.10.0-514.26.2.el7.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 6
        DATE: Mon Sep 30 11:15:23 2019
      UPTIME: 3 days, 00:24:11
LOAD AVERAGE: 0.33, 0.93, 0.96
       TASKS: 763
    NODENAME: xt1
     RELEASE: 3.10.0-514.26.2.el7.x86_64
     VERSION: #1 SMP Tue Jul 4 15:04:05 UTC 2017
     MACHINE: x86_64  (3407 Mhz)
      MEMORY: 5.9 GB
       PANIC: "BUG: unable to handle kernel NULL pointer dereference at           (null)”
         PID: 10361
     COMMAND: “write_file_step”
        TASK: ffff880087abde20  [THREAD_INFO: ffff88018bea4000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)

再次验证访问了null指针,但是谁访问的还得往下看

3.查看出错栈的信息 bt
crash> bt
PID: 10361  TASK: ffff880087abde20  CPU: 1   COMMAND: “write_file_step”
 #0 [ffff88018bea7880] machine_kexec at ffffffff81059beb
 #1 [ffff88018bea78e0] __crash_kexec at ffffffff81105822
 #2 [ffff88018bea79b0] crash_kexec at ffffffff81105910
 #3 [ffff88018bea79c8] oops_end at ffffffff81690008
 #4 [ffff88018bea79f0] no_context at ffffffff8167fc96
 #5 [ffff88018bea7a40] __bad_area_nosemaphore at ffffffff8167fd2c
 #6 [ffff88018bea7a88] bad_area at ffffffff81680050
 #7 [ffff88018bea7ab0] __do_page_fault at ffffffff81692f4f
 #8 [ffff88018bea7b10] do_page_fault at ffffffff81692ff5
 #9 [ffff88018bea7b40] page_fault at ffffffff8168f208
    [exception RIP: unknown or invalid address]
    RIP: 0000000000000000  RSP: ffff88018bea7bf8  RFLAGS: 00010246
    RAX: ffffffffa03de780  RBX: 0000000000000000  RCX: 000000000000000a
    RDX: 0000000000000000  RSI: ffff880097fd4150  RDI: ffff88018f1e0900
    RBP: ffff88018bea7cb8   R8: 0000000000000000   R9: ffff88018bea7c58
    R10: 0000000000000000  R11: 000000000000000a  R12: 000000000000000a
    R13: 0000000000001000  R14: ffff880097fd4150  R15: ffff88018bea7e20
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff88018bea7bf8] generic_file_buffered_write at ffffffff81181bbe
#11 [ffff88018bea7cc0] __generic_file_aio_write at ffffffff811831a2
#12 [ffff88018bea7d40] generic_file_aio_write at ffffffff81183419
#13 [ffff88018bea7d80] fuse_file_aio_write at ffffffffa03dad85 [fuse]
#14 [ffff88018bea7e18] do_sync_write at ffffffff811fe18d
#15 [ffff88018bea7ef0] vfs_write at ffffffff811fe9fd
#16 [ffff88018bea7f30] sys_pwrite64 at ffffffff811ff6d2
#17 [ffff88018bea7f80] system_call_fastpath at ffffffff81697809
    RIP: 00007fef1a169003  RSP: 00007ffef15260c8  RFLAGS: 00010246
    RAX: 0000000000000012  RBX: ffffffff81697809  RCX: 0000000000000000
    RDX: 000000000000000a  RSI: 00000000006020c0  RDI: 0000000000000003
    RBP: 00007ffef1526100   R8: 00007fef1a0c9988   R9: 000000000000000e
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000000
    R13: 00007ffef1526220  R14: 0000000000400640  R15: 0000000000000000
    ORIG_RAX: 0000000000000012  CS: 0033  SS: 002b

我们从栈中看到出错的地址在ffffffff81181bbe,而且exception RIP: unknown or invalid address说明指令地址有问题,从下面RIP: 0000000000000000 又看出该指针为空,说明没有指令地址为空,执行指令的时候取不到任何有效指令!

RIP is the instruction pointer. It points to a memory address, indicating the progress of program execution in memory。
在core文件里面RIP里面存的指令就是造成内核奔溃的指令!
4.我们根据出错函数报告的地址反汇编出出错位置 dis
crash> dis -l ffffffff81181bbe
/usr/src/debug/kernel-3.10.0-514.26.2.el7/linux-3.10.0-514.26.2.el7.x86_64/mm/filemap.c: 2984
0xffffffff81181bbe <generic_file_buffered_write+286>:   movslq %eax,%r15
generic_perform_write

2984行是一个函数的调用,从代码中观察函数出错的函数名脚generaic_perform_write,而我们可以看看a_ops->write_begin(由于是指令出错,大概率是它,别的话顶多是参数出错), file, mapping,这三个指针谁是错的,
我们就来用第4条看看这个函数的参数

4.查看函数的参数(实时的参数和出错当场的参数)

我们先来看看x86 机构cpu寄存器装载参数的情况:

According to the ABI, the first 6 integer or pointer arguments to a function are passed in registers. The first is placed in rdi, the second in rsi, the third in rdx, and then rcx, r8 and r9. Only the 7th argument and onwards are passed on the stack.
The stack frame

With the above in mind, let's see how the stack frame for this C function looks:
long myfunc(long a, long b, long c, long d,
            long e, long f, long g, long h)
{
    long xx = a * b * c * d * e * f * g * h;
    long yy = a + b + c + d + e + f + g + h;
    long zz = utilfunc(xx, yy, xx % yy);
    return zz + 20;
}

This is the stack frame:


出错代码的位置:

status = a_ops->write_begin(file, mapping, pos, bytes, flags,
                        &page, &fsdata);

第一个参数: file 地址:ffff88018f1e0900
第二个参数:mapping 地址:ffff880097fd4150
后面的几个参数都是普通变量,所以不需要打印他们的地址了。

打印file的值:

crash> struct file ffff88018f1e0900
struct file {
  f_u = {
    fu_list = {
      next = 0xffff88018f1e0900,
      prev = 0xffff88018f1e0900
    },
    fu_rcuhead = {
      next = 0xffff88018f1e0900,
      func = 0xffff88018f1e0900
    }
  },
  f_path = {
    mnt = 0xffff880182a6c520,
    dentry = 0xffff88019148e0c0
  },
  f_inode = 0xffff880097fd4000,
  f_op = 0xffffffffa03de840,
  ... 
  f_mapping = 0xffff880097fd4150,
  ...
}

打印mapping的值与上面类似,只是加地址的时候加上对应的寄存器地址就可以了,打印结果正常。

crash> struct address_space ffff880097fd4150
struct address_space {
  host = 0xffff880097fd4000,
  ...
  writeback_index = 0,
  a_ops = 0xffffffffa03de780,
  flags = 131290,
  backing_dev_info = 0xffff880184703148,
  ...
}

从之前的分析得知我们的指令是空,所以很有可能就是这样的逻辑:write_begin把所有的参数都压入栈里面了,在最后调用函数write_begin的时候发现自己是个空指针,没有指令可以执行!

所以我们就打印a_ops来看下它的成员函数指针write_begin是否为空:
从上面得知a_ops的地址为0xffffffffa03de780

crash> struct address_space_operations 0xffffffffa03de780
struct address_space_operations {
  writepage = 0xffffffffa03db3d0,
  readpage = 0xffffffffa03d99c0,
  writepages = 0x0,
  set_page_dirty = 0xffffffff8118cb20 <__set_page_dirty_nobuffers>,
  readpages = 0xffffffffa03d78c0,
  write_begin = 0x0,
  write_end = 0x0,
  bmap = 0xffffffffa03d68d0,
  invalidatepage = 0x0,
  releasepage = 0x0,
  freepage = 0x0,
  direct_IO = 0xffffffffa03da260,
  rh_reserved_get_xip_mem = 0x0,
  migratepage = 0x0,
  launder_page = 0xffffffffa03db370,
  is_partially_uptodate = 0x0,
  is_dirty_writeback = 0x0,
  error_remove_page = 0x0,
  swap_activate = 0x0,
  swap_deactivate = 0x0,
  invalidatepage_range = 0x0
}

小结

仓了个天,经过一番努力,终于找到真凶!write_begin为空,后来通过对比代码发现新版的kernel 源码给这个指针赋值了,但是老版的没有赋值,所以在老版的逻辑里冒然调用了这个接口时出现访问空指针错误!在此把所用到的东西和分析过程贴出来供大家参考,有好的建议和不对之处欢迎留言!

参考

1.https://eli.thegreenplace.net/2011/09/06/stack-frame-layout-on-x86-64/
2.http://www.voidcn.com/article/p-eyktsjtm-qz.html

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 217,907评论 6 506
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,987评论 3 395
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 164,298评论 0 354
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,586评论 1 293
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,633评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,488评论 1 302
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,275评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,176评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,619评论 1 314
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,819评论 3 336
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,932评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,655评论 5 346
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,265评论 3 329
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,871评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,994评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,095评论 3 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,884评论 2 354

推荐阅读更多精彩内容