gdb调试技巧-一个c++程序崩溃coredump的分析例子

最近支援linux项目组,由于代码都是c/c++写的,奔溃的时候需要用gdb去分析coredump文件,记录一个典型的案例备忘。

问题的背景是我们有个程序提供了一套ipc接口给其他应用调用,在某种系统环境下出现了应用调用接口导致我们的进程一直崩溃的情况。

异常堆栈分析

拿到coredump文件进入gdb之后它会告诉我们这个coredump是由于/usr/bin/Demo程序在shared_ptr<Database>里遇到了SIGSEGV导致的奔溃:

Core was generated by `/usr/bin/Demo'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  std::__shared_ptr<Database, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=0xf5e1b58c, this@entry=0xf5e1b554)
    at /usr/include/c++/11.5.0/bits/shared_ptr_base.h:1152
1152    /usr/include/c++/11.5.0/bits/shared_ptr_base.h: No such file or directory.
[Current thread is 1 (LWP 13645)]

我们通过bt命令打印函数堆栈:

(gdb) bt
#0  std::__shared_ptr<Database, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=0xf5e1b58c, this@entry=0xf5e1b554)
    at /usr/include/c++/11.5.0/bits/shared_ptr_base.h:1152                                                                                     
#1  std::shared_ptr<Database>::shared_ptr (this=0xf5e1b58c, this@entry=0xf5e1b554) at /usr/include/c++/11.5.0/bits/shared_ptr.h:150
#2  Context::GetDatabase (this=0x0)                                                                                                                
    at /home/linjw/workspace/demo_code/src/common/context.cpp:141
#3  0x00688206 in AudioServiceImpl::GetVolume (this=<optimized out>)
    at /home/linjw/workspace/demo_code/impl/src/audio/audio_service_impl.cpp:68
...

可以看到的确是在AudioServiceImpl::GetVolume处理获取音量请求的时候奔溃的:

int AudioServiceImpl::GetVolume() {
    return context_.lock()->GetDatabase()->Get(DB_VOLUME, 50);
}

从堆栈里面可以看到一个奇怪的地方Context::GetDatabase (this=0x0)Context的this指针是空指针,意味着context_还没有初始化这个就被调用了。

分析下源码,这个程序的初始化流程是这样的:

image.png

所以应该是在初始化操作里面卡住了导致AudioServiceImpl::OnCreate没有调用到,然后这个时候应用去发送udi请求的时候就会在子线程里面调用到AudioServiceImpl::GetVolume导致奔溃。

所以这里有两个问题:

  1. 初始化操作卡住了
  2. 应该修改下初始化流程在初始化完成之后才注册ipc接口

切换线程

然后我们看看初始化到底是卡在了哪里,由于我们是在主线程做的初始化,所以线程的id和进程id是一样的,我们可以通过info inferior命令看到pid是13639:

(gdb) info inferior
  Num  Description       Connection           Executable
* 1    process 13639     1 (core)             /home/linjw/gdb/sysroots/usr/bin/Demo

这个13639其实也是我们主线程的线程Target Id,然后我们用info threads列出所有的线程,找到它的id是19:

(gdb) info threads
  Id   Target Id         Frame
* 1    LWP 13645         std::__shared_ptr<Database, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=0xf5e1b58c, this@entry=0xf5e1b554)
    at /usr/include/c++/11.5.0/bits/shared_ptr_base.h:1152
  2    LWP 13642         __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  3    LWP 13652         __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  4    LWP 13647         __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  5    LWP 13662         __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  6    LWP 13654         __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  7    LWP 13648         __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  8    LWP 13655         __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  9    LWP 13644         __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
  10   LWP 13656         __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
  11   LWP 13643         __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
  12   LWP 13657         __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
  13   LWP 13659         __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
  14   LWP 13649         __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  15   LWP 13650         __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  16   LWP 13651         __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  17   LWP 13660         __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
  18   LWP 13661         __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  19   LWP 13639         __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47

接着可以通过thread 19命令将上下文环境切到主线程:

(gdb) thread 19
[Switching to thread 19 (LWP 13639)]
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
47      ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S: No such file or directory.

然后再用bt命令查看19号线程的堆栈:

(gdb) bt
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
#1  0xf7a98418 in __GI___ioctl (fd=0, request=1075858688) at ../sysdeps/unix/sysv/linux/ioctl.c:35
#2  0x00699160 in SetEnable (enable=<optimized out>)
    at /home/linjw/workspace/demo_code/impl/src/system/mcu/mcu.cpp:55
#3  0x00620908 in BaseService::BaseService(std::weak_ptr<ILifecycle>)::{lambda(std::weak_ptr<Context>)#1}::operator()(std::weak_ptr<Context>) const (context=..., __closure=0x2245258)
    at /home/linjw/workspace/demo_code/./src/common/base_service.h:45
#4  std::__invoke_impl<void, BaseService::BaseService(std::weak_ptr<ILifecycle>)::{lambda(std::weak_ptr<Context>)#1}&, std::weak_ptr<Context> >(std::__invoke_other, BaseService::BaseService(std::weak_ptr<ILifecycle>)::{lambda(std::weak_ptr<Context>)#1}&, std::weak_ptr<Context>&&) (
    __f=...) at /usr/include/c++/11.5.0/bits/invoke.h:61
#5  std::__invoke_r<void, BaseService::BaseService(std::weak_ptr<ILifecycle>)::{lambda(std::weak_ptr<Context>)#1}&, std::weak_ptr<Context> >(BaseService::BaseService(std::weak_ptr<ILifecycle>)::{lambda(std::weak_ptr<Context>)#1}&, std::weak_ptr<Context>&&) (__fn=...)
    at /usr/include/c++/11.5.0/bits/invoke.h:111
#6  std::_Function_handler<void (std::weak_ptr<Context>), BaseService::BaseService(std::weak_ptr<ILifecycle>)::{lambda(std::weak_ptr<Context>)#1}>::_M_invoke(std::_Any_data const&, std::weak_ptr<Context>&&) (__functor=..., __args#0=...)
    at /usr/include/c++/11.5.0/bits/std_function.h:290
#7  0x0061a908 in std::function<void (std::weak_ptr<Context>)>::operator()(std::weak_ptr<Context>) const (__args#0=..., this=<optimized out>)
    at /usr/include/c++/11.5.0/bits/std_function.h:247
#8  LifecycleObserver::OnCreate (context=..., this=<optimized out>)
    at /home/linjw/workspace/demo_code/./src/common/lifecycle.h:47
#9  Context::Init (this=0x223f95c, dispatcher=...)
    at /home/linjw/workspace/demo_code/src/common/context.cpp:61
#10 0x005b9fc8 in main (argc=<optimized out>, argv=<optimized out>)
    at /home/linjw/workspace/demo_code/src/app/main.cpp:49

可以看到它卡在了mcu.cpp的第55行ioctl里面:

ioctl(ioctl_cmd.fd, DEMO_REQUEST, (unsigned long)&ioctl_cmd.args);

查看变量信息

但是还有个比较奇怪的地方是在堆栈信息里面看到传给ioctl的fd是0:

#1  0xf7a98418 in __GI___ioctl (fd=0, request=1075858688) at ../sysdeps/unix/sysv/linux/ioctl.c:35

fd为0的话表示是标准输出。从代码上看fd是ioctl_cmd.fd:

bool SetEnable(bool enable)
{
    ...
    ioctl_cmd.fd = open(DEV_PATH, O_RDWR);
    ...
    ioctl(ioctl_cmd.fd, DEMO_REQUEST, (unsigned long)&ioctl_cmd.args);
    ...
}

所以我们可以使用f 2切换到SetEnable函数所在的帧:

(gdb) f 2
#2  0x00699160 in SetEnable (enable=<optimized out>)
    at /home/linjw/workspace/demo_code/impl/src/system/mcu/mcu.cpp:55
55      /home/linjw/workspace/demo_code/impl/src/system/mcu/mcu.cpp: No such file or directory.

然后用p ioctl_cmd打印出ioctl_cmd的值,可以看到fd应该是27才对:

(gdb) p ioctl_cmd.fd
$1 = 27

我们也可以用 p ioctl_cmd打印出ioctl_cmd的完整数据:

(gdb) p ioctl_cmd
$2 = {fd = 27, args = {ArgDemo = 0 '\000', Reserve = '\000' <repeats 12 times>}}

这块可能是抓钱的coredump哪里异常导致的,由于这个问题出现之后是必现的,我加了个打印然后再复现,发现fd的确应该不为0,另外在ioctl前后加了打印也确认了的确卡在这里。

反汇编

如果你更加硬核的话可以用disassemble反汇编当前帧的函数:

Dump of assembler code for function _Z21SetEnableb:
   0x00699090 <+0>:     ldrh    r0, [r3, r5]
   0x00699092 <+2>:     movs    r5, r0
   0x00699094 <+4>:     movs    r2, #173        ; 0xad
   0x00699096 <+6>:     movs    r3, r1
   0x00699098 <+8>:     str     r1, [r4, #108]  ; 0x6c
   0x0069909a <+10>:    movs    r5, r1
   0x0069909c <+12>:    lsls    r4, r2, #4
   0x0069909e <+14>:    lsls    r0, r2, #9
   0x006990a0 <+16>:    lsls    r4, r6, #1
   0x006990a2 <+18>:    add     r0, r0
   0x006990a4 <+20>:    ldrh    r6, [r4, r5]
   0x006990a6 <+22>:    movs    r5, r0
   0x006990a8 <+24>:    movs    r2, #157        ; 0x9d
   0x006990aa <+26>:    movs    r3, r1
   0x006990ac <+28>:    str     r5, [r6, #108]  ; 0x6c
   0x006990ae <+30>:    movs    r5, r1
   0x006990b0 <+32>:    lsls    r4, r2, #4
   0x006990b2 <+34>:    lsls    r0, r2, #9
   0x006990b4 <+36>:    lsls    r4, r6, #1
   0x006990b6 <+38>:    subs    r4, #0
   0x006990b8 <+40>:    ldrh    r6, [r6, r5]
   0x006990ba <+42>:    movs    r5, r0
   0x006990bc <+44>:    add     r4, sp, #916    ; 0x394
   0x006990be <+46>:    movs    r3, r2
   ...

这部分可以参考我之前写的一篇笔记i register查看寄存器值和汇编代码强行分析异常到底是如何发生的:

(gdb) i registers
r0             0x0                 0
r1             0x40204d00          1075858688
r2             0x78f728            7927592
r3             0x1                 1
r4             0x1                 1
r5             0x78c204            7913988
r6             0x2                 2
r7             0xff995ba4          4288240548
r8             0x78c204            7913988
r9             0x22456b8           35935928
r10            0x223f970           35912048
r11            0x5ec1d5            6210005
r12            0x36                54
sp             0xff995ad0          0xff995ad0
lr             0x699161            6918497
pc             0x699160            0x699160 <SetEnable(bool)+208>
cpsr           0x810030            8454192
fpscr          0x80000000          -2147483648

PS: gdb的disassemble反汇编处理的指令地址和你直接用objdump反汇编可执行程序出来的指令地址会有偏移,那是因为可执行程序被加载到内存之后指令地址就会有偏移,这个偏移可以/proc/{pid}/maps确认:

00590000-00776000 r-xp 00000000 103:0d 2184                              /usr/bin/Demo
00785000-0078e000 r--p 001e5000 103:0d 2184                              /usr/bin/Demo
0078e000-00791000 rw-p 001ee000 103:0d 2184                              /usr/bin/Demo
...
f7b10000-f7b28000 r-xp 00000000 103:0d 1039                              /lib/libgcc_s.so.1
f7b28000-f7b37000 ---p 00018000 103:0d 1039                              /lib/libgcc_s.so.1
...
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。