由coreDump引发的一次探讨

昨天面试的时候被问到这么一个问题:“如果你开发的程序在Linux上运行的时候发生了段错误之类的问题,你会用什么方法解决?”...心想这不就是在考我coredump嘛,吧啦吧啦说了一通。对方追问:“如果你编译的时候忘记加上-g调试选项了,你又该如何定位段错误的位置呢?”...额,就这么被问住了。最后草草作答,但我一直觉得:即便没有符号信息,只要我们获取到了出错位置的内存地址就应该有办法定位到错误的。这个猜测对么?如果对,应该又是怎么做呢?


首先借用一下C语言结构体里的成员数组和指针 | | 酷 壳 - CoolShell中段错误的例子:

crash.c

root@k8s:~/test# gcc -o crash_noDebug crash.c

crash.c: In function ‘main’:

crash.c:15:10: warning: format not a string literal and no format arguments [-Wformat-security]

  printf(f.a->s);

          ^

root@k8s:~/test# ulimit -c

0

root@k8s:~/test# ulimit -c unlimited

root@k8s:~/test# ulimit -c

unlimited

root@k8s:~/test# ./crash_noDebug

Segmentation fault (core dumped)

root@k8s:~/test# gdb crash_noDebug core

GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1

Copyright (C) 2016 Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law.  Type "show copying"

and "show warranty" for details.

This GDB was configured as "x86_64-linux-gnu".

Type "show configuration" for configuration details.

For bug reporting instructions, please see:

<http://www.gnu.org/software/gdb/bugs/>.

Find the GDB manual and other documentation resources online at:

<http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".

Type "apropos word" to search for commands related to "word"...

Reading symbols from crash_noDebug...(no debugging symbols found)...done.

[New LWP 18077]

Core was generated by `./crash_noDebug'.

Program terminated with signal SIGSEGV, Segmentation fault.

#0  strchrnul () at ../sysdeps/x86_64/strchr.S:32

32      ../sysdeps/x86_64/strchr.S: No such file or directory.

(gdb) where

#0  strchrnul () at ../sysdeps/x86_64/strchr.S:32

#1  0x00007f1ce8cb2208 in __find_specmb (format=0x4 <error: Cannot access memory at address 0x4>) at printf-parse.h:108

#2  _IO_vfprintf_internal (s=0x7f1ce902a620 <_IO_2_1_stdout_>, format=0x4 <error: Cannot access memory at address 0x4>,

    ap=ap@entry=0x7ffccc8407e8) at vfprintf.c:1312

#3  0x00007f1ce8cba899 in __printf (format=<optimized out>) at printf.c:33

#4  0x000000000040055f in main ()

(gdb)

虽然没有符号表直接映射出代码出错的位置,但是最后#4 0x000000000040055f 的这个地址应该还是很有价值的。我暂且做了一个假设,假设加-g选项与否并不影响代码段的内存位置。那么,如果成立的话,只要重新尝试用-g选项编译一下程序,然后通过某种方法定位到0x000000000040055f 地址所对应的符号信息,应该就能够成功解题了。为了验证这个猜测,我重新coredump了一下带有符号信息的程序。

root@k8s:~/test# gcc -o crash_withDebug -g crash.c

crash.c: In function ‘main’:

crash.c:15:10: warning: format not a string literal and no format arguments [-Wformat-security]

  printf(f.a->s);

          ^

root@k8s:~/test# ./crash_withDebug

Segmentation fault (core dumped)

root@k8s:~/test# gdb crash_withDebug core

GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1

Copyright (C) 2016 Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law.  Type "show copying"

and "show warranty" for details.

This GDB was configured as "x86_64-linux-gnu".

Type "show configuration" for configuration details.

For bug reporting instructions, please see:

<http://www.gnu.org/software/gdb/bugs/>.

Find the GDB manual and other documentation resources online at:

<http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".

Type "apropos word" to search for commands related to "word"...

Reading symbols from crash_withDebug...done.

[New LWP 23657]

Core was generated by `./crash_withDebug'.

Program terminated with signal SIGSEGV, Segmentation fault.

#0  strchrnul () at ../sysdeps/x86_64/strchr.S:32

32      ../sysdeps/x86_64/strchr.S: No such file or directory.

(gdb) where

#0  strchrnul () at ../sysdeps/x86_64/strchr.S:32

#1  0x00007ffabb54e208 in __find_specmb (format=0x4 <error: Cannot access memory at address 0x4>) at printf-parse.h:108

#2  _IO_vfprintf_internal (s=0x7ffabb8c6620 <_IO_2_1_stdout_>, format=0x4 <error: Cannot access memory at address 0x4>,

    ap=ap@entry=0x7ffe3a0c4708) at vfprintf.c:1312

#3  0x00007ffabb556899 in __printf (format=<optimized out>) at printf.c:33

#4  0x000000000040055f in main (argc=1, argv=0x7ffe3a0c48e8) at crash.c:15

(gdb)

得到段错误发生的位置是crash.c的第15行,内存地址亦是0x000000000040055f, 这大概侧面印证了此前那个假设的正确性。当然正常的项目中程序的coredump恐怕不总是这样容易再现的,所以不能寄望于追加-g选项后,运行得到coredump文件再次调试获得出错位置。那么,如果只有0x000000000040055f这个内存地址信息,我们该如何定位代码位置呢?这个问题先留在这里,我想系统的学习和整理一下编译、内存等相关的知识,到时候应该自然而然能够得出答案吧!

C语言的编译过程如下,

编译过程

那么问题来了,gcc的-g选项具体作用在上述的哪一步呢?如果能知道符号表(Symbol table)是个什么东西就不难推测出-g实际作用于编译过程,因为所谓的符号表实际是汇编代码中的追加的一些符号信息。

root@k8s:~/test# gcc -S crash.i -o crash_noDebug.S

crash.c: In function ‘main’:

crash.c:15:10: warning: format not a string literal and no format arguments [-Wformat-security]

  printf(f.a->s);

          ^

root@k8s:~/test# gcc -S crash.i -g -o crash_withDebug.S

crash.c: In function ‘main’:

crash.c:15:10: warning: format not a string literal and no format arguments [-Wformat-security]

  printf(f.a->s);

          ^

root@k8s:~/test# diff crash_noDebug.S crash_withDebug.S

2a3

> .Ltext0:

6a8,9

>      .file 1 "crash.c"

>      .loc 1 12 0

15a19

>      .loc 1 13 0

16a21

>      .loc 1 14 0

20a26

>      .loc 1 15 0

26a33

>      .loc 1 17 0

27a35

>      .loc 1 18 0

33a42,380

> .Letext0:

>      .section        .debug_info,"",@progbits

> .Ldebug_info0:

...略...

>      .section        .debug_line,"",@progbits

> .Ldebug_line0:

>      .section        .debug_str,"MS",@progbits,1

> .LASF3:

>      .string "unsigned int"

> .LASF13:

>      .string "/root/test"

> .LASF0:

>      .string "long unsigned int"

> .LASF8:

>      .string "char"

> .LASF12:

>      .string "crash.c"

> .LASF1:

>      .string "unsigned char"

> .LASF14:

>      .string "main"

> .LASF6:

>      .string "long int"

> .LASF9:

>      .string "argc"

> .LASF11:

>      .string "GNU C11 5.4.0 20160609 -mtune=generic -march=x86-64 -g -fstack-protector-strong"

> .LASF2:

>      .string "short unsigned int"

> .LASF4:

>      .string "signed char"

> .LASF5:

>      .string "short int"

> .LASF7:

>      .string "sizetype"

> .LASF10:

>      .string "argv"


最后就是借助一些Linux二进制文件分析工具的力量来找到0x000000000040055f这个地址对应的代码段是哪里的问题了。常用的一些工具如nm、objdump、readelf之类,其中:

nm:专门用来列出二进制文件中的符号信息的。无法详细定位到目标地址的内容,Pass。

objdump:用以显示目标文件的各色信息,比如可以反汇编得到.text段信息。再合适不过了。

root@k8s:~/test# objdump -d -j .text crash_withDebug

crash_withDebug:    file format elf64-x86-64

Disassembly of section .text:

0000000000400526 <main>:

  400526:      55                      push  %rbp

  400527:      48 89 e5                mov    %rsp,%rbp

  40052a:      48 83 ec 20            sub    $0x20,%rsp

  40052e:      89 7d ec                mov    %edi,-0x14(%rbp)

  400531:      48 89 75 e0            mov    %rsi,-0x20(%rbp)

  400535:      48 c7 45 f0 00 00 00    movq  $0x0,-0x10(%rbp)

  40053c:      00

  40053d:      48 8b 45 f0            mov    -0x10(%rbp),%rax

  400541:      48 83 c0 04            add    $0x4,%rax

  400545:      48 85 c0                test  %rax,%rax

  400548:      74 15                  je    40055f <main+0x39>

  40054a:      48 8b 45 f0            mov    -0x10(%rbp),%rax

  40054e:      48 83 c0 04            add    $0x4,%rax

  400552:      48 89 c7                mov    %rax,%rdi

  400555:      b8 00 00 00 00          mov    $0x0,%eax

  40055a:      e8 a1 fe ff ff          callq  400400 <printf@plt>

  40055f:      b8 00 00 00 00          mov    $0x0,%eax

  400564:      c9                      leaveq

  400565:      c3                      retq

  400566:      66 2e 0f 1f 84 00 00    nopw  %cs:0x0(%rax,%rax,1)


readelf:显示ELF格式文件(如可执行二进制、o目标文件、共享库以及coredump文件)的信息。尝试使用-s选项打印符号信息,与nm一样无法准确定位目标地址内容,Pass。


很后悔当时没向面试官询问一下正确答案应该是什么,从目前的调查结果来看,

·通过原有coredump确定段错误内存地址

·追加-g选项重新编译可执行程序

·借助objdump查看代码段符号与地址对应关系

这个思路基本能粗略的定位到出错位置,至于是否还有更方便或者高效的方法本人就不甚明了。长路漫漫啊...

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容