最近一些问题的现象一开始难以解释,函数的参数地址在函数内部被传递给另外的函数,然后发现地址发生了改变,这样的情况称之为函数的栈被毁坏,导致无法重入。
然后被调用的函数里面,访问了非法的地址导致了segment fault,产生core dump文件。问题比较棘手
查了一些文件,准备从gdb的栈保护设置开始着手。
1) 编译的时候添加编译选项
-fstack-protector 和 -fstack-protector-all 这两个选项指示编译器开启栈保护,这样在栈乱序的第一时间可以dump出来现场。可加在Makefile里面, 顺便扯一句,Makefile这种东西对于搞开源软件的人,还真是得精通,我随便想写个Makefile玩着,突然感觉自己头脑一片空白。
2) gdb的多线程功能
bt 查看当前线程的调用栈
bt full 查看详细的调用栈
info threads 可以查看所有线程的信息
thread <num> 可以具体跳转到某个线程
f <num> 可以跳转到某个栈中位置
i locals 显示当前调用栈的所有变量
i register 显示当前调用栈的寄存器值,主要是查看地址
有了这些命令的帮助我们可以从core dump 的文件里面分析出很多问题。
下面举个例子:
gdb /lab/testtools/rhel664/dallas/testRelease/R10A06_dynamic_udpport_5/mnsserv/bin/mhlif core-mhlif-18310-1384802382
(gdb) bt
#0 0x0000003383488611 in memcpy () from /lib64/libc.so.6
#1 0x000000000041a9aa in ReadFromQueue (q=0x647580, msg=0x4fc780004fc71 , size=280,
time=21081) at ltsosdep.c:443
#2 0x000000000041b552 in OSH_ReceiveMsgQMillisec (q=0x647580, msg=0x4fc780004fc71 ,
size=280, time=21081) at ltsosdep.c:1370
#3 0x000000000042d47d in RPS::ReceiveMsg (this=0x2b3100005330, delay=21081) at rps.cc:590
#4 0x000000000042d731 in RPS::Execute (this=0x2b31681ffdf0) at rps.cc:572
#5 0x000000000042dbe8 in StartRps (arg=0x157a680) at rps.cc:181
#6 0x0000003383c077e1 in start_thread () from /lib64/libpthread.so.0
#7 0x00000033834e68ed in clone () from /lib64/libc.so.6
(gdb) bt full
#0 0x0000003383488611 in memcpy () from /lib64/libc.so.6
No symbol table info available.
#1 0x000000000041a9aa in ReadFromQueue (q=0x647580, msg=0x4fc780004fc71 , size=280,
time=21081) at ltsosdep.c:443
row = 0x2b31682a433c
answer = LTS_OK
#2 0x000000000041b552 in OSH_ReceiveMsgQMillisec (q=0x647580, msg=0x4fc780004fc71 ,
size=280, time=21081) at ltsosdep.c:1370
No locals.
#3 0x000000000042d47d in RPS::ReceiveMsg (this=0x2b3100005330, delay=21081) at rps.cc:590
rpsMsg = {msgId = 4501, type = 0 '\000', data = {loadReplayReq = {
fileName = "Ú%\004\000\002\000\000\000\001\000\000\000\235ú\004\000tQ\003\000GP\003\000¸U\000\000Oû\004\000pR\000\000\206ü\004\000\bú\004\000ÅS\000\000vR\000\000\067P\003\000fP\003\000 ü\004\000Úü\004\000¢P\003\000ÿT\000\000\vý\004\000²O\003\000Z\002\002\000Nú\004\000+ú\004\000>ú\004\000\233T\000\000íÿ\001\000ÊT\000\000G\001\002\000M\001\002\000Y\003\002\000£ú\004\000\020ú\004\000\032\000\002\000ÎU\000\000x\000\002\000\035\001\002\000K\002\002\000æù\004\000\206S\000\000\071U\000\000\232ü\004\000õP\003\000ë\000\002\000\202S\003\000Ø\000\002\000xú\004\000\201\001\002\000=T\000\000oR\000\000"..., natType = 48 '0', timeStretch = 11057,
rpsType = 2156588448}, replayConReq = {msIndex = 271834, contextIndex = 2 '\002', resend = 0 '\000', replayId = 0,
sessionId = 1, sessionTime = 326301, destIp1 = {addr64 = {932690803249524, 1402216627852728},
b = "tQ\003\000GP\003\000¸U\000\000Oû\004", addr16 = {20852, 3, 20551, 3, 21944, 0, 64335, 4}, ui = {i1 = 217460,
i2 = 217159, i3 = 21944, ipv4 = 326479}}, destIp2 = {addr64 = {1403552362680944, 92105573988872},
b = "pR\000\000\206ü\004\000\bú\004\000ÅS\000", addr16 = {21104, 0, 64646, 4, 64008, 4, 21445, 0}, ui = {
i1 = 21104, i2 = 326790, i3 = 326152, ipv4 = 21445}}, reqPackets = 21110, timeStretch = 217143, type = 102 'f',
radiotype = 80 'P', kernelMsId = 933081645382874}, msgQid = {_qId = 0x2000425da}, payloadPropReq = {
payloadPropId = 271834, groupId = 2 '\002', msgLength = 0, userBw = 1}, connectionReq = {msIndex = 271834,
contextIndex = 2 '\002', payloadPropId = 0, sessionId = 1, addresses = {GiIpAddr = {addr64 = {932690803249524,
1402216627852728}, b = "tQ\003\000GP\003\000¸U\000\000Oû\004", addr16 = {20852, 3, 20551, 3, 21944, 0, 64335,
4}, ui = {i1 = 217460, i2 = 217159, i3 = 21944, ipv4 = 326479}}, msPortNo = 21104, GiPortNo = 0},
reqPackets = 326152, initiator = 197 'Å', type = 83 'S', radiotype = 0 '\000', kernelMsId = 932622083576438},
rpsDeactReq = {msIndex = 271834, contextIndex = 2 '\002', sendMhlResponse = LTS_TRUE, sessionId = {326301, 217460,
217159, 21944, 326479, 21104, 326790, 326152, 21445, 21110, 217143, 217190, 326688, 326874, 217250, 21759, 326923,
217010, 131674, 326222}, pdpcontextId = 326187, sessionnum = 62 '>'}, moveUpdateDataReq = {msIndex = 271834,
toDevice = 2 '\002', moveIndex = 1, status = 326301}, suspendResumeReq = {msIndex = 271834, sessionId = {2, 1,
326301, 217460, 217159, 21944, 326479, 21104, 326790, 326152, 21445, 21110, 217143, 217190, 326688, 326874,
217250, 21759, 326923, 217010}, sessionnum = 90 'Z', contextIndex = 2 '\002'}, rabCreateReleaseReq = {
msIndex = 271834, contextId = 2 '\002'}, peMoveResp = {msIndex = 271834, toDevice = 2 '\002', moveIndex = 1,
peIndex = 326301, status = 217460}, scalePayloadReq = {scaleFactor = 271834}, magQid = {_qId = 0x2000425da}}}
count =
#4 0x000000000042d731 in RPS::Execute (this=0x2b31681ffdf0) at rps.cc:572
nowTime = 12394937602
nextTime =
count = 844209533
entry =
pEngine = 0x2b3168518270
#5 0x000000000042dbe8 in StartRps (arg=0x157a680) at rps.cc:181
Rps = {mhlifQId = {_qId = 0x2b315c225000}, magifQId = {_qId = 0x2b319454d000}, initQId = {_qId = 0x2b315c225000},
mDeviceNo = 12, mRpsState = RPS_RUNNING_STATE, sessionRepository = {rpsSessionPolymer = {buckets = 100001,
hash_func = 0x42a5c0 , p_dataRepository = 0x2b316c001070}},
log = @0x1538040, apnDev = 10, vpReplayStore = std::vector of length 0, capacity 0,
mpAlreadyLoaded = std::map with 0 elements}
#6 0x0000003383c077e1 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#7 0x00000033834e68ed in clone () from /lib64/libc.so.6
一般来说bt full没什么用,但是可以看到一些局部变量的值,但是有些值不可靠,我们还不能准确的定位
(gdb) info threads
16 Thread 0x2b3151cb7100 (LWP 18310) 0x0000003383c0b44c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
15 Thread 0x2b315c54b700 (LWP 18428) 0x00000033834df443 in select () from /lib64/libc.so.6
14 Thread 0x2b315c224700 (LWP 18423) 0x00000033834df443 in select () from /lib64/libc.so.6
13 Thread 0x2b31525e5700 (LWP 18422) 0x0000003383c0b44c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
12 Thread 0x2b3151fb0700 (LWP 18313) 0x00000033834df443 in select () from /lib64/libc.so.6
11 Thread 0x2b3194873700 (LWP 18535) 0x00000033834df443 in select () from /lib64/libc.so.6
10 Thread 0x2b319454c700 (LWP 18534) 0x00000033834df443 in select () from /lib64/libc.so.6
9 Thread 0x2b3194225700 (LWP 18533) 0x00000033834df443 in select () from /lib64/libc.so.6
8 Thread 0x2b3188425700 (LWP 18531) 0x00000033834df443 in select () from /lib64/libc.so.6
7 Thread 0x2b3188200700 (LWP 18530) 0x0000003383c0b44c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
6 Thread 0x2b3178602700 (LWP 18529) 0x0000003383c0b44c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
5 Thread 0x2b3178401700 (LWP 18435) 0x00000033834df443 in select () from /lib64/libc.so.6
4 Thread 0x2b3178200700 (LWP 18434) 0x00000033834df443 in select () from /lib64/libc.so.6
3 Thread 0x2b3169f6b700 (LWP 18433) 0x0000003383c0b44c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
2 Thread 0x2b3169d6a700 (LWP 18432) 0x0000003383c0b44c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
* 1 Thread 0x2b3168200700 (LWP 18429) 0x0000003383488611 in memcpy () from /lib64/libc.so.6
(gdb) thread 1
[Switching to thread 1 (Thread 0x2b3168200700 (LWP 18429))]#0 0x0000003383488611 in memcpy () from /lib64/libc.so.6
(gdb) bt
#0 0x0000003383488611 in memcpy () from /lib64/libc.so.6
#1 0x000000000041a9aa in ReadFromQueue (q=0x647580, msg=0x4fc780004fc71 , size=280,
time=21081) at ltsosdep.c:443
#2 0x000000000041b552 in OSH_ReceiveMsgQMillisec (q=0x647580, msg=0x4fc780004fc71 ,
size=280, time=21081) at ltsosdep.c:1370
#3 0x000000000042d47d in RPS::ReceiveMsg (this=0x2b3100005330, delay=21081) at rps.cc:590
#4 0x000000000042d731 in RPS::Execute (this=0x2b31681ffdf0) at rps.cc:572
#5 0x000000000042dbe8 in StartRps (arg=0x157a680) at rps.cc:181
#6 0x0000003383c077e1 in start_thread () from /lib64/libpthread.so.0
#7 0x00000033834e68ed in clone () from /lib64/libc.so.6
(gdb) f 1
#1 0x000000000041a9aa in ReadFromQueue (q=0x647580, msg=0x4fc780004fc71 , size=280,
time=21081) at ltsosdep.c:443
443 ltsosdep.c: No such file or directory.
in ltsosdep.c
(gdb) i locals
row = 0x2b31682a433c
answer = LTS_OK
(gdb) i register
rax 0x2b0000001197 47278999998871
rbx 0x4fc780004fc71 1403492233444465
rcx 0x7 7
rdx 0x118 280
rsi 0x2b31682a4340 47491200992064
rdi 0x4fc780004fc71 1403492233444465
rbp 0x2b31681ffb80 0x2b31681ffb80
rsp 0x2b31681ffb20 0x2b31681ffb20
r8 0x1c0002000ce527 7881307938678055
r9 0x2b310003517c 47489453609340
r10 0x0 0
r11 0x202 514
r12 0x525a00005259 90546500555353
r13 0x2b3100005330 47489453413168
r14 0x20c49ba5e353f7cf 2361183241434822607
r15 0x2b316c106d70 47491266407792
rip 0x41a9aa 0x41a9aa
eflags 0x10203 [ CF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
这里只是演示了一些查看core dump文件的方法,其实在进程alive的时候,我们可以直接attach 到进程上面去分析代码。
(gdb) attach 2467
Attaching to process 2467
Reading symbols from /root/algorithm/testBh...done.
Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libstdc++.so.6
Reading symbols from /lib/tls/i686/cmov/libm.so.6...Reading symbols from /usr/lib/debug/lib/tls/i686/cmov/libm-2.11.1.so...done.
done.
Loaded symbols for /lib/tls/i686/cmov/libm.so.6
Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libgcc_s.so.1
Reading symbols from /lib/tls/i686/cmov/libc.so.6...Reading symbols from /usr/lib/debug/lib/tls/i686/cmov/libc-2.11.1.so...done.
done.
Loaded symbols for /lib/tls/i686/cmov/libc.so.6
Reading symbols from /lib/ld-linux.so.2...Reading symbols from /usr/lib/debug/lib/ld-2.11.1.so...done.
done.
Loaded symbols for /lib/ld-linux.so.2
0x005f7422 in __kernel_vsyscall ()
(gdb) break testBh.cc:38
Breakpoint 1 at 0x80488ff: file testBh.cc, line 38.
(gdb) c
Continuing.
这些方法可以让进程挂住,然后单步调试,或者print一些局部变量
打印所有线程堆栈
在gdb中使用 thread apply all bt 查看所用线程堆栈信息