今天手写归并排序时,出现了这个错误
0 94 25 31 47 76 8 6 21 83 0 19 61 12 46 74
0 6 6 8 8 12 12 25 25 25 31 31 31 61 74 94
*** stack smashing detected ***: ./a.out terminated
仔细一看,归并的结果也是错误的,由于主排序函数merge_sort
仅仅是判断边界+2次递归+归并有序子数组,所以猜测错误出在merge
函数中
void merge(int* a, int mid, int n) {
int* buf = new int[n];
memcpy(buf, a, n * sizeof(int));
int* pL = buf;
int* pR = buf + mid;
int* const pEndL = buf + mid;
int* const pEndR = buf + n;
for (; pL != pEndL && pR != pEndR; ++a) {
if (*pL <= *pR) {
*a = *pL;
++pL;
}
else {
*a = *pR;
++pR;
}
}
if (pL == pEndL)
memcpy(a, pR, (pEndR - pR) * sizeof(int));
else
memcpy(a, pL, (pEndR - pL) * sizeof(int));
delete[] buf;
}
PS:这里简洁点写是可以*a = *pL++
这样的,很多库源码里都这么写,但是前几天腾讯一面的面试官表示这样的风格不太好,还要判断++和*的优先级,于是我就不图方便了。
之前merge
函数采用下标访问的方法时没问题,这里采用指针访问则出了问题,仔细点一眼能发现错误在哪,但是调试时的肉眼不是那么可靠的。
依旧是用valgrind检测
$ valgrind ./a.out --leak-check=full
==14866== Memcheck, a memory error detector
==14866== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==14866== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==14866== Command: ./a.out --leak-check=full
==14866==
6 47 66 81 63 28 31 86 67 95 81 22 89 59 86 12
6 12 22 22 28 28 47 63 66 67 81 81 81 86 86 95
*** stack smashing detected ***: ./a.out terminated
==14866==
==14866== Process terminating with default action of signal 6 (SIGABRT)
==14866== at 0x540E428: raise (raise.c:54)
==14866== by 0x5410029: abort (abort.c:89)
==14866== by 0x54507E9: __libc_message (libc_fatal.c:175)
==14866== by 0x54F215B: __fortify_fail (fortify_fail.c:37)
==14866== by 0x54F20FF: __stack_chk_fail (stack_chk_fail.c:28)
==14866== by 0x400BBF: main (in /home/xyz/cpp/sort/a.out)
==14866==
==14866== HEAP SUMMARY:
==14866== in use at exit: 72,704 bytes in 1 blocks
==14866== total heap usage: 17 allocs, 16 frees, 73,984 bytes allocated
==14866==
==14866== LEAK SUMMARY:
==14866== definitely lost: 0 bytes in 0 blocks
==14866== indirectly lost: 0 bytes in 0 blocks
==14866== possibly lost: 0 bytes in 0 blocks
==14866== still reachable: 72,704 bytes in 1 blocks
==14866== suppressed: 0 bytes in 0 blocks
==14866== Rerun with --leak-check=full to see details of leaked memory
==14866==
==14866== For counts of detected and suppressed errors, rerun with: -v
==14866== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Aborted (core dumped)
从LEAK SUMMARY可以看到堆上没有内存泄露(这里still reachable是正常的,是用来跟踪内存泄露的内存块),SIGABRT信号是从__stack_chk_fail
函数导致的,如同名字定义,是栈上的错误。
用来排序的数组是静态数组,存放于栈上。由于栈上的内存分配是自动回收的(仅仅移动相应的栈指针即可),栈上的越界访问无法用valgrind来检测。不过这里valgrind定位到了出错位置在main函数里。
我的main函数中测试代码如下
int arr[N];
// ...
merge_sort(arr, N);
尝试把arr改成堆上申请的数组
int* arr = new int[N];
// ...
merge_sort(arr, N);
// ...
delete[] arr;
g++加上-g选项编译后,再用valgrind检测,错误信息就出来了,截取关键检测结果如下
==15746== Invalid write of size 8
==15746== at 0x4C326CB: memcpy@@GLIBC_2.14 (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15746== by 0x4009EB: merge(int*, int, int) (merge.cpp:31)
==15746== by 0x400A7C: merge_sort(int*, int) (merge.cpp:43)
==15746== by 0x400B22: main (merge.cpp:60)
==15746== Address 0x5ab6cc0 is 0 bytes after a block of size 64 alloc'd
==15746== at 0x4C2E80F: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15746== by 0x400A9E: main (merge.cpp:54)
准确地定位到了出错位置,即merge
函数中调用的库函数memcpy
(第31行),查看附近的代码
$ awk 'NR==31' merge.cpp
memcpy(a, pL, (pEndR - pL) * sizeof(int));
嗯,定位到这里就很明显了,归并是把左边区域L和右边区域R合并,pEndR是R的右边界,pL是在左边区域中进行迭代的指针,这里的目的是把左边区域剩下的元素全部放在归并后的数组末尾,所以应该是pEndL-pL
。
最后说一点,对于大小为N的数组,一般情况下访问下标未越界太多的位置不会提示错误,因为该地址可能被其他变量所占用,所以实质上这样的访问被认为是合法的,比如
$ cat a.cc
#include <stdio.h>
int main() {
int a[3];
a[3] = 10;
printf("%d\n", a[3]);
return 0;
}
$ g++ a.cc
$ ./a.out
10
运行结果未报错,但如果访问的是a[100]则会Segmentation fault (core dumped)
栈上的陷阱很多,还难以定位,我之前也写了一篇类似的博客C程序的局部变量被重用现象,栈上的数组越界检测比较难检测,可以像本文这样尝试改成堆上的数组来精确检测错误。