耳边一直是spm催版本的声音,好不容易在周四提测了一个版本,周五就被打回来了,问题描述是如下:
偶现,使用一段时间后崩溃
是可忍熟不可忍,很明显被测试打了回来。下面讲一下解这个问题的经过:
1
从日志中看崩溃时间是
03-13 10:56:18.185 1966 1966 I Process : Sending signal. PID: 1966 SIG: 9
...
03-13 10:56:18.353 998 1018 I ActivityManager: Process ***(pid 1966) has died
进程被sig 9了,以为是gc内存不够才被强杀的,但是没有找到具体的gc日志,只能继续找其他问题了
2
终于在前面的日志中找到了端倪,哎,一分钟前的error,一分钟后才出现。。。
03-13 10:55:23.757 1966 1966 E AndroidRuntime: Process: ***, PID: 1966
03-13 10:55:23.757 1966 1966 E AndroidRuntime: java.lang.RuntimeException: Could not read input channel file descriptors from parcel.
03-13 10:55:23.757 1966 1966 E AndroidRuntime: at android.view.InputChannel.nativeReadFromParcel(Native Method)
03-13 10:55:23.757 1966 1966 E AndroidRuntime: at android.view.InputChannel.readFromParcel(InputChannel.java:148)
03-13 10:55:23.757 1966 1966 E AndroidRuntime: at android.view.InputChannel$1.createFromParcel(InputChannel.java:39)
03-13 10:55:23.757 1966 1966 E AndroidRuntime: at android.view.InputChannel$1.createFromParcel(InputChannel.java:36)
03-13 10:55:23.757 1966 1966 E AndroidRuntime: at com.android.internal.view.InputBindResult.<init>(InputBindResult.java:68)
03-13 10:55:23.757 1966 1966 E AndroidRuntime: at com.android.internal.view.InputBindResult$1.createFromParcel(InputBindResult.java:112)
03-13 10:55:23.757 1966 1966 E AndroidRuntime: at com.android.internal.view.InputBindResult$1.createFromParcel(InputBindResult.java:109)
03-13 10:55:23.757 1966 1966 E AndroidRuntime: at com.android.internal.view.IInputMethodManager$Stub$Proxy.windowGainedFocus(IInputMethodManager.java:755)
03-13 10:55:23.757 1966 1966 E AndroidRuntime: at android.view.inputmethod.InputMethodManager.startInputInner(InputMethodManager.java:1283)
03-13 10:55:23.757 1966 1966 E AndroidRuntime: at android.view.inputmethod.InputMethodManager.onWindowFocus(InputMethodManager.java:1491)
03-13 10:55:23.757 1966 1966 E AndroidRuntime: at android.view.ViewRootImpl$ViewRootHandler.handleMessage(ViewRootImpl.java:3767)
03-13 10:55:23.757 1966 1966 E AndroidRuntime: at android.os.Handler.dispatchMessage(Handler.java:111)
03-13 10:55:23.757 1966 1966 E AndroidRuntime: at android.os.Looper.loop(Looper.java:194)
03-13 10:55:23.757 1966 1966 E AndroidRuntime: at android.app.ActivityThread.main(ActivityThread.java:5677)
03-13 10:55:23.757 1966 1966 E AndroidRuntime: at java.lang.reflect.Method.invoke(Native Method)
03-13 10:55:23.757 1966 1966 E AndroidRuntime: at java.lang.reflect.Method.invoke(Method.java:372)
03-13 10:55:23.757 1966 1966 E AndroidRuntime: at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:985)
03-13 10:55:23.757 1966 1966 E AndroidRuntime: at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:780)
没有明代码的日志,注定是个不太好解的bug,接着找日志
03-13 10:55:19.899 1966 1966 I art : Alloc concurrent mark sweep GC freed 3(96B) AllocSpace objects, 0(0B) LOS objects, 8% free, 41MB/45MB, paused 1.572ms total 135.273ms
03-13 10:55:19.899 1966 1966 W art : Large object allocation failed: ashmem_create_region failed for 'large object space allocation': Too many open files
03-13 10:55:19.899 1966 1966 E art : Throwing OutOfMemoryError "Failed to allocate a 16396 byte allocation with 4194208 free bytes and 214MB until OOM"
03-13 10:55:19.900 1966 1966 W art : Large object allocation failed: ashmem_create_region failed for 'large object space allocation': Too many open files
03-13 10:55:19.908 1966 1966 I art : Alloc sticky concurrent mark sweep GC freed 37(22KB) AllocSpace objects, 0(0B) LOS objects, 4% free, 41MB/43MB, paused 1.530ms total 7.990ms
03-13 10:55:19.908 1966 1966 W art : Large object allocation failed: ashmem_create_region failed for 'large object space allocation': Too many open files
出现了 Too many open files, 以前也没遇到过这种,所以先进行了一轮百度,原来是fd持有数超过了1024,导致进程崩溃。但是可疑的地方很多, thread, handlerThread, sql, socket, file,binder,resource。。。
根据这个思路查找项目中用到这些的地方,发现并没有出现引用资源没释放等问题。
接着从日志中找到
03-13 10:55:19.033 1966 1966 E SQLiteLog: (14) cannot open file at line 27843 of [9491ba7d73]
03-13 10:55:19.033 1966 1966 E SQLiteLog: (14) os_unix.c:27843: (24) open(/data/data/***/databases) -
03-13 10:55:19.041 1966 1966 E SQLiteLog: (14) cannot open file at line 27843 of [9491ba7d73]
03-13 10:55:19.041 1966 1966 E SQLiteLog: (14) os_unix.c:27843: (24) open(/data/data/***/databases) -
就误区进入了Sql,无所获
3
接着查看fd日志,ls -al /proc/pid/fd
lrwx------ system system 2020-03-14 19:39 0 -> /dev/null
lrwx------ system system 2020-03-14 19:39 1 -> /dev/null
lrwx------ system system 2020-03-14 19:39 10 -> /dev/binder
lrwx------ system system 2020-03-14 19:39 100 -> socket:[20584]
l-wx------ system system 2020-03-14 19:39 101 -> pipe:[19938]
lrwx------ system system 2020-03-14 19:39 102 -> anon_inode:[eventpoll]
lr-x------ system system 2020-03-14 19:39 103 -> /data/app/Tingban_K-radio.apk
lr-x------ system system 2020-03-14 19:39 104 -> pipe:[19940]
l-wx------ system system 2020-03-14 19:39 105 -> pipe:[19940]
lrwx------ system system 2020-03-14 19:39 106 -> anon_inode:[eventpoll]
运行一段时间之后出现大量的
lrwx------ system system 2020-03-14 19:40 220 -> socket:[30466]
lrwx------ system system 2020-03-14 19:40 221 -> socket:[32544]
lrwx------ system system 2020-03-14 19:40 222 -> socket:[32583]
lrwx------ system system 2020-03-14 19:40 223 -> socket:[33258]
lrwx------ system system 2020-03-14 19:40 224 -> socket:[32643]
lrwx------ system system 2020-03-14 19:40 225 -> socket:[33818]
lrwx------ system system 2020-03-14 19:40 226 -> socket:[32699]
lrwx------ system system 2020-03-14 19:40 227 -> socket:[33451]
lrwx------ system system 2020-03-14 19:40 228 -> socket:[32766]
lrwx------ system system 2020-03-14 19:40 229 -> socket:[34855]
但是项目中没有用socket,很是迷茫
因为项目是静置后来运行的,于是乎就去查询线程,看看后台哪些线程在运行
找到ctx_run线程,每隔两秒钟就运行一次,其他线程都是sleep状态。有点苗头了,于是项目一搜,没有这个名称的线程。。。然后查询 ctx_run的线程号,在AS中寻找该线程号的日志,终于找到问题
但是这个日志不是我们项目打的,时间上跟ctx_run的运行时间吻合,并且在该日志上面,找到我们项目打印的日志,定位到这个问题属于第三方库,问题终于结束。
花了20个小时查这个问题,做个记录。
fd-1024问题影响并不止这些,看下面的log:
03-13 11:36:50.172 32093 32303 E Parcel : writeDupFileDescriptor: error 24 dup fd 1023
03-13 11:36:50.172 234 977 E Parcel : dup() failed in Parcel::read, i is 0, fds[i] is -1, fd_count is 1, error: Bad file number
03-13 11:36:50.172 234 977 E BufferQueueProducer: [com.ecarx.xiaoka/com.ecarx.magicbook.ui.MagicActivity](this:0x7f99a27000,id:286,api:1,p:32093,c:234) queueBuffer: fence is NULL
03-13 11:36:50.172 32093 32303 E Surface : queueBuffer: error queuing buffer to SurfaceTexture, -22
从log上看,fd 1023问题导致系统dup操作失败,导致后面surface渲染界面出现error,出现的现象是什么呢?就是原来点击一个按钮会跳转到下一个界面,但是因为surface渲染失败,表面上看不到页面被加载,但是实际上该页面已经加入到堆栈中了。疑难杂症,做个记录。