1.背景
测试同学在下午正常测试的时候突然出现系统重启,然后开机界面一直在转圈无法进入Launcher。
2.调查过程
2.1 问题初诊断
不幸的是车机adb口不好使,首先通过串口进入在Android系统,配置usb模式,让adb好使,通过ADB进入到车机。
进入车机之后使用`uptime`指令查看系统启动时间,看是Android重启,还是QNX重启了。
一看uptime的系统启动时间,发现才10分多钟,明显是QNX重启了。那这个问题从一个问题发散成两个问题了。
QNX故障导致系统整体重启
-
重启后系统无法进入Launcher
此文章主要记录问题2:系统服务发生死锁导致无法进入Launcher。
2.2 深入调查
2.2.1 查看进程启动情况
`ps -A`查看所有进程启动情况
......
mdnsr 733 1 5628 1184 poll_sche+ 0 S mdnsd
root 745 2 0 0 worker_th+ 0 I [kworker/4:3]
radio 747 1 13924 3064 poll_sche+ 0 S ipacm-diag
radio 751 1 22164 6476 futex_wai+ 0 S ipacm
# systemserver启动
system 761 341 4523728 162136 ep_poll 0 S system_server
root 866 2 0 0 worker_th+ 0 I [kworker/u13:2]
root 949 2 0 0 worker_th+ 0 I [kworker/1:4]
root 950 2 0 0 worker_th+ 0 I [kworker/2:3]
root 951 2 0 0 worker_th+ 0 I [kworker/2:4]
root 952 2 0 0 worker_th+ 0 I [kworker/2:5]
root 954 2 0 0 worker_th+ 0 I [kworker/2:6]
root 955 2 0 0 worker_th+ 0 I [kworker/1:5]
root 957 2 0 0 worker_th+ 0 I [kworker/1:6]
root 964 2 0 0 worker_th+ 0 I [kworker/1:7]
root 1006 2 0 0 worker_th+ 0 I [kworker/0:3]
root 1007 2 0 0 worker_th+ 0 I [kworker/1:8]
root 1008 2 0 0 worker_th+ 0 I [kworker/0:4]
root 1012 2 0 0 worker_th+ 0 I [kworker/0:5]
root 1015 2 0 0 worker_th+ 0 I [kworker/0:6]
root 1017 2 0 0 worker_th+ 0 I [kworker/0:7]
root 1020 2 0 0 worker_th+ 0 I [kworker/0:8]
root 1021 2 0 0 worker_th+ 0 I [kworker/0:9]
root 1022 2 0 0 worker_th+ 0 I [kworker/0:10]
root 1023 2 0 0 worker_th+ 0 I [kworker/0:11]
root 1025 2 0 0 worker_th+ 0 I [kworker/0:12]
# 场景服务启动
system 1045 341 4361960 67808 ep_poll 0 S com.gxa.car.scene
可以看到SystemServer已经启动完毕,在startOtherService已经把管理上层场景的服务已经拉起,但是后面就没有服务起来了。因此初步判断是有系统服务卡死了!
使用`kill -3 761`将当前所有所有线程的堆栈取出来。
在`/data/anr`路径下会生成一个名为**“trace_00”**的文件。导出此文件,开始分析是哪些线程发送卡死了。
2.2.2 分析trace日志
正常看trace一半先看main线程,但是看车机现象是一直卡在开机界面,不会重启,因此可以初步判断不是主线程卡死了,如果是主线程卡死了,watchdog会直接触发重启Android。不过还是看一下main线程状态。
"main" prio=5 tid=1 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x74726b48 self=0x7b90614c00
| sysTid=729 nice=-2 cgrp=default sched=0/0 handle=0x7c16087548
| state=S schedstat=( 822945383 262318239 1924 ) utm=56 stm=25 core=1 HZ=100
| stack=0x7fdac06000-0x7fdac08000 stackSize=8MB
| held mutexes=
at com.android.server.am.ActivityManagerService.registerReceiver(ActivityManagerService.java:20857)
- waiting to lock <0x033008f9> (a com.android.server.am.ActivityManagerService) held by thread 86
at android.app.ContextImpl.registerReceiverInternal(ContextImpl.java:1488)
at android.app.ContextImpl.registerReceiver(ContextImpl.java:1449)
at com.android.server.net.NetworkStatsService.systemReady(NetworkStatsService.java:398)
at com.android.server.SystemServer.lambda$startOtherServices$4(SystemServer.java:1851)
at com.android.server.-$$Lambda$SystemServer$s9erd2iGXiS7bbg_mQJUxyVboQM.run(lambda:-1)
at com.android.server.am.ActivityManagerService.systemReady(ActivityManagerService.java:15282)
at com.android.server.SystemServer.startOtherServices(SystemServer.java:1777)
at com.android.server.SystemServer.run(SystemServer.java:444)
at com.android.server.SystemServer.main(SystemServer.java:303)
at java.lang.reflect.Method.invoke(Native method)
at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:493)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:838)
main线程registerReceiver()方法中等待<0x033008f9>这把锁。顺藤摸瓜,看看哪个线程持有这把锁。
"Binder:729_7" prio=5 tid=86 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x12e80020 self=0x7b87c51000
| sysTid=1259 nice=-2 cgrp=default sched=0/0 handle=0x7b6e7df4f0
| state=S schedstat=( 3735363 3189481 17 ) utm=0 stm=0 core=2 HZ=100
| stack=0x7b6e6e4000-0x7b6e6e6000 stackSize=1009KB
| held mutexes=
at com.android.server.wm.WindowManagerService.deferSurfaceLayout(WindowManagerService.java:2894)
- waiting to lock <0x09464241> (a com.android.server.wm.WindowHashMap) held by thread 20
at com.android.server.am.ActivityManagerService.handleAppDiedLocked(ActivityManagerService.java:5928)
at com.android.server.am.ActivityManagerService.appDiedLocked(ActivityManagerService.java:6107)
at com.android.server.am.ActivityManagerService$AppDeathRecipient.binderDied(ActivityManagerService.java:1885)
- locked <0x033008f9> (a com.android.server.am.ActivityManagerService)
at android.os.BinderProxy.sendDeathNotice(Binder.java:1193)
名为Binder:729_7的86号线程持有main线程的<0x033008f9>锁,但是该线程执行deferSurfaceLayout方法依赖<0x09464241>锁(锁的名字WindowHashMap的实例,看源码mWindowMap这个对象)释放。继续顺藤摸瓜。
"android.display" prio=5 tid=20 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x12c40f88 self=0x7b907ddc00
| sysTid=806 nice=-4 cgrp=default sched=0/0 handle=0x7b74e0a4f0
| state=S schedstat=( 28085321 36714886 210 ) utm=1 stm=1 core=3 HZ=100
| stack=0x7b74d07000-0x7b74d09000 stackSize=1041KB
| held mutexes=
at com.android.server.policy.PhoneWindowManager.canDismissBootAnimation(PhoneWindowManager.java:7637)
- waiting to lock <0x0a6033e6> (a java.lang.Object) held by thread 13
at com.android.server.wm.WindowManagerService.performEnableScreen(WindowManagerService.java:3455)
- locked <0x09464241> (a com.android.server.wm.WindowHashMap)
at com.android.server.wm.WindowManagerService.access$1100(WindowManagerService.java:272)
at com.android.server.wm.WindowManagerService$H.handleMessage(WindowManagerService.java:4861)
at android.os.Handler.dispatchMessage(Handler.java:106)
at android.os.Looper.loop(Looper.java:193)
at android.os.HandlerThread.run(HandlerThread.java:65)
at com.android.server.ServiceThread.run(ServiceThread.java:44)
android.display线程持有<0x09464241>锁,需要释放该锁,依赖执行canDismissBootAnimation()方法释放<0x0a6033e6>锁(源码中该锁名字叫mLock),继续顺藤摸瓜。
"android.ui" prio=5 tid=13 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x12c408b0 self=0x7b907d7c00
| sysTid=799 nice=-2 cgrp=default sched=0/0 handle=0x7b756324f0
| state=S schedstat=( 24267703 15746873 101 ) utm=2 stm=0 core=0 HZ=100
| stack=0x7b7552f000-0x7b75531000 stackSize=1041KB
| held mutexes=
at com.android.server.wm.WindowManagerService$LocalService.waitForAllWindowsDrawn(WindowManagerService.java:7370)
- waiting to lock <0x09464241> (a com.android.server.wm.WindowHashMap) held by thread 20
at com.android.server.policy.PhoneWindowManager.finishKeyguardDrawn(PhoneWindowManager.java:6885)
at com.android.server.policy.PhoneWindowManager.screenTurningOn(PhoneWindowManager.java:6938)
- locked <0x0a6033e6> (a java.lang.Object)
at com.android.server.policy.PhoneWindowManager.systemBooted(PhoneWindowManager.java:7631)
at com.android.server.wm.WindowManagerService.enableScreenAfterBoot(WindowManagerService.java:3392)
at com.android.server.am.ActivityManagerService.enableScreenAfterBoot(ActivityManagerService.java:7993)
at com.android.server.am.ActivityManagerService.ensureBootCompleted(ActivityManagerService.java:8162)
at com.android.server.am.ActivityManagerService$UiHandler.handleMessage(ActivityManagerService.java:2033)
at android.os.Handler.dispatchMessage(Handler.java:106)
at android.os.Looper.loop(Looper.java:193)
at android.os.HandlerThread.run(HandlerThread.java:65)
at com.android.server.ServiceThread.run(ServiceThread.java:44)
at com.android.server.UiThread.run(UiThread.java:43)
android.ui线程持有<0x0a6033e6>锁,执行依赖waitForAllWindowsDrawn()依赖<0x09464241>释放!!!!
问题终于找到了:
android.display线程持<0x09464241>mWindowMap对象锁,依赖<0x0a6033e6>mLock对象锁释放。
android.ui线程持有<0x0a6033e6>mLock对象锁,依赖<0x09464241>mWindowMap对象锁释放。
发生循环依赖了。导致死锁的场景发生!!!
解决思路:发生死锁一般解决办法就是翻看源码解开其中一个方法的锁即可。
2.3源码调查
从上面分析看主要是执行canDismissBootAnimation()和waitForAllWindowsDrawn()中造成循环依赖了。因此直接调查这两个方法。
/frameworks/base/services/core/java/com/android/server/policy/PhoneWindowManager.java
@Override
public boolean canDismissBootAnimation() {
synchronized (mLock) {
return mKeyguardDrawComplete;
}
}
/frameworks/base/services/core/java/com/android/server/wm/WindowManagerService.java
@Override
public void waitForAllWindowsDrawn(Runnable callback, long timeout) {
boolean allWindowsDrawn = false;
synchronized (mWindowMap) {
mWaitingForDrawnCallback = callback;
getDefaultDisplayContentLocked().waitForAllWindowsDrawn();
mWindowPlacerLocked.requestTraversal();
mH.removeMessages(H.WAITING_FOR_DRAWN_TIMEOUT);
if (mWaitingForDrawn.isEmpty()) {
allWindowsDrawn = true;
} else {
mH.sendEmptyMessageDelayed(H.WAITING_FOR_DRAWN_TIMEOUT, timeout);
checkDrawnWindowsLocked();
}
}
if (allWindowsDrawn) {
callback.run();
}
}
初步一看,canDismissBootAnimation()方法的mLock锁只是为了保证返回mKeyguardDrawComplete局部变量的原子性。而waitForAllWindowsDrawn()方法的mWindowMap是为了保证那么大一段代码的原子性。
解决办法很简单,移除mKeyguardDrawComplete返回的锁即可。
2.4 解决后代码
/frameworks/base/services/core/java/com/android/server/policy/PhoneWindowManager.java
@Override
public boolean canDismissBootAnimation() {
return mKeyguardDrawComplete;
}
3.回顾
咋一想,居然google原生代码也出现了死锁问题,还想着是不是可以取给google提交一个bug和修复方案,万一采纳了呢,那不是牛逼了。
由于当前项目用的Android P的代码,于是赶紧去看看Android Q 谷歌有没有修复该代码。
@Override
public boolean canDismissBootAnimation() {
return mDefaultDisplayPolicy.isKeyguardDrawComplete();
}
# frameworks/base/services/core/java/com/android/server/wm/DisplayPolicy.java
public boolean isKeyguardDrawComplete() {
return mKeyguardDrawComplete;
}
哎哎哎,还是太年轻,too young,too simple了。看来Android P的漏洞,Android Q已经修复了,修复的方法和我开始调查的时候修复的方法一样,还是挺开心的。因此顺利提交代码到项目中了。