前面已经这个系列已经更新了4篇,死机重启问题分析中,Watchdog问题最为常见,今天接着写一写Watchdog问题的分析套路以及工作原理。
应用与系统稳定性第一篇---ANR问题分析的一般套路
应用与系统稳定性第二篇---ANR的监测与信息采集
应用与系统稳定性第三篇---FD泄露问题漫谈
应用与系统稳定性第四篇---单线程导致的空指针问题分析
一、Watchdog基本认识
1、什么是watchdog?
Watchdog又名看门狗,如果不按时给“喂狗”,超过一分钟,就会咬人。Android系统中,服务有上百种,为了防止SystemServer的一些核心服务hang住而发生冻屏,引入了Watchdog机制,当出现故障时,Watchdog就会调用Process.killProcess(Process.myPid())杀死SystemServer进程system_server进程是zygote的大弟子,是zygote进程fork的第一个进程,zygote和system_server这两个进程可以说是Java世界的半边天,任何一个进程的死亡,都会导致Java世界的崩溃。所以如果子进程SystemServer挂了,Zygote就会自杀,这样Zygote孵化的所有子进程都会重启一遍,相当于手机被软重启了,用户不会因为手机冻屏而不能使用。
上面说的是防止Watchdog问题,系统的处理策略,而我们程序员关注的是,具体是哪里发生了Watchdog,和ANR类似,Watchdog发生过程中,需要dump trace,最终定位并解决问题。所以得研究一套机制能确定超时问题。
watchdog代码位于 /frameworks/base/services/core/java/com/android/server/Watchdog.java
常见Log有下面两种,一种是Blocked in handler 、另外一种是: Blocked in monitor,区别在下文分析。
11-15 06:56:39.696 24203 24902 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in handler on main thread (main), Blocked in handler on ui thread (android.ui)
11-15 06:56:39.696 24203 24902 W Watchdog: main thread stack trace:
11-15 06:56:39.696 24203 24902 W Watchdog: at android.os.MessageQueue.nativePollOnce(Native Method)
11-15 06:56:39.696 24203 24902 W Watchdog: at android.os.MessageQueue.next(MessageQueue.java:323)
11-15 06:56:39.696 24203 24902 W Watchdog: at android.os.Looper.loop(Looper.java:142)
11-15 06:56:39.696 24203 24902 W Watchdog: at com.android.server.SystemServer.run(SystemServer.java:377)
11-15 06:56:39.696 24203 24902 W Watchdog: at com.android.server.SystemServer.main(SystemServer.java:239)
11-15 06:56:39.696 24203 24902 W Watchdog: at java.lang.reflect.Method.invoke(Native Method)
11-15 06:56:39.696 24203 24902 W Watchdog: at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:901)
11-15 06:56:39.696 24203 24902 W Watchdog: at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:791)
11-15 06:56:39.696 24203 24902 W Watchdog: ui thread stack trace:
......
10-26 00:07:00.884 1000 17132 17312 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.Watchdog$BinderThreadMonitor on foreground thread (android.fg)
10-26 00:07:00.884 1000 17132 17312 W Watchdog: foreground thread stack trace:
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Binder.blockUntilThreadAvailable(Native Method)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$BinderThreadMonitor.monitor(Watchdog.java:381)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:353)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Handler.handleCallback(Handler.java:873)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:99)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Looper.loop(Looper.java:193)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:44)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: *** GOODBYE!
2、初始化
Watchdog本身继承Thread,初始化是在SystemServer启动过程中
public final class SystemServer {
... ...
/**
* Starts a miscellaneous grab bag of stuff that has yet to be refactored
* and organized.
*/
private void startOtherServices() {
......
try {
......
traceBeginAndSlog("InitWatchdog");
final Watchdog watchdog = Watchdog.getInstance(); // 获取Watchdog对象初始化
watchdog.init(context, mActivityManagerService); // 注册receiver以接收系统重启广播
Trace.traceEnd(Trace.TRACE_TAG_SYSTEM_SERVER);
......
}
......
mActivityManagerService.systemReady(new Runnable() {
@Override
public void run() {
......
Watchdog.getInstance().start();
......
}
});
}
241 public static Watchdog getInstance() {
242 if (sWatchdog == null) {
243 sWatchdog = new Watchdog();
244 }
245
246 return sWatchdog;
247 }
为了搞一套超时判断的方案,在Watchdog在构造函数中,会构建很多HandlerChecker,可以分为两类:
- Monitor Checker,用于检查是Monitor对象可能发生的死锁, AMS, PKMS, WMS等核心的系统服务都是Monitor对象。
- Looper Checker,用于检查线程的消息队列是否长时间处于工作状态。Watchdog自身的消息队列,ui, Io, display这些全局的消息队列都是被检查的对象。此外,一些重要的线程的消息队列,也会加入到Looper Checker中,譬如AMS, PKMS,这些是在对应的对象初始化时加入的。
/* This handler will be used to post message back onto the main thread */
107 final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
249 private Watchdog() {
//实质调用的是父类Thread的构造方法,设置线程名称
250 super("watchdog");
251 // Initialize handler checkers for each common thread we want to check. Note
252 // that we are not currently checking the background thread, since it can
253 // potentially hold longer running operations with no guarantees about the timeliness
254 // of operations there.
255
256 // The shared foreground thread is the main checker. It is where we
257 // will also dispatch monitor checks and do other work.
258 mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
259 "foreground thread", DEFAULT_TIMEOUT);
260 mHandlerCheckers.add(mMonitorChecker);
261 // Add checker for main thread. We only do a quick check since there
262 // can be UI running on the thread.
263 mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
264 "main thread", DEFAULT_TIMEOUT));
265 // Add checker for shared UI thread.
266 mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
267 "ui thread", DEFAULT_TIMEOUT));
268 // And also check IO thread.
269 mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
270 "i/o thread", DEFAULT_TIMEOUT));
271 // And the display thread.
272 mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
273 "display thread", DEFAULT_TIMEOUT));
274
275 // Initialize monitor for Binder threads.
276 addMonitor(new BinderThreadMonitor());
277 //O上新增对FD泄露的监控
278 mOpenFdMonitor = OpenFdMonitor.create();
......
283 }
其中DEFAULT_TIMEOUT一般是一分钟,对于installd是10分钟。
两类HandlerChecker的侧重点不同,
Monitor Checker预警我们不能长时间持有核心系统服务的对象锁,否则会阻塞很多函数的运行;
Looper Checker预警我们不能长时间的霸占消息队列,否则其他消息将得不到处理。
所以Watchdog就靠这两个Checker来搞搞事情了。
3、基本原理
3.1如何添加Checker对象
拿AMS举例,是既添加了Monitor Checker对象,也添加了Looper Checker对象,也实现了Watchdog.Monitor接口,重写了monitor方法。
public class ActivityManagerService extends IActivityManager.Stub
implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {
......
public ActivityManagerService(Context systemContext) {
......
Watchdog.getInstance().addMonitor(this);
Watchdog.getInstance().addThread(mHandler);
......
}
......
/** In this method we try to acquire our lock to make sure that we have not deadlocked */
public void monitor() {
synchronized (this) { }
}
......
}
在AMS构造的时候,会调用Watchdog的addMonitor和addThread把自己和MainHandler的对象mHander加进去
323 public void addThread(Handler thread) {
324 addThread(thread, DEFAULT_TIMEOUT);
325 }
326
327 public void addThread(Handler thread, long timeoutMillis) {
328 synchronized (this) {
329 if (isAlive()) {
330 throw new RuntimeException("Threads can't be added once the Watchdog is running");
331 }
332 final String name = thread.getLooper().getThread().getName();
333 mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
334 }
335 }
336
314 public void addMonitor(Monitor monitor) {
315 synchronized (this) {
316 if (isAlive()) {
317 throw new RuntimeException("Monitors can't be added once the Watchdog is running");
318 }
319 mMonitorChecker.addMonitor(monitor);
320 }
321 }
mMonitorChecker是HandlerChecker 对象,实质上是HandlerChecker的addMonitor方法,而mHandlerCheckers是ArrayList对象,就可以直接add。
120 public final class HandlerChecker implements Runnable {
121 private final Handler mHandler;
122 private final String mName;
123 private final long mWaitMax;
124 private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
125 private boolean mCompleted;
126 private Monitor mCurrentMonitor;
127 private long mStartTime;
128
129 HandlerChecker(Handler handler, String name, long waitMaxMillis) {
130 mHandler = handler;
131 mName = name;
132 mWaitMax = waitMaxMillis;
133 mCompleted = true;
134 }
135
136 public void addMonitor(Monitor monitor) {
137 mMonitors.add(monitor);
138 }
......
}
3.2、核心原理
在添加Checker之后,该如何使用这些Checker呢?因为Watchdog继承Thread,直接看run方法。
398 @Override
399 public void run() {
400 boolean waitedHalf = false;
401 while (true) {
402 final ArrayList<HandlerChecker> blockedCheckers;
403 final String subject;
404 final boolean allowRestart;
//是否是在调试状态
405 int debuggerWasConnected = 0;
406 synchronized (this) {
//CHECK_INTERVAL时长是DEFAULT_TIMEOUT的一半,一般是30s
407 long timeout = CHECK_INTERVAL;
408 //1、处理所有的HandlerChecker
410 for (int i=0; i<mHandlerCheckers.size(); i++) {
411 HandlerChecker hc = mHandlerCheckers.get(i);
412 hc.scheduleCheckLocked();
413 }
.....
// 2. 开始定期检查
423 long start = SystemClock.uptimeMillis();
424 while (timeout > 0) {
425 if (Debug.isDebuggerConnected()) {
426 debuggerWasConnected = 2;
427 }
428 try {
429 wait(timeout);
430 } catch (InterruptedException e) {
431 Log.wtf(TAG, e);
432 }
433 if (Debug.isDebuggerConnected()) {
434 debuggerWasConnected = 2;
435 }
436 timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
437 }
438 // 3. 获取状态,状态有如下三种,
439 final int waitState = evaluateCheckerCompletionLocked();
440 if (waitState == COMPLETED) {
441 // The monitors have returned; reset
442 waitedHalf = false;
443 continue;
444 } else if (waitState == WAITING) {
445 // still waiting but within their configured intervals; back off and recheck
446 continue;
447 } else if (waitState == WAITED_HALF) {
448 if (!waitedHalf) {
449 //超时一半的时候,开始dumpStackTraces
451 ArrayList<Integer> pids = new ArrayList<Integer>();
452 pids.add(Process.myPid());
453 ActivityManagerService.dumpStackTraces(true, pids, null, null,
454 getInterestingNativePids());
455 waitedHalf = true;
456 }
457 continue;
458 }
459
460 // 走到这里,说明存在超时的HandlerChecker
461 blockedCheckers = getBlockedCheckersLocked();
462 subject = describeCheckersLocked(blockedCheckers);
463 allowRestart = mAllowRestart;
464 }
465
466 // If we got here, that means that the system is most likely hung.
467 // First collect stack traces from all threads of the system process.
468 // Then kill this process so that the system will restart.
//eventlog打印发生了watchdog
469 EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
470 //
471 ArrayList<Integer> pids = new ArrayList<>();
472 pids.add(Process.myPid());
473 if (mPhonePid > 0) pids.add(mPhonePid);
474 // Pass !waitedHalf so that just in case we somehow wind up here without having
475 //开始dumpStackTraces,包含pids中的进程和getInterestingNativePids中的进程
476 final File stack = ActivityManagerService.dumpStackTraces(
477 !waitedHalf, pids, null, null, getInterestingNativePids());
478
479 // Give some extra time to make sure the stack traces get written.
480 // The system's been hanging for a minute, another second or two won't hurt much.
481 SystemClock.sleep(2000);
482
483 // Pull our own kernel thread stacks as well if we're configured for that
//开始dumpKernelStackTraces
484 if (RECORD_KERNEL_THREADS) {
485 dumpKernelStackTraces();
486 }
487
488 // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
489 doSysRq('w');
490 doSysRq('l');
491
492 // Try to add the error to the dropbox, but assuming that the ActivityManager
493 // itself may be deadlocked. (which has happened, causing this statement to
494 // deadlock and the watchdog as a whole to be ineffective)
495 Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
496 public void run() {
//将Error加入到DropBox文件中
497 mActivity.addErrorToDropBox(
498 "watchdog", null, "system_server", null, null,
499 subject, null, stack, null);
500 }
501 };
502 dropboxThread.start();
......
525
526 // Only kill the process if the debugger is not attached.
527 if (Debug.isDebuggerConnected()) {
528 debuggerWasConnected = 2;
529 }
530 if (debuggerWasConnected >= 2) {
531 Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
532 } else if (debuggerWasConnected > 0) {
533 Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
534 } else if (!allowRestart) {
535 Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
536 } else {
537 Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
538 for (int i=0; i<blockedCheckers.size(); i++) {
539 Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
540 StackTraceElement[] stackTrace
541 = blockedCheckers.get(i).getThread().getStackTrace();
542 for (StackTraceElement element: stackTrace) {
543 Slog.w(TAG, " at " + element);
544 }
545 }
546 Slog.w(TAG, "*** GOODBYE!");
//最终杀死System进程
547 Process.killProcess(Process.myPid());
548 System.exit(10);
549 }
550
551 waitedHalf = false;
552 }
553 }
原理总结:
- 1、系统中所有需要监控的服务都调用Watchdog的addMonitor添加Monitor Checker到mMonitors这个List中或者addThread方法添加Looper Checker到mHandlerCheckers这个List中。
- 2、当Watchdog线程启动后,便开始无限循环,它的run方法就开始执行
- 第一步调用HandlerChecker#scheduleCheckLocked处理所有的mHandlerCheckers
- 第二步定期检查是否超时,每一次检查的间隔时间由CHECK_INTERVAL常量设定,为30秒,每一次检查都会调用evaluateCheckerCompletionLocked()方法来评估一下HandlerChecker的完成状态:
COMPLETED表示已经完成
WAITING和WAITED_HALF表示还在等待,但未超时,WAITED_HALF时候会dump一次trace.
OVERDUE表示已经超时。默认情况下,timeout是1分钟。- 3、如果超时时间到了,还有HandlerChecker处于未完成的状态(OVERDUE),则通过getBlockedCheckersLocked()方法,获取阻塞的HandlerChecker,生成一些描述信息,保存日志,包括一些运行时的堆栈信息。
4、最后杀死SystemServer进程
上面就是大概的原理总结,还需要看几个细节问题
3.2.1、HandlerChecker#scheduleCheckLocked的处理?
127 public void scheduleCheckLocked() {
//mMonitors.size为0或者,消息队列处于空闲,说明没有阻塞,设置 mCompleted = true后直接返回
128 if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
129 // If the target looper has recently been polling, then
130 // there is no reason to enqueue our checker on it since that
131 // is as good as it not being deadlocked. This avoid having
132 // to do a context switch to check the thread. Note that we
133 // only do this if mCheckReboot is false and we have no
134 // monitors, since those would need to be executed at this point.
135 mCompleted = true;
136 return;
137 }
......
144 mCompleted = false;
145 mCurrentMonitor = null;
146 mStartTime = SystemClock.uptimeMillis();
//post一个消息到当前mHandler所在消息队列的最前面
147 mHandler.postAtFrontOfQueue(this);
148 }
如果上面消息能够执行,下面的run方法就会走进去,尝试调用monitor申请锁。
public final class HandlerChecker implements Runnable {
.......
@Override
public void run() {
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
mCurrentMonitor.monitor();
}
synchronized (Watchdog.this) {
mCompleted = true;
mCurrentMonitor = null;
}
}
}
对于Looper Checker而言,会判断线程的消息队列是否处于空闲状态。 如果被监测的消息队列一直闲不下来,则说明可能已经阻塞等待了很长时间
如果scheduleCheckLocked中post的消息能够被执行到,对于Monitor Checker而言,会调用实现类的monitor方法,上文中提到的AMS.monitor()方法, 方法实现一般很简单,就是获取当前类的对象锁,如果当前对象锁已经被持有,则monitor()会一直处于wait状态,直到超时。
如果scheduleCheckLocked中post的消息不能够被执行到,那么说明消息队列中前一个消息一直在执行,没有执行完成,也会超时。不得不佩服这种巧妙的设计啊,postAtFrontOfQueue可谓是一箭双雕,既检测了是否锁有耗时,也检查了消息队列中某个Message是否耗时。
二、案例分析
对于Watchdog问题分析,首先需要确定trace是否有效,通过前面的分析,Watchdog在30s和1分钟的时候都会dump一次trace,比如看到下面的trace。
09-24 11:25:43.442 1000 1540 2033 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in handler on ActivityManager (ActivityManager)
09-24 11:25:43.442 1000 1540 2033 W Watchdog: ActivityManager stack trace:
09-24 11:25:43.442 1000 1540 2033 W Watchdog: at android.os.MessageQueue.nativePollOnce(Native Method)
09-24 11:25:43.442 1000 1540 2033 W Watchdog: at android.os.MessageQueue.next(MessageQueue.java:325)
09-24 11:25:43.442 1000 1540 2033 W Watchdog: at android.os.Looper.loop(Looper.java:148)
09-24 11:25:43.442 1000 1540 2033 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)
09-24 11:25:43.442 1000 1540 2033 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:46)
09-24 11:25:43.442 1000 1540 2033 W Watchdog: *** GOODBYE!
然后我们看ActivityManager的trace.
"ActivityManager" prio=5 tid=12 Blocked
group="main" sCount=1 dsCount=0 flags=1 obj=0x13180c38 self=0x73bb923600
sysTid=1579 nice=-2 cgrp=default sched=0/0 handle=0x73adbcf4f0
state=S schedstat=( 3039883125048 14149853235996 6778200 ) utm=112965 stm=191023 core=6 HZ=100
stack=0x73adacd000-0x73adacf000 stackSize=1037KB
held mutexes=
at com.android.server.am.ActiveServices.serviceTimeout(ActiveServices.java:3486)
waiting to lock <0x0748826a> (a com.android.server.am.ActivityManagerService) held by thread 10
at com.android.server.am.ActivityManagerService$MainHandler.handleMessage(ActivityManagerService.java:2032)
at android.os.Handler.dispatchMessage(Handler.java:106)
at android.os.Looper.loop(Looper.java:173)
at android.os.HandlerThread.run(HandlerThread.java:65)
at com.android.server.ServiceThread.run(ServiceThread.java:46)
因为ActivityManager被10号线程blocked,继续看10号线程的trace.
"Binder:1540_1C" prio=5 tid=10 Native
group="main" sCount=1 dsCount=0 flags=1 obj=0x1318deb8 self=0x73c0817600
sysTid=8946 nice=-4 cgrp=default sched=0/0 handle=0x739db674f0
state=S schedstat=( 2025031009459 6852098325718 5020435 ) utm=136019 stm=66484 core=1 HZ=100
stack=0x739da6d000-0x739da6f000 stackSize=1005KB
held mutexes=
kernel: __switch_to+0x9c/0xd0
kernel: futex_wait_queue_me+0xc4/0x13c
kernel: futex_wait+0xe4/0x204
kernel: do_futex+0x170/0x500
kernel: SyS_futex+0x90/0x1b0
kernel: __sys_trace+0x4c/0x4c
native: #00 pc 000000000001db2c /system/lib64/libc.so (syscall+28)
native: #01 pc 00000000000e74c8 /system/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+152)
native: #02 pc 00000000005227a8 /system/lib64/libart.so (art::GoToRunnable(art::Thread*)+440)
native: #03 pc 00000000005225a8 /system/lib64/libart.so (art::JniMethodEnd(unsigned int, art::Thread*)+28)
native: #04 pc 0000000000cb8fc0 /system/framework/arm64/boot-framework.oat (Java_android_os_Process_setThreadPriority__II+176)
at android.os.Process.setThreadPriority(Native method)
at com.android.server.ThreadPriorityBooster.boost(ThreadPriorityBooster.java:49)
at com.android.server.wm.WindowManagerThreadPriorityBooster.boost(WindowManagerThreadPriorityBooster.java:58)
at com.android.server.wm.WindowManagerService.boostPriorityForLockedSection(WindowManagerService.java:930)
at com.android.server.wm.WindowManagerService.containsDismissKeyguardWindow(WindowManagerService.java:3116)
locked <0x0b54880e> (a com.android.server.wm.WindowHashMap)
at com.android.server.am.ActivityRecord.hasDismissKeyguardWindows(ActivityRecord.java:1364)
at com.android.server.am.ActivityStack.checkKeyguardVisibility(ActivityStack.java:2070)
at com.android.server.am.ActivityStack.ensureActivitiesVisibleLocked(ActivityStack.java:1924)
at com.android.server.am.ActivityStackSupervisor.ensureActivitiesVisibleLocked(ActivityStackSupervisor.java:3626)
at com.android.server.am.ActivityStackSupervisor.attachApplicationLocked(ActivityStackSupervisor.java:1043)
at com.android.server.am.ActivityManagerService.attachApplicationLocked(ActivityManagerService.java:7471)
at com.android.server.am.ActivityManagerService.attachApplication(ActivityManagerService.java:7538)
locked <0x0748826a> (a com.android.server.am.ActivityManagerService)
at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:292)
at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:3026)
at android.os.Binder.execTransact(Binder.java:704)
难道是setThreadPriority超时??但是缺乏1分钟的trace,我们不能断定是这个地方卡住。在 dumptraces 的时候对于处于 Suspended 状态的线程,会修改线程的 suspend_count_,使其+1,然后将其添加到suspended_count_modified_threads 的列表中,然后对于 suspended_count_modified_threads 中的线程一起 dumptraces ,对于 dump 完成的 thread 会进行 suspend_count_ - 1 的操作。Suspended 线程想要由 jni 回到 java 代码(Runnable 状态)在 GoToRunnable 时会检查 suspend_count_,如果不为0就在这里等待,直到其变为0。所以这里只能说明 dumptraces 的时候 tid=10 在执行 setThreadPriority 的 native method,如果要确定是否卡在了这里还需要对比两次 traces才能确定。
2.1、案例一
有的手机在Monkey测试过程中发生Watchdog不会重启,现象可能是冻屏,查看traces_SystemServer_WDT05_1月_23_50_59.974.txt,发现所有线程都被73号线程blocked,而且两次trace完全一致
来看看73号线程在干嘛
"Binder:1300_3" prio=5 tid=73 Native
| group="main" sCount=1 dsCount=0 flags=1 obj=0x14f89110 self=0x7ee794d600
| sysTid=1774 nice=-10 cgrp=default sched=0/0 handle=0x7ecbc474f0
| state=S schedstat=( 59882636556 104471794509 273786 ) utm=3455 stm=2533 core=6 HZ=100
| stack=0x7ecbb4d000-0x7ecbb4f000 stackSize=1005KB
| held mutexes=
kernel: __switch_to+0x94/0xa8
kernel: binder_thread_read+0x460/0x10a0
kernel: binder_ioctl_write_read+0x21c/0x360
kernel: binder_ioctl+0x50c/0x798
kernel: do_vfs_ioctl+0xb8/0x800
kernel: SyS_ioctl+0x84/0x98
kernel: el0_svc_naked+0x24/0x28
native: #00 pc 00000000000690a4 /system/lib64/libc.so (__ioctl+4)
native: #01 pc 0000000000024638 /system/lib64/libc.so (ioctl+132)
native: #02 pc 0000000000061a10 /system/lib64/libbinder.so (_ZN7android14IPCThreadState14talkWithDriverEb+256)
native: #03 pc 00000000000627a8 /system/lib64/libbinder.so (_ZN7android14IPCThreadState15waitForResponseEPNS_6ParcelEPi+340)
native: #04 pc 00000000000624c8 /system/lib64/libbinder.so (_ZN7android14IPCThreadState8transactEijRKNS_6ParcelEPS1_j+216)
native: #05 pc 0000000000056d98 /system/lib64/libbinder.so (_ZN7android8BpBinder8transactEjRKNS_6ParcelEPS1_j+72)
native: #06 pc 000000000008a86c /system/lib64/libgui.so (???)
native: #07 pc 000000000009ec88 /system/lib64/libgui.so (_ZN7android16ScreenshotClient6updateERKNS_2spINS_7IBinderEEENS_4RectEjjiibj+260)
native: #08 pc 00000000000fb058 /system/lib64/libandroid_runtime.so (???)
native: #09 pc 0000000001326998 /system/framework/arm64/boot-framework.oat (Java_android_view_SurfaceControl_nativeScreenshot__Landroid_os_IBinder_2Landroid_graphics_Rect_2IIIIZZI+264)
at android.view.SurfaceControl.nativeScreenshot(Native method)
at android.view.SurfaceControl.screenshot(SurfaceControl.java:877)
at com.android.server.wm.DisplayContent.-com_android_server_wm_DisplayContent-mthref-0(DisplayContent.java:2863)
at com.android.server.wm.-$Lambda$OzPvdnGprtQoLZLCvw2GU8IaGyI.$m$0(unavailable:-1)
at com.android.server.wm.-$Lambda$OzPvdnGprtQoLZLCvw2GU8IaGyI.screenshot(unavailable:-1)
at com.android.server.wm.DisplayContent.screenshotApplications(DisplayContent.java:3125)
- locked <0x036ec628> (a com.android.server.wm.WindowHashMap)
at com.android.server.wm.DisplayContent.screenshotApplications(DisplayContent.java:2862)
at com.android.server.wm.AppWindowContainerController.screenshotApplications(AppWindowContainerController.java:749)
at com.android.server.am.ActivityRecord.screenshotActivityLocked(ActivityRecord.java:1650)
at com.android.server.am.ActivityRecord.setVisible(ActivityRecord.java:1675)
at com.android.server.am.ActivityStack.makeInvisible(ActivityStack.java:2078)
at com.android.server.am.ActivityStack.ensureActivitiesVisibleLocked(ActivityStack.java:1896)
at com.android.server.am.ActivityStackSupervisor.ensureActivitiesVisibleLocked(ActivityStackSupervisor.java:3575)
at com.android.server.am.ActivityManagerService.ensureConfigAndVisibilityAfterUpdate(ActivityManagerService.java:20965)
at com.android.server.am.ActivityManagerService.updateDisplayOverrideConfigurationLocked(ActivityManagerService.java:20897)
at com.android.server.am.ActivityManagerService.updateDisplayOverrideConfigurationLocked(ActivityManagerService.java:20867)
at com.android.server.am.ActivityStack.resumeTopActivityInnerLocked(ActivityStack.java:2608)
at com.android.server.am.ActivityStack.resumeTopActivityUncheckedLocked(ActivityStack.java:2246)
at com.android.server.am.ActivityStackSupervisor.resumeFocusedStackTopActivityLocked(ActivityStackSupervisor.java:2148)
at com.android.server.am.ActivityStack.completePauseLocked(ActivityStack.java:1480)
at com.android.server.am.ActivityStack.activityPausedLocked(ActivityStack.java:1406)
at com.android.server.am.ActivityManagerService.activityPaused(ActivityManagerService.java:7542)
- locked <0x08abeada> (a com.android.server.am.ActivityManagerService)
at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:317)
at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:3018)
at android.os.Binder.execTransact(Binder.java:677)
最后是停在下面两行
native: #06 pc 000000000008a86c /system/lib64/libgui.so (???)
native: #07 pc 000000000009ec88 /system/lib64/libgui.so (_ZN7android16ScreenshotClient6updateERKNS_2spINS_7IBinderEEENS_4RectEjjiibj+260)
使用addr2line -Cfe ./system/lib64/libgui.so 000000000009ec88
_ZN7android16ScreenshotClient6updateERKNS_2spINS_7IBinderEEENS_4RectEjjiibj
frameworks/native/libs/gui/SurfaceComposerClient.cpp:1018 (discriminator 1)
1003 status_t ScreenshotClient::captureToBuffer(const sp<IBinder>& display,
1004 Rect sourceCrop, uint32_t reqWidth, uint32_t reqHeight,
1005 int32_t minLayerZ, int32_t maxLayerZ, bool useIdentityTransform,
1006 uint32_t rotation,
1007 sp<GraphicBuffer>* outBuffer) {
1008 sp<ISurfaceComposer> s(ComposerService::getComposerService());
1009 if (s == NULL) return NO_INIT;
1010
1011 sp<IGraphicBufferConsumer> gbpConsumer;
1012 sp<IGraphicBufferProducer> producer;
1013 BufferQueue::createBufferQueue(&producer, &gbpConsumer);
1014 sp<BufferItemConsumer> consumer(new BufferItemConsumer(gbpConsumer,
1015 GRALLOC_USAGE_HW_TEXTURE | GRALLOC_USAGE_SW_READ_NEVER | GRALLOC_USAGE_SW_WRITE_NEVER,
1016 1, true));
1017
1018 status_t ret = s->captureScreen(display, producer, sourceCrop, reqWidth, reqHeight,
1019 minLayerZ, maxLayerZ, useIdentityTransform,
1020 static_cast<ISurfaceComposer::Rotation>(rotation));
1021 if (ret != NO_ERROR) {
1022 return ret;
1023 }
1024 BufferItem b;
1025 consumer->acquireBuffer(&b, 0, true);
1026 *outBuffer = b.mGraphicBuffer;
1027 return ret;
1028}
1018行captureScreen函数是在做截屏,看来是截屏时候发生了Watchdog,根据captureScreen,那么对应的surfaceflinger的trace如下:
"Binder:820_3" sysTid=1331
#00 pc 0000000000068fb8 /system/lib64/libc.so (__epoll_pwait+8)
#01 pc 000000000001fc68 /system/lib64/libc.so (epoll_pwait+48)
#02 pc 0000000000015c84 /system/lib64/libutils.so (_ZN7android6Looper9pollInnerEi+144)
#03 pc 0000000000015b6c /system/lib64/libutils.so (_ZN7android6Looper8pollOnceEiPiS1_PPv+108)
#04 pc 00000000000b921c /system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger13captureScreenERKNS_2spINS_7IBinderEEERKNS1_INS_22IGraphicBufferProducerEEENS_4RectEjjiibNS_16ISurfaceComposer8RotationE+672)
#05 pc 0000000000088660 /system/lib64/libgui.so (_ZN7android17BnSurfaceComposer10onTransactEjRKNS_6ParcelEPS1_j+1788)
#06 pc 00000000000b8828 /system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger10onTransactEjRKNS_6ParcelEPS1_j+144)
#07 pc 00000000000559ac /system/lib64/libbinder.so (_ZN7android7BBinder8transactEjRKNS_6ParcelEPS1_j+136)
#08 pc 0000000000061ecc /system/lib64/libbinder.so (_ZN7android14IPCThreadState14executeCommandEi+536)
#09 pc 0000000000061c04 /system/lib64/libbinder.so (_ZN7android14IPCThreadState20getAndExecuteCommandEv+156)
#10 pc 0000000000062250 /system/lib64/libbinder.so (_ZN7android14IPCThreadState14joinThreadPoolEb+60)
#11 pc 0000000000082bcc /system/lib64/libbinder.so (_ZN7android10PoolThread10threadLoopEv+24)
#12 pc 0000000000011674 /system/lib64/libutils.so (_ZN7android6Thread11_threadLoopEPv+280)
#13 pc 0000000000066970 /system/lib64/libc.so (_ZL15__pthread_startPv+36)
#14 pc 000000000001f474 /system/lib64/libc.so (__start_thread+68)
surfaceflinger的主线程trace如下
----- pid 820 at 2018-01-05 23:49:16 -----
Cmd line: /system/bin/surfaceflinger
ABI: 'arm64'
"surfaceflinger" sysTid=820
#00 pc 00000000000690a4 /system/lib64/libc.so (__ioctl+4)
#01 pc 0000000000024638 /system/lib64/libc.so (ioctl+132)
#02 pc 0000000000015210 /system/lib64/libhwbinder.so (_ZN7android8hardware14IPCThreadState14talkWithDriverEb+256)
#03 pc 0000000000015f58 /system/lib64/libhwbinder.so (_ZN7android8hardware14IPCThreadState15waitForResponseEPNS0_6ParcelEPi+60)
#04 pc 0000000000015d84 /system/lib64/libhwbinder.so (_ZN7android8hardware14IPCThreadState8transactEijRKNS0_6ParcelEPS2_j+216)
#05 pc 00000000000128d4 /system/lib64/libhwbinder.so (_ZN7android8hardware10BpHwBinder8transactEjRKNS0_6ParcelEPS2_jNSt3__18functionIFvRS2_EEE+72)
#06 pc 0000000000038bc4 /system/lib64/android.hardware.graphics.composer@2.1.so (_ZN7android8hardware8graphics8composer4V2_118BpHwComposerClient11createLayerEmjNSt3__18functionIFvNS3_5ErrorEmEEE+240)
#07 pc 0000000000091cc0 /system/lib64/libsurfaceflinger.so (_ZN7android4Hwc28Composer11createLayerEmPm+100)
#08 pc 000000000009a930 /system/lib64/libsurfaceflinger.so (_ZN4HWC27Display11createLayerEPNSt3__110shared_ptrINS_5LayerEEE+72)
#09 pc 00000000000c5304 /system/lib64/libsurfaceflinger.so (_ZN7android10HWComposer11createLayerEi+152)
#10 pc 00000000000b1524 /system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger15setUpHWComposerEv+1560)
#11 pc 00000000000b096c /system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger20handleMessageRefreshEv+108)
#12 pc 00000000000aa660 /system/lib64/libsurfaceflinger.so (_ZN7android16ExSurfaceFlinger20handleMessageRefreshEv+16)
#13 pc 00000000000b03e4 /system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger17onMessageReceivedEi+260)
#14 pc 0000000000015d40 /system/lib64/libutils.so (_ZN7android6Looper9pollInnerEi+332)
#15 pc 0000000000015b6c /system/lib64/libutils.so (_ZN7android6Looper8pollOnceEiPiS1_PPv+108)
#16 pc 000000000008b944 /system/lib64/libsurfaceflinger.so (_ZN7android12MessageQueue11waitMessageEv+84)
#17 pc 00000000000af338 /system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger3runEv+20)
#18 pc 0000000000002cfc /system/bin/surfaceflinger (main+948)
#19 pc 000000000001b8b0 /system/lib64/libc.so (__libc_init+88)
#20 pc 00000000000028a8 /system/bin/surfaceflinger (do_arm64_start+80)
看到主线程正在createLayer,又通过binder从surfaceflinger进程call到了/vendor/bin/hw/android.hardware.graphics.composer@2.1-service,我们在去看看graphics.composer的对应线程的trace.
"HwBinder:738_1" sysTid=1273
#00 pc 000000000001dc2c /system/lib64/libc.so (syscall+28)
#01 pc 0000000000066014 /system/lib64/libc.so (pthread_cond_wait+96)
#02 pc 000000000001fc3c /vendor/lib64/hw/hwcomposer.sdm845.so (_ZN3sdm10HWCSession11CreateLayerEP11hwc2_devicemPm+120)
#03 pc 00000000000140e0 /vendor/lib64/hw/android.hardware.graphics.composer@2.1-impl.so (_ZN7android8hardware8graphics8composer4V2_114implementation14ComposerClient11createLayerEmjNSt3__18functionIFvNS3_5ErrorEmEEE+84)
#04 pc 0000000000044840 /system/lib64/android.hardware.graphics.composer@2.1.so (_ZN7android8hardware8graphics8composer4V2_116BsComposerClient11createLayerEmjNSt3__18functionIFvNS3_5ErrorEmEEE+160)
#05 pc 000000000003fd10 /system/lib64/android.hardware.graphics.composer@2.1.so (_ZN7android8hardware8graphics8composer4V2_118BnHwComposerClient10onTransactEjRKNS0_6ParcelEPS5_jNSt3__18functionIFvRS5_EEE+2224)
#06 pc 0000000000011be0 /system/lib64/vndk-sp/libhwbinder.so (_ZN7android8hardware9BHwBinder8transactEjRKNS0_6ParcelEPS2_jNSt3__18functionIFvRS2_EEE+132)
#07 pc 00000000000156fc /system/lib64/vndk-sp/libhwbinder.so (_ZN7android8hardware14IPCThreadState14executeCommandEi+584)
#08 pc 0000000000015404 /system/lib64/vndk-sp/libhwbinder.so (_ZN7android8hardware14IPCThreadState20getAndExecuteCommandEv+156)
#09 pc 0000000000015b0c /system/lib64/vndk-sp/libhwbinder.so (_ZN7android8hardware14IPCThreadState14joinThreadPoolEb+60)
#10 pc 000000000001f5c8 /system/lib64/vndk-sp/libhwbinder.so (_ZN7android8hardware10PoolThread10threadLoopEv+24)
#11 pc 0000000000011674 /system/lib64/vndk-sp/libutils.so (_ZN7android6Thread11_threadLoopEPv+280)
#12 pc 0000000000066970 /system/lib64/libc.so (_ZL15__pthread_startPv+36)
#13 pc 000000000001f474 /system/lib64/libc.so (__start_thread+68)
终于找到了最终blocked的地方
#02 pc 000000000001fc3c /vendor/lib64/hw/hwcomposer.sdm845.so (_ZN3sdm10HWCSession11CreateLayerEP11hwc2_devicemPm+120)
#03 pc 00000000000140e0 /vendor/lib64/hw/android.hardware.graphics.composer@2.1-impl.so (_ZN7android8hardware8graphics8composer4V2_114implementation14ComposerClient11createLayerEmjNSt3__18functionIFvNS3_5ErrorEmEEE+84)
再次使用addr2line
addr2line -f -e hwcomposer.sdm845.so 1fc3c
_ZN3sdm6Locker4WaitEv
hardware/qcom/display/include/../sdm/include/utils/locker.h:141
addr2line -f -e android.hardware.graphics.composer@2.1-impl.so 140e0
ZN7android8hardware8graphics8composer4V2_114implementation14ComposerClient11createLayerEmjNSt3_18functionIFvNS3_5ErrorEmEEE
hardware/interfaces/graphics/composer/2.1/default/ComposerClient.cpp:299
hardware/interfaces/graphics/composer/2.1/default/ComposerClient.cpp#299
295Return<void> ComposerClient::createLayer(Display display,
296 uint32_t bufferSlotCount, createLayer_cb hidl_cb)
297{
298 Layer layer = 0;
299 Error err = mHal.createLayer(display, &layer);
300 if (err == Error::NONE) {
301 std::lock_guard<std::mutex> lock(mDisplayDataMutex);
302
303 auto dpy = mDisplayData.find(display);
304 if (dpy != mDisplayData.end()) {
305 auto ly = dpy->second.Layers.emplace(layer, LayerBuffers()).first;
306 ly->second.Buffers.resize(bufferSlotCount);
307 } else {
308 layer = 0;
309 err = Error::BAD_DISPLAY;
310 }
311 }
312
313 hidl_cb(err, layer);
314 return Void();
315}
看样子是createLayer出了问题,最后将问题转给底层显示模块的同学继续分析。最后关于Watchdog还是有一些问题可以思考的,比如Watchdog各个版本有哪些变化,Watchdog线程被blocked了怎么办?而且Watchdog问题纷繁复杂,各个模块的业务都不一样,由于篇幅原因,读者自己调查。