一.简介
Watchdog,从中文字面意思来看是“看门狗”,有看护之意。最早引入Watchdog是在单片机系统中,由于单片机的工作环境容易受到外界磁场的干扰,导致程序“跑飞”造成整个系统无法正常工作,因此引入了一个“看门狗”,对单片机的运行状态进行实时监测,针对运行故障做一些保护处理,譬如让系统重启。这种Watchdog属于硬件层面,必须有硬件电路的支持。
Linux系统也引入了Watchdog,在Linux内核下,当Watchdog启动后,便设定了一个定时器,如果在超时时间内没有对/dev/Watchdog进行写操作,则会导致系统重启。通过定时器实现的Watchdog属于软件层面。
在Android系统中,也设计了一个软件层面Watchdog,用于保护一些重要的系统服务,比如:AMS、WMS、PMS等,由于以上核心服务运行在system_server进程里面,所以当以上服务出现异常时,通常会将system_server进程kill掉,即让Android系统重启,由于Watchdog机制的存在,平时会出现一些system_server进程被Watchdog杀掉而发生Android系统重启的现象。
前面简单介绍了Watchdog,那在Android系统中,它是如何工作的,如何对系统服务进行检测的?本文基于Andorid 8.1来对Watchdog源码及工作机制进行分析。
二.Watchdog注册及启动
a.注册
通过framework源码可以发现,在AMS、WMS的构造方法内部,会进行相应Watchdog检测类型的注册:
1.Monitor注册
注册监听方式如下:
Watchdog.getInstance().addMonitor(this);
回调方法如下:
public void monitor() {
synchronized (this) { }
}
用来检测不能长时间持有核心系统服务的对象锁,否则会阻塞很多函数的运行;
2.Handler注册
注册监听方式如下:
Watchdog.getInstance().addThread(mHandler);
用来检测是否存在长时间霸占的消息,否则其他消息将得不到处理;
以上两类都会导致系统卡住(System Not Responding),后面会进行分析。
b.启动
Watchdog本身是一个Thread,在启动前需要完成注册,否则会报异常[后面分析Watchdog源码会讲到],那么Watchdog是在什么地方启动的呢?
熟悉framework的同学应该清楚,有一个进程叫system_server,Android系统所有的核心服务都运行在该进程中,如果该进程出现异常,那么Android系统就会重启;Watchdog就是在system_server进程启动的,一起看一下:
try {
traceBeginAndSlog("StartServices");
startBootstrapServices();
startCoreServices();
startOtherServices();
SystemServerInitThreadPool.shutdown();
}
system_server进程在启动时,会启动系统核心服务,具体调用逻辑如上,在startOtherServices()内部会启动Wathdog,简单看一下代码:
private void startOtherServices() {
........
........
traceBeginAndSlog("StartWatchdog");
Watchdog.getInstance().start();
traceEnd();
........
........
}
等所有的核心系统服务启动完成后,才执行的Watchdog.getInstance().start(),跟前面讲到的是保持一致的,接下来通过源码对Watchdog工作机制进行分析;
三.Watchdog机制源码分析
Watchdog的源码位于frameworks/base/services/core/java/com/android/server/Watchdog.java,根据前面讲到,Watchdog本身是一个Thread,一起看一下:
a.初始化
public class Watchdog extends Thread {
........
final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
final HandlerChecker mMonitorChecker
.......
.......
private Watchdog() {
super("watchdog");
// The shared foreground thread is the main checker. It is where we
// will also dispatch monitor checks and do other work.
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
// Add checker for main thread. We only do a quick check since there
// can be UI running on the thread.
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
"main thread", DEFAULT_TIMEOUT));
// Add checker for shared UI thread.
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
"ui thread", DEFAULT_TIMEOUT));
// And also check IO thread.
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
"i/o thread", DEFAULT_TIMEOUT));
// And the display thread.
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
"display thread", DEFAULT_TIMEOUT));
// Initialize monitor for Binder threads.
addMonitor(new BinderThreadMonitor());
mOpenFdMonitor = OpenFdMonitor.create();
// See the notes on DEFAULT_TIMEOUT.
assert DB ||
DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
}
.......
.......
}
在构造方法内部,会构建多个HandlerChecker,重点关注一下mMonitorChecker,然后加入到mHandlerCheckers列表中,HandlerChecker是用来对系统服务进行检测,可以分为以下两类:
Monitor Checker:用来检查Monitor对象可能发生的死锁,AMS、WMS等核心系统服务都是Monitor对象。
Handler Checker:用来检查线程的消息队列是否长时间处于工作状态。Watchdog自身的消息队列,Ui、 Io、 Display这些全局的消息队列都是被检查的对象。此外,一些核心服务的重要线程的消息队列,比如AMS、PMS,也会加入到Handler Checker中,这些是在对应的对象初始化时加入的。
可以看到在构造方法内部,执行了addMonitor(new BinderThreadMonitor()),用来对Binder进行检测;
// Initialize monitor for Binder threads.
addMonitor(new BinderThreadMonitor());
private static final class BinderThreadMonitor implements Watchdog.Monitor {
@Override
public void monitor() {
Binder.blockUntilThreadAvailable();
}
}
当进行定时检测时,会回调到native层,对应的代码为frameworks/native/libs/binder/IPCThreadState.cpp
void IPCThreadState::blockUntilThreadAvailable()
{
pthread_mutex_lock(&mProcess->mThreadCountLock);
while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
ALOGW("Waiting for thread to be free. mExecutingThreadsCount=%lu mMaxThreads=%lu\n",
static_cast<unsigned long>(mProcess->mExecutingThreadsCount),
static_cast<unsigned long>(mProcess->mMaxThreads));
pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
}
pthread_mutex_unlock(&mProcess->mThreadCountLock);
}
BinderThreadMonitor也是被添加到mMonitorChecker中,主要是用于确认binder是否有出现不够用的情况,例如:假设binder的mMaxThreads为15个,超过15后就需要check是否存在binder阻塞。
b.HandlerChecker
在初始化时,会创建多个HandlerChecker,然后加入到列表中,一起看一下HandlerChecker是什么:
/**
* Used for checking status of handle threads and scheduling monitor callbacks.
*/
public final class HandlerChecker implements Runnable {
private final Handler mHandler;
private final String mName;
private final long mWaitMax;
private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
private boolean mCompleted;
private Monitor mCurrentMonitor;
private long mStartTime;
HandlerChecker(Handler handler, String name, long waitMaxMillis) {
mHandler = handler;
mName = name;
mWaitMax = waitMaxMillis;
mCompleted = true;
}
public void addMonitor(Monitor monitor) {
mMonitors.add(monitor);
}
public void scheduleCheckLocked() {
if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
mCompleted = true;
return;
}
if (!mCompleted) {
// we already have a check in flight, so no need
return;
}
mCompleted = false;
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis();
mHandler.postAtFrontOfQueue(this);
}
public boolean isOverdueLocked() {
return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
}
public int getCompletionStateLocked() {
if (mCompleted) {
return COMPLETED;
} else {
long latency = SystemClock.uptimeMillis() - mStartTime;
if (latency < mWaitMax/2) {
return WAITING;
} else if (latency < mWaitMax) {
return WAITED_HALF;
}
}
return OVERDUE;
}
.......
........
public String describeBlockedStateLocked() {
if (mCurrentMonitor == null) {
return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
} else {
return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
+ " on " + mName + " (" + getThread().getName() + ")";
}
}
@Override
public void run() {
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
mCurrentMonitor.monitor();
}
synchronized (Watchdog.this) {
mCompleted = true;
mCurrentMonitor = null;
}
}
}
可以看到,HandlerChecker是一个Runnable,用检测Handler的运行状态和Monitor的回调,主要方法如下:
addMonitor():将Monitor对象添加到mMonotors列表中;
scheduleCheckLocked():检测开始入口,将自身加入到消息队列中执行;
isOverdueLocked():判断是否超时;
getCompletionStateLocked():获取完成状态;
describeBlockedStateLocked():获取异常信息,来判断是Monitor超时还是Handler执行超时;
run():开始执行检测;
c.注册
Watchdog检测包括Monitor check和Handler check,分别都有对应的注册方法入口:
c.1:Monitor注册
public void addMonitor(Monitor monitor) {
synchronized (this) {
if (isAlive()) {
throw new RuntimeException("Monitors can't be added once the Watchdog is running");
}
mMonitorChecker.addMonitor(monitor);
}
}
前面讲到,Watchdog在启动前需要先注册,从addMonitor()方法可以看到,在内部有判断,如果线程已经启动了,再执行的话就抛异常了;满足条件,执行mMonitorChecker.addMonitor将monitor加入到mMonitorChecker的mMonitors列表里面;
注意:所有的核心系统服务都是调用addMonitor()来对自身进行注册的,最终都会调用到mMonitorChecker的addMonitor(),也就是说都是通过mMonitorChecker来进行检测的[在run()内部遍历mMonitors],该mMonitorChecker是在Watchdog构造方法内部创建的,然后再将mMonitorChecker加入到mHandlerCheckers列表中;
c.2:Handler注册
public void addThread(Handler thread) {
addThread(thread, DEFAULT_TIMEOUT);
}
public void addThread(Handler thread, long timeoutMillis) {
synchronized (this) {
if (isAlive()) {
throw new RuntimeException("Threads can't be added once the Watchdog is running");
}
final String name = thread.getLooper().getThread().getName();
mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
}
}
addThread()来注册Handler检测,DEFAULT_TIMEOUT为1分钟,将自身的Handler作为参数来创建一HandlerChecker对象,然后添加到mHandlerCheckers列表中;
对比两个注册方法可以看到,Monitor和Handler是分开检测的,所有核心系统服务的Monitor检测是在一个HandlerChecker里面执行的,即;而Handler检测是在不同的HandlerChecker里面执行的,每个系统服务都创建一个HandlerChecker。
d.开启检测
Watchdog是一个Thread,所以开启检测肯定是在线程启动的时候就执行了,一起看一下run()方法:
@Override
public void run() {
........
while (true) {
.......
synchronized (this) {
long timeout = CHECK_INTERVAL;
//-----------------分析1------------------------
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
hc.scheduleCheckLocked();
}
........
//-----------------分析2------------------------
long start = SystemClock.uptimeMillis();
while (timeout > 0) {
......
try {
wait(timeout);
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
.........
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
boolean fdLimitTriggered = false;
.......
if (!fdLimitTriggered) {
//-----------------分析3------------------------
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
waitedHalf = false;
continue;
} else if (waitState == WAITING) {
continue;
} else if (waitState == WAITED_HALF) {
........
continue;
}
// something is overdue!
//-----------------分析4------------------------
blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);
} else {
blockedCheckers = Collections.emptyList();
subject = "Open FD high water mark reached";
}
allowRestart = mAllowRestart;
}
..........
.........
//------------------------分析5--------------------------------
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
for (int i=0; i<blockedCheckers.size(); i++) {
Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
StackTraceElement[] stackTrace
= blockedCheckers.get(i).getThread().getStackTrace();
for (StackTraceElement element: stackTrace) {
Slog.w(TAG, " at " + element);
}
}
Slog.w(TAG, "*** GOODBYE!");
Process.killProcess(Process.myPid());
System.exit(10);
}
waitedHalf = false;
}
}
run()方法内部执行逻辑有点复杂,把他们拆分成五个部分:
分析1:遍历mHandlerCheckers列表,执行scheduleCheckLocked()来开启检测;
分析2:开启定期检测,每一次检查的间隔时间由CHECK_INTERVAL常量设定,默认为30秒;
分析3:检查HanddlerChecker的完成状态:COMPLETED表示已经完成;WAITING和WAITED_HALF表示还在等待,但未超时;OVERDUE表示已经超时;
分析4:如果存在超时的HandlerChecker,获取阻塞的HandlerChecker,生成一些描述信息;
分析5:保存日志,打印调用栈,然后kill系统进程;
run()方法是while(true)死循环,只要是系统进程没有被kill就会一直循环执行HandlerChecker的scheduleCheckLocked(),接下来再看一下该方法:
public void scheduleCheckLocked() {
if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
mCompleted = true;
return;
}
//如果没有完成,直接返回不去执行
if (!mCompleted) {
// we already have a check in flight, so no need
return;
}
mCompleted = false;
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis();
//将Monitor Checker的对象置于消息队列之前,优先运行
mHandler.postAtFrontOfQueue(this);
}
对于核心服务的Monitor检测,mHandler统一用的是FgThread(name是android.fg)提供的Handler;对于核心服务的Handler检测,mHandler用的是服务自身的Handler;两者检测方向不同,所以用不同的Handler。
@Override
public void run() {
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
mCurrentMonitor.monitor();
}
synchronized (Watchdog.this) {
mCompleted = true;
mCurrentMonitor = null;
}
}
在HandlerChecker的run()内部,如果是Monitor检测(mMonitorChecker),mMonitors的size不为0,会遍历回调monitor()来获取核心服务的锁来进行检测,都执行完毕后进行置位;如果是Handler检测,mMonitors的size为0,不会执行monitor(),直接置位;
Monitor检测与Handler检测的区别是:Monitor检测需要执行monitor()来获取锁,获取不到就一直block直至超时,可能是死锁或锁一直被其他占用;而Handler检测是只要执行了run(),说明核心服务的Handler是正常工作的,没有被其他消息堵塞,如果mCompleted = false,说明该runnable没有被执行,可能是Handler内部有一直执行的消息导致了阻塞;
四.案例分析
当系统核心服务出现异常触发了Watchdog的检测时,会将异常堆栈信息输出到文件中,文件名为如下格式:20210810014838_traces_SystemServer_WDT10_8月_01_48_37.661_pid1107,如果是Monitor阻塞的话,在日志中会打印以下EventLog信息:
08-10 01:47:56.394 1107 1440 I watchdog: Blocked in monitor com.android.server.wm.WindowManagerService on foreground thread (android.fg)
如果是Handler阻塞的话,会在日志中打印以下EventLog信息:
08-10 01:47:56.394 1107 1440 I watchdog: Blocked in handler on xx (xx)
然后打印warning log信息:
08-10 01:48:42.967 1107 1440 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.wm.WindowManagerService on foreground thread (android.fg)
08-10 01:48:46.255 1107 1440 W Watchdog: *** GOODBYE!
此时system_server进程就被kill了,系统就重启了。
在anr文件下,找到20210810014838_traces_SystemServer_WDT10_8月_01_48_37.661_pid1107文件进行解压,找到android.fg线程:
"android.fg" prio=5 tid=16 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x134011c0 self=0x7a8077c600
| sysTid=1147 nice=0 cgrp=default sched=0/0 handle=0x7a70bfc4f0
| state=S schedstat=( 1586113377 2192768722 6889 ) utm=70 stm=88 core=1 HZ=100
| stack=0x7a70afa000-0x7a70afc000 stackSize=1037KB
| held mutexes=
at com.android.server.wm.WindowManagerService.monitor(WindowManagerService.java:7069)
- waiting to lock <0x05f70fd8> (a com.android.server.wm.WindowHashMap)
at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:211)
at android.os.Handler.handleCallback(Handler.java:790)
at android.os.Handler.dispatchMessage(Handler.java:99)
at android.os.Looper.loop(Looper.java:164)
at android.os.HandlerThread.run(HandlerThread.java:65)
at com.android.server.ServiceThread.run(ServiceThread.java:46)
可以看到monitor在等待0x05f70fd8这个锁,该锁是WindowHashMap,再搜索一下0x05f70fd8被谁占用;
"android.anim" prio=5 tid=26 Runnable
| group="main" sCount=0 dsCount=0 flags=0 obj=0x133c40c0 self=0x7a70e19600
| sysTid=1199 nice=-4 cgrp=default sched=0/0 handle=0x7a700cd4f0
| state=R schedstat=( 299015045884 448343750812 840845 ) utm=22091 stm=7810 core=3 HZ=100
| stack=0x7a6ffcb000-0x7a6ffcd000 stackSize=1037KB
| held mutexes= "mutator lock"(shared held)
at com.android.server.wm.WindowContainer.forAllWindows(WindowContainer.java:-1)
at com.android.server.wm.AppWindowToken.forAllWindowsUnchecked(AppWindowToken.java:1549)
at com.android.server.wm.AppWindowToken.forAllWindows(AppWindowToken.java:1544)
at com.android.server.wm.WindowContainer.forAllWindows(WindowContainer.java:616)
at com.android.server.wm.WindowContainer.forAllWindows(WindowContainer.java:616)
at com.android.server.wm.WindowContainer.forAllWindows(WindowContainer.java:616)
at com.android.server.wm.DisplayContent$TaskStackContainers.forAllWindows(DisplayContent.java:3434)
at com.android.server.wm.DisplayContent.forAllWindows(DisplayContent.java:1556)
at com.android.server.wm.WindowContainer.forAllWindows(WindowContainer.java:633)
at com.android.server.wm.DisplayContent.updateWallpaperForAnimator(DisplayContent.java:2655)
at com.android.server.wm.WindowAnimator.animate(WindowAnimator.java:202)
- locked <0x05f70fd8> (a com.android.server.wm.WindowHashMap)
at com.android.server.wm.WindowAnimator.lambda$-com_android_server_wm_WindowAnimator_3951(WindowAnimator.java:105)
at com.android.server.wm.-$Lambda$OQfQhd_xsxt9hoLAjIbVfOwa-jY.$m$0(unavailable:-1)
at com.android.server.wm.-$Lambda$OQfQhd_xsxt9hoLAjIbVfOwa-jY.doFrame(unavailable:-1)
at android.view.Choreographer$CallbackRecord.run(Choreographer.java:964)
at android.view.Choreographer.doCallbacks(Choreographer.java:778)
at android.view.Choreographer.doFrame(Choreographer.java:710)
at android.view.Choreographer$FrameDisplayEventReceiver.run(Choreographer.java:952)
at android.os.Handler.handleCallback(Handler.java:790)
at android.os.Handler.dispatchMessage(Handler.java:99)
at android.os.Looper.loop(Looper.java:164)
at android.os.HandlerThread.run(HandlerThread.java:65)
at com.android.server.ServiceThread.run(ServiceThread.java:46)
可以看到,该锁是被WindowAnimator内部的animate()方法占用着,那么接下来就是看一下该方法为啥一直占用这该锁了。
上面详细分析了Watchdog的使用及工作流程,system_server进程作为Android系统重要的进程,运行着核心服务,如果核心服务不能正常运行时[死锁或消息队列一直处于忙碌状态],系统也就没有运行的必要了,Watchdog就承担起了检测system_server进程的任务,如果system_server进程异常,就执行kill让系统重启。