这是关于故障诊断文章的第二篇,翻译自《How to Analyze Java Thread Dumps》,原文地址:https://dzone.com/articles/how-analyze-java-thread-dumps
——————————————神奇的分割线————————————
这篇文章的内容最初由Tae Jin Gu发布在Cubrid blog上。
当存在卡顿,或者当机遇Java的web应用相应速度慢于预期时,我们需要使用thread dumps。如果你觉得thread dumps对你来说非常复杂,本文将帮助你理解。本文中我将解释Java中的线程threads是啥,有哪些类型,他们怎样被创建的,怎么管理他们,你怎样对运行应用dump threads,最后将解释如何分析dump文件,找出瓶颈或者是阻塞线程。这篇文章是丰富java应用debug经验的结晶。
java和线程(Thread)
web服务器利用数十到数百个线程来处理大量的并发用户请求。如果两个或者多个线程需要使用同一个资源,线程间必然产生竞争,有时候甚至会发生死锁。
线程竞争是指线程处于等待锁的状态,这个锁被别的线程锁持有了。在web应用中,不同的线程会频繁的访问共享资源。例如,记录日志,需要记录日志的新城必须需要获取相关的锁,访问共享的资源。
死锁是一种特殊类型的线程竞争,在这种情况下,两个或者多个线程等待其他线程完成工作之后才能完成自己的任务。
Java线程的背景知识
线程同步
线程可以与其他线程同时协调工作。当多个线程需要访问共享资源时,为了保证正确性,必须使用线程同步保证同时只有一个线程允许访问共享资源。
java中的线程同步可以使用监视器(monitor)实现。每个java对象有唯一一个监视器。监视器只能被一个线程占有。一个线程如果要占用别的线程获取了的监视器对象,需要再监视器的等待队列中等待其他线程释放该监视器。
线程状态
为了分析线程dump,需要了解线程的各种状态。这些状态在java.lang.Thread.State中列出了。
- NEW:线程刚被创建还没有被执行。
- RUNNABLE: 线程占有cpu资源,正在处理任务(由于操作系统的资源调度,其可能处于WATING状态)
- BLOCKED:为了获取监视器锁(monitor lock),线程在等待其他线程释放锁
- WAITING:线程使用了wait,join或pard方法进行等待
- TIMED_WAITING:线程使用sleep,wait,join或者park方法进行等待。(与WAITING状态不同的是方法参数指定了最大等待时间,WAITING状态可以通过时间也可以通过外部改变来解除)
线程类型
java线程可以分成两类:
- daemon线程(精灵线程,也可认为是后台线程)
- 非daemon线程
当不存在其他非daemon线程时deamon线程停止工作。即使你不创建任何线程,java应用默认就会创建多个线程。他们大多都是daemon线程,主要是用来出来像gc或者JMX等。
运行 static void main(String[] args)的方法都会被创建为非daemon线程,当这个线程停止工作的时候,所有其他的daemon线程也都会停止。
创建thread dump
我们将介绍三种最常用的方法。需要注意的是有许多其他方法获得dump数据。一次thread dump只能展现dump时刻的线程状态,因此为了看到线程状态的改变,推荐的做法是每个5秒dump一次,共dump5-10次。
使用jstack进行thread dump
在jdk1.6及高版本中,你可以在windows系统中使用jstack进行thread dump。
jps -v
jstack -f 5824
通过使用jsp获得java进程的pid,将pid作为jstack的参数以获取thread dump
通过jVisualVM获取thread dump
通过使用像jVisualVM类似的软件生成thread dump文件。
在linux终端中生成
通过ps -ef命令列出所有进程Pid,找出对应的java进程pid
ps -ef | grep java
使用对应的Pid作为kill -3的参数获取thread dump。
线程dump文件中的线程信息
"pool-1-thread-13" prio=6 tid=0x000000000729a000 nid=0x2fb4 runnable [0x0000000007f0f000] java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
- locked <0x0000000780b7e688> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.readLine(BufferedReader.java:299)
- locked <0x0000000780b7e688> (a java.io.InputStreamReader)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
- 线程名字:当使用java.lang.Thread类生成一个线程时,线程将被命名为Thread-(NUmber(序号)),而当使用java.util.concurrent.ThreadFacotry类时,将被命名为pool-(number)-thread-(number).
- 优先级:带便线程的优先级
- 线程ID:代表线程的唯一ID(一些重要的信息,包括线程cpu使用率,内存使用率,可以通过线程id获取)
- 线程状态:代表线程的状态
- 线程调用栈:代表线程的调用栈信息
按类型排列线程Dump信息
不能获取锁情况(BLOCKED)
当一个线程获取了锁导致别的线程不能获取的情况,这种情况通常会造成应用程序性能整体变慢。在以下例子中,BLOCKED_TEST pool-1-thread-1占有了<0x0000000780a000b0>锁,而BLOCKED_TEST pool-1-thread-2和BLOCKED_TEST pool-1-thread-3线程则在等待获取<0x0000000780a000b0>锁。
"BLOCKED_TEST pool-1-thread-1" prio=6 tid=0x0000000006904800 nid=0x28f4 runnable [0x000000000785f000]
java.lang.Thread.State: RUNNABLE
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:282)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
- locked <0x0000000780a31778> (a java.io.BufferedOutputStream)
at java.io.PrintStream.write(PrintStream.java:432)
- locked <0x0000000780a04118> (a java.io.PrintStream)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85)
- locked <0x0000000780a040c0> (a java.io.OutputStreamWriter)
at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168)
at java.io.PrintStream.newLine(PrintStream.java:496)
- locked <0x0000000780a04118> (a java.io.PrintStream)
at java.io.PrintStream.println(PrintStream.java:687)
- locked <0x0000000780a04118> (a java.io.PrintStream)
at com.nbp.theplatform.threaddump.ThreadBlockedState.monitorLock(ThreadBlockedState.java:44)
- locked <0x0000000780a000b0> (a com.nbp.theplatform.threaddump.ThreadBlockedState)
at com.nbp.theplatform.threaddump.ThreadBlockedState$1.run(ThreadBlockedState.java:7)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Locked ownable synchronizers:
- <0x0000000780a31758> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
"BLOCKED_TEST pool-1-thread-2" prio=6 tid=0x0000000007673800 nid=0x260c waiting for monitor entry [0x0000000008abf000]
java.lang.Thread.State: BLOCKED (on object monitor)
at com.nbp.theplatform.threaddump.ThreadBlockedState.monitorLock(ThreadBlockedState.java:43)
- waiting to lock <0x0000000780a000b0> (a com.nbp.theplatform.threaddump.ThreadBlockedState)
at com.nbp.theplatform.threaddump.ThreadBlockedState$2.run(ThreadBlockedState.java:26)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Locked ownable synchronizers:
- <0x0000000780b0c6a0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
"BLOCKED_TEST pool-1-thread-3" prio=6 tid=0x00000000074f5800 nid=0x1994 waiting for monitor entry [0x0000000008bbf000]
java.lang.Thread.State: BLOCKED (on object monitor)
at com.nbp.theplatform.threaddump.ThreadBlockedState.monitorLock(ThreadBlockedState.java:42)
- waiting to lock <0x0000000780a000b0> (a com.nbp.theplatform.threaddump.ThreadBlockedState)
at com.nbp.theplatform.threaddump.ThreadBlockedState$3.run(ThreadBlockedState.java:34)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Locked ownable synchronizers:
- <0x0000000780b0e1b8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
处于死锁状态的情况
这种情况的产生是由于线程a需要获取线程b的锁来继续工作,而线程b需要线程a的锁来继续工作。在线程dump中,你可以看到DEADLOCK_TEST-1 thread 持有0x00000007d58f5e48锁并且试着获取0x00000007d58f5e60锁。也可以看到DEADLOCK_TEST-2 thread持有 0x00000007d58f5e60锁,并且尝试着获取0x00000007d58f5e48锁。可以看到,每个线程都在等待获取另外一个线程持有的锁,这种状态将一直持续到其中一个线程放弃锁。
"DEADLOCK_TEST-1" daemon prio=6 tid=0x000000000690f800 nid=0x1820 waiting for monitor entry [0x000000000805f000]
java.lang.Thread.State: BLOCKED (on object monitor)
at com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.goMonitorDeadlock(ThreadDeadLockState.java:197)
- waiting to lock <0x00000007d58f5e60> (a com.nbp.theplatform.threaddump.ThreadDeadLockState$Monitor)
at com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.monitorOurLock(ThreadDeadLockState.java:182)
- locked <0x00000007d58f5e48> (a com.nbp.theplatform.threaddump.ThreadDeadLockState$Monitor)
at com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.run(ThreadDeadLockState.java:135)
Locked ownable synchronizers:
- None
"DEADLOCK_TEST-2" daemon prio=6 tid=0x0000000006858800 nid=0x17b8 waiting for monitor entry [0x000000000815f000]
java.lang.Thread.State: BLOCKED (on object monitor)
at com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.goMonitorDeadlock(ThreadDeadLockState.java:197)
- waiting to lock <0x00000007d58f5e78> (a com.nbp.theplatform.threaddump.ThreadDeadLockState$Monitor)
at com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.monitorOurLock(ThreadDeadLockState.java:182)
- locked <0x00000007d58f5e60> (a com.nbp.theplatform.threaddump.ThreadDeadLockState$Monitor)
at com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.run(ThreadDeadLockState.java:135)
Locked ownable synchronizers:
- None
"DEADLOCK_TEST-3" daemon prio=6 tid=0x0000000006859000 nid=0x25dc waiting for monitor entry [0x000000000825f000]
java.lang.Thread.State: BLOCKED (on object monitor)
at com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.goMonitorDeadlock(ThreadDeadLockState.java:197)
- waiting to lock <0x00000007d58f5e48> (a com.nbp.theplatform.threaddump.ThreadDeadLockState$Monitor)
at com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.monitorOurLock(ThreadDeadLockState.java:182)
- locked <0x00000007d58f5e78> (a com.nbp.theplatform.threaddump.ThreadDeadLockState$Monitor)
at com.nbp.theplatform.threaddump.ThreadDeadLockState$DeadlockThread.run(ThreadDeadLockState.java:135)
Locked ownable synchronizers:
- None
当持续不断地等着从远程sever接收消息时
这种情况下线程看起来是正常的,由于线程状态一直显示为RUNNABLE。但是,当你对thread dump进行时间排序时,你将发现socketReadThread线程一直在等待从socket中获取信息。
"socketReadThread" prio=6 tid=0x0000000006a0d800 nid=0x1b40 runnable [0x00000000089ef000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
- locked <0x00000007d78a2230> (a java.io.InputStreamReader)
at sun.nio.cs.StreamDecoder.read0(StreamDecoder.java:107)
- locked <0x00000007d78a2230> (a java.io.InputStreamReader)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:93)
at java.io.InputStreamReader.read(InputStreamReader.java:151)
at com.nbp.theplatform.threaddump.ThreadSocketReadState$1.run(ThreadSocketReadState.java:27)
at java.lang.Thread.run(Thread.java:662)
等待的情况
线程保持等待(WAIT)状态。在线程dump中,IoWaitThread线程等待从LinkedBlockingQueue中接收消息。如果LinkedBlockingQueue中一直没有消息进入的话,线程将一直保持等待状态。
"IoWaitThread" prio=6 tid=0x0000000007334800 nid=0x2b3c waiting on condition [0x000000000893f000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007d5c45850> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:440)
at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:629)
at com.nbp.theplatform.threaddump.ThreadIoWaitState$IoWaitHandler2.run(ThreadIoWaitState.java:89)
at java.lang.Thread.run(Thread.java:662)
当没有有效地组织线程资源时
当线程资源没有被合理使用组织时,不需要的线程将会堆积。当发生这种情况的时候,推荐的做法是监控线程或者检查线程结束的条件。
如果使用线程dump解决问题
案例一:当cpu使用率异常飙高时
1、找到拥有最高cpu使用率的线程
ps -mo pid.lwp.stime.time.cpu -C java(待验证)
从应用中找出cpu使用率最高的线程。
找出使用Cpu最多的LWP(线程),将uid(10039)转化成16进制(0x2737)
2、在获得thread dump后,查看相应线程的情况
先对pid为10029的进程进行thread dump,然后找出nid为0x2737的线程
"NioProcessor-2" prio=10 tid=0x0a8d2800 nid=0x2737 runnable [0x49aa5000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
- locked <0x74c52678> (a sun.nio.ch.Util$1)
- locked <0x74c52668> (a java.util.Collections$UnmodifiableSet)
- locked <0x74c501b0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
at external.org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:65)
at external.org.apache.mina.common.AbstractPollingIoProcessor$Worker.run(AbstractPollingIoProcessor.java:708)
at external.org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:51)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
每个小时都进行多次thread dump,查看对应线程的线程状态,找出问题。
案例二:当处理性能变慢
在租了多次线程dumps之后,找出BLOCKED状态的线程。
" DB-Processor-13" daemon prio=5 tid=0x003edf98 nid=0xca waiting for monitor entry [0x000000000825f000]
java.lang.Thread.State: BLOCKED (on object monitor)
at beans.ConnectionPool.getConnection(ConnectionPool.java:102)
- waiting to lock <0xe0375410> (a beans.ConnectionPool)
at beans.cus.ServiceCnt.getTodayCount(ServiceCnt.java:111)
at beans.cus.ServiceCnt.insertCount(ServiceCnt.java:43)
"DB-Processor-14" daemon prio=5 tid=0x003edf98 nid=0xca waiting for monitor entry [0x000000000825f020]
java.lang.Thread.State: BLOCKED (on object monitor)
at beans.ConnectionPool.getConnection(ConnectionPool.java:102)
- waiting to lock <0xe0375410> (a beans.ConnectionPool)
at beans.cus.ServiceCnt.getTodayCount(ServiceCnt.java:111)
at beans.cus.ServiceCnt.insertCount(ServiceCnt.java:43)
" DB-Processor-3" daemon prio=5 tid=0x00928248 nid=0x8b waiting for monitor entry [0x000000000825d080]
java.lang.Thread.State: RUNNABLE
at oracle.jdbc.driver.OracleConnection.isClosed(OracleConnection.java:570)
- waiting to lock <0xe03ba2e0> (a oracle.jdbc.driver.OracleConnection)
at beans.ConnectionPool.getConnection(ConnectionPool.java:112)
- locked <0xe0386580> (a java.util.Vector)
- locked <0xe0375410> (a beans.ConnectionPool)
at beans.cus.Cue_1700c.GetNationList(Cue_1700c.java:66)
at org.apache.jsp.cue_1700c_jsp._jspService(cue_1700c_jsp.java:120)
如果线程处于blocked状态,找出与线程想获取的锁相关的线程。
通过thread dump,可以确认线程处于blocked状态主要是因为线程获取不到0xe0375410锁。可以通过分析占有对应锁的线程栈信息解决以上问题。
关于为啥上述情况在使用了数据库管理系统的应用中频繁出现,有两个原因。第一个原因是由于不合理的配置。即使事实上线程仍然在运行,但是由于不合理的DBCP配置导致他们不能获得最佳性能。如果多次thread dump进行对比,你会发现之前处于BLOCKED状态的线程处于不同的状态。
第二个原因就是不正常的连接。当与数据库的连接不正常时,线程将会等待直到超时。在这种情况下,即使多次dump进行对比,会发现与数据库相关的线程仍然处于BLOCKED状态。通过适当地改变一些配置,例如超时时间,会较少这种问题持续的时间。
Thread Dump友好的编码实践
线程命名
当使用java.lang.Thread类生成一个线程时,线程将被命名为Thread-(NUmber(序号)),而当使用java.util.concurrent.ThreadFacotry类时,将被命名为pool-(number)-thread-(number).当分析应用中成百上千个线程时,如果所有的线程都使用了默认的名字,分析起来就会变得非常困难,因为很难分辨要分析的线程。
因此,推荐你养成一个号习惯:当创建一个线程的是偶,就它指定一个名字。
当使用java.lang.Thread创建线程的时候,可以通过构造方法给线程命名。
public Thread(Runnable target, String name);
public Thread(ThreadGroup group, String name);
public Thread(ThreadGroup group, Runnable target, String name);
public Thread(ThreadGroup group, Runnable target, String name, long stackSize);
当你使用java.util.concurrent.ThreadFactory创建线程时,你可以通过创建自定义的ThreadFactory命名。如果你不需要特殊功能,你可以使用下面描述的MyThreadFactory:
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ThreadFactory;
import java.util.concurrent.atomic.AtomicInteger;
public class MyThreadFactory implements ThreadFactory {
private static final ConcurrentHashMap<String, AtomicInteger> POOL_NUMBER =
new ConcurrentHashMap<String, AtomicInteger>();
private final ThreadGroup group;
private final AtomicInteger threadNumber = new AtomicInteger(1);
private final String namePrefix;
public MyThreadFactory(String threadPoolName) {
if (threadPoolName == null) {
throw new NullPointerException("threadPoolName");
}
POOL_NUMBER.putIfAbsent(threadPoolName, new AtomicInteger());
SecurityManager securityManager = System.getSecurityManager();
group = (securityManager != null) ? securityManager.getThreadGroup() :
Thread.currentThread().getThreadGroup();
AtomicInteger poolCount = POOL_NUMBER.get(threadPoolName);
if (poolCount == null) {
namePrefix = threadPoolName + " pool-00-thread-";
} else {
namePrefix = threadPoolName + " pool-" + poolCount.getAndIncrement() + "-thread-";
}
}
public Thread newThread(Runnable runnable) {
Thread thread = new Thread(group, runnable, namePrefix + threadNumber.getAndIncrement(), 0);
if (thread.isDaemon()) {
thread.setDaemon(false);
}
if (thread.getPriority() != Thread.NORM_PRIORITY) {
thread.setPriority(Thread.NORM_PRIORITY);
}
return thread;
}
}
可以通过使用MBean获取更多信息
你可以通过使用MBean获取ThreadInfo信息。通过ThreadInfo你可以获取更多通过thread dump很难获取的信息。
ThreadMXBean mxBean = ManagementFactory.getThreadMXBean();
long[] threadIds = mxBean.getAllThreadIds();
ThreadInfo[] threadInfos =
mxBean.getThreadInfo(threadIds);
for (ThreadInfo threadInfo : threadInfos) {
System.out.println(
threadInfo.getThreadName());
System.out.println(
threadInfo.getBlockedCount());
System.out.println(
threadInfo.getBlockedTime());
System.out.println(
threadInfo.getWaitedCount());
System.out.println(
threadInfo.getWaitedTime());
}
总结
希望本文可以给广大的码友提供帮助
More
关于如何分析Thread Dump网上有很多文章,得出dump文件之后由很多工具可以使用,例如MAT,TDA,JProfiler等。
http://www.cnblogs.com/zhengyun_ustc/archive/2013/03/18/tda.html 这篇文章对于各种线程的状态与分析总结得很到位。