问题现象
在线上发布一个java 7服务的时候,发现某台机器发布完成后无法正常提供服务,发布后出现大量线程被blocked,触发了告警:
从监控中可以看到,JVM中存活的线程数量已经达到2k+,这本身就是不正常的,其次,有近1.8k的线程被blocked了,这就说明服务根本就没有正常启动,存在启动问题。
问题分析
线程数量超出正常水平,和线程blocked是因果关系,因为线程被blocked了,所以需要更多的线程来执行工作,所以新的线程被不断的创建出来。
所以需要找出线程被阻塞到了什么地方,通过简单排查分析,发现大量的线程都被阻塞在相同的地方:
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton
(DefaultSingletonBeanRegistry.java: 213)
at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean
(AbstractBeanFactory.java: 308)
at org.springframework.beans.factory.support.AbstractBeanFactory.getBean
(AbstractBeanFactory.java: 197)
来看一下阻塞的地方法代码:org.springframework.beans.factory.support.DefaultSingletonBeanRegistry#getSingleton(java.lang.String, boolean)
/**
* Return the (raw) singleton object registered under the given name.
* <p>Checks already instantiated singletons and also allows for an early
* reference to a currently created singleton (resolving a circular reference).
* @param beanName the name of the bean to look for
* @param allowEarlyReference whether early references should be created or not
* @return the registered singleton object, or {@code null} if none found
*/
protected Object getSingleton(String beanName, boolean allowEarlyReference) {
Object singletonObject = this.singletonObjects.get(beanName);
if (singletonObject == null && isSingletonCurrentlyInCreation(beanName)) {
synchronized (this.singletonObjects) {
singletonObject = this.earlySingletonObjects.get(beanName);
if (singletonObject == null && allowEarlyReference) {
ObjectFactory<?> singletonFactory = this.singletonFactories.get(beanName);
if (singletonFactory != null) {
singletonObject = singletonFactory.getObject();
this.earlySingletonObjects.put(beanName, singletonObject);
this.singletonFactories.remove(beanName);
}
}
}
}
return (singletonObject != NULL_OBJECT ? singletonObject : null);
}
org.springframework.beans.factory.support.DefaultSingletonBeanRegistry#getSingleton(java.lang.String, boolean)这个方法确实存在同步代码,需要执行同步代码的线程需要获取到锁才能执行,否则就会被blocked。
分析到这里,我们能确定的事情就是调用方法org.springframework.beans.factory.support.DefaultSingletonBeanRegistry#getSingleton(java.lang.String, boolean)确实会产生因竞争同步锁而导致的线程blocked,但是根据报警,几乎所有的线程都被blocked了,那就可能存在死锁问题,导致这个锁无法被释放,所以所有访问该方法的线程都被blocked,为了搞明白具体的原因,先把线程堆栈转储下来。
"xxx-13-thread-1" daemon prio=10 tid=0x00007fa3790a1000 nid=0x48b4c waiting for monitor entry [0x00007fa38e17b000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:213)
- waiting to lock <0x0000000727af5b68> (a java.util.concurrent.ConcurrentHashMap)
at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:308)
at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:197)
at org.springframework.aop.aspectj.annotation.BeanFactoryAspectInstanceFactory.getAspectInstance(BeanFactoryAspectInstanceFactory.java:83)
at org.springframework.aop.aspectj.annotation.LazySingletonAspectInstanceFactoryDecorator.getAspectInstance(LazySingletonAspectInstanceFactoryDecorator.java:53)
at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:627)
at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:616)
at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:70)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:168)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:92)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:671)
...
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:736)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:92)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:671)
...
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
被blocked的线程栈都和上面贴出的一样,重点在于:
waiting to lock <0x0000000727af5b68> (a java.util.concurrent.ConcurrentHashMap)
0x0000000727af5b68是对象的地址,其实根据后面的提示(a java.util.concurrent.ConcurrentHashMap)也可以确定是我们上面分析过的那个同步对象,再来看一下刚才那个同步对象:
/** Cache of singleton objects: bean name --> bean instance */
private final Map<String, Object> singletonObjects = new ConcurrentHashMap<String, Object>(256);
现在,我需要知道是哪个线程占有了对象0x0000000727af5b68的锁不释放,导致其他线程被blocked,为了搜索占有锁的线程,可以在线程栈转储文件中搜索关键字:"locked <0x0000000727af5b68>",根据对象锁获取逻辑,只可能有一个线程持有该对象锁,搜索后,发现了如下的堆栈:
"main" prio=10 tid=0x00007fa4a0018000 nid=0x4889e runnable [0x00007fa4a8fea000]
java.lang.Thread.State: RUNNABLE
at java.util.HashMap.put(HashMap.java:494)
at org.apache.thrift.meta_data.FieldMetaData.addStructMetaDataMap(FieldMetaData.java:49)
...
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:191)
at com.sun.proxy.$Proxy346.<clinit>(Unknown Source)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at java.lang.reflect.Proxy.newInstance(Proxy.java:764)
at java.lang.reflect.Proxy.newProxyInstance(Proxy.java:755)
at org.springframework.aop.framework.JdkDynamicAopProxy.getProxy(JdkDynamicAopProxy.java:122)
at org.springframework.aop.framework.JdkDynamicAopProxy.getProxy(JdkDynamicAopProxy.java:112)
at org.springframework.aop.framework.ProxyFactory.getProxy(ProxyFactory.java:96)
...
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.invokeCustomInitMethod(AbstractAutowireCapableBeanFactory.java:1759)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.invokeInitMethods(AbstractAutowireCapableBeanFactory.java:1696)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1626)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:553)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:481)
at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:312)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:230)
- locked <0x0000000727af5b68> (a java.util.concurrent.ConcurrentHashMap)
at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:308)
at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:197)
at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:756)
at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:867)
at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:542)
- locked <0x00000007292f2fb0> (a java.lang.Object)
at org.springframework.boot.context.embedded.EmbeddedWebApplicationContext.refresh(EmbeddedWebApplicationContext.java:123)
at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:666)
at org.springframework.boot.SpringApplication.refreshContext(SpringApplication.java:353)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:300)
是main线程,它持有了对象0x00007fa4a8fea000锁,并且根据他的状态为RUNNABLE,说明它并没有被阻塞,也就是这其实不是死锁问题,估计是main线程进入了死循环出不来,从而持有的锁无法释放,导致其他需要对象0x00007fa4a8fea000锁的线程都被blocked。
看到栈顶在java.util.HashMap.put(HashMap.java:494),心里一惊,感觉发现了thrift的一个bug了,这件事情还值得再说一说。
HashMap在至今的java版本中均不是线程安全的,也就是说,如果你的场景中会存在并发访问一个Map,你就不能用HashMap,否则会出现或多或少的问题,我们使用的是Java 7,在Java 7中,多线程并发访问HashMap会存在线程死循环的问题。
为了说明问题,截取HashMap的put方法代码如下:
public V put(K key, V value) {
if (table == EMPTY_TABLE) {
inflateTable(threshold);
}
if (key == null)
return putForNullKey(value);
int hash = hash(key);
int i = indexFor(hash, table.length);
for (Entry<K,V> e = table[i]; e != null; e = e.next) { // ----- 494行
Object k;
if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
V oldValue = e.value;
e.value = value;
e.recordAccess(this);
return oldValue;
}
}
modCount++;
addEntry(hash, key, value, i);
return null;
}
进入死循环的条件就是当前的e.next = e,也就是某个节点的next指针指向了自己,导致无限循环问题。为了验证这个问题,将堆dump了下来,然后使用Eclipse Memory Analyzer Tool(下文中使用 MAT 来指代该工具)来载入dump下来的堆,然后点击下面示意图中的按钮获取到线程列表:
MAT可以将线程的名字,当前的堆栈及持有的对象分析出来,对于排查内存问题非常的方便,找到main线程:
结合HashMap的put死循环代码,当时的e就是0x73ae67608这个java.util.HashMap.Entry,可以看到,这个java.util.HashMap.Entry的next还是自己,这样就导致了执行该代码的main线程死循环了。
关于HashMap的死循环问题是如何产生的,可以参考为什么HashMap不线程安全
问题解决
这个HashMap的代码是thrift的代码,我们可以看看原始代码:
//
// Source code recreated from a .class file by IntelliJ IDEA
// (powered by Fernflower decompiler)
//
package org.apache.thrift.meta_data;
import java.io.Serializable;
import java.util.HashMap;
import java.util.Map;
import org.apache.thrift.TBase;
import org.apache.thrift.TFieldIdEnum;
public class FieldMetaData implements Serializable {
public final String fieldName;
public final byte requirementType;
public final FieldValueMetaData valueMetaData;
private static Map<Class<? extends TBase>, Map<? extends TFieldIdEnum, FieldMetaData>> structMap = new HashMap();
public FieldMetaData(String name, byte req, FieldValueMetaData vMetaData) {
this.fieldName = name;
this.requirementType = req;
this.valueMetaData = vMetaData;
}
public static void addStructMetaDataMap(Class<? extends TBase> sClass, Map<? extends TFieldIdEnum, FieldMetaData> map) {
structMap.put(sClass, map);
}
public static Map<? extends TFieldIdEnum, FieldMetaData> getStructMetaDataMap(Class<? extends TBase> sClass) {
if(!structMap.containsKey(sClass)) {
try {
sClass.newInstance();
} catch (InstantiationException var2) {
throw new RuntimeException("InstantiationException for TBase class: " + sClass.getName() + ", message: " + var2.getMessage());
} catch (IllegalAccessException var3) {
throw new RuntimeException("IllegalAccessException for TBase class: " + sClass.getName() + ", message: " + var3.getMessage());
}
}
return (Map)structMap.get(sClass);
}
}
根据问题,我们知道,解决问题的方式有两种,一种是将structMap定义成并发安全的ConcurrentHashMap,另一种方法是将访问structMap的代码写成同步的,也就是在操作structMap的方法上(或者代码段上)加上synchronized关键字。
此时兴奋的我想快去给thrift提个pr,但是发现如下的代码:
可以看到thrift已经修复了该问题,是使用加synchronized关键字的方案来解决的。我们可以升级到0.9.3及之后的版本就可以避免再次发生这样的问题。
这个pr是为了解决THRIFT-1618这个任务的,为了看看这个问题是否和我们的问题一致,可以搜索一下这个任务:
可以看到这个任务的状态是CLOSED,已经被解决,问题描述也和我们的状况一致。
结论
基于上文的分析,总结一下,该问题是因为多线程并发访问HashMap触发Java 7 HashMap扩容时导致链表循环,从而线程进入死循环,而死循环线程持有的对象锁无法得到释放,其他请求获取对象锁的线程均被blocked。将thrift版本升级到0.9.3以上就可以解决这个问题。