bug背景:由于甲方数据库机房这边经常网络波动,或有相关防火墙规则修改,导致线上应用的数据库连接会断开,连接池错误,一度只能通过不断重启应用才可以解决。
相关组件:Spting-Boot-1.5.3.RELEASE
、com.alibaba.druid-1.0.31
bug复现:本地将项目启动之后,断网/或者关闭测试库连接,模拟网络波动,bug得以复现。
主要异常信息:
c.a.datasource.pool.ConnectionPool - atomikos connection pool 'pw': error creating proxy of connection an AtomikosXAPooledConnection with a SessionHandleState with 0 context(s)
com.atomikos.datasource.pool.CreateConnectionException: an AtomikosXAPooledConnection with a SessionHandleState with 0 context(s): connection is erroneous
at com.atomikos.jdbc.AtomikosXAPooledConnection.testUnderlyingConnection(AtomikosXAPooledConnection.java:116)
at com.atomikos.datasource.pool.AbstractXPooledConnection.createConnectionProxy(AbstractXPooledConnection.java:71)
at com.atomikos.datasource.pool.ConnectionPool.retrieveFirstAvailableConnection(ConnectionPool.java:209)
at com.atomikos.datasource.pool.ConnectionPool.retrieveFirstAvailableConnectionAndGrowPoolIfNecessary(ConnectionPool.java:173)
at com.atomikos.datasource.pool.ConnectionPool.findOrWaitForAnAvailableConnection(ConnectionPool.java:160)
at com.atomikos.datasource.pool.ConnectionPool.borrowConnection(ConnectionPool.java:151)
at com.atomikos.jdbc.AbstractDataSourceBean.getConnection(AbstractDataSourceBean.java:342)
at com.atomikos.jdbc.AbstractDataSourceBean.getConnection(AbstractDataSourceBean.java:394)
at org.hibernate.engine.jdbc.connections.internal.DatasourceConnectionProviderImpl.getConnection(DatasourceConnectionProviderImpl.java:122)
at org.hibernate.internal.AbstractSessionImpl$NonContextualJdbcConnectionAccess.obtainConnection(AbstractSessionImpl.java:386)
at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.acquireConnectionIfNeeded(LogicalConnectionManagedImpl.java:87)
at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.getPhysicalConnection(LogicalConnectionManagedImpl.java:112)
at org.hibernate.engine.jdbc.internal.StatementPreparerImpl.connection(StatementPreparerImpl.java:47)
at org.hibernate.engine.jdbc.internal.StatementPreparerImpl$5.doPrepare(StatementPreparerImpl.java:146)
at org.hibernate.engine.jdbc.internal.StatementPreparerImpl$StatementPreparationTemplate.prepareStatement(StatementPreparerImpl.java:172)
at org.hibernate.engine.jdbc.internal.StatementPreparerImpl.prepareQueryStatement(StatementPreparerImpl.java:148)
at org.hibernate.loader.Loader.prepareQueryStatement(Loader.java:1929)
at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:1898)
at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:1876)
at org.hibernate.loader.Loader.doQuery(Loader.java:919)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:336)
at org.hibernate.loader.Loader.doList(Loader.java:2617)
at org.hibernate.loader.Loader.doList(Loader.java:2600)
at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2429)
at org.hibernate.loader.Loader.list(Loader.java:2424)
at org.hibernate.loader.hql.QueryLoader.list(QueryLoader.java:501)
at org.hibernate.hql.internal.ast.QueryTranslatorImpl.list(QueryTranslatorImpl.java:371)
at org.hibernate.engine.query.spi.HQLQueryPlan.performList(HQLQueryPlan.java:216)
at org.hibernate.internal.SessionImpl.list(SessionImpl.java:1326)
at org.hibernate.internal.QueryImpl.list(QueryImpl.java:87)
at org.hibernate.jpa.internal.QueryImpl.list(QueryImpl.java:606)
at org.hibernate.jpa.internal.QueryImpl.getSingleResult(QueryImpl.java:529)
at org.hibernate.jpa.criteria.compile.CriteriaQueryTypeQueryAdapter.getSingleResult(CriteriaQueryTypeQueryAdapter.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.orm.jpa.SharedEntityManagerCreator$DeferredQueryInvocationHandler.invoke(SharedEntityManagerCreator.java:372)
at com.sun.proxy.$Proxy1483.getSingleResult(Unknown Source)
由异常信息可知应该是因为分布式事务连接池有问题,遂查看项目连接池相关的配置。核心代码如下:
public static DataSource createDataSource(Map<String, Object> properties, String id) throws Exception {
DruidXADataSource datasource = new DruidXADataSource();
datasource.setFilters("stat,slf4j");
DruidDataSourceFactory.config(datasource, properties);
AtomikosDataSourceBean ds = new AtomikosDataSourceBean();
ds.setXaDataSource(datasource);
ds.setUniqueResourceName(id);
ds.setPoolSize(5);
//这里是解决方案,之前是没有的
//设置线程维护时间间隔 秒为单位
//ds.setMaintenanceInterval(10);
return ds;
}
由此可知,分布式事务连接池底层使用的是druid,直觉告诉我会不会是druid连接池断开重连的机制有问题。遂百度相关配置以及查阅github-issues,进行配置之后发现不好使,改报错还是报错。druid连接池自己的断线重连没问题。于是将目标转移到AtomikosDataSourceBean
这里,通过查阅相关博客得知:分布式事务连接池使用druid底层连接创建分布式数据源,不是直接与数据库建立连接,属于借用关系,连接使用完毕之后会归还给druid进行管理。
通过查阅异常信息-at com.atomikos.jdbc.AtomikosXAPooledConnection.testUnderlyingConnection(AtomikosXAPooledConnection.java:116)
protected void testUnderlyingConnection() throws CreateConnectionException {
if ( isErroneous() ) throw new CreateConnectionException ( this + ": connection is erroneous" );
String testQuery = getTestQuery();
if (testQuery != null) {
if ( LOGGER.isDebugEnabled() ) LOGGER.logDebug ( this + ": testing connection with query [" + testQuery + "]" );
Statement stmt = null;
try {
stmt = connection.createStatement();
//use execute instead of executeQuery - cf case 58830
stmt.execute(testQuery);
stmt.close();
} catch ( Exception e) {
//catch any Exception - cf case 22198
throw new CreateConnectionException ( "Error executing testQuery" , e );
}
if ( LOGGER.isDebugEnabled() ) LOGGER.logDebug ( this + ": connection tested OK" );
}
else {
if ( LOGGER.isDebugEnabled() ) LOGGER.logDebug ( this + ": no test query, skipping test" );
}
}
public boolean isErroneous()
{
return sessionHandleState.isErroneous();
}
得知:AtomikosDataSourceBean 在获取连接时会检测连接的有效性,由于数据库断开连接,druid中的连接都是不可用的,所以抛出此异常。
解决方法:已知druid中数据库连接池在网络恢复之后会重连,剔除掉不可用连接。但是分布式数据源连接池这里会有缓存,导致两边连接状态不同步。即需要解决分布式数据连接池的自检,及时剔除不可用连接。
AtomikosDataSourceBean默认使用的连接池是由com.atomikos.datasource.pool.ConnectionPool
实现的。其中由自己的维护器
private void launchMaintenanceTimer() {
int maintenanceInterval = properties.getMaintenanceInterval();
//不指定间隔顺序的话-默认一分钟之行一次
if ( maintenanceInterval <= 0 ) {
if ( LOGGER.isDebugEnabled() ) LOGGER.logDebug ( this + ": using default maintenance interval..." );
maintenanceInterval = DEFAULT_MAINTENANCE_INTERVAL;
}
maintenanceTimer = new PooledAlarmTimer ( maintenanceInterval * 1000 );
maintenanceTimer.addAlarmTimerListener(new AlarmTimerListener() {
public void alarm(AlarmTimer timer) {
reapPool();
removeConnectionsThatExceededMaxLifetime();
addConnectionsIfMinPoolSizeNotReached();
removeIdleConnectionsIfMinPoolSizeExceeded();
}
});
TaskManager.getInstance().executeTask ( maintenanceTimer );
}
连接池回收处理定时器,会根据maintenanceinterval设置的重复时间进行重复执行(不指定的话默认一分钟一次),每次执行会:
-
reapPool()
是否有超过最大数据获取时间的连接,如果有,这抛出获取时间超时的异常 -
removeConnectionsThatExceededMaxLifetime()
删除超过最大生命周期的没有正在使用的连接 -
addConnectionsIfMinPoolSizeNotReached()
添加连接数达到最小连接 -
removeIdleConnectionsIfMinPoolSizeExceeded()
删除连接数,如果超过最大连接
public synchronized void reapPool()
{
long maxInUseTime = properties.getReapTimeout();
if ( connections == null || maxInUseTime <= 0 ) return;
if ( LOGGER.isDebugEnabled() ) LOGGER.logDebug ( this + ": reaping old connections" );
Iterator<XPooledConnection> it = connections.iterator();
while ( it.hasNext() ) {
XPooledConnection xpc = it.next();
long lastTimeReleased = xpc.getLastTimeAcquired();
boolean inUse = !xpc.isAvailable();
long now = System.currentTimeMillis();
if ( inUse && ( ( now - maxInUseTime * 1000 ) > lastTimeReleased ) ) {
if ( LOGGER.isDebugEnabled() ) LOGGER.logDebug ( this + ": connection in use for more than " + maxInUseTime + "s, reaping it: " + xpc );
xpc.reap();
EventPublisher.publish(new PooledConnectionReapedEvent(properties.getUniqueResourceName(),xpc));
}
}
logCurrentPoolSize();
}
四个方法每个中都会检测连接有效性。剔除不可用连接。boolean inUse = !xpc.isAvailable();
至此,得知分布式事务连接池的检测周期默认一分钟一次、跟druid不一致。线上环境突发网络波动不允许一分钟没响应。解决方案:将分布式事务的连接池周期调整为10s,即可解决。