springboot+atomikos+druid 数据库重连连接池假死问题

bug背景:由于甲方数据库机房这边经常网络波动,或有相关防火墙规则修改,导致线上应用的数据库连接会断开,连接池错误,一度只能通过不断重启应用才可以解决。

相关组件:Spting-Boot-1.5.3.RELEASEcom.alibaba.druid-1.0.31

bug复现:本地将项目启动之后,断网/或者关闭测试库连接,模拟网络波动,bug得以复现。

主要异常信息:
c.a.datasource.pool.ConnectionPool - atomikos connection pool 'pw': error creating proxy of connection an AtomikosXAPooledConnection with a SessionHandleState with 0 context(s)
com.atomikos.datasource.pool.CreateConnectionException: an AtomikosXAPooledConnection with a SessionHandleState with 0 context(s): connection is erroneous
    at com.atomikos.jdbc.AtomikosXAPooledConnection.testUnderlyingConnection(AtomikosXAPooledConnection.java:116)
    at com.atomikos.datasource.pool.AbstractXPooledConnection.createConnectionProxy(AbstractXPooledConnection.java:71)
    at com.atomikos.datasource.pool.ConnectionPool.retrieveFirstAvailableConnection(ConnectionPool.java:209)
    at com.atomikos.datasource.pool.ConnectionPool.retrieveFirstAvailableConnectionAndGrowPoolIfNecessary(ConnectionPool.java:173)
    at com.atomikos.datasource.pool.ConnectionPool.findOrWaitForAnAvailableConnection(ConnectionPool.java:160)
    at com.atomikos.datasource.pool.ConnectionPool.borrowConnection(ConnectionPool.java:151)
    at com.atomikos.jdbc.AbstractDataSourceBean.getConnection(AbstractDataSourceBean.java:342)
    at com.atomikos.jdbc.AbstractDataSourceBean.getConnection(AbstractDataSourceBean.java:394)
    at org.hibernate.engine.jdbc.connections.internal.DatasourceConnectionProviderImpl.getConnection(DatasourceConnectionProviderImpl.java:122)
    at org.hibernate.internal.AbstractSessionImpl$NonContextualJdbcConnectionAccess.obtainConnection(AbstractSessionImpl.java:386)
    at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.acquireConnectionIfNeeded(LogicalConnectionManagedImpl.java:87)
    at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.getPhysicalConnection(LogicalConnectionManagedImpl.java:112)
    at org.hibernate.engine.jdbc.internal.StatementPreparerImpl.connection(StatementPreparerImpl.java:47)
    at org.hibernate.engine.jdbc.internal.StatementPreparerImpl$5.doPrepare(StatementPreparerImpl.java:146)
    at org.hibernate.engine.jdbc.internal.StatementPreparerImpl$StatementPreparationTemplate.prepareStatement(StatementPreparerImpl.java:172)
    at org.hibernate.engine.jdbc.internal.StatementPreparerImpl.prepareQueryStatement(StatementPreparerImpl.java:148)
    at org.hibernate.loader.Loader.prepareQueryStatement(Loader.java:1929)
    at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:1898)
    at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:1876)
    at org.hibernate.loader.Loader.doQuery(Loader.java:919)
    at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:336)
    at org.hibernate.loader.Loader.doList(Loader.java:2617)
    at org.hibernate.loader.Loader.doList(Loader.java:2600)
    at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2429)
    at org.hibernate.loader.Loader.list(Loader.java:2424)
    at org.hibernate.loader.hql.QueryLoader.list(QueryLoader.java:501)
    at org.hibernate.hql.internal.ast.QueryTranslatorImpl.list(QueryTranslatorImpl.java:371)
    at org.hibernate.engine.query.spi.HQLQueryPlan.performList(HQLQueryPlan.java:216)
    at org.hibernate.internal.SessionImpl.list(SessionImpl.java:1326)
    at org.hibernate.internal.QueryImpl.list(QueryImpl.java:87)
    at org.hibernate.jpa.internal.QueryImpl.list(QueryImpl.java:606)
    at org.hibernate.jpa.internal.QueryImpl.getSingleResult(QueryImpl.java:529)
    at org.hibernate.jpa.criteria.compile.CriteriaQueryTypeQueryAdapter.getSingleResult(CriteriaQueryTypeQueryAdapter.java:54)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.springframework.orm.jpa.SharedEntityManagerCreator$DeferredQueryInvocationHandler.invoke(SharedEntityManagerCreator.java:372)
    at com.sun.proxy.$Proxy1483.getSingleResult(Unknown Source)
由异常信息可知应该是因为分布式事务连接池有问题,遂查看项目连接池相关的配置。核心代码如下:
public static DataSource createDataSource(Map<String, Object> properties, String id) throws Exception {
        DruidXADataSource datasource = new DruidXADataSource();
        datasource.setFilters("stat,slf4j");
        DruidDataSourceFactory.config(datasource, properties);
        AtomikosDataSourceBean ds = new AtomikosDataSourceBean();
        ds.setXaDataSource(datasource);
        ds.setUniqueResourceName(id);
        ds.setPoolSize(5);
        //这里是解决方案,之前是没有的
        //设置线程维护时间间隔 秒为单位
        //ds.setMaintenanceInterval(10);
        return ds;
    }

由此可知,分布式事务连接池底层使用的是druid,直觉告诉我会不会是druid连接池断开重连的机制有问题。遂百度相关配置以及查阅github-issues,进行配置之后发现不好使,改报错还是报错。druid连接池自己的断线重连没问题。于是将目标转移到AtomikosDataSourceBean这里,通过查阅相关博客得知:分布式事务连接池使用druid底层连接创建分布式数据源,不是直接与数据库建立连接,属于借用关系,连接使用完毕之后会归还给druid进行管理

通过查阅异常信息-at com.atomikos.jdbc.AtomikosXAPooledConnection.testUnderlyingConnection(AtomikosXAPooledConnection.java:116)

protected void testUnderlyingConnection() throws CreateConnectionException {
        if ( isErroneous() ) throw new CreateConnectionException ( this + ": connection is erroneous" );
        String testQuery = getTestQuery();
        if (testQuery != null) {
            if ( LOGGER.isDebugEnabled() ) LOGGER.logDebug ( this + ": testing connection with query [" + testQuery + "]" );
            Statement stmt = null;
            try {
                stmt = connection.createStatement();
                //use execute instead of executeQuery - cf case 58830
                stmt.execute(testQuery);
                stmt.close();
            } catch ( Exception e) {
                //catch any Exception - cf case 22198
                throw new CreateConnectionException ( "Error executing testQuery" ,  e );
            }
            if ( LOGGER.isDebugEnabled() ) LOGGER.logDebug ( this + ": connection tested OK" );
        }
        else {
            if ( LOGGER.isDebugEnabled() ) LOGGER.logDebug ( this + ": no test query, skipping test" );
        }
    }
public boolean isErroneous() 
    {
        return sessionHandleState.isErroneous();
    }

得知:AtomikosDataSourceBean 在获取连接时会检测连接的有效性,由于数据库断开连接,druid中的连接都是不可用的,所以抛出此异常。

解决方法:已知druid中数据库连接池在网络恢复之后会重连,剔除掉不可用连接。但是分布式数据源连接池这里会有缓存,导致两边连接状态不同步。即需要解决分布式数据连接池的自检,及时剔除不可用连接。

AtomikosDataSourceBean默认使用的连接池是由com.atomikos.datasource.pool.ConnectionPool实现的。其中由自己的维护器

private void launchMaintenanceTimer() {
        int maintenanceInterval = properties.getMaintenanceInterval();
        //不指定间隔顺序的话-默认一分钟之行一次
        if ( maintenanceInterval <= 0 ) {
            if ( LOGGER.isDebugEnabled() ) LOGGER.logDebug ( this + ": using default maintenance interval..." );
            maintenanceInterval = DEFAULT_MAINTENANCE_INTERVAL;
        }
        maintenanceTimer = new PooledAlarmTimer ( maintenanceInterval * 1000 );
        maintenanceTimer.addAlarmTimerListener(new AlarmTimerListener() {
            public void alarm(AlarmTimer timer) {
                reapPool();
                removeConnectionsThatExceededMaxLifetime();
                addConnectionsIfMinPoolSizeNotReached();
                removeIdleConnectionsIfMinPoolSizeExceeded();
            }
        });
        TaskManager.getInstance().executeTask ( maintenanceTimer );
    }

连接池回收处理定时器,会根据maintenanceinterval设置的重复时间进行重复执行(不指定的话默认一分钟一次),每次执行会:

  1. reapPool() 是否有超过最大数据获取时间的连接,如果有,这抛出获取时间超时的异常
  2. removeConnectionsThatExceededMaxLifetime()删除超过最大生命周期的没有正在使用的连接
  3. addConnectionsIfMinPoolSizeNotReached()添加连接数达到最小连接
  4. removeIdleConnectionsIfMinPoolSizeExceeded()删除连接数,如果超过最大连接
public synchronized void reapPool()
    {
        long maxInUseTime = properties.getReapTimeout();
        if ( connections == null || maxInUseTime <= 0 ) return;

        if ( LOGGER.isDebugEnabled() ) LOGGER.logDebug ( this + ": reaping old connections" );

        Iterator<XPooledConnection> it = connections.iterator();
        while ( it.hasNext() ) {
            XPooledConnection xpc = it.next();
            long lastTimeReleased = xpc.getLastTimeAcquired();
            boolean inUse = !xpc.isAvailable();

            long now = System.currentTimeMillis();
            if ( inUse && ( ( now - maxInUseTime * 1000 ) > lastTimeReleased ) ) {
                if ( LOGGER.isDebugEnabled() ) LOGGER.logDebug ( this + ": connection in use for more than " + maxInUseTime + "s, reaping it: " + xpc );
                xpc.reap();
                EventPublisher.publish(new PooledConnectionReapedEvent(properties.getUniqueResourceName(),xpc));
            }
        }
        logCurrentPoolSize();
    }

四个方法每个中都会检测连接有效性。剔除不可用连接。boolean inUse = !xpc.isAvailable();

至此,得知分布式事务连接池的检测周期默认一分钟一次、跟druid不一致。线上环境突发网络波动不允许一分钟没响应。解决方案:将分布式事务的连接池周期调整为10s,即可解决。

参考博客

atomikos JTA 源码解读详解springboot+atomikos+druid 数据库连接失效分析

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容