数据库连接池配置不当导致的full gc问题记录

背景

线上有个流量比较大的服务，qps日常在2000-3000之间，请求方对耗时要求比较高，响应时间要求在300ms以内，服务上线以后，运行也比较平稳。某天夜里，突然有几千个dubbo请求超时，找了几个请求查看调用链路发现超时的地方各不相同，服务依赖的接口方均表示服务没有异常。
这种情况下，一般就是服务自身的问题了，仔细看了下dubbo的超时，发现超时的都是同一个provider，于是看了下这个出问题的provider，看了下这台机器的gc情况，发现超时时有一次full gc，gc耗时达到了1s以上。

full gc时间过长问题排查

gc日志查看

发现full gc时间过长以后，就去对应的机器上把gc log下载了下来，对gc日志进行分析。

36402.792: [GC (Allocation Failure) 36402.792: [ParNew: 2105440K->7584K(2359296K), 0.0124564 secs] 4517143K->2419327K(4980736K), 0.0129547 secs] [Times: user=0.13 sys=0.00, real=0.01 secs] 
36402.810: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2411743K(2621440K)] 2419601K(4980736K), 0.0037437 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 
36402.814: [CMS-concurrent-mark-start]
36403.622: [CMS-concurrent-mark: 0.800/0.808 secs] [Times: user=5.73 sys=0.26, real=0.81 secs] 
36403.623: [CMS-concurrent-preclean-start]
36403.801: [CMS-concurrent-preclean: 0.177/0.178 secs] [Times: user=0.69 sys=0.05, real=0.18 secs] 
36403.801: [CMS-concurrent-abortable-preclean-start]
36403.998: [GC (Allocation Failure) 36403.998: [ParNew: 2104736K->7273K(2359296K), 0.0122722 secs] 4516479K->2419078K(4980736K), 0.0127518 secs] [Times: user=0.13 sys=0.01, real=0.02 secs] 
36404.672: [CMS-concurrent-abortable-preclean: 0.826/0.870 secs] [Times: user=3.39 sys=0.26, real=0.87 secs] 
36404.677: [GC (CMS Final Remark) [YG occupancy: 1097431 K (2359296 K)]36404.677: [GC (CMS Final Remark) 36404.678: [ParNew: 1097431K->7946K(2359296K), 0.0146413 secs] 3509235K->2419797K(4980736K), 0.0151127 secs] [Times: user=0.17 sys=0.00, real=0.02 secs] 
36404.693: [Rescan (parallel) , 0.0044661 secs]36404.697: [weak refs processing, 1.6796757 secs]36406.377: [class unloading, 0.1835045 secs]36406.560: [scrub symbol table, 0.0291043 secs]36406.590: [scrub string table, 0.0029833 secs][1 CMS-remark: 2411851K(2621440K)] 2419797K(4980736K), 2.1253433 secs] [Times: user=2.33 sys=0.00, real=2.13 secs] 
36406.803: [CMS-concurrent-sweep-start]
36409.458: [GC (Allocation Failure) 36409.458: [ParNew: 2105098K->53837K(2359296K), 0.0675226 secs] 4328721K->2277488K(4980736K), 0.0681548 secs] [Times: user=0.86 sys=0.00, real=0.07 secs] 
36410.035: [GC (Allocation Failure) 36410.035: [ParNew: 2150989K->63917K(2359296K), 0.0566114 secs] 4128382K->2041367K(4980736K), 0.0572665 secs] [Times: user=0.71 sys=0.00, real=0.06 secs] 
36410.966: [GC (Allocation Failure) 36410.966: [ParNew: 2161069K->51123K(2359296K), 0.0144844 secs] 3640286K->1530366K(4980736K), 0.0149783 secs] [Times: user=0.16 sys=0.00, real=0.02 secs] 
36411.552: [CMS-concurrent-sweep: 2.287/4.749 secs] [Times: user=23.03 sys=1.41, real=4.75 secs] 
36411.552: [CMS-concurrent-reset-start]
36411.558: [CMS-concurrent-reset: 0.005/0.005 secs] [Times: user=0.03 sys=0.00, real=0.00 secs]

gc日志里面各项代表的意义可以看这篇文章：https://www.jianshu.com/p/a5b3905bf225
我们的老年代收集算法用的是cms算法，现在主要看一下里面会stw的两个阶段耗时，Initial Mark阶段耗时real是0.01秒，耗时很低。然后就是重新标记Final Remark阶段，耗时达到了2.13s，这里面看具体的耗时，weak refs processing, 1.6796757 secs这个占用了比较久的时间，那么这个弱引用处理，究竟在处理什么呢，就要看具体dump出来的堆了。

weak refs processing处理逻辑

weak refs processing的源码如下：

void ReferenceProcessor::process_discovered_references(
  BoolObjectClosure*           is_alive,
  OopClosure*                  keep_alive,
  VoidClosure*                 complete_gc,
  AbstractRefProcTaskExecutor* task_executor) {
  NOT_PRODUCT(verify_ok_to_handle_reflists());
 
  assert(!enqueuing_is_done(), "If here enqueuing should not be complete");
  // Stop treating discovered references specially.
  disable_discovery();
 
  bool trace_time = PrintGCDetails && PrintReferenceGC;
  // Soft references
  {
    TraceTime tt("SoftReference", trace_time, false, gclog_or_tty);
    process_discovered_reflist(_discoveredSoftRefs, _current_soft_ref_policy, true,
                               is_alive, keep_alive, complete_gc, task_executor);
  }
 
  update_soft_ref_master_clock();
 
  // Weak references
  {
    TraceTime tt("WeakReference", trace_time, false, gclog_or_tty);
    process_discovered_reflist(_discoveredWeakRefs, NULL, true,
                               is_alive, keep_alive, complete_gc, task_executor);
  }
 
  // Final references
  {
    TraceTime tt("FinalReference", trace_time, false, gclog_or_tty);
    process_discovered_reflist(_discoveredFinalRefs, NULL, false,
                               is_alive, keep_alive, complete_gc, task_executor);
  }
 
  // Phantom references
  {
    TraceTime tt("PhantomReference", trace_time, false, gclog_or_tty);
    process_discovered_reflist(_discoveredPhantomRefs, NULL, false,
                               is_alive, keep_alive, complete_gc, task_executor);
  }
 
  // Weak global JNI references. It would make more sense (semantically) to
  // traverse these simultaneously with the regular weak references above, but
  // that is not how the JDK1.2 specification is. See #4126360. Native code can
  // thus use JNI weak references to circumvent the phantom references and
  // resurrect a "post-mortem" object.
  {
    TraceTime tt("JNI Weak Reference", trace_time, false, gclog_or_tty);
    if (task_executor != NULL) {
      task_executor->set_single_threaded_mode();
    }
    process_phaseJNI(is_alive, keep_alive, complete_gc);
  }
}

看JVM源码，weak refs processing主要包括SoftReference、WeakReference、FinalReference、PhantomReference以及JNI Weak Reference这五种Reference对象的处理，处理的主要内容是对之前标记的Reference对象重新处理，重新判断是否需要标记（不标记就是要回收的），如果不标记就需要放到refqueue里，等待java ReferenceHandler线程处理。
所以从工作的主要内容看各种Reference的处理时间跟reference的个数成正比。

dump出堆进行分析

怎么查看堆内都有哪些对象呢
初步定位
简单查看对象数目可以使用下面命令，初步定位问题。

// 这里面7代表要查看进程id
jmap -histo:live  7  > dump_7

这个dump_7中就会有堆中对象的个数和占用空间大小信息，这个由于我没有线上机权限，并没有执行。
dump出堆进行分析
这里使用如下命令可以dump出堆，这个操作会影响生产环境应用，慎用

jmap -dump:format=b,file=headdump_7.hprof  7

dump出来堆文件后，将这个文件用mat打开，可以看到内存泄漏暴露大概的地方：

image.png

里面的内容复制出来如下：

The class "com.mysql.cj.jdbc.AbandonedConnectionCleanupThread", loaded by "org.apache.catalina.loader.ParallelWebappClassLoader @ 0x721856020", occupies 119,594,176 (45.21%) bytes. The memory is accumulated in one instance of "java.util.concurrent.ConcurrentHashMap$Node[]", loaded by "<system class loader>", which occupies 119,594,016 (45.21%) bytes.

Keywords:
com.mysql.cj.jdbc.AbandonedConnectionCleanupThread
org.apache.catalina.loader.ParallelWebappClassLoader @ 0x721856020
java.util.concurrent.ConcurrentHashMap$Node[]

查看详情可以看到，connectionFinalizerPhantomRefs占用了其中大部分空间，下一节会分析下AbandonedConnectionCleanupThread类中的connectionFinalizerPhantomRefs：

image.png

AbandonedConnectionCleanupThread类分析

AbandonedConnectionCleanupThread类主要变量

AbandonedConnectionCleanupThread类维护了一个单例的单线程池，负责定期清理已回收的connection连接。

public class AbandonedConnectionCleanupThread implements Runnable {
   // 存放connection对应的幻象引用集合
    private static final Set<ConnectionFinalizerPhantomReference> connectionFinalizerPhantomRefs = ConcurrentHashMap.newKeySet();
    // 引用队列，从这个里面取连接出来进行释放
    private static final ReferenceQueue<MysqlConnection> referenceQueue = new ReferenceQueue<>();

    private static final ExecutorService cleanupThreadExcecutorService;
    private static Thread threadRef = null;
    private static Lock threadRefLock = new ReentrantLock();

    static {
        cleanupThreadExcecutorService = Executors.newSingleThreadExecutor(r -> {
            Thread t = new Thread(r, "mysql-cj-abandoned-connection-cleanup");
            t.setDaemon(true);
            // Tie the thread's context ClassLoader to the ClassLoader that loaded the class instead of inheriting the context ClassLoader from the current
            // thread, which would happen by default.
            // Application servers may use this information if they attempt to shutdown this thread. By leaving the default context ClassLoader this thread
            // could end up being shut down even when it is shared by other applications and, being it statically initialized, thus, never restarted again.

            ClassLoader classLoader = AbandonedConnectionCleanupThread.class.getClassLoader();
            if (classLoader == null) {
                // This class was loaded by the Bootstrap ClassLoader, so lets tie the thread's context ClassLoader to the System ClassLoader instead.
                classLoader = ClassLoader.getSystemClassLoader();
            }

            t.setContextClassLoader(classLoader);
            return threadRef = t;
        });
        cleanupThreadExcecutorService.execute(new AbandonedConnectionCleanupThread());
    }
}

加入connectionFinalizerPhantomRefs分析

下面看下连接是什么时候加入到connectionFinalizerPhantomRefs的。

// com.mysql.cj.jdbc.AbandonedConnectionCleanupThread
/**
     * Tracks the finalization of a {@link MysqlConnection} object and keeps a reference to its {@link NetworkResources} so that they can be later released.
     * 
     * @param conn
     *            the Connection object to track for finalization
     * @param io
     *            the network resources to close on the connection finalization
     */
    protected static void trackConnection(MysqlConnection conn, NetworkResources io) {
        threadRefLock.lock();
        try {
            if (isAlive()) {
                ConnectionFinalizerPhantomReference reference = new ConnectionFinalizerPhantomReference(conn, io, referenceQueue);
                connectionFinalizerPhantomRefs.add(reference);
            }
        } finally {
            threadRefLock.unlock();
        }
    }

这个trackConnection方法是在生产连接的时候会被调用，代码如下：

// com.mysql.cj.jdbc.ConnectionImpl
/**
     * Creates a connection to a MySQL Server.
     * 
     * @param hostInfo
     *            the {@link HostInfo} instance that contains the host, user and connections attributes for this connection
     * @exception SQLException
     *                if a database access error occurs
     */
    public ConnectionImpl(HostInfo hostInfo) throws SQLException {

        try {
            
          // 省去连接生成过程

        } catch (CJException e1) {
            throw SQLExceptionsMapping.translateException(e1, getExceptionInterceptor());
        }

        try {
            createNewIO(false);

            unSafeQueryInterceptors();
            // 将连接加入到清理线程中的幻象引用集合中
            AbandonedConnectionCleanupThread.trackConnection(this, this.getSession().getNetworkResources());
        } catch (SQLException ex) {
            cleanup(ex);

            // don't clobber SQL exceptions
            throw ex;
        } catch (Exception ex) {
            cleanup(ex);

            throw SQLError
                    .createSQLException(
                            this.propertySet.getBooleanProperty(PropertyKey.paranoid).getValue() ? Messages.getString("Connection.0")
                                    : Messages.getString("Connection.1",
                                            new Object[] { this.session.getHostInfo().getHost(), this.session.getHostInfo().getPort() }),
                            MysqlErrorNumbers.SQL_STATE_COMMUNICATION_LINK_FAILURE, ex, getExceptionInterceptor());
        }

    }

ConnectionFinalizerPhantomReference类

下面看下ConnectionFinalizerPhantomReference连接清理幻象引用的实现,这个类的构造方法中有MysqlConnection 数据库连接对象，NetworkResources 数据库连接资源，ReferenceQueue引用队列，用于在被PhantomReference引用的connection被gc时，JVM会将PhantomReference对象扔到ReferenceQueue中。

 /**
     * {@link PhantomReference} subclass to track {@link MysqlConnection} objects finalization.
     * This class holds a reference to the Connection's {@link NetworkResources} so they it can be later closed.
     */
    private static class ConnectionFinalizerPhantomReference extends PhantomReference<MysqlConnection> {
        private NetworkResources networkResources;

        ConnectionFinalizerPhantomReference(MysqlConnection conn, NetworkResources networkResources, ReferenceQueue<? super MysqlConnection> refQueue) {
            super(conn, refQueue);
            this.networkResources = networkResources;
        }

        void finalizeResources() {
            if (this.networkResources != null) {
                try {
                    this.networkResources.forceClose();
                } finally {
                    this.networkResources = null;
                }
            }
        }
    }

连接关闭的操作

放到引用队列以后，AbandonedConnectionCleanupThread中前面提到的线程池会去执行关闭连接的操作。

public void run() {
        for (;;) {
            try {
                checkThreadContextClassLoader();
                Reference<? extends MysqlConnection> reference = referenceQueue.remove(5000);
                if (reference != null) {
                    // 释放连接资源
                    finalizeResource((ConnectionFinalizerPhantomReference) reference);
                }
            } catch (InterruptedException e) {
                threadRefLock.lock();
                try {
                    threadRef = null;

                    // Finalize remaining references.
                    Reference<? extends MysqlConnection> reference;
                    while ((reference = referenceQueue.poll()) != null) {
                        finalizeResource((ConnectionFinalizerPhantomReference) reference);
                    }
                    connectionFinalizerPhantomRefs.clear();
                } finally {
                    threadRefLock.unlock();
                }
                return;
            } catch (Exception ex) {
                // Nowhere to really log this.
            }
        }
    }

finalizeResource实现：

/**
     * Release resources from the given {@link ConnectionFinalizerPhantomReference} and remove it from the references set.
     * 
     * @param reference
     *            the {@link ConnectionFinalizerPhantomReference} to finalize.
     */
    private static void finalizeResource(ConnectionFinalizerPhantomReference reference) {
        try {
           // 调用引用中回收资源的方法
            reference.finalizeResources();
            reference.clear();
        } finally {
            // 引用从集合中删除
            connectionFinalizerPhantomRefs.remove(reference);
        }
    }

ConnectionFinalizerPhantomReference#finalizeResources方法实现：

 /**
     * {@link PhantomReference} subclass to track {@link MysqlConnection} objects finalization.
     * This class holds a reference to the Connection's {@link NetworkResources} so they it can be later closed.
     */
    private static class ConnectionFinalizerPhantomReference extends PhantomReference<MysqlConnection> {
        private NetworkResources networkResources;

        ConnectionFinalizerPhantomReference(MysqlConnection conn, NetworkResources networkResources, ReferenceQueue<? super MysqlConnection> refQueue) {
            super(conn, refQueue);
            this.networkResources = networkResources;
        }

        void finalizeResources() {
            if (this.networkResources != null) {
                try {
                   // 关闭其中的网络连接资源
                    this.networkResources.forceClose();
                } finally {
                    this.networkResources = null;
                }
            }
        }
    }

看到这里大家应该明白，是jdbc为每个connection都生成了一个ConnectionFinalizerPhantomReference，目的是为了当connection对象回收时，顺便回收相关资源。这其实是一个保底操作，是怕connnection资源被上层的连接池或者使用者忘记close，从而导致资源泄漏。

问题解决方式

现在知道了为什么ConnectionFinalizerPhantomReference会产生，下面分析下为什么会有这么多的ConnectionFinalizerPhantomReference，我们使用的是hikari线程池，hikari连接池参数配置介绍见https://www.jianshu.com/p/eb85103f29d6。
我们现有的配置如下：

最小连接数没设置，默认会和最大连接数一样，是100
maxLifeTime用的是默认值，30min，30分钟后连接会被回收
也就是说每30分钟都会产生100个无效的连接，一天会产生4800个ConnectionFinalizerPhantomReference。

这个问题优化的方式有两种：
调整数据库连接池配置
现在连接数100个过多，默认的10个就够用了，这样连接数的产生就会小10倍，然后可以调整maxLifeTime，数据库默认的连接空闲时间是8小时，maxLifeTime可以调整为6个小时，这样每天只会产生40个连接，这个ConnectionFinalizerPhantomReference过多的问题也就能解决了。
删除connectionFinalizerPhantomRefs集合
数据库连接的释放有连接池保证，这个保底机制其实是多余的，可以起个定时任务线程，去清理这个connectionFinalizerPhantomRefs集合。
下面是一个旧版本的清理示例代码。

image.png

参考连接

PhantomReference导致CMS GC耗时严重
 PhantomReference 引发的GC问题

最后编辑于：2022.05.17 21:57:44

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 211,884评论 6赞 492
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 90,347评论 3赞 385
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 157,435评论 0赞 348
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 56,509评论 1赞 284
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 65,611评论 6赞 386
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 49,837评论 1赞 290
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,987评论 3赞 408
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 37,730评论 0赞 267
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 44,194评论 1赞 303
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 36,525评论 2赞 327
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 38,664评论 1赞 340
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 34,334评论 4赞 330
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,944评论 3赞 313
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 30,764评论 0赞 21
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,997评论 1赞 266
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 46,389评论 2赞 360
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 43,554评论 2赞 349

数据库连接池配置不当导致的full gc问题记录

背景

full gc时间过长问题排查

gc日志查看

weak refs processing处理逻辑

dump出堆进行分析

AbandonedConnectionCleanupThread类分析

AbandonedConnectionCleanupThread类主要变量

加入connectionFinalizerPhantomRefs分析

ConnectionFinalizerPhantomReference类

连接关闭的操作

问题解决方式

参考连接

推荐阅读更多精彩内容