📅遇到的OOM坑

最近写了个小爬虫，第二版的时候在运行两天后会报错OOM错误。

2017-09-15 20:29:51.038 INFO 15128 --- [pool-1-thread-1] c.cocal.service.Impl.UdpNettyServerImpl : ping ip:1.235.115.177,port:55476
2017-09-15 20:31:31.161 INFO 15128 --- [pool-2-thread-1] c.cocal.service.Impl.UdpNettyServerImpl : queue size is not emputy ! size : 1 ping size : 4350067
2017-09-15 20:30:46.924 INFO 15128 --- [pool-1-thread-2] c.cocal.service.Impl.UdpNettyServerImpl : findNode ip:5.158.127.22,port:7881
2017-09-15 20:31:32.937 ERROR 15128 --- [Druid-ConnectionPool-Create-1406004470] com.alibaba.druid.pool.DruidDataSource : create connection error, url: *****************, errorCode 0, state S1000
java.sql.SQLException: java.lang.OutOfMemoryError: Java heap space
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:964) ~[mysql-connector-java-5.1.43.jar!/:5.1.43]

具体表现为JVM一直在full GC 导致整个应用Stop The World

2017-09-14T18:34:41.025+0800: 75039.008: [GC (Allocation Failure) [PSYoungGen: 21184K->2656K(22528K)] 705452K->687132K(708096K), 0.0414546 secs] [Times: user=0.14 sys=0.00, real=0.04 secs]
2017-09-14T18:34:53.920+0800: 75051.902: [GC (Allocation Failure) [PSYoungGen: 21088K->2720K(22528K)] 705564K->687396K(708096K), 0.0331795 secs] [Times: user=0.13 sys=0.00, real=0.03 secs]
2017-09-14T18:35:08.749+0800: 75066.732: [GC (Allocation Failure) [PSYoungGen: 21152K->2720K(22528K)] 705828K->687564K(708096K), 0.0268483 secs] [Times: user=0.13 sys=0.00, real=0.03 secs]
2017-09-14T18:35:23.320+0800: 75081.303: [GC (Allocation Failure) [PSYoungGen: 21152K->2720K(22528K)] 705996K->687756K(708096K), 0.0341229 secs] [Times: user=0.13 sys=0.00, real=0.03 secs]
2017-09-14T18:35:36.746+0800: 75094.729: [GC (Allocation Failure) [PSYoungGen: 21152K->2720K(22528K)] 706188K->687916K(708096K), 0.0176401 secs] [Times: user=0.06 sys=0.00, real=0.02 secs]
2017-09-14T18:35:36.763+0800: 75094.746: [Full GC (Ergonomics) [PSYoungGen: 2720K->1617K(22528K)] [ParOldGen: 685196K->685230K(685568K)] 687916K->686847K(708096K), [Metaspace: 53408K->53408K(1097728K)], 3.8734480 secs] [Times: user=13.44 sys=0.08, real=3.87 secs]
2017-09-14T18:35:52.782+0800: 75110.765: [Full GC (Ergonomics) [PSYoungGen: 20049K->1616K(22528K)] [ParOldGen: 685230K->685341K(685568K)] 705279K->686957K(708096K), [Metaspace: 53408K->53408K(1097728K)], 4.6482656 secs] [Times: user=16.01 sys=0.06, real=4.65 secs]
2017-09-14T18:36:09.593+0800: 75127.576: [Full GC (Ergonomics) [PSYoungGen: 20048K->1616K(22528K)] [ParOldGen: 685341K->685458K(685568K)] 705389K->687075K(708096K), [Metaspace: 53408K->53408K(1097728K)], 4.6473164 secs] [Times: user=16.05 sys=0.09, real=4.65 secs]
2017-09-14T18:36:25.969+0800: 75143.952: [Full GC (Ergonomics) [PSYoungGen: 20048K->1642K(22528K)] [ParOldGen: 685458K->685561K(685568K)] 705507K->687203K(708096K), [Metaspace: 53408K->53408K(1097728K)], 3.4603270 secs] [Times: user=12.19 sys=0.00, real=3.46 secs]
2017-09-14T18:36:40.240+0800: 75158.223: [Full GC (Ergonomics) [PSYoungGen: 18432K->1786K(22528K)] [ParOldGen: 685561K->685528K(685568K)] 703993K->687314K(708096K), [Metaspace: 53408K->53408K(1097728K)], 4.6423260 secs] [Times: user=16.09 sys=0.02, real=4.64 secs]

分析下来原因是：
应用有一个模块是消费者生产者模型，消费者消耗的速度慢于生产者的速度积累很多未消费的数据在消费者的队列里面（发生oom时队列积累了430w个未消费的对象 = =）。
自然JVM也不会释放这部分内存导致了oom。
解决办法：

加快消费者的消耗速度。
仍然给消费队列加上最大限制。
消费队列里面应该有很多重复的对象，加层过滤。

顺便这次学到的参数有：
GC overhead limt exceedHotSpot 1.6的一个策略，通过GC时间来预测是否要发生防止发生oom。
-XX:-UseGCOverheadLimit 可以使用这个参数来关闭

先记录在此，以免后续忘记了

📅遇到的OOM坑

推荐阅读更多精彩内容