事故的发生
2018.10.13早上9点左右,正好周六时间,突然收到告警说Kafka进程挂了,刚开始没注意,让同事重启,发生仍然起不来,仔细查看日志,发现会生成如下文件:hs_err_pid25501.log ,内容的开头为:
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 65536 bytes for committing reserved memory.
# Possible reasons:
# The system is out of physical RAM or swap space
# In 32 bit mode, the process size limit was hit
# Possible solutions:
# Reduce memory load on the system
# Increase physical memory or swap space
# Check if swap backing store is full
# Use 64 bit Java on a 64 bit OS
# Decrease Java heap size (-Xmx/-Xms)
# Decrease number of Java threads
# Decrease Java thread stack sizes (-Xss)
# Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
# Out of Memory Error (os_linux.cpp:2640), pid=25501, tid=0x00007f035aef6700
初步觉得是内存不够用了,所以按照上面的提示修改各种Java的启动参数,把分配给Kafka的内存减少,终于能够正常启动了,所以也放松了警惕。
再次发生
到了下午4点左右,再次收到告警Kafka进程挂了,这次尝试了各种方法也无法启动,最长运行35分钟左右就会挂,于是我们都来到公司一起分析原因。经过仔细查看,发现了kafka在退出时,输出了错误日志如下:
2018-10-13 19:12:08,103|ERROR|ReplicaFetcherThread-0-5|kafka.utils.Logging$class.error:76|[ReplicaFetcher replicaId=0, leaderId=5, fetcherId=0] Error while processing data for partition mts_mpid-5|
org.apache.kafka.common.errors.KafkaStorageException: Error while rolling log segment for mts_mpid-5 in dir /data8/kafka/logs
Caused by: java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:940)
at kafka.log.AbstractIndex.<init>(AbstractIndex.scala:67)
at kafka.log.OffsetIndex.<init>(OffsetIndex.scala:53)
at kafka.log.LogSegment$.open(LogSegment.scala:560)
at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1340)
at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1316)
at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
at kafka.log.Log.roll(Log.scala:1316)
at kafka.log.Log.kafka$log$Log$$maybeRoll(Log.scala:1303)
at kafka.log.Log$$anonfun$append$2.apply(Log.scala:726)
at kafka.log.Log$$anonfun$append$2.apply(Log.scala:640)
at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
at kafka.log.Log.append(Log.scala:640)
at kafka.log.Log.appendAsFollower(Log.scala:623)
at kafka.cluster.Partition$$anonfun$appendRecordsToFollower$1.apply(Partition.scala:560)
at kafka.cluster.Partition$$anonfun$appendRecordsToFollower$1.apply(Partition.scala:560)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250)
at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:256)
at kafka.cluster.Partition.appendRecordsToFollower(Partition.scala:559)
at kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:112)
at kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:43)
at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:183)
at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:169)
at scala.Option.foreach(Option.scala:257)
at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:169)
at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:166)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:166)
at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166)
at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:164)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method)
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:937)
... 34 more
找了系统管理员查看主机是否有硬件异常后者一些内核问题,系统管理员经过调查一番,也没有任何结果,主机非常正常,也没有发生过内存非配错误。内存也非常充足,Swap也一直空闲,没有使用过,各种内核参数调整后,也无法阻止错误的发生。
解决问题
到网上查了一下,关于这个异常的解决方案,大多数调整到64位Java虚拟机,调整内存运行参数等,我们已经是使用64位虚拟机运行了,而且内存参数也调整过,所以也不适合我们。
问了系统管理员会不会是内存映射文件方式有限制,他表示缺省配置都非常充足,于是查看了一下现在的大小为65530。
root@ubuntu # cat /proc/sys/vm/max_map_count
65530
我们的kafka集群占用空间非常大,12个磁盘 * 4T的空间,这样打开的文件数应该不会小,所以这次启动Kafka进程后,查看了一下进程映射文件的数量:
root@ubuntu # cat /proc/61764/maps | wc -l
53620
一下子就到了5万多,而且一直在上涨,估计就是这个问题了,于是调整参数。
root@ubuntu # sysctl -w vm.max_map_count=262144
root@ubuntu # sysctl -a|grep vm.max_map_count
vm.max_map_count = 262144
这次终于能够启动了,但是启动时间超长,花了1个多小时才最终启动起来,而且最终进程打开的映射文件数果然超过了65530。
所以问题到这基本就得到了解决,那么是什么原因造成映射文件这么多呢?本文先不回答这个问题,因为第二天我们又碰到了一个问题,在哪个问题中进行解答。