Active NameNode挂掉处理

凌晨时分,Active NameNode突然挂掉,关键日志如下:

2019-06-07 01:32:50,055 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12008 ms (timeout=20000 ms) for a response for startLogSegment(1656036462). Succeeded so far: [10.220.48.13:8485]

2019-06-07 01:32:50,081 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-jrsy-zk02.pekdc1.jdfin.local/10.220.48.26:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2019-06-07 01:32:51,055 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13009 ms (timeout=20000 ms) for a response for startLogSegment(1656036462). Succeeded so far: [10.220.48.13:8485]

2019-06-07 01:32:51,081 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-jrsy-zk02.pekdc1.jdfin.local/10.220.48.26:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2019-06-07 01:32:52,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14010 ms (timeout=20000 ms) for a response for startLogSegment(1656036462). Succeeded so far: [10.220.48.13:8485]

2019-06-07 01:32:52,082 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-jrsy-zk02.pekdc1.jdfin.local/10.220.48.26:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2019-06-07 01:32:53,058 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15011 ms (timeout=20000 ms) for a response for startLogSegment(1656036462). Succeeded so far: [10.220.48.13:8485]

2019-06-07 01:32:53,082 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-jrsy-zk02.pekdc1.jdfin.local/10.220.48.26:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2019-06-07 01:32:54,058 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16012 ms (timeout=20000 ms) for a response for startLogSegment(1656036462). Succeeded so far: [10.220.48.13:8485]

2019-06-07 01:32:54,083 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-jrsy-zk02.pekdc1.jdfin.local/10.220.48.26:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2019-06-07 01:32:55,059 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17013 ms (timeout=20000 ms) for a response for startLogSegment(1656036462). Succeeded so far: [10.220.48.13:8485]

2019-06-07 01:32:55,083 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-jrsy-zk02.pekdc1.jdfin.local/10.220.48.26:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2019-06-07 01:32:56,060 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18014 ms (timeout=20000 ms) for a response for startLogSegment(1656036462). Succeeded so far: [10.220.48.13:8485]

2019-06-07 01:32:56,084 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-jrsy-zk02.pekdc1.jdfin.local/10.220.48.26:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2019-06-07 01:32:57,062 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19015 ms (timeout=20000 ms) for a response for startLogSegment(1656036462). Succeeded so far: [10.220.48.13:8485]

2019-06-07 01:32:57,084 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-jrsy-zk02.pekdc1.jdfin.local/10.220.48.26:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

2019-06-07 01:32:58,046 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment 1656036462 failed for required journal (JournalAndStream(mgr=QJM to [10.220.48.13:8485, 10.220.48.26:8485, 10.220.48.27:8485], stream=null))

java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.

        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)

        at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.startLogSegment(QuorumJournalManager.java:408)

        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:107)

        at org.apache.hadoop.hdfs.server.namenode.JournalSet$3.apply(JournalSet.java:222)

        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)

        at org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:219)

        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:1288)

        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1257)

        at org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1395)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:5818)

        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1122)

        at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:142)

        at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12025)

        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)

        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)

        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2226)

        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2222)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:422)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1758)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2222)

2019-06-07 01:32:58,049 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1

2019-06-07 01:32:58,093 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at TX-220-48-33.h.chinabank.com.cn/10.220.48.33

************************************************************/

通过查看监控发现 3台journal node,有一个journal node 7个小时前进程就挂掉了,挂掉的原因是内存溢出,通过查看/tmp目录中的进程dump信息如下可知:

#

# There is insufficient memory for the Java Runtime Environment to continue.

# Native memory allocation (mmap) failed to map 3145728 bytes for committing reserved memory.

# Possible reasons:

#  The system is out of physical RAM or swap space

#  In 32 bit mode, the process size limit was hit

# Possible solutions:

#  Reduce memory load on the system

#  Increase physical memory or swap space

#  Check if swap backing store is full

#  Use 64 bit Java on a 64 bit OS

#  Decrease Java heap size (-Xmx/-Xms)

#  Decrease number of Java threads

#  Decrease Java thread stack sizes (-Xss)

#  Set larger code cache with -XX:ReservedCodeCacheSize=

# This output file may be truncated or incomplete.

#

Out of Memory Error (os_linux.cpp:2640), pid=7916, tid=0x00007f86780d3700

#

# JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 1.8.0_181-b13)

# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode linux-amd64 compressed oops)

# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

#

通过查看没有挂掉的 journal node进程的gc信息如下:

    S0     S1      E        O       M       CCS      YGC        YGCT      FGC     FGCT       GCT 

46.88  0.00  61.72  93.59  97.35  93.57 11230009 26118.504 127717 3519.742 29638.246

34.38  0.00  22.83  93.90  97.35  93.57 11230011 26118.508 127717 3519.742 29638.250

  0.00  34.38  79.36  96.91  97.35  93.57 11230012 26118.510 127717 3519.742 29638.252

  0.00  31.25  21.89  97.22  97.35  93.57 11230014 26118.514 127717 3519.742 29638.256

31.25  0.00  49.14  97.37  97.35  93.57 11230015 26118.516 127717 3519.742 29638.259

  0.00  31.25  0.00  98.03  97.35  93.57 11230018 26118.523 127718 3519.742 29638.265

  0.00  34.38  34.07  47.77  97.35  93.57 11230020 26118.527 127718 3519.768 29638.295

  0.00  34.38  58.11  48.10  97.35  93.57 11230022 26118.539 127718 3519.768 29638.307

  0.00  37.50  53.27  48.50  97.35  93.57 11230024 26118.543 127718 3519.768 29638.311

  0.00  34.38  16.37  48.78  97.35  93.57 11230026 26118.547 127718 3519.768 29638.315

21.88  0.00  55.71  48.87  97.35  93.57 11230027 26118.549 127718 3519.768 29638.317

31.25  0.00  63.76  49.20  97.35  93.57 11230029 26118.554 127718 3519.768 29638.322

43.75  0.00  79.80  49.58  97.35  93.57 11230031 26118.558 127718 3519.768 29638.326

  0.00  68.75  9.98  50.15  97.35  93.57 11230034 26118.564 127718 3519.768 29638.332

  0.00  50.00  14.22  50.52  97.35  93.57 11230036 26118.568 127718 3519.768 29638.336

68.75  0.00  65.19  50.70  97.35  93.57 11230037 26118.570 127718 3519.768 29638.338

56.25  0.00  20.29  50.98  97.35  93.57 11230039 26118.574 127718 3519.768 29638.342

  0.00  68.75  85.59  53.96  97.35  93.57 11230040 26118.576 127718 3519.768 29638.344

  0.00  34.38  49.55  54.55  97.35  93.57 11230042 26118.581 127718 3519.768 29638.349

  0.00  37.50  31.61  54.84  97.35  93.57 11230044 26118.585 127718 3519.768 29638.353

  0.00  28.12  23.99  58.07  97.35  93.57 11230046 26118.589 127718 3519.768 29638.357

34.38  0.00  75.00  58.33  97.35  93.57 11230047 26118.591 127718 3519.768 29638.359

34.38  0.00  15.54  58.62  97.35  93.57 11230049 26118.596 127718 3519.768 29638.363

  0.00  28.12  43.44  58.77  97.35  93.57 11230050 26118.598 127718 3519.768 29638.366

31.25  0.00  72.77  58.88  97.35  93.57 11230051 26118.600 127718 3519.768 29638.368

28.12  0.00  15.47  59.17  97.35  93.57 11230053 26118.604 127718 3519.768 29638.372

  0.00  28.12  35.98  59.28  97.35  93.57 11230054 26118.607 127718 3519.768 29638.375

    通过分析gc信息可以发现,该JN进程gc频繁,并且,平均下来每次full gc将近10秒钟,显然可以理解成某些时候 full gc 时间超过了  20 秒钟,也就是超过了 参数 dfs.qjournal.start-segment.timeout.ms 的默认值 20秒,可以理解为 总共 3个JN,其中一个JN进程已经挂了,因此剩下的2个JN不能有任意一个进程出现问题,但是很碰巧,7个NS中有一个NS的NameNode进程(也就是当前挂掉的NameNode进程)在JN挂了1个之后的7小时后,碰到了 剩下的2个JN中其中一个JN发生了full gc,并且时间超过了 20秒没有给NameNode响应,导致开启一个新的editlog失败(与参数 dfs.qjournal.start-segment.timeout.ms 值有关),最终导致namenode进程挂掉。

解决方式:

    通过分析JN进程的启动信息发现,JN进程只有1g内存,因此调大JN进程的内存大小,并改成了G1 GC,减少full gc 次数,以及时间。

    另外可以调大小 参数 dfs.qjournal.start-segment.timeout.ms (hdfs-site.xml中配置的)的值(默认 20秒)来屏蔽故障的发生

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,254评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,875评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,682评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,896评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,015评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,152评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,208评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,962评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,388评论 1 304
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,700评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,867评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,551评论 4 335
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,186评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,901评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,142评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,689评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,757评论 2 351