【2019-01-04】Spark 程序在driver卡住

1.背景

  • 现象
    用户在使用Spark时候,任务正常40min结束。偶现driver卡住,UI上没有新task产生,且2小时候后程序failed。由于客户愿意无法获取到driver详细日志。

  • 基本信息 :

组件 版本 模块
Spark 1.5.1 dataframe.write.mode(SaveMode.Overwrite).parquet(path)
  • 核心业务代码:
dataFrame.write.mode(SaveMode.Overwrite).parquet(path)
sqlContext.sql("load data inpath '"+path+"'+ overwrite into table "+ tableName)
  • hang着的drvier堆栈:
    at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)

2.分析

  • 分析对应的application日志

发现在2019-01-16 15:53:11,748:

 WARN  | [driver-heartbeater] | Error sending message [message = Heartbeat(28,[Lscala.Tuple2;@2043010d,BlockManagerId(28, 10.180.181.196, 23640))] in 2 attempts | org.apache.spark.Logging$class.logWarning(Logging.scala:92)

最终executor和driver的心跳失效:

2019-01-16 15:53:20,101 | ERROR | [dispatcher-event-loop-43] | Driver 10.180.143.218:23214 disassociated! Shutting down. | org.apache.spark.Logging$class.logError(Logging.scala:75)

此外还有大量shuffle异常日志:

2019-01-16 14:12:01,709 | WARN  | [shuffle-server-0] | Exception in connection from /10.180.181.153:49936 | org.apache.spark.network.server.TransportChannelHandler.exceptionCaught(TransportChannelHandler.java:79)
io.netty.handler.codec.CorruptedFrameException: negative pre-adjustment length field: -6101251590885211657
        at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:377)
        at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:327)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:230)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:748)
2019-01-16 14:12:01,710 | INFO  | [shuffle-server-0] | DIGEST41:Unmatched MACs | com.sun.security.sasl.digest.DigestMD5Base$DigestPrivacy.unwrap(DigestMD5Base.java:1481)
2019-01-16 14:12:01,710 | WARN  | [shuffle-server-0] | Exception in connection from /10.180.181.153:49936 | org.apache.spark.network.server.TransportChannelHandler.exceptionCaught(TransportChannelHandler.java:79)
io.netty.handler.codec.DecoderException: javax.security.sasl.SaslException: DIGEST-MD5: Out of order sequencing of messages from server. Got: 505 Expected: 501
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:99)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
        at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:300)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)
        at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828)
        at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:748)
Caused by: javax.security.sasl.SaslException: DIGEST-MD5: Out of order sequencing of messages from server. Got: 505 Expected: 501
        at com.sun.security.sasl.digest.DigestMD5Base$DigestPrivacy.unwrap(DigestMD5Base.java:1489)
        at com.sun.security.sasl.digest.DigestMD5Base.unwrap(DigestMD5Base.java:213)
        at org.apache.spark.network.sasl.SparkSaslServer.unwrap(SparkSaslServer.java:149)
        at org.apache.spark.network.sasl.SaslEncryption$DecryptionHandler.decode(SaslEncryption.java:127)
        at org.apache.spark.network.sasl.SaslEncryption$DecryptionHandler.decode(SaslEncryption.java:102)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)

shuffle异常是否也是失败的根因?

  • 多分析几个节点的executor日志
    1.节点:HQCDDSJAPP181008_26009,从14:13:17,33515:13:17,292间executor无任何日志打印

  • 进一步分析driver日志

1.由于无法获取到driver日志,没法做更多的分析。先排除推测机制的干扰。让客户关闭掉spark推测机制:spark.speculation

2.关闭掉推测机制后,任务运行也失败了。启动executor失败的次数达到上限

Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (150) reached) 

并且失败都有大量的socket异常打印,executor和driver网络通信中断:

Caused by: java.io.IOException: Connection from 10.180.143.218/10.180.143.218:23781 closed
        at org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104)
        at org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:94)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
        at io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)

还是怀疑是网络的问题

  • 2019-01-21重新收集到了日志。

1.分析AM日志,发现AM日志从15点到之后都没有任何打印:

2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181113:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181132:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181077:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181045:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181080:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181140:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181164:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181103:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181104:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181033:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181109:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181046:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,500 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181182:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,505 | INFO  | [Reporter] | Received 29 containers from YARN, launching executors on 0 of them. | org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2019-01-21 17:25:56,565 | INFO  | [dispatcher-event-loop-2] | Driver terminated or disconnected! Shutting down. 10.180.143.218:23073 | org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2019-01-21 17:25:56,567 | INFO  | [dispatcher-event-loop-2] | Unregistering ApplicationMaster with FAILED (diag message: Driver has been dead, no need to retry the AM.) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2019-01-21 17:25:56,573 | INFO  | [dispatcher-event-loop-2] | Waiting for application to be successfully unregistered. | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.unregisterApplicationMaster(AMRMClientImpl.java:394)
2019-01-21 17:25:56,675 | INFO  | [dispatcher-event-loop-2] | Deleting staging directory .sparkStaging/application_1547839891509_15295 | org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2019-01-21 17:25:58,197 | INFO  | [pool-1-thread-1] | Shutdown hook called | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

发现TID 13203 执行了40多min,在stage 31。虽然stage31的某个task执行慢,但是最终是执行成功的。spark的shuffle对网络性能要求比较高,准备进行如下shuffle调优,避免单个task问题:
准备进行如下调整:调优化shuffle参数:
spark.shuffle.file.buffer=64k,
spark.reducer.maxSizeInFlight=96M
spark.network.timeout=300s
spark.rpc.askTimeout=300s
spark.shuffle.io.serverThreads=8

结论

1.部分task执行慢,是由于shuffle性能影响,调整shuffle参数规避。
spark.shuffle.file.buffer=64k,
spark.reducer.maxSizeInFlight=96M
spark.network.timeout=300s
spark.rpc.askTimeout=300s
spark.shuffle.io.serverThreads=8
2.不排除网络问题的影响,试图调整os参数,但是客户生产
net.ipv4.tcp_keepalive_time= 15
net.ipv4.tcp_keepalive_probes = 10
net.ipv4.tcp_keepalive_intvl= 30
3.关闭sasl
spark.authenticate.enableSaslEncryption=false
spark.authenticate=false

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 217,907评论 6 506
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,987评论 3 395
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 164,298评论 0 354
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,586评论 1 293
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,633评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,488评论 1 302
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,275评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,176评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,619评论 1 314
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,819评论 3 336
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,932评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,655评论 5 346
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,265评论 3 329
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,871评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,994评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,095评论 3 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,884评论 2 354