概述
Spark on Yarn的日志配置分为两类:
- Spark on Yarn client模式
- Spark on Yarn cluster模式
接下为大家逐一介绍。
Spark on Yarn client模式下的日志配置
在client模式下,Spark分为三部分,分别是
driver,application master以及executor,这种模式通常使用在测试环境中。
driver:可以认为是spark application客户端
application master:是用来从yarn的ResourceManager获取资源,并分配资源给具体的任务,启动/停止任务等。
executor运行在某个的nodeManager节点的container中,并执行具体的任务。
基于以上的讲解,来看一下其日志的配置:
- driver端
spark-submit \
--class com.hm.spark.Application \
--master yarn \
--deploy-mode cluster \
// client模式下driver端日志
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties" \
// 将本地文件上传到container中
--files /home/hadoop/spark-workspace/log4j.properties \
/home/hadoop/spark-workspace/my-spark-etl-assembly-1.0-SNAPSHOT.jar
- applicationMaster端
spark-submit \
--class com.hm.spark.Application \
--master yarn \
--deploy-mode cluster \
// application master端日志
--conf "spark.yarn.am.extraJavaOptions=-Dlog4j.configuration=log4j.properties" \
// 将本地文件上传到container中
--files /home/hadoop/spark-workspace/log4j.properties \
/home/hadoop/spark-workspace/my-spark-etl-assembly-1.0-SNAPSHOT.jar
- executor端
spark-submit \
--class com.hm.spark.Application \
--master yarn \
--deploy-mode cluster \
// executor端日志
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties" \
// 将本地文件上传到container中
--files /home/hadoop/spark-workspace/log4j.properties \
/home/hadoop/spark-workspace/my-spark-etl-assembly-1.0-SNAPSHOT.jar
Spark on Yarn cluster模式下的日志配置
在cluster模式下,Spark分为两部分,分别是
driver和executor,通常应用在生产环境。
driver既承担client的角色又有application master的能力,运行在某个的nodeManager节点的container中。
executor运行在具体的nodeManager的container上,并执行具体的任务。
基于以上的讲解,来看一下其日志的配置:
- driver端
spark-submit \
--class com.hm.spark.Application \
--master yarn \
--deploy-mode cluster \
// yarn cluster driver端日志
--conf "spark.yarn.cluster.driver.extraJavaOption=-Dlog4j.configuration=log4j.properties" \
// 将本地文件上传到container中
--files /home/hadoop/spark-workspace/log4j.properties \
/home/hadoop/spark-workspace/my-spark-etl-assembly-1.0-SNAPSHOT.jar
- executor端
spark-submit \
--class com.hm.spark.Application \
--master yarn \
--deploy-mode cluster \
// executor端日志
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties" \
// 将本地文件上传到container中
--files /home/hadoop/spark-workspace/log4j.properties \
/home/hadoop/spark-workspace/my-spark-etl-assembly-1.0-SNAPSHOT.jar
具体的日志文件内容
在client模式下,driver日志配置模板为:
# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=WARN
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
这里使用控制台输出可以在driver更加方便的查看日志。
其它日志配置
log4j.rootLogger=INFO,rolling
log4j.appender.rolling=org.apache.log4j.RollingFileAppender
log4j.appender.rolling.File=${log}/abc.log
log4j.appender.rolling.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling.layout.conversionPattern=[%d] %p %m (%c)%n
log4j.appender.rolling.maxFileSize=2KB
log4j.appender.rolling.maxBackupIndex=10
这里建议使用appender,从而防止日志过大把磁盘撑爆。