cdh默认安装,日志都在/var/log下,先找这里最方便
yarn
查看某个具体的applicationid的log:yarn logs -applicationId application_1546927165868_0023
如果你想看有多少application_id,可以进入:hdfs dfs -ls /tmp/logs/root/logs
例子:
当我搭建好cdh后,执行wordcount,尝试mr,报错
root@hadoop-slave1:/home/zhanqian/input# hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-examples-2.6.0-cdh5.15.1.jar wordcount /input /output/wordcount1
19/01/09 10:24:17 INFO client.RMProxy: Connecting to ResourceManager at hadoop-master/10.0.0.81:8032
19/01/09 10:24:18 INFO input.FileInputFormat: Total input paths to process : 2
19/01/09 10:24:18 INFO mapreduce.JobSubmitter: number of splits:2
19/01/09 10:24:19 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1546927165868_0024
19/01/09 10:24:19 INFO impl.YarnClientImpl: Submitted application application_1546927165868_0024
19/01/09 10:24:19 INFO mapreduce.Job: The url to track the job: http://hadoop-master:8088/proxy/application_1546927165868_0024/
19/01/09 10:24:19 INFO mapreduce.Job: Running job: job_1546927165868_0024
19/01/09 10:24:28 INFO mapreduce.Job: Job job_1546927165868_0024 running in uber mode : false
19/01/09 10:24:28 INFO mapreduce.Job: map 0% reduce 0%
19/01/09 10:24:28 INFO mapreduce.Job: Job job_1546927165868_0024 failed with state FAILED due to: Application application_1546927165868_0024 failed 2 times due to AM Container for appattempt_1546927165868_0024_000002 exited with exitCode: 1
For more detailed output, check application tracking page:http://hadoop-master:8088/proxy/application_1546927165868_0024/Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1546927165868_0024_02_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
at org.apache.hadoop.util.Shell.run(Shell.java:507)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 1
上面报错并不能找到有效的信息,根据提示打开url也没有有效信息,里面可以找到一个hadoop-cmf-yarn-JOBHISTORY-hadoop-master.log.out
,但没有有效信息,里面的报错并不是根本原因,而是连锁反应发生的。这时候就要看yarn日志,用yarn logs -applicationId application_1546927165868_0023
查看即可。
查看yarn当前运行任务列表,可使用如下命令查看:yarn application -list
如需杀死当前某个作业,使用kill application-id的命令如下:yarn application -kill application_1437456051228_1725
executor端日志
当以cluster/client运行spark时候,运行在如下所示,没有任何异常报错。
16-11-2018 15:14:36 CST noah-dp-spark INFO - 18/11/16 15:14:36 INFO scheduler.DAGScheduler: Registering RDD 1 (map at UserAction.scala:598)
16-11-2018 15:14:36 CST noah-dp-spark INFO - 18/11/16 15:14:36 INFO scheduler.DAGScheduler: Got job 0 (collect at UserAction.scala:609) with 1 output partitions
16-11-2018 15:14:36 CST noah-dp-spark INFO - 18/11/16 15:14:36 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (collect at UserAction.scala:609)
16-11-2018 15:14:36 CST noah-dp-spark INFO - 18/11/16 15:14:36 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
16-11-2018 15:14:36 CST noah-dp-spark INFO - 18/11/16 15:14:36 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 0)
16-11-2018 15:14:36 CST noah-dp-spark INFO - 18/11/16 15:14:36 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[1] at map at UserAction.scala:598), which has no missing parents
16-11-2018 15:14:36 CST noah-dp-spark INFO - 18/11/16 15:14:36 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.8 KB, free 365.9 MB)
16-11-2018 15:14:36 CST noah-dp-spark INFO - 18/11/16 15:14:36 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 365.9 MB)
16-11-2018 15:14:36 CST noah-dp-spark INFO - 18/11/16 15:14:36 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.81.77.67:17664 (size: 2.3 KB, free: 366.3 MB)
16-11-2018 15:14:36 CST noah-dp-spark INFO - 18/11/16 15:14:36 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16-11-2018 15:14:36 CST noah-dp-spark INFO - 18/11/16 15:14:36 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[1] at map at UserAction.scala:598) (first 15 tasks are for partitions Vector(0))
16-11-2018 15:14:36 CST noah-dp-spark INFO - 18/11/16 15:14:36 INFO cluster.YarnScheduler: Adding task set 0.0 with 1 tasks
16-11-2018 15:14:37 CST noah-dp-spark INFO - 18/11/16 15:14:37 INFO spark.ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 1)
16-11-2018 15:14:41 CST noah-dp-spark INFO - 18/11/16 15:14:41 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.81.174.117:39678) with ID 1
16-11-2018 15:14:41 CST noah-dp-spark INFO - 18/11/16 15:14:41 INFO spark.ExecutorAllocationManager: New executor 1 has registered (new total is 1)
16-11-2018 15:14:41 CST noah-dp-spark INFO - 18/11/16 15:14:41 INFO storage.BlockManagerMasterEndpoint: Registering block manager hadoop-slave1:46294 with 366.3 MB RAM, BlockManagerId(1, hadoop-slave1, 46294, None)
16-11-2018 15:14:41 CST noah-dp-spark INFO - 18/11/16 15:14:41 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, hadoop-slave1, executor 1, partition 0, RACK_LOCAL, 5811 bytes)
16-11-2018 15:14:41 CST noah-dp-spark INFO - 18/11/16 15:14:41 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on hadoop-slave1:46294 (size: 2.3 KB, free: 366.3 MB)
16-11-2018 15:14:43 CST noah-dp-spark INFO - 18/11/16 15:14:43 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop-slave1:46294 (size: 32.8 KB, free: 366.3 MB)
接下来就是找日志,发现卡在hadoop-slave1
节点上,那么我们去hadoop-slave1
上去找日志信息。
spark on yarn模式下一个executor对应yarn的一个container,所以在executor的节点运行ps -ef|grep spark.yarn.app.container.log.dir
,如果这个节点上可能运行多个application,那么再通过application id进一步过滤。上面的命令会查到executor的进程信息,并且包含了日志路径,例如
-Djava.io.tmpdir=/data1/hadoop/yarn/local/usercache/ocdp/appcache/application_1521424748238_0051/container_e07_1521424748238_0051_01_000002/tmp '
-Dspark.history.ui.port=18080' '-Dspark.driver.port=59555'
-Dspark.yarn.app.container.log.dir=/data1/hadoop/yarn/log/application_1521424748238_0051/container_e07_1521424748238_0051_01_000002
也就是说这个executor的日志就在/data1/hadoop/yarn/log/application_1521424748238_0051/container_e07_1521424748238_0051_01_000002目录里。至此,我们就找到了运行时的executor日志。
另外还遇到个问题,我在以cluster模式启动的时候,14秒左右就fail了,想看container里面的日志,结果被删除了,原因是默认运行结束删除,我在CDH中修改了yarn的配置yarn.nodemanager.delete.debug-delay-sec = 1000
修改该配置即可,你就能看到运行完的debug log记录了。
hbase
- 先在cdh的std角色和sterr日志里看
- 找不到,在日志在相关进程寻找看看,例如:
/opt/cm-5.15.1/run/cloudera-scm-agent/process/444-hbase-REGIONSERVER/logs
之前的RS启动后即宕机,发现错误在这里面才看到,原因是默认配置50M导致的对内存不足
Thu Feb 21 16:02:21 CST 2019
JAVA_HOME=/opt/jdk/jdk1.8.0_181
using /opt/jdk/jdk1.8.0_181 as JAVA_HOME
using 5 as CDH_VERSION
using as HBASE_HOME
using /opt/cm-5.15.1/run/cloudera-scm-agent/process/444-hbase-REGIONSERVER as HBASE_CONF_DIR
using /opt/cm-5.15.1/run/cloudera-scm-agent/process/444-hbase-REGIONSERVER as HADOOP_CONF_DIR
using as HADOOP_HOME
CONF_DIR=/opt/cm-5.15.1/run/cloudera-scm-agent/process/444-hbase-REGIONSERVER
CMF_CONF_DIR=/opt/cm-5.15.1/etc/cloudera-scm-agent
Thu Feb 21 16:02:21 CST 2019 Starting znode cleanup thread with HBASE_ZNODE_FILE=/opt/cm-5.15.1/run/cloudera-scm-agent/process/444-hbase-REGIONSERVER/znode25911 for regionserver
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /tmp/hbase_hbase-REGIONSERVER-409cef9ec8084db201a877a119f4f55e_pid25911.hprof ...
Heap dump file created [62523093 bytes in 0.327 secs]
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p
/opt/cm-5.15.1/lib/cmf/service/common/killparent.sh"
# Executing /bin/sh -c "kill -9 25911
/opt/cm-5.15.1/lib/cmf/service/common/killparent.sh"...