打开Ambari看到hdfs报警[alert]: Total Blocks:[*], Missing Blocks:[*]
, 发现是有些文件块损坏了. 启动hdfs的时候发现也起不来了, 日志一直循环下面的东西.
Retrying after 10 seconds. Reason: Execution of '/usr/hdp/current/hadoop-hdfs-namenode/bin/hdfs dfsadmin -fs hdfs://test01.bigdata.hbh:8020 -safemode get | grep 'Safe mode is OFF'' returned 1.
NameNode一直处于安全模式
[root@test01 ~]# sudo -u hdfs hdfs dfsadmin -fs hdfs://test01.bigdata.hbh:8020 -safemode get
Safe mode is ON
打开NameNode UI可以看到如下的描述:
Safe mode is ON. The reported blocks 4156 needs additional 2 blocks to reach the threshold 1.0000 of total blocks 4157. The number of live datanodes 4 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
说明我们的损坏的文件比例超过了阈值, 这个阈值配置在hdfs中, 下图是从Ambari的配置管理, 这里配置的是100%, 也就是说不允许任何一个块损坏掉. 如果我们配置成99%应该就不会触发safemode了.
问题描述: 测试集群上的硬盘容量很小, 只有几十G, 之前做基准测试的时候就把磁盘写满了, 导致数据块丢失, 系统启动都是有问题的, 一直说hdfs在safe mode.
基础
什么是safe mode
怎么样触发safe mode
丢了一部分副本的数据
检查
[hdfs@test01 ~]$ hadoop fsck /user/root/.staging/job_1515575016190_0003/job.jar -files -blocks -locations -racks
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Connecting to namenode via http://test01.bigdata.hbh:50070/fsck?ugi=hdfs&files=1&blocks=1&locations=1&racks=1&path=%2Fuser%2Froot%2F.staging%2Fjob_1515575016190_0003%2Fjob.jar
FSCK started by hdfs (auth:SIMPLE) from /172.16.201.200 for path /user/root/.staging/job_1515575016190_0003/job.jar at Fri Jan 26 16:11:15 CST 2018
/user/root/.staging/job_1515575016190_0003/job.jar 272019 bytes, 1 block(s): Under replicated BP-1912246748-192.168.89.173-1513143837848:blk_1073751971_11222. Target Replicas is 10 but found 4 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
0. BP-1912246748-192.168.89.173-1513143837848:blk_1073751971_11222 len=272019 repl=4 [/default-rack/172.16.201.200:50010, /default-rack/172.16.201.201:50010, /default-rack/172.16.201.202:50010, /default-rack/172.16.201.204:50010]
Status: HEALTHY
Total size: 272019 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 272019 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 1 (100.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 4.0
Corrupt blocks: 0
Missing replicas: 6 (60.0 %)
Number of data-nodes: 4
Number of racks: 1
FSCK ended at Fri Jan 26 16:11:15 CST 2018 in 0 milliseconds
The filesystem under path '/user/root/.staging/job_1515575016190_0003/job.jar' is HEALTHY
[hdfs@test01 ~]$
脏数据
[hdfs@test01 ~]$ hadoop fsck /apps/hbase/data/oldWALs/test01.bigdata.hbh%2C16020%2C1515637923065.default.1515745933793 -files -blocks -locations -racks
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Connecting to namenode via http://test01.bigdata.hbh:50070/fsck?ugi=hdfs&files=1&blocks=1&locations=1&racks=1&path=%2Fapps%2Fhbase%2Fdata%2FoldWALs%2Ftest01.bigdata.hbh%252C16020%252C1515637923065.default.1515745933793
FSCK started by hdfs (auth:SIMPLE) from /172.16.201.200 for path /apps/hbase/data/oldWALs/test01.bigdata.hbh%2C16020%2C1515637923065.default.1515745933793 at Fri Jan 26 16:12:22 CST 2018
/apps/hbase/data/oldWALs/test01.bigdata.hbh%2C16020%2C1515637923065.default.1515745933793 91 bytes, 1 block(s):
/apps/hbase/data/oldWALs/test01.bigdata.hbh%2C16020%2C1515637923065.default.1515745933793: CORRUPT blockpool BP-1912246748-192.168.89.173-1513143837848 block blk_1073753448
MISSING 1 blocks of total size 91 B
0. BP-1912246748-192.168.89.173-1513143837848:blk_1073753448_12711 len=91 MISSING!
Status: CORRUPT
Total size: 91 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 91 B)
********************************
UNDER MIN REPL'D BLOCKS: 1 (100.0 %)
dfs.namenode.replication.min: 1
CORRUPT FILES: 1
MISSING BLOCKS: 1
MISSING SIZE: 91 B
CORRUPT BLOCKS: 1
********************************
Minimally replicated blocks: 0 (0.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 0.0
Corrupt blocks: 1
Missing replicas: 0
Number of data-nodes: 4
Number of racks: 1
FSCK ended at Fri Jan 26 16:12:22 CST 2018 in 1 milliseconds
The filesystem under path '/apps/hbase/data/oldWALs/test01.bigdata.hbh%2C16020%2C1515637923065.default.1515745933793' is CORRUPT
处理问题
查到具体哪个DataNode的哪些文件是丢失/损坏了的
[root@test01 ~]# sudo -u hdfs hdfs fsck /apps/hbase/data/oldWALs/ | egrep -v '^\.+$' | egrep -v '^$'
Connecting to namenode via http://test01.bigdata.hbh:50070/fsck?ugi=hdfs&path=%2Fapps%2Fhbase%2Fdata%2FoldWALs
FSCK started by hdfs (auth:SIMPLE) from /172.16.201.200 for path /apps/hbase/data/oldWALs at Thu Feb 08 09:57:58 CST 2018
/apps/hbase/data/oldWALs/test02.bigdata.hbh%2C16020%2C1515637922143..meta.1515745955950.meta: CORRUPT blockpool BP-1912246748-192.168.89.173-1513143837848 block blk_1073753450
/apps/hbase/data/oldWALs/test02.bigdata.hbh%2C16020%2C1515637922143..meta.1515745955950.meta: MISSING 1 blocks of total size 91 B..
/apps/hbase/data/oldWALs/test05.bigdata.hbh%2C16020%2C1515637921765.default.1515745929606: CORRUPT blockpool BP-1912246748-192.168.89.173-1513143837848 block blk_1073753446
/apps/hbase/data/oldWALs/test05.bigdata.hbh%2C16020%2C1515637921765.default.1515745929606: MISSING 1 blocks of total size 91 B.Status: CORRUPT
Total size: 182 B
Total dirs: 1
Total files: 2
Total symlinks: 0
Total blocks (validated): 2 (avg. block size 91 B)
********************************
UNDER MIN REPL'D BLOCKS: 2 (100.0 %)
dfs.namenode.replication.min: 1
CORRUPT FILES: 2
MISSING BLOCKS: 2
MISSING SIZE: 182 B
CORRUPT BLOCKS: 2
********************************
Minimally replicated blocks: 0 (0.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 0.0
Corrupt blocks: 2
Missing replicas: 0
Number of data-nodes: 4
Number of racks: 1
FSCK ended at Thu Feb 08 09:57:58 CST 2018 in 1 milliseconds
The filesystem under path '/apps/hbase/data/oldWALs' is CORRUPT
[root@test01 ~]# sudo -u hdfs hadoop fs -rm /apps/hbase/data/oldWALs/test05.bigdata.hbh%2C16020%2C1515637921765.default.1515745929606
18/02/08 09:58:17 INFO fs.TrashPolicyDefault: Moved: 'hdfs://test01.bigdata.hbh:8020/apps/hbase/data/oldWALs/test05.bigdata.hbh%2C16020%2C1515637921765.default.1515745929606' to trash at: hdfs://test01.bigdata.hbh:8020/user/hdfs/.Trash/Current/apps/hbase/data/oldWALs/test05.bigdata.hbh%2C16020%2C1515637921765.default.1515745929606
[root@test01 ~]# sudo -u hdfs hdfs fsck /apps/hbase/data/oldWALs/ | egrep -v '^\.+$' | egrep -v '^$'
Connecting to namenode via http://test01.bigdata.hbh:50070/fsck?ugi=hdfs&path=%2Fapps%2Fhbase%2Fdata%2FoldWALs
FSCK started by hdfs (auth:SIMPLE) from /172.16.201.200 for path /apps/hbase/data/oldWALs at Thu Feb 08 09:58:24 CST 2018
/apps/hbase/data/oldWALs/test02.bigdata.hbh%2C16020%2C1515637922143..meta.1515745955950.meta: CORRUPT blockpool BP-1912246748-192.168.89.173-1513143837848 block blk_1073753450
/apps/hbase/data/oldWALs/test02.bigdata.hbh%2C16020%2C1515637922143..meta.1515745955950.meta: MISSING 1 blocks of total size 91 B.Status: CORRUPT
Total size: 91 B
Total dirs: 1
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 91 B)
********************************
UNDER MIN REPL'D BLOCKS: 1 (100.0 %)
dfs.namenode.replication.min: 1
CORRUPT FILES: 1
MISSING BLOCKS: 1
MISSING SIZE: 91 B
CORRUPT BLOCKS: 1
********************************
Minimally replicated blocks: 0 (0.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 0.0
Corrupt blocks: 1
Missing replicas: 0
Number of data-nodes: 4
Number of racks: 1
FSCK ended at Thu Feb 08 09:58:24 CST 2018 in 1 milliseconds
The filesystem under path '/apps/hbase/data/oldWALs' is CORRUPT