集群间数据拷贝,可以使用scp rsync distcp等等 方法,再次我只介绍一下distcp,scp和rsync在linux章节已有介绍,就不多说了
1、网址
http://hadoop.apache.org/docs/r2.8.2/hadoop-distcp/DistCp.html
2、概述
DistCp版本2(分布式副本)是用于大型集群间/集群内复制的工具。它使用MapReduce来实现分布,错误处理和恢复以及报告。它将文件和目录列表扩展为映射任务的输入,每个任务都将复制源列表中指定文件的一个分区。
以前的DistCp的实现在它的使用以及它的可扩展性和性能方面都有它的一些怪癖和缺点。DistCp重构的目的是解决这些缺点,使其能够以编程方式使用和扩展。引入了新的范例来提高运行时间和设置性能,同时保留默认的传统行为。
本文档旨在描述新的DistCp的设计,其崭新功能,最佳使用以及与传统实施的偏差。
3、帮助命令
[victor@node1 hadoop]$ hadoop distcp
usage: distcp OPTIONS [source_path...] <target_path>
OPTIONS
-append Reuse existing data in target files and
append new data to them if possible
-async Should distcp execution be blocking
-atomic Commit all changes or none
-bandwidth <arg> Specify bandwidth per map in MB
-delete Delete from target, files missing in source
-diff <arg> Use snapshot diff report to identify the
difference between source and target
-f <arg> List of files that need to be copied
-filelimit <arg> (Deprecated!) Limit number of files copied
to <= n
-filters <arg> The path to a file containing a list of
strings for paths to be excluded from the
copy.
-i Ignore failures during copy
-log <arg> Folder on DFS where distcp execution logs
are saved
-m <arg> Max number of concurrent maps to use for
copy
-mapredSslConf <arg> Configuration for ssl config file, to use
with hftps://. Must be in the classpath.
-numListstatusThreads <arg> Number of threads to use for building file
listing (max 40).
-overwrite Choose to overwrite target files
unconditionally, even if they exist.
-p <arg> preserve status (rbugpcaxt)(replication,
block-size, user, group, permission,
checksum-type, ACL, XATTR, timestamps). If
-p is specified with no <arg>, then
preserves replication, block size, user,
group, permission, checksum type and
timestamps. raw.* xattrs are preserved when
both the source and destination paths are
in the /.reserved/raw hierarchy (HDFS
only). raw.* xattrpreservation is
independent of the -p flag. Refer to the
DistCp documentation for more details.
-sizelimit <arg> (Deprecated!) Limit number of files copied
to <= n bytes
-skipcrccheck Whether to skip CRC checks between source
and target paths.
-strategy <arg> Copy strategy to use. Default is dividing
work based on file sizes
-tmp <arg> Intermediate work path to be used for
atomic commit
-update Update target, copying only missingfiles or
directories
4、使用方法
[victor@node1 hadoop]$ bin/hadoop distcp hdfs://nn1:9000/foo/bar hdfs://nn6:9000/bar/foo
把nn1集群的/foo/bar目录下的所有文件或目录名展开并存储到nn6集群中,这些文件内容的拷贝工作被分配给多个map任务, 然后每个TaskTracker分别执行从nn1到nn2的拷贝操作,注意DistCp使用绝对路径进行操作。
5、命令行中可以指定多个源目录
[victor@node1 hadoop]$ bin/hadoop distcp \
hdfs://nn1:9000/foo/a \
hdfs://nn1:9000/foo/b \
hdfs://nn6:9000/bar/foo
6、使用-f选项,从文件里获得多个源
[victor@node1 hadoop]$ hadoop distcp -f \
hdfs://nn1:9000/srclist \
hdfs://nn6:9000/bar/foo
//尖叫提示:srclist的内容是 hdfs://nn1:9000/foo/a 和hdfs://nn1:9000/foo/b