在命令行执行 hadoop fs -ls /
这种 Hadoop Commands 时,系统内部是怎样处理的呢?
1. bash 处理
可以用 linux 的which
和ll
命令查看hadoop
命令的源头:
$ which hadoop
/usr/bin/hadoop
$ ll /usr/bin/hadoop
lrwxrwxrwx 1 root root 24 11-19 18:55 /usr/bin/hadoop -> /etc/alternatives/hadoop
$ ll /etc/alternatives/hadoop
lrwxrwxrwx 1 root root 64 11-19 18:55 /etc/alternatives/hadoop -> /app/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/bin/hadoop
可以发现,在CDH集群中,hadoop
命令最终指向了/app/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/bin/hadoop
文件,使用vim查看该文件,可以发现命令又被指向了/app/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hadoop/bin/hadoop
,使用vim查看该文件,以下截取其核心部分:
#core commands
*)
# the core commands
if [ "$COMMAND" = "fs" ] ; then
CLASS=org.apache.hadoop.fs.FsShell
elif [ "$COMMAND" = "version" ] ; then
CLASS=org.apache.hadoop.util.VersionInfo
elif [ "$COMMAND" = "jar" ] ; then
CLASS=org.apache.hadoop.util.RunJar
elif [ "$COMMAND" = "key" ] ; then
CLASS=org.apache.hadoop.crypto.key.KeyShell
elif [ "$COMMAND" = "checknative" ] ; then
CLASS=org.apache.hadoop.util.NativeLibraryChecker
elif [ "$COMMAND" = "distcp" ] ; then
CLASS=org.apache.hadoop.tools.DistCp
CLASSPATH=${CLASSPATH}:${TOOL_PATH}
elif [ "$COMMAND" = "daemonlog" ] ; then
CLASS=org.apache.hadoop.log.LogLevel
elif [ "$COMMAND" = "archive" ] ; then
CLASS=org.apache.hadoop.tools.HadoopArchives
CLASSPATH=${CLASSPATH}:${TOOL_PATH}
elif [ "$COMMAND" = "credential" ] ; then
CLASS=org.apache.hadoop.security.alias.CredentialShell
elif [ "$COMMAND" = "s3guard" ] ; then
CLASS=org.apache.hadoop.fs.s3a.s3guard.S3GuardTool
CLASSPATH=${CLASSPATH}:${TOOL_PATH}
elif [ "$COMMAND" = "trace" ] ; then
CLASS=org.apache.hadoop.tracing.TraceAdmin
elif [ "$COMMAND" = "classpath" ] ; then
if [ "$#" -eq 1 ]; then
# No need to bother starting up a JVM for this simple case.
echo $CLASSPATH
exit
else
CLASS=org.apache.hadoop.util.Classpath
fi
elif [[ "$COMMAND" = -* ]] ; then
# class and package names cannot begin with a -
echo "Error: No command named \`$COMMAND' was found. Perhaps you meant \`hadoop ${COMMAND#-}'"
exit 1
else
CLASS=$COMMAND
fi
shift
exec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$@"
;;
可以发现,shell脚本对不同的命令行输入(如,fs
、distcp
)指定了不同的$CLASS
变量,映射到不同的类,并最终用Java执行这些类(后续的参数作为Java参数被传入应用中)。
2. Java 处理
以hadoop fs
命令为例:
a. fs
映射到org.apache.hadoop.fs.FsShell
类,该类对应的main方法如下:
public static void main(String argv[]) throws Exception {
FsShell shell = newShellInstance();
Configuration conf = new Configuration();
conf.setQuietMode(false);
shell.setConf(conf);
int res;
try {
res = ToolRunner.run(shell, argv);
} finally {
shell.close();
}
System.exit(res);
}
在main方法中初始化了Configuration,并通过 ToolRunner 的run()方法执行命令。
b. ToolRunner的run()方法代码如下
public static int run(Configuration conf, Tool tool, String[] args)
throws Exception{
if(conf == null) {
conf = new Configuration();
}
GenericOptionsParser parser = new GenericOptionsParser(conf, args);
//set the configuration back, so that Tool can configure itself
tool.setConf(conf);
//get the args w/o generic hadoop args
String[] toolArgs = parser.getRemainingArgs();
return tool.run(toolArgs);
}
ToolRunner的run方法会实例化一个GenericOptionsParser
,用于解析通用配置,如果命令行中包括fs
、jt
、conf
、libjars
、files
、archives
、D
、tokenCacheFile
这些关键字,就会在这个通用配置解析阶段被解析,并进行相应的配置。
之后,利用getRemainingArgs()
方法获取其它参数。
最后,调用tool.run(),这里的run方法是工具类(即FsShell
)本身实现的。
c. FsShell
的run方法代码如下:
public int run(String argv[]) throws Exception {
// initialize FsShell
init();
int exitCode = -1;
...
try {
exitCode = instance.run(Arrays.copyOfRange(argv, 1, argv.length));
}
...
return exitCode;
}
沿着代码可以继续深入,研究fs
命令对应的执行逻辑,这篇文章主要是讲述命令是如何被提交执行的,更细节的过程就不继续展开了。