Hadoop是Apache基金会旗下的一个开源的分布式计算平台,是基于Java语言开发的,有很好的跨平台特性,并且可以部署在廉价的计算机集群中。用户无需了解分布式底层细节,就可以开发分布式程序,充分利用集群的威力进行高速运算和存储。
Ubuntu 版本: 18.x ~ 20.x
Hadoop版本:3.2.2 (http://hadoop.apache.org/)
1. 安装 JDK
$ sudo apt-get update
$ sudo apt-get install openjdk-8-jdk
$ java -version
openjdk version "1.8.0_275"
OpenJDK Runtime Environment (build 1.8.0_275-8u275-b01-0ubuntu1~18.04-b01)
OpenJDK 64-Bit Server VM (build 25.275-b01, mixed mode)
$ update-alternatives --config java
There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
2. 设置 JAVA_HOME
$ sudo vi /etc/profile
JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
JRE_HOME=$JAVA_HOME/jre
CLASS_PATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib
PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
export JAVA_HOME JRE_HOME CLASS_PATH PATH
$ source /etc/profile
$ echo $JAVA_HOME
3. 安装 Hadoop 到主机
$ wget https://mirrors.bfsu.edu.cn/apache/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz # 下载 hadoop
$ mv ./hadoop-3.2.2.tar.gz ~/apps/ # 移动到你想要放置的文件夹
$ tar -zvxf hadoop-3.2.2.tar.gz # ~/apps/hadoop-3.2.2
本地配置 SSH 无密码访问:
$ cd ~/.ssh
$ ssh-keygen -t rsa
$ cat ./id_rsa.pub >> ./authorized_keys
$ ssh localhsot
4. 配置 Hadoop
创建 Hadoop用户和组,并授予执行权限
$ sudo addgroup hadoop
$ sudo usermod -a -G hadoop xxx # 将当前用户加入到hadoop组
$ sudo vim /etc/sudoers # 将hadoop组加入到sudoer
在 root ALL=(ALL) ALL 后
添加一行 hadoop ALL=(ALL) ALL
$ sudo chmod -R 755 ~/apps/hadoop-3.2.2
$ sudo chown -R xxx:hadoop ~/apps/hadoop-3.2.2 # 否则ssh会拒绝访问
$ sudo vim /etc/profile (增量配置,不要删除之前的 JAVA_HOME ...)
HADOOP_HOME=/home/xxx/apps/hadoop-3.2.2
PATH=.$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin
export HADOOP_HOME
$ sudo source /etc/profile
$ echo $HADOOP_HOME
5. 单机模式运行 Hadoop
$ hadoop version # 测试是否配置成功
$ cd ~/apps/hadoop-3.2.2
$ mkdir input
$ cp README.txt input
$ hadoop jar share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-3.2.2-sources.jar org.apache.hadoop.examples.WordCount input output
6. 伪分布式配置 Hadoop
$ cd ~/apps/hadoop-3.2.2
$ vim ./etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
$ vim ./etc/hadoop/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/home/xxx/apps/hadoop-3.2.2/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
*注:路径里的 ‘xxx’ 是 ubuntu 用户的home目录名,下同
$ vim ./etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.http.address</name>
<value>0.0.0.0:50070</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/xxx/apps/hadoop-3.2.2/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/xxx/apps/hadoop-3.2.2/tmp/dfs/data</value>
</property>
</configuration>
$ hdfs namenode -format
$ ./sbin/start-dfs.sh
遇到报错:localhost: ERROR: JAVA_HOME is not set and could not be found.
解决方案:其实是hadoop里面hadoop-env.sh文件里面的java路径设置不对,hadoop-env.sh ./etc/hadoop目录下,具体的修改办法如下:
$ vim ./etc/hadoop/hdoop-env.sh
将语句
export JAVA_HOME=$JAVA_HOME
# 也有可能语句为export JAVA_HOME= (且被注释掉了)
修改为
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 # 自己的Java home路径,可以在终端输入$JAVA_HOME 查看
保存后退出,重新执行 ./sbin/start-dfs.sh
输入jps命令查看是否启动成功, namenode和datanode都要出现才算成功??
访问 http://localhost:50070 查看节点信息
关闭hdfs: ./sbin/stop-dfs.sh
7. 配置yarn(可选项)
$ vim ./etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
$ vim ./etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
启动yarn(要先启动了hdfs:start-dfs.sh)
$ ./sbin/start-yarn.sh
开启历史服务器,查看历史任务,这样可以在web界面中查看任务运行情况:
$ ./sbin/mr-jobhistory-daemon.sh start historyserver
启动成功后可以在 http://localhost:8088/cluster 访问集群资源管理器。
不启用 YARN 时,是 “mapred.LocalJobRunner” 在跑任务,启用 YARN 之后,是 “mapred.YARNRunner” 在跑任务。启动 YARN 有个好处是可以通过 Web 界面查看任务的运行情况:http://localhost:8088/cluster 。
8. HDFS文件操作
$ hdfs dfsadmin -report # 查看信息
$ hdfs dfs -mkdir /test # create folder
$ hdfs dfs -rm -r /test # delete folder
$ hdfs dfs -ls -R / # recursive list folder
$ hdfs dfs -put data.txt /test # put file from local
$ hdfs dfs -get /test/data.txt # get file to local
9. 集群/分布式配置 Hadoop
一台主机 hadoop-master-vm,一台辅机 hadoop-slave-vm,配置主辅机之间SSH无密码访问(把各自的~/.ssh/id_rsa.pub 放入对方的 ~/.ssh/authorized_keys 中)
1) 主机hadoop-master-vm, 修改/etc/hosts
127.0.0.1 localhost
#127.0.1.1 hadoop-master-vm
192.168.0.3 hadoop-master-vm
192.168.0.4 hadoop-slave-vm
...
2) 辅机hadoop-slave-vm, 修改/etc/hosts
127.0.0.1 localhost
#127.0.1.1 hadoop-slave-vm
192.168.0.3 hadoop-master-vm
192.168.0.4 hadoop-slave-vm
...
3) 修改主机上的 hadoop 配置
$ cd ~/apps/hadoop-3.2.2
$ vim ./etc/hadoop/workers
#localhost
hadoop-slave-vm
$ vim ./etc/hadoop/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/home/xxx/apps/hadoop-3.2.2/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master-vm:9000</value>
</property>
</configuration>
$ vim ./etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.http.address</name>
<value>hadoop-master-vm:50070</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/xxx/apps/hadoop-3.2.2/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/xxx/apps/hadoop-3.2.2/tmp/dfs/data</value>
</property>
</configuration>
$ vim ./etc/hadoop/mapred-site.xml # MapReduce 相关
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop-master-vm:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop-master-vm:19888</value>
</property>
</configuration>
$ vim ./etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master-vm</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
$ hdfs namenode -format
把主机的hadoop目录 ~/apps/hadoop-3.2.2 完全同步到辅机上,辅机的JAVA_HOME, HADOOP_HOME 等参考主机的配置。
4) 在主机上运行
$ ./sbin/start-dfs.sh
$ ./sbin/start-yarn.sh
# $ mr-jobhistory-daemon.sh start historyserver
(1) 主机上查看节点
xxx@hadoop-master-vm:~/apps/hadoop-3.2.2$ jps
6807 NameNode
7752 ResourceManager
7082 SecondaryNameNode
8171 Jps
8156 JobHistoryServer
(2)辅机上查看节点
xxx@hadoop-slave-vm:~/apps/hadoop-3.2.2$ jps
2368 remoting.jar
6192 DataNode
20802 Jps
(3)主辅机上查看report
$ hdfs dfsadmin -report
Configured Capacity: 490651459584 (456.95 GB)
Present Capacity: 410068922368 (381.91 GB)
DFS Remaining: 410068877312 (381.91 GB)
DFS Used: 45056 (44 KB)
DFS Used%: 0.00%
...
-------------------------------------------------
Live datanodes (1):
Name: 192.168.0.4:9866 (hadoop-slave-vm)
Hostname: hadoop-slave-vm
Decommission Status : Normal
Configured Capacity: 490651459584 (456.95 GB)
DFS Used: 45056 (44 KB)
Non DFS Used: 55587368960 (51.77 GB)
DFS Remaining: 410068877312 (381.91 GB)
DFS Used%: 0.00%
DFS Remaining%: 83.58%
...
10. Web UI
1) Hadoop node overview
http://hadoop-master-vm:50070/
2) Hadoop cluster
http://hadoop-master-vm:8088/