Hadoop单节点安装

环境

一台ubuntu 14.04虚拟机。

Hadoop版本:2.6.0。

增加用户

为了隔离Hadoop和其它软件,可以新建一个用户hduser和用户组hadoop来专门运行Hadoop:

sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser

配置SSH免秘钥登陆

Hadoop使用SSH管理节点,需要为相关的远程机器和本机配置SSH免密码登陆。

首先,生成SSH秘钥,生成的私钥文件默认位置是~/.ssh/id_rsa.pub

ssh-keygen -t rsa -P ""

将私钥写入~/.ssh/authorized_keys

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

尝试使用ssh登陆localhost,应该不再需要输入密码:

ssh localhost

准备依赖软件

确保本机上安装有JDK。安装JDK:

sudo apt-get update
sudo apt-get install default-jdk

确保安装节点上有ssh,且sshd已经运行。安装ssh:

sudo apt-get install ssh

安装rsync:

sudo apt-get install rsync

安装Hadoop

下载Hadoop安装文件:

wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz

解压缩安装包:

tar xfz hadoop-2.6.0.tar.gz

设置Hadoop根路径环境变量。本次安装路径为/usr/local/hadoop,后续的一些命令会使用到该路径:

export HADOOP_HOME=/usr/local/hadoop

移动安装包到Hadoop根路径:

sudo mv hadoop-2.6.0 $HADOOP_HOME

将安装目录的属主修改为hduser

sudo chown hduser $HADOOP_HOME

配置Hadoop

1. 设置JAVA_HOME环境变量

通过以下命令,可以获知JAVA_HOME环境变量的值:

update-alternatives --config java

在本机上显示结果如下:

There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
Nothing to configure.

那么JAVA_HOME应该为/jre/bin/java前面的部分,也即:

/usr/lib/jvm/java-7-openjdk-amd64

2. 配置./bashrc

我打算将相关路径写到.bashrc下,这样,登陆用户时,就会自动加载。使用VIM编辑.bashrc

vim ~/.bashrc

.bashrc中添加相关环境变量:

#HADOOP
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

应用环境变量:

source ~/.bashrc

3. 配置$HADOOP_HOME/etc/hadoop/hadoop-env.sh

编辑$HADOOP_HOME/etc/hadoop/hadoop-env.sh文件,设置JAVA_HOME

sudo vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh

找到其中的JAVA_HOME变量,修改为:

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

4. 配置$HADOOP_HOME/etc/hadoop/core-site.xml

编辑$HADOOP_HOME/etc/hadoop/core-site.xml文件:

sudo vim $HADOOP_HOME/etc/hadoop/core-site.xml

<configuration></configuration>之间加入HDFS的配置(HDFS的端口配置在9000):

<property>
   <name>fs.default.name</name>
   <value>hdfs://localhost:9000</value>
</property>
5. 配置$HADOOP_HOME/etc/hadoop/yarn-site.xml

编辑$HADOOP_HOME/etc/hadoop/yarn-site.xml文件:

sudo vim $HADOOP_HOME/etc/hadoop/yarn-site.xml

<configuration></configuration>之间加入以下内容:

<property>
   <name>yarn.nodemanager.aux-services</name>
   <value>mapreduce_shuffle</value>
</property>
<property>
   <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
   <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
6. 配置$HADOOP_HOME/etc/hadoop/mapred-site.xml

HADOOP_HOME目录下有一个配置模板$HADOOP_HOME/etc/hadoop/mapred-site.xml.template,先拷贝到$HADOOP_HOME/etc/hadoop/mapred-site.xml

cp $HADOOP_HOME/etc/hadoop/mapred-site.xml{.template,}

编辑$HADOOP_HOME/etc/hadoop/mapred-site.xml文件:

sudo vim $HADOOP_HOME/etc/hadoop/mapred-site.xml

<configuration></configuration>之间加入以下内容:

<property>
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
</property>

7. 准备数据存储目录

假设准备将数据存放在/mnt/hdfs,方便起见,现将其设为一个环境变量:

export HADOOP_DATA_DIR=/mnt/hdfs

创建DataNode和NameNode的存储目录,同时将这两个文件夹的属主修改为hduser

sudo mkdir -p $HADOOP_DATA_DIR/namenode
sudo mkdir -p $HADOOP_DATA_DIR/datanode
sudo chown hduser /mnt/hdfs/namenode
sudo chown hduser /mnt/hdfs/datanode

8. 配置$HADOOP_HOME/etc/hadoop/hdfs-site.xml

编辑$HADOOP_HOME/etc/hadoop/hdfs-site.xml文件:

sudo vim $HADOOP_HOME/etc/hadoop/hdfs-site.xml

<configuration></configuration>之间增加DataNode和NameNode的配置,如下:

<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/mnt/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/mnt/hdfs/datanode</value>
</property>

9. 格式化HDFS文件系统

使用下列命令格式化HDFS文件系统:

hdfs namenode -format

启动Hadoop

启动HDFS:

start-dfs.sh

启动yarn:

start-yarn.sh

HDFS和yarn的web控制台默认监听端口分别为500708088

如果一切正常,使用jps可以查看到正在运行的Hadoop服务,在我机器上的显示结果为:

29117 NameNode
29675 ResourceManager
29278 DataNode
30002 NodeManager
30123 Jps
29469 SecondaryNameNode

运行Hadoop任务

下面以著名的WordCount例子来说明如何使用Hadoop。

1. 准备程序包

下面是WordCount的源代码。

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

编译代码,并打包:

export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
bin/hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class

wc.jar就是打包后的Hadoop Mapreduce程序文件。

2. 准备输入文件

我们的Hadoop Mapreduce程序从HDFS读取输入文件,同时也将输出存放到HDFS中。本文将测试程序的输入目录和输出目录确定为wordcount/inputwordcount/output

在HDFS上创建输入文件夹:

hdfs dfs -mkdir -p wordcount/input

准备一些文本文件作为测试数据,本文准备的两个文件如下:

文件1:input1
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
The project includes these modules:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

文件2:input2
Apache Hadoop 2.6.0 is a minor release in the 2.x.y release line, building upon the previous stable release 2.4.1.
Here is a short overview of the major features and improvements.
Common
Authentication improvements when using an HTTP proxy server. This is useful when accessing WebHDFS via a proxy server.
A new Hadoop metrics sink that allows writing directly to Graphite.
Specification work related to the Hadoop Compatible Filesystem (HCFS) effort.
HDFS
Support for POSIX-style filesystem extended attributes. See the user documentation for more details.
Using the OfflineImageViewer, clients can now browse an fsimage via the WebHDFS API.
The NFS gateway received a number of supportability improvements and bug fixes. The Hadoop portmapper is no longer required to run the gateway, and the gateway is now able to reject connections from unprivileged ports.
The SecondaryNameNode, JournalNode, and DataNode web UIs have been modernized with HTML5 and Javascript.
YARN
YARN's REST APIs now support write/modify operations. Users can submit and kill applications through REST APIs.
The timeline store in YARN, used for storing generic and application-specific information for applications, supports authentication through Kerberos.
The Fair Scheduler supports dynamic hierarchical user queues, user queues are created dynamically at runtime under any specified parent-queue.

将这两个文件拷贝到wordcount/input

hdfs dfs -copyFromLocal input* wordcount/input/

3. 运行程序

在Hadoop上执行程序:

hadoop jar wc.jar WordCount wordcount/input wordcount/output

程序的结果在wordcount/output,查看输出目录:

hdfs dfs -ls wordcount/output

查看输出结果:

hdfs dfs -cat wordcount/output/part-r-00000

参考文献

[1] "HDFS Commands Guide", http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#copyFromLocal

[2] "MapReduce Tutorial", http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

[3] "Hadoop MapReduce Next Generation - Setting up a Single Node Cluster", http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

[4] "Running Hadoop On Ubuntu Linux (Single-Node Cluster) - Michael G. Noll", http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

[5] "How to Install Hadoop on Ubuntu 13.10", https://www.digitalocean.com/community/tutorials/how-to-install-hadoop-on-ubuntu-13-10

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 220,192评论 6 511
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 93,858评论 3 396
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 166,517评论 0 357
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 59,148评论 1 295
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 68,162评论 6 397
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,905评论 1 308
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,537评论 3 420
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,439评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,956评论 1 319
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 38,083评论 3 340
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,218评论 1 352
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,899评论 5 347
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,565评论 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 32,093评论 0 23
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,201评论 1 272
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,539评论 3 375
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 45,215评论 2 358

推荐阅读更多精彩内容