1.环境准备
(1)启动Hadoop集群
[root@bigdata ~]# start-all.sh
[root@bigdata ~]# jps
2096 NameNode
2422 SecondaryNameNode
2232 DataNode
2586 ResourceManager
2813 NodeManager
3037 Jps
(2)启动HistoryServer服务器
[root@bigdata ~]# mr-jobhistory-daemon.sh start historyserver
starting historyserver, logging to /root/trainings/hadoop-2.7.3/logs/mapred-root-historyserver-bigdata.out
[root@bigdata ~]# jps
2096 NameNode
3123 Jps
2422 SecondaryNameNode
2232 DataNode
2586 ResourceManager
3084 JobHistoryServer
2813 NodeManager
(3)启动Pig的集群模式
[root@bigdata ~]# pig
18/09/26 00:02:32 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
18/09/26 00:02:32 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
18/09/26 00:02:32 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2018-09-26 00:02:32,804 [main] INFO org.apache.pig.Main - Apache Pig version 0.17.0 (r1797386) compiled Jun 02 2017, 15:41:58
2018-09-26 00:02:32,804 [main] INFO org.apache.pig.Main - Logging error messages to: /root/pig_1537891352803.log
2018-09-26 00:02:32,830 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
2018-09-26 00:02:33,289 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2018-09-26 00:02:33,289 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://bigdata:9000
2018-09-26 00:02:33,812 [main] INFO org.apache.pig.PigServer - Pig Script ID for the session: PIG-default-4db6a82f-0910-4950-a889-e1d7ee031cce
2018-09-26 00:02:33,812 [main] WARN org.apache.pig.PigServer - ATS is disabled since yarn.timeline-service.enabled set to false
grunt>
(4)上传测试数据到HDFS
grunt> copyFromLocal /root/input/data.txt /input
grunt> cat /input/data.txt
I love Beijing
I love China
Beijing is the capital of China
2.WordCount程序
(1)加载数据
grunt> lines = load '/input/data.txt' as (line:chararray);
(2)分词操作
grunt> words = foreach lines generate flatten(TOKENIZE(line)) as word;
2018-09-26 00:55:07,771 [main] INFO org.apache.pig.impl.util.SpillableMemoryManager
- Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold
= 489580128, usageThreshold = 489580128
(3)按词分组
grunt> grpd = group words by word;
(4)按词计算
grunt> cntd = foreach grpd generate group, COUNT(words);
(5)打印结果
grunt> dump cntd;
log
(I,2)
(is,1)
(of,1)
(the,1)
(love,2)
(China,2)
(Beijing,2)
(capital,1)
可以看到,使用PigLatin实现WordCount程序,只需要4句话即可,大大提高了MapReduce程序的开发效率。