本节主要内容:
Spark Helloworld实验:
运用Spark,计算词语出现的次数
wordcount数据准备
# echo "Hello World Bye World" > /file0
# echo "Hello Hadoop Goodbye Hadoop" > /file1
# sudo -u hdfs hdfs dfs -mkdir -p /usr/spark/wordcount/input
# sudo -u hdfs hdfs dfs -put file* /user/spark/wordcount/input
# sudo -u hdfs hdfs dfs -chmod 1777 /user/spark/wordcount/input
# sudo -u hdfs hdfs dfs -chown -R spark:spark /user/spark/wordcount/input
进入spark-shell运行脚本
## sudo -u spark spark-shell
Setting default log level to "WARN".
scala>
scala> val file = sc.textFile("hdfs://cluster1/user/spark/wordcount/input") 定义变量file,指向源文件地址
scala> val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) 调用file的flatMap方法,将每一行用空格分割,并取出单词,然后根据单词统计,并累加
scala> counts.saveAsTextFile("hdfs://cluster1/user/spark/wordcount/output") 定义文件的输出
在pig中查看
# sudo -u hdfs pig
grunt> ls
hdfs://cluster1/user/spark/wordcount/output/_SUCCESS<r 3> 0
hdfs://cluster1/user/spark/wordcount/output/part-00000<r 3> 28
hdfs://cluster1/user/spark/wordcount/output/part-00001<r 3> 23
grunt> cat part-00000
(Bye,1)
(Hello,2)
(World,2)
grunt> cat part-00001
(Goodbye,1)
(Hadoop,2)
grunt>