翻译: https://www.cloudera.com/documentation/enterprise/latest/topics/spark_first.html
版本: 5.14.2
运行Spark应用程序的最简单方法是使用Scala或Python shell。
- 可以使用如下命令启动shell应用程序,请运行以下命令之一:
- Scala :
$ SPARK_HOME/bin/spark-shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version ...
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)
Type in expressions to have them evaluated.
Type :help for more information
...
SQL context available as sqlContext.
scala>
- python :
$ SPARK_HOME/bin/pyspark
Python 2.6.6 (r266:84292, Jul 23 2015, 15:22:56)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information
...
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version ...
/_/
Using Python version 2.6.6 (r266:84292, Jul 23 2015 15:22:56)
SparkContext available as sc, HiveContext available as sqlContext.
>>>
在CDH部署中,使用package安装时,SPARK_HOME 默认为 /usr/lib/spark ;使用parcel安装时,默认为/opt/cloudera/parcels/CDH/lib/spark 。 在使用Cloudera Manager 部署时, /usr/bin 下的shell命令也可使用 。
有关shell选项的完整列表,请运行spark-shell or pyspark with the -h flag。
- 运行经典的hadoop word count 应用 , 复制input file到hdfs:
$ hdfs dfs -put input
- 在shell中 , 使用如下形式 运行word count 应用:
- scala:
scala> val myfile = sc.textFile("hdfs://namenode_host:8020/path/to/input")
scala> val counts = myfile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
scala> counts.saveAsTextFile("hdfs://namenode:8020/path/to/output")
- python :
>>> myfile = sc.textFile("hdfs://namenode_host:8020/path/to/input")
>>> counts = myfile.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2)
>>> counts.saveAsTextFile("hdfs://namenode:8020/path/to/output")