下载必要的组建
组建地址
spark : http://spark.apache.org/downloads.html
hadoop : http://hadoop.apache.org/releases.html
jdk: http://www.oracle.com/technetwork/java/javase/downloads/index-jsp-138363.html
hadoop-commin : https://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip (for windows7)
Scala : http://www.scala-lang.org/downloads
安装
这里的scala是对需要使用scala的语言的人准备的,如果不用可以不用安装
a. 安装jdk和scala,默认步骤即可,
b. 解压spark (D:\spark-2.0.0-bin-hadoop2.7)
c. 解压hadoop (D:\hadoop2.7)
d. 解压hadoop-commin (for w7)
e. copy hadoop-commin/bin to hadoop/bin (for w7)
就是把winutils.exe拷贝到hadoop-2.7.4的bin目录下
配置环境变量
贴一下别人的图,参考
https://blog.csdn.net/HHTNAN/article/details/78391409
https://blog.csdn.net/qq_38799155/article/details/78254580
https://blog.csdn.net/hjxinkkl/article/details/57083549?winzoom=1
JAVA_HOME:
并在Path中添加
%JAVA_HOME%\bin
CLASSPATH :
SPARK_HOME :
并在Path中添加
%SPARK_HOME%\bin
%SPARK_HOME%\sbin
PYTHONPATH :
将spark\python\pyspark整个文件夹复制到Anaconda3\Lib\site-packages文件夹中
HADOOP_HOME
注意要把winutils.exe拷贝到hadoop-2.7.4的bin目录下,为了支持python语言
并在Path中添加
%HADOOP_HOME%\bin
总图
用户添加的变量
系统的Path变量
献上测试程序:
import sys
from pyspark import SparkContext
if __name__ == "__main__":
master = "local"
if len(sys.argv) == 2:
master = sys.argv[1]
try:
sc.stop()
except:
pass
sc = SparkContext(master, "WordCount")
lines = sc.parallelize(["pandas", "i like pandas"])
result = lines.flatMap(lambda x: x.split(" ")).countByValue()
for key, value in result.items():
print ("%s %i" % (key, value))