1. 版本信息
spark-2.4.4-bin-hadoop2.7.tgz
hadoop-2.7.7.tar.gz
scala-2.11.12.tgz
jdk-8u391-linux-x64.tar.gz
python3.7(python3.8及以上版本不能兼容spark2.4,python版本要求3.5+)
pyspark==3.4.2
findspark==2.0.1
nebula-graph==3.4.0
2. 安装
设置免密登录
ssh-keygen -t rsa
ssh-copy-id root@node03
tar zxvf spark-2.4.4-bin-hadoop2.7.tgz -C /usr/local
tar zxvf hadoop-2.7.7.tar.gz -C /usr/local
tar zxvf scala-2.11.12.tgz -C /usr/local
tar zxvf jdk-8u391-linux-x64.tar.gz -C /usr/local
vim /etc/profile
export JAVA_HOME=/usr/local/jdk1.8.0_391
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export SPARK_HOME=/usr/local/spark-2.4.4-bin-hadoop2.7
export SCALA_HOME=/usr/local/scala-2.11.12
export HADOOP_HOME=/usr/local/hadoop-2.7.7
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH
export PATH=$PATH:$MAVEN_HOME/bin:$SCALA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$SCALA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin
source /etc/profile
3. 配置hadoop、spark
3.1 hadoop(cd /usr/local/hadoop-2.7.7/)
etc/hadoop/slaves
node03
etc/hadoop/hadoop-env.sh 中添加
JAVA_HOME=/usr/local/jdk1.8.0_391
etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://node03:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop-2.7.7/data</value>
</property>
</configuration>
etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permission.enabled</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>node03:50090</value>
</property>
</configuration>
etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
sbin目录下start_all.sh、stop_all.sh、start-dfs.sh、stop-dfs.sh、start-yarn.sh、stop-yarn.sh中添加
HDFS_DATANODE_USER=root
YARN_RESOURCEMANAGER_USER=root
HDFS_NAMENODE_SECURE_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
YARN_NODEMANAGER_USER=root
3.2 spark(/usr/local/spark-2.4.4-bin-hadoop2.7)
conf/slaves
node03
conf/spark-env.sh(从conf/spark-env.sh.template复制得到)中添加
export SPARK_MASTER_HOST=node03
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/
4. 启动
# hdfs 格式化
hdfs namenode -format
/usr/local/hadoop-2.7.7/sbin/start-dfs.sh
/usr/local/hadoop-2.7.7/sbin/start-yarn.sh
/usr/local/spark-2.4.4-bin-hadoop2.7/sbin/start-all.sh
[root@node03 ~]# jps
29265 DataNode
21826 Master
100405 Jps
25269 NodeManager
29430 SecondaryNameNode
29591 ResourceManager
21898 Worker
29135 NameNode
spark各个端口:
1)Spark查看当前Spark-shell运行任务情况端口号:4040
2)Spark Master内部通信服务端口号:7077 (类比于Hadoop的8020(9000)端口)
3)Spark Standalone模式Master Web端口号:8080(类比于Hadoop YARN任务运行情况查看端口号:8088)
4)Spark历史服务器端口号:18080 (类比于Hadoop历史服务器端口号:19888)
hadoop各个端口:
1.HDFS:
NameNode:默认端口为 8020(RPC),50070(Web UI)。
DataNode:默认端口为 50010(数据传输)、50020(心跳)、50075(Web UI)。
SecondaryNameNode:默认端口为 50090(Web UI)。
2.YARN:
ResourceManager:默认端口为 8032(RPC)、8088(Web UI)。
NodeManager:默认端口为 8042(Web UI)。
3.MapReduce:
JobHistoryServer:默认端口为 10020(RPC)、19888(Web UI)。
4.HBase:
HMaster:默认端口为 16000(RPC)、16010(Web UI)。
RegionServer:默认端口为 16020(RPC)、16030(Web UI)。
5.ZooKeeper:
默认端口为 2181。
NameNode:维护hadoop文件系统的命名空间
8021 JobTracker:协调MapReduce任务的进程
DataNode:存储数据块,并向客户端提供数据
50060 TaskTracker:运行MapReduce任务的节点
Secondary NameNode:定期合并hadoop文件系统编辑日志,并发送到NameNode
5. pyspark
生成nebula-spark-connector-3.0.0.jar文件,打包完成在nebula-spark-connector/nebula-spark-connector/target目录下
$ git clone https://github.com/vesoft-inc/nebula-spark-connector.git -b v3.0.0
$ cd nebula-spark-connector/nebula-spark-connector
$ mvn clean package -Dmaven.test.skip=true -Dgpg.skip -Dmaven.javadoc.skip=true
python3.7加载nebula-spark-connector-3.0.0.jar文件读取nebula点和边,并保存到hdfs
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").config(
"spark.jars", "/root/nebula-spark-connector-3.0.0.jar").config(
"spark.driver.extraClassPath", "/root/nebula-spark-connector-3.0.0.jar").appName(
"nebula-connector").getOrCreate()
df = spark.read.format("com.vesoft.nebula.connector.NebulaDataSource").option(
"type", "vertex").option(
"spaceName", "player").option(
"label", "user").option(
"returnCols", "name,age").option(
"metaAddress", "<nebula-meta-ip>:9559").option(
"partitionNumber", 60).load()
# df = spark.read.format("com.vesoft.nebula.connector.NebulaDataSource").option(
# "type", "vertex").option(
# "spaceName", "player").option(
# "label", "relation").option(
# "returnCols", "create_date,type,sub_type,values").option(
# "metaAddress", "<nebula-meta-ip>:9559").option(
# "partitionNumber", 60).load()
# df.show(n=2)
# df.write.format("csv").save("./relation", header=True)
# df = spark.read.format("com.vesoft.nebula.connector.NebulaDataSource").option(
# "type", "edge").option(
# "spaceName", "player").option(
# "label", "has").option(
# "returnCols", "sdate,rtype").option(
# "metaAddress", "<nebula-meta-ip>:9559").option(
# "partitionNumber", 60).load()
df.show(n=2)
df.write.format("csv").save("hdfs://node03:9000/frame", header=True)
spark.stop()
# 默认保存大小1.3M,合并hdfs文件
hdfs dfs -cat /frame/part-* | hdfs dfs -copyFromLocal - /input/frame.csv
# 删除临时文件夹
hdfs dfs -rmr /frame