一、Initializing Spark 初始化spark
1.创建sparkconf,包含应用程序的有关信息,如Application Name,Core,Memory,以键值对的方式设置
详参http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkConf
2.构建Sparkcontext,告诉spark如何去连接集群(可以是local,standalone,yarn,mesos)
注意:Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.
val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
最佳实践:In practice, when running on a cluster, you will not want to hardcode
master
in the program, but rather launch the application with spark-submit
and receive it there.当运行集群时,不要硬编码master,应该用spark-submit的方式提交上去,这样的话,同一份代码既可以跑在yarn上,也可以跑在standalone和mesos上,不需要做任何代码的修改
二、IDEA构建Spark应用程序
1.添加spark-core和scala的依赖
<properties>
<maven.compiler.source>1.5</maven.compiler.source>
<maven.compiler.target>1.5</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.11.8</scala.version>
<spark.version>2.3.1</spark.version>
<hadoop.version>2.6.0-cdh5.7.0</hadoop.version>
</properties>
<repositories>
<repository>
<id>cloudera</id>
<name>cloudera</name>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependencies>
2.创建SparkContext 两行开始加一行结束
编程模板
object SparkContextApp {
def main(args: Array[String]): Unit = {
//第一步:创建SparkContext
val sparkConf = new SparkConf().setAppName("SparkContextApp").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
//第二步:读取文件并进行相应的业务逻辑处理
//TODO...
//第三步:关闭SparkContext
sc.stop()
}
}