- 最终输出
+---+--------+
|age| name|
+---+--------+
| 30|zhangsan|
| 31| lisi|
| 32| wangwu|
| 32| sid|
+---+--------+
1、Scala程序
从SparkSession入手
SparkSession是旧的版本中SQLContext和HiveContext的组合封装。
import spark.implicits._用来隐式地将DataFrames转化为RDD,当DataFrames的变量调用RDD的方法的时候,DataFrames中的隐式转化方法会将DataFrames转化为RDD。
import org.apache.spark.sql.SparkSession
object Hive_Json {
def main(args: Array[String]): Unit = {
val path = "C:/java/spark_practise/src/main/resources/input/people.json"
val spark = SparkSession.builder().appName("SparkSessionTest").master("local[2]").getOrCreate()
import spark.implicits._
val people = spark.read.json(path)
people.show()
people.createOrReplaceTempView("people")
spark.sql("select * from people").show()
spark.stop()
}
}
2、 json文件
{"name":"zhangsan","age":30}
{"name":"lisi","age":31}
{"name":"wangwu","age":32}
{"name":"sid","age":32}
3、pom文件设置
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>spark_practise</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<scala.version>2.11.8</scala.version>
<spark.version>2.2.0</spark.version>
</properties>
<dependencies>
<!-- scala依赖 -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!-- spark依赖 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- spark依赖 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- hivecontext要用这个依赖-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<!-- 该插件用于将 Scala 代码编译成 class 文件 -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<!-- 声明绑定到 maven 的 compile 阶段 -->
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.1.0</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
-
使用Spark-Shell读取本地文件
hdfs和本地文件分别的前缀是hdfs和file,默认的文件系统根据hadoop/etc/core-site.xml文件中的fs.defaultFS配置。
使用hadoop dfs -put 本地文件 远程文件的方式将文件放到HDFS文件系统中。
或者把路径改成本地路径就可以读取本地的文件