Dataset=RDD+schema
Dataset几乎就是一个RDD,除了它还包括一个schema,这个schema很多时候也是自动推导出来的,最简单的schema是包含一个名为value的列,它的类型可以是String,Int...
如下代码创建一个Dataset:
scala> import spark.implicits._
import spark.implicits._
scala> val ds = Seq(("bluejoe", 100), ("alex", 200)).toDS
ds: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]
scala> ds.schema
res0: org.apache.spark.sql.types.StructType = StructType(StructField(_1,StringType,true), StructField(_2,IntegerType,false))
scala> ds.collect
res1: Array[(String, Int)] = Array((bluejoe,100), (alex,200))
这个Dataset就包含了2行记录,每个记录是一个Tuple2,如:(bluejoe,100)
可以针对这个Dataset做SQL查询:
scala> ds.select("_1").collect
res4: Array[org.apache.spark.sql.Row] = Array([bluejoe], [alex])
scala> ds.show
+-------+---+
| _1| _2|
+-------+---+
|bluejoe|100|
| alex|200|
+-------+---+
scala> ds.select("_1").show
+-------+
| _1|
+-------+
|bluejoe|
| alex|
+-------+
scala> ds.select(ds("_1")).show
+-------+
| _1|
+-------+
|bluejoe|
| alex|
+-------+
scala> ds.select($"_1").show
+-------+
| _1|
+-------+
|bluejoe|
| alex|
+-------+
ds.select("_1")与ds.select(ds("_1")),以及ds.select($"_1")等价
$"_1"神奇吗?一点都不神奇,$()其实是一个函数:
implicit class StringToColumn(val sc: StringContext) {
def $(args: Any*): ColumnName = {
new ColumnName(sc.s(args: _*))
}
}
SQL列还可以进行运算操作:
scala> ds.select(ds("_2")+10).show
+---------+
|(_2 + 10)|
+---------+
| 110|
| 210|
+---------+
scala> ds.select($"_2"+10).show
+---------+
|(_2 + 10)|
+---------+
| 110|
| 210|
+---------+
+、-等运算符其实也被ColumnName定义了,这里不再赘述。
也可以使用map()对Dataset进行变形:
scala> ds.map(x=>(x._1.toUpperCase, x._2+10)).show
+-------+---+
| _1| _2|
+-------+---+
|BLUEJOE|110|
| ALEX|210|
+-------+---+
可以看出,map()函数会生成新的schema:
scala> ds.map(x=>(x._1.toUpperCase, x._2+10, true)).show
+-------+---+----+
| _1| _2| _3|
+-------+---+----+
|BLUEJOE|110|true|
| ALEX|210|true|
+-------+---+----+
除了将一个Tuple转换成另外一个Tuple,还可以转成一个JavaBean:
scala> case class Person(name:String,age:Int){};
defined class Person
scala> val ds2=ds.map(x=>Person(x._1.toUpperCase, x._2+10))
ds2: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]
scala> ds2.show
+-------+---+
| name|age|
+-------+---+
|BLUEJOE|110|
| ALEX|210|
+-------+---+
注意这个新的Dataset的每一行变成了一个Person对象:
scala> ds2.collect
res36: Array[Person] = Array(Person(BLUEJOE,110), Person(ALEX,210))
注意,不是任何对象都可以放到Dataset中:
scala> import org.apache.spark.sql._
import org.apache.spark.sql._
scala> ds.map(x=>Row(x._1.toUpperCase, x._2+10)).show
<console>:32: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
ds.map(x=>Row(x._1.toUpperCase, x._2+10)).show
DataFrame是Dataset[Row]的别名
A DataFrame is a Dataset organized into named columns.
Dataset可以转成DataFrame:
scala> val df=ds.toDF
df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
scala> df.collect
res33: Array[org.apache.spark.sql.Row] = Array([bluejoe,100], [alex,200])
注意看到DataFrame的每行确实是一个Row,观察源代码:
def toDF(): DataFrame = new Dataset[Row](sparkSession, queryExecution, RowEncoder(schema))
实际上,toDF()使用一个RowEncoder来实现Tuple到Row的转码。
也可以使用as()函数来转换成DataFrame:
scala> ds.as[Row](RowEncoder(ds.schema)).collect
res55: Array[org.apache.spark.sql.Row] = Array([bluejoe,100], [alex,200])
DataFrame的map()函数具有一些陷阱,因为它实际上还是一个Dataset,所以它的每一行还是可以转换成任意对象(甚至是非Row对象!!):
scala> df.map(x=>(x(0).asInstanceOf[String].toLowerCase, x(1).asInstanceOf[Int]-10)).collect
res43: Array[(String, Int)] = Array((bluejoe,90), (alex,190))
看到没?这个map()之后的对象并不再是DataFrame了!!如果坚持要转变成DataFrame,就必须用到别扭的toDF():
scala> df.map(x=>(x(0).asInstanceOf[String].toLowerCase, x(1).asInstanceOf[Int]-10)).toDF.collect
res44: Array[org.apache.spark.sql.Row] = Array([bluejoe,90], [alex,190])
或者指定Encoder:
scala> df.map{x:Row=>Row(x(0).asInstanceOf[String].toLowerCase, x(1).asInstanceOf[Int]-10)}(RowEncoder(ds.schema)).collect
res52: Array[org.apache.spark.sql.Row] = Array([bluejoe,90], [alex,190])
别扭吗?真的很别扭!