SparkSQL文件按内容分区写至本地

原始数据:

[hadoop@hadoop000 data]$ cat infos.txt 
1,ruoze,30
2,jepson,18
3,spark,30
[hadoop@hadoop000 bin]$ ./spark-shell --master local[2] --jars ~/software/mysql-connector-java-5.1.27.jar
scala> case class Info (id: Int, name: String, age: Int)
scala> val info = spark.sparkContext.textFile("file:///home/hadoop/data/infos.txt")
scala> val infoDF = info.map(_.split(",")).map(x=> Info(x(0).toInt,x(1),x(2).toInt)).toDF
scala> infoDF.show()
+---+------+---+                                                                
| id|  name|age|
+---+------+---+
|  1| ruoze| 30|
|  2|jepson| 18|
|  3| spark| 30|
+---+------+---+
scala> infoDF.write.mode("overwrite").format("json").partitionBy("age").save("file:///home/hadoop/data/output")
[hadoop@hadoop000 ~]$ cd /data/output
[hadoop@hadoop000 output]$ ll
total 8
drwxrwxr-x. 2 hadoop hadoop 4096 Sep 24 07:29 age=18
drwxrwxr-x. 2 hadoop hadoop 4096 Sep 24 07:29 age=30
-rw-r--r--. 1 hadoop hadoop    0 Sep 24 07:29 _SUCCESS
[hadoop@hadoop000 output]$ cd age=30
[hadoop@hadoop000 age=30]$ ll
total 8
-rw-r--r--. 1 hadoop hadoop 24 Sep 24 07:29 part-00000-986f0ca2-b9bd-4ff7-8b63-9acf21f9e0d4.c000.json
-rw-r--r--. 1 hadoop hadoop 24 Sep 24 07:29 part-00001-986f0ca2-b9bd-4ff7-8b63-9acf21f9e0d4.c000.json
[hadoop@hadoop000 age=30]$ cat part*
{"id":1,"name":"ruoze"}
{"id":3,"name":"spark"}

用IDEA代码实现:

import org.apache.spark.sql.SparkSession

object SparkSQLApp {
  def main(args: Array[String]): Unit = {
      val spark = SparkSession
            .builder()
            .appName("SparkSQLApp")
            .master("local[2]")
            .getOrCreate()
      val info = spark.sparkContext.textFile("file:///E:/BigDataSoftware/data/infos.txt")
      import spark.implicits._
      val infoDF = info.map(_.split(",")).map(x=> Info(x(0).toInt,x(1),x(2).toInt)).toDF
      infoDF.write.mode("overwrite").option("timestampFormat", "yyyy/MM/dd     HH:mm:ss ZZ").format("json").partitionBy("age").save("file:///E:/BigDataSoftware/data/output")
    spark.stop()
  }
  case class Info (id: Int, name: String, age: Int)
}

如果没有.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")会报错:

Exception in thread "main" java.lang.IllegalArgumentException: Illegal pattern component: XXX
    at org.apache.commons.lang3.time.FastDateFormat.parsePattern(FastDateFormat.java:577)
    at org.apache.commons.lang3.time.FastDateFormat.init(FastDateFormat.java:444)
    at org.apache.commons.lang3.time.FastDateFormat.<init>(FastDateFormat.java:437)
...........................................

原因:
The default for the timestampFormat is yyyy-MM-dd'T'HH:mm:ss.SSSXXX which is an illegal argument. It needs to be set when you are writing the dataframe out.
The fix is to change that to ZZ which will include the timezone.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容