val movieRDD = spark
.read
.option("uri",mongoConfig.uri)
.option("collection",MOVIES_COLLECTION_NAME)
.format("com.mongodb.spark.sql")
.load()
.as[Movie]
.rdd
.map(_.mid).cache()
val ratingRDD = spark
.read
.option("uri",mongoConfig.uri)
.option("collection",RATINGS_COLLECTION_NAME)
.format("com.mongodb.spark.sql")
.load()
.as[MovieRating]
.rdd
.map(rating=> (rating.uid, rating.mid, rating.score)).cache()
这个地方加一个cache()方法,这里反复拉取的时候太消耗内存