目前Spark的最新版本是2.3.0,更新了Spark streaming对接Kafka的API,但是最新的API仍属于实验阶段,正式版本可能会有变化,本文主要介绍2.3.0的API如何使用。
This version of the integration is marked as experimental, so the API is potentially subject to change.
pom.xml配置
加入如下依赖
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.3.0</version>
</dependency>
</dependencies>
代码
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, TaskContext}
object SparkStreamingNewAPIExample {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("SparkStreamingNewAPIExample")
val ssc = new StreamingContext(sparkConf, Seconds(10))
val kafkaParams = scala.collection.Map[String, Object](
"bootstrap.servers" -> "hostA:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "testGroup",
"auto.offset.reset" -> "latest",
"partition.assignment.strategy" -> "org.apache.kafka.clients.consumer.RangeAssignor",
"enable.auto.commit" -> (true: java.lang.Boolean)
)
val topics = Array("topic1","topic2")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.foreachPartition { item =>
val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
println(s"The record from topic [${o.topic}] is in partition ${o.partition} which offset from ${o.fromOffset} to ${o.untilOffset}")
println(s"The record content is ${item.toList.mkString}")
}
rdd.count()
}
ssc.start()
ssc.awaitTermination()
}
}
分析
上面的代码的作用是spark streaming每10秒消费一次topic 1和topic2,然后将RDD的相关信息打印在标准输出中。
其中可以看到KafkaUtils.createDirectStream
与spark 1.6.x版本不论是方法参数还是返回值都有了很大的不同,尤其是返回值,返回的RDD的类型不再是键值对,而是内容更加丰富的ConsumerRecord[K, V]
类型。
例如得到如下的日志打印,可以很详细的知道当前spark处理的数据是来自kafka的哪个topic,partition和offset。
The record is in partition 0 which offset from 23 to 25
The record content is ConsumerRecord(topic = topic1, partition = 0, offset = 23, CreateTime = 1487209064531, checksum = 2357653885, serialized key size = -1, serialized value size = 6, key = null, value = aaaaaa)ConsumerRecord(topic = topic1, partition = 0, offset = 24, CreateTime = 1487209065989, checksum = 2696444472, serialized key size = -1, serialized value size = 8, key = null, value = bbbbbbbb)
参数说明
对于代码中的enable.auto.commit
参数值是true
,含义是当数据被消费完之后会,如果spark streaming的程序由于某种原因停止之后再启动,下次不会重复消费之前消费过的数据。这样就会产生一个问题,从业务的角度,有可能消费之后的数据还没有经过业务处理,并不是真正意义上的“消费完成”。所以如果为false
那么什么情况算消费完,由业务决定。这样就需要手动提交,只需在rdd.count()
之前加入这段代码stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
即可。