背景
Kafka实时记录从数据采集工具Flume或业务系统实时接口收集数据,并作为消息缓冲组件为上游实时计算框架提供可靠数据支撑,Spark 1.3版本后支持两种整合Kafka机制(Receiver-based Approach 和 Direct Approach),具体细节请参考文章最后官方文档链接,数据存储使用HBase
实现思路
实现Kafka消息生产者模拟器
Spark-Streaming采用Direct Approach方式实时获取Kafka中数据
Spark-Streaming对数据进行业务计算后数据存储到HBase
本地虚拟机集群环境配置
由于笔者机器性能有限,hadoop/zookeeper/kafka集群都搭建在一起主机名分别为hadoop1,hadoop2,hadoop3; hbase为单节点 在hadoop1
缺点及不足
由于笔者技术有限,代码设计上有部分缺陷,比如spark-streaming计算后数据保存hbase逻辑性能很低,希望大家多提意见以便小编及时更正
代码实现
Kafka消息模拟器
packageclickstreamimportjava.util.{Properties,Random,UUID}importkafka.producer.{KeyedMessage,Producer,ProducerConfig}importorg.codehaus.jettison.json.JSONObject/** *
Created by 郭飞 on 2016/5/31.
*/objectKafkaMessageGenerator{privatevalrandom =newRandom()privatevarpointer =-1privatevalos_type =Array("Android","IPhone OS","None","Windows Phone")defclick() :Double= { random.nextInt(10) }defgetOsType() :String= { pointer = pointer +1if(pointer >= os_type.length) { pointer =0os_type(pointer) }else{ os_type(pointer) } }defmain(args:Array[String]):Unit= {valtopic ="user_events"//本地虚拟机ZK地址valbrokers ="hadoop1:9092,hadoop2:9092,hadoop3:9092"valprops =newProperties() props.put("metadata.broker.list", brokers) props.put("serializer.class","kafka.serializer.StringEncoder")valkafkaConfig =newProducerConfig(props)valproducer =newProducer[String,String](kafkaConfig)while(true) {// prepare event datavalevent =newJSONObject() event .put("uid",UUID.randomUUID())//随机生成用户id.put("event_time",System.currentTimeMillis.toString)//记录时间发生时间.put("os_type", getOsType)//设备类型.put("click_count", click)//点击次数// produce event messageproducer.send(newKeyedMessage[String,String](topic, event.toString)) println("Message sent: "+ event)Thread.sleep(200) } }}
Spark-Streaming主类
packageclickstreamimportkafka.serializer.StringDecoderimportnet.sf.json.JSONObjectimportorg.apache.hadoop.hbase.client.{HTable,Put}importorg.apache.hadoop.hbase.util.Bytesimportorg.apache.hadoop.hbase.{HBaseConfiguration,TableName}importorg.apache.spark.SparkConfimportorg.apache.spark.streaming.kafka.KafkaUtilsimportorg.apache.spark.streaming.{Seconds,StreamingContext}/**
* Created by 郭飞 on 2016/5/31.
*/objectPageViewStream{defmain(args:Array[String]):Unit= {varmasterUrl ="local[2]"if(args.length >0) { masterUrl = args(0) }// Create a StreamingContext with the given master URLvalconf =newSparkConf().setMaster(masterUrl).setAppName("PageViewStream")valssc =newStreamingContext(conf,Seconds(5))// Kafka configurationsvaltopics =Set("PageViewStream")//本地虚拟机ZK地址valbrokers ="hadoop1:9092,hadoop2:9092,hadoop3:9092"valkafkaParams =Map[String,String]("metadata.broker.list"-> brokers,"serializer.class"->"kafka.serializer.StringEncoder")// Create a direct streamvalkafkaStream =KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc, kafkaParams, topics)valevents = kafkaStream.flatMap(line => {valdata =JSONObject.fromObject(line._2)Some(data) })// Compute user click timesvaluserClicks = events.map(x => (x.getString("uid"), x.getInt("click_count"))).reduceByKey(_ + _) userClicks.foreachRDD(rdd => { rdd.foreachPartition(partitionOfRecords => { partitionOfRecords.foreach(pair => {//Hbase配置valtableName ="PageViewStream"valhbaseConf =HBaseConfiguration.create() hbaseConf.set("hbase.zookeeper.quorum","hadoop1:9092") hbaseConf.set("hbase.zookeeper.property.clientPort","2181") hbaseConf.set("hbase.defaults.for.version.skip","true")//用户IDvaluid = pair._1//点击次数valclick = pair._2//组装数据valput =newPut(Bytes.toBytes(uid)) put.add("Stat".getBytes,"ClickStat".getBytes,Bytes.toBytes(click))valStatTable=newHTable(hbaseConf,TableName.valueOf(tableName))StatTable.setAutoFlush(false,false)//写入数据缓存StatTable.setWriteBufferSize(3*1024*1024)StatTable.put(put)//提交StatTable.flushCommits() }) }) }) ssc.start() ssc.awaitTermination() }}
Maven POM文件
4.0.0com.guofei.sparkRiskControl1.0-SNAPSHOTjarRiskControlhttp://maven.apache.orgUTF-8org.apache.sparkspark-core_2.101.3.0org.apache.sparkspark-streaming_2.101.3.0org.apache.sparkspark-streaming-kafka_2.101.3.0org.apache.hbasehbase0.96.2-hadoop2pomorg.apache.hbasehbase-server0.96.2-hadoop2org.apache.hbasehbase-client0.96.2-hadoop2org.apache.hbasehbase-common0.96.2-hadoop2commons-iocommons-io1.3.2commons-loggingcommons-logging1.1.3log4jlog4j1.2.17com.google.protobufprotobuf-java2.5.0io.nettynetty3.6.6.Finalorg.apache.hbasehbase-protocol0.96.2-hadoop2org.apache.zookeeperzookeeper3.4.5org.cloudera.htracehtrace-core2.01org.codehaus.jacksonjackson-mapper-asl1.9.13org.codehaus.jacksonjackson-core-asl1.9.13org.codehaus.jacksonjackson-jaxrs1.9.13org.codehaus.jacksonjackson-xc1.9.13org.slf4jslf4j-api1.6.4org.slf4jslf4j-log4j121.6.4org.apache.hadoophadoop-client2.6.4commons-configurationcommons-configuration1.6org.apache.hadoophadoop-auth2.6.4org.apache.hadoophadoop-common2.6.4net.sf.json-libjson-lib2.4jdk15org.codehaus.jettisonjettison1.1redis.clientsjedis2.5.2org.apache.commonscommons-pool22.2src/main/scalasrc/test/scalanet.alchim31.mavenscala-maven-plugin3.2.2compiletestCompile-make:transitive-dependencyfile${project.build.directory}/.scala_dependenciesorg.apache.maven.pluginsmaven-shade-plugin2.4.3packageshade*:*META-INF/*.SFMETA-INF/*.DSAMETA-INF/*.RSA
FAQ
Maven导入json-lib报错
Failure to find net.sf.json-lib:json-lib:jar:2.3 in
http://repo.maven.apache.org/maven2was cached in the local
repository
解决:
http://stackoverflow.com/questions/4173214/maven-missing-net-sf-json-lib
net.sf.json-lib
json-lib
2.4
jdk15
执行Spark-Streaming程序报错
org.apache.spark.SparkException: Task not serializable
userClicks.foreachRDD(rdd=>{ rdd.foreachPartition(partitionOfRecords=>{ partitionOfRecords.foreach(这里面的代码中所包含的对象必须是序列化的这里面的代码中所包含的对象必须是序列化的这里面的代码中所包含的对象必须是序列化的}) }) })
执行Maven打包报错,找不到依赖的jar包
error:not found: object kafka
ERROR import kafka.javaapi.producer.Producer
解决:win10本地系统 用户/郭飞/.m2/ 目录含有中文
参考文档
spark-streaming官方文档
http://spark.apache.org/docs/latest/streaming-programming-guide.html
spark-streaming整合kafka官方文档
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
spark-streaming整合flume官方文档
http://spark.apache.org/docs/latest/streaming-flume-integration.html
spark-streaming整合自定义数据源官方文档
http://spark.apache.org/docs/latest/streaming-custom-receivers.html
spark-streaming官方scala案例
简单之美博客