1. flume
flume的安装配置就不说了,网上一大堆。
我还是给一个网址吧,https://www.jianshu.com/p/82c77166b5a3
编写flume配置文件
cd /opt/apache-flume-1.8.0-bin
vim conf/flume_kafka_and_hdfs.conf
填写内容如下:
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/syslogdatatest.txt
a1.sources.r1.channels = c1 c2
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /home/flume/flumeCheckpoint
a1.channels.c1.dataDirs = /home/flume/flumeData, /home/flume/flumeDataExt
a1.channels.c1.capacity = 2000000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 2000000
a1.channels.c2.transactionCapacity = 100
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.path = hdfs://cn01:9000/flume/events/%Y/%m/%d/%H/%M
a1.sinks.k1.hdfs.filePrefix = cmcc
a1.sinks.k1.hdfs.minBlockReplicas = 1
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 60
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.idleTimeout = 0
a1.sinks.k2.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k2.topic = test1
a1.sinks.k2.brokerList = 192.168.10.101:9092,192.168.10.102:9092,192.168.10.103:9092
a1.sinks.k2.requiresAcks = 1
a1.sinks.k2.batchSize = 100
a1.sinks.k2.channel = c2
之后保存退出即可
2. kafka
同样kafka 的安装配置也给一个地址,https://www.jianshu.com/p/3cb394ef41c0
kafka不需要额外的写什么,只是一个消息中间件,只要启动了kafka并且创建了topic(本文是test1,和flume配置文件里面的要相同)就好了。
3. spark
关于spark集群的搭建给一个网址https://www.jianshu.com/p/f9a9147176a7,都比较简单。
编写scala脚本
cd /opt/spark-2.2.1-bin-hadoop2.7
mkdir test #
cd test
mkdir -p src/main/scala
vim src/main/scala/DirectKafkaWordCount.scala
填写如下代码到DirectKafkaWordCount.scala脚本里。
import kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.kafka.KafkaUtils._
import org.apache.spark.SparkConf
object DirectKafkaWordCount {
def main(args: Array[String]) {
if(args.length < 2) {
System.err.println(s"""
|Usage: DirectKafkaWordCount <brokers> <topics>
| <brokers> is a list of one or more Kafka brokers
| <topics> is a list of one or more kafka topics to consume from
|
""".stripMargin)
System.exit(1)
}
//StreamingExamples.setStreamingLogLevels()
val Array(brokers, topics) = args
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
val lines = messages.map(_._2)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
保存退出即可,在编写一个spark相关依赖的脚本。
vim build.sbt
填写如下内容即可。
name := "Simple Project With DirectKafkaWordCount"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-8_2.11" % "2.2.1"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.2.1"
同样的保存退出。
最后我们使用命令来编译一下。
sbt package
当然需要先安装sbt命令。网上一大堆。
他会下载一些依赖,我们等着就行了。看到最后的输出信息有success就表示编译成功了。
我们可以看到test目录下多了两个子目录,其中在target/scala-2.11
目录下有一个jar包。这正是我们需要的。
4. 启动运行提交作业
先启动flume:
cd /opt/apache-flume-1.8.0-bin
bin/flume-ng agent --conf conf/ --conf-file conf/flume_kafka_and_hdfs_test.conf --name a1 -Dflume.root.logger=INFO,console
然后另外打开一个终端用来运行spark job。命令如下。
cd /opt/spark-2.2.1-bin-hadoop2.7
spark-submit --jars /home/spark-streaming-kafka-0-8-assembly_2.11-2.2.1.jar test/target/scala-2.11/simple-project-with-directkafkawordcount_2.11-1.0.jar cn01:9092,cn02:9092,cn03:9092 test1
其中--jars 后面跟的是依赖项, 我们需要先到这里找到对应自己spark版本的下载并上传到服务就可以了。
或者用--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.1代替--jars参数。他会在线下载。
OK! 你就会看到程序正常运行了。
最后一步就是我们需要往/home/syslogdatatest.txt
文件中写一点内容了,用来做wordCounts。
在另开一个终端。
vim /home/syslogdatatest.txt
#写一些东西
hello flume
hello kafka
hello spark
apache spark
apache kafka
apache flume
保存退出即可。
不出意外的话就立即能在刚才提交spark job的终端上看到对应的词频统计结果了。
我们可以在UI界面上看到更多的信息。
END