背景
根据公司的业务需求,研究如何使用Filebeat收集信息,并应用于大数据的业务场景之中。
第一部分先从Filebeat的引入开始,实现一个从nginx读取数据,并通过kafka写入hdfs的整个配置过程,其中的数据流向大概是这样的
Nginx(log) -> Filebeat -> Kafka -> Flume -> HDFS
第二部分会对Filebeat的filter进行深入的研究,并从MongoDB获取数据
从Filebeat到HDFS
基于Nginx的日志收集(输出到文件)
ssh到某生产服务器,其上会直接产生大量的nginx日志
在此机器上下载filebeat,并解压缩到特定目录,生成指向filebeat目录的软连接
-
修改filebeat.yml文件,修改后可以用
sudo ./filebeat test config
命令进行检测- 关闭input中type为log的输入:enabled: false(为了使用nginx的日志作为输入)
- 关闭elasticsearch的输出
- 增加到文件的输出
#------------------------------- File output ----------------------------------- output.file: # Boolean flag to enable or disable the output module. enabled: true # Path to the directory where to save the generated files. The option is # mandatory. # fbout是输出文件的所在目录 path: "/var/log/fbout"
使用
sudo ./filebeat module enable nginx
命令使nginx作为输入的配置生效进入modules.d目录,调整nginx.yml文件,修改error.log和access.log文件的所在目录
使用
sudo ./filebeat setup -e
命令初始化filebeat使用
sudo ./filebeat -e
命令运行filebeat,开始日志收集
输出改为推送到kafka,并输出到HDFS
- 创建一个新的topic,分区为21,合理的设置分区,可以提高并行吞吐率
kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 21 --topic EventFromFilebeat
# 查看topic信息
kafka-topics --zookeeper localhost:2181 --topic EventFromFilebeat --describe
- 修改filebeat.yml,增加kafka的输出
#------------------------------- Kafka output ----------------------------------
output.kafka:
# Boolean flag to enable or disable the output module.
enabled: true
# The list of Kafka broker addresses from where to fetch the cluster metadata.
# The cluster metadata contain the actual Kafka brokers events are published
# to.
hosts: ["bigdata01:9092","bigdata02:9092","bigdata03:9092"]
# The Kafka topic used for produced events. The setting can be a format string
# using any event field. To set the topic from document type use `%{[type]}`.
topic: EventFromFilebeat
- 要把ip和host的对应关系,维护到/etc/hosts文件里,否则会报错
{大数据服务器内网IP-01} bigdata01
{大数据服务器内网IP-02} bigdata02
{大数据服务器内网IP-03} bigdata03
- 使用flume作为pipeline向HDFS发送数据
在3个机器上下载flume并解压缩,放到相应目录
-
创建配置文件(source, sink...),文件名是flume-conf-kafka2hdfs.properties
# flume-conf-kafka2hdfs.properties # ------------------- 定义数据流---------------------- # source的名字 flume2HDFS_agent.sources = source_from_kafka # channels的名字,建议按照type来命名 flume2HDFS_agent.channels = mem_channel # sink的名字,建议按照目标来命名 flume2HDFS_agent.sinks = hdfs_sink #auto.commit.enable = true ## kerberos config ## #flume2HDFS_agent.sinks.hdfs_sink.hdfs.kerberosPrincipal = flume/datanode2.hdfs.alpha.com@OMGHADOOP.COM #flume2HDFS_agent.sinks.hdfs_sink.hdfs.kerberosKeytab = /root/apache-flume-1.6.0-bin/conf/flume.keytab #-------- kafkaSource相关配置----------------- # 定义消息源类型 # For each one of the sources, the type is defined flume2HDFS_agent.sources.source_from_kafka.type = org.apache.flume.source.kafka.KafkaSource flume2HDFS_agent.sources.source_from_kafka.channels = mem_channel flume2HDFS_agent.sources.source_from_kafka.batchSize = 5000 # 定义kafka所在的地址 #flume2HDFS_agent.sources.source_from_kafka.zookeeperConnect = 10.129.142.46:2181,10.166.141.46:2181,10.166.141.47:2181/testkafka # 据文档描述,只要配置kafka的集群即可,不需要单独配置zookeeperConnect了 flume2HDFS_agent.sources.source_from_kafka.kafka.bootstrap.servers = bigdata01:9092,bigdata02:9092,bigdata03:9092 # 配置消费的kafka topic #flume2HDFS_agent.sources.source_from_kafka.topic = itil_topic_4097 flume2HDFS_agent.sources.source_from_kafka.kafka.topics = EventFromFilebeat2 # 配置消费的kafka groupid #flume2HDFS_agent.sources.source_from_kafka.groupId = flume4097 flume2HDFS_agent.sources.source_from_kafka.kafka.consumer.group.id = flumetest #---------hdfsSink 相关配置------------------ # The channel can be defined as follows. flume2HDFS_agent.sinks.hdfs_sink.type = hdfs # 指定sink需要使用的channel的名字,注意这里是channel #Specify the channel the sink should use flume2HDFS_agent.sinks.hdfs_sink.channel = mem_channel #flume2HDFS_agent.sinks.hdfs_sink.filePrefix = %{host} # 可以通过hdfs getconf -nnRpcAddresses命令查找hdfs的NameNode的rpc地址 flume2HDFS_agent.sinks.hdfs_sink.hdfs.path = hdfs://bigdata01:8022/user/zhaoyan/nginx/%y-%m-%d/%H%M #File size to trigger roll, in bytes (0: never roll based on file size) flume2HDFS_agent.sinks.hdfs_sink.hdfs.rollSize = 0 #Number of events written to file before it rolled (0 = never roll based on number of events) flume2HDFS_agent.sinks.hdfs_sink.hdfs.rollCount = 0 flume2HDFS_agent.sinks.hdfs_sink.hdfs.rollInterval = 3600 flume2HDFS_agent.sinks.hdfs_sink.hdfs.threadsPoolSize = 30 #flume2HDFS_agent.sinks.hdfs_sink.hdfs.codeC = gzip #flume2HDFS_agent.sinks.hdfs_sink.hdfs.fileType = CompressedStream flume2HDFS_agent.sinks.hdfs_sink.hdfs.fileType=DataStream flume2HDFS_agent.sinks.hdfs_sink.hdfs.writeFormat=Text #------- memoryChannel相关配置------------------------- # channel类型 # Each channel's type is defined. flume2HDFS_agent.channels.mem_channel.type = memory # Other config values specific to each type of channel(sink or source) # can be defined as well # channel存储的事件容量 # In this case, it specifies the capacity of the memory channel flume2HDFS_agent.channels.mem_channel.capacity = 100000 # 事务容量 flume2HDFS_agent.channels.mem_channel.transactionCapacity = 10000
注:以上配置文件参考这篇文章
-
执行以下命令开启pipeline作为kafka的consumer,然后再开启filebeat即可
# 注意增加最大内存的设置,修改flume-ng文件不好使 flume-ng agent -Xmx1024m -n flume2HDFS_agent -f ../kafka-hdfs/flume-conf-kafka2hdfs.properties
最终在HDFS中呈现的数据如下图所示
HDFS中的内容
可以用hdfs dfs -cat /path/to/file
查看文件的内容