三要素(Source/Channel/Sink)
-
Source:负责接收数据到flume的组件
-
1.Netcat:基于TCP端口的数据源接收器
# 配置Agent中的三要素 a1.sources = r1 a1.sinks = k1 a1.channels = c1 # 配置Source部分 a1.sources.r1.type = netcat a1.sources.r1.bind = 192.0.0.2 a1.sources.r1.port = 44444 # 配置Sink部分 a1.sinks.k1.type = logger # 配置Channel部分 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # 绑定相关组件 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
- logger4j配置文件log4j.properties(控制台输出+文件输出)
# 设置Logger的日志级别为INFO,同时增加两个日志输出项叫A1,A2. log4j.rootLogger=INFO, A1, A2 # A1这个设置项被配置为控制台输出 ConsoleAppender. log4j.appender.A1=org.apache.log4j.ConsoleAppender # A1 输出项的输出格式. log4j.appender.A1.layout=org.apache.log4j.PatternLayout log4j.appender.A1.layout.ConversionPattern=%-4r [%t] %-5p %c %x - %m%n # A2 这个配置项的设置,文件输出 log4j.appender.A2=org.apache.log4j.FileAppender # 设置日志的文件名 log4j.appender.A2.File=./logs/log.out # 定义输出的日志格式 log4j.appender.A2.layout=org.apache.log4j.PatternLayout log4j.appender.A2.layout.conversionPattern=%m%n
启动程序
flume-ng agent --name a1 --conf-file /root/trainging/flume-1.9.0/conf/example1.conf --conf /root/trainging/flume-1.9.0/conf/
发送数据
telnet 192.0.0.2 44444
-
2.Exec:基于命令行标准输出来产生数据的数据源接收器
# 配置Agent中的三要素 a1.sources = r1 a1.sinks = k1 a1.channels = c1 # 配置Source部分 a1.sources.r1.type = exec a1.sources.r1.command = tail -F /root/data/flume/access.log # 配置Sink部分 a1.sinks.k1.type = logger # 配置Channel部分 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # 绑定相关组件 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
logger4j配置文件log4j.properties(控制台输出+文件输出):配置同上
启动程序
flume-ng agent --name a1 --conf-file /root/trainging/flume-1.9.0/conf/example2.conf --conf /root/trainging/flume-1.9.0/conf/
发送数据
echo hello >> /root/data/flume/access.log
-
3.avro:高扩展的RPC数据源(最常用)
# 配置Agent中的三要素 a1.sources = r1 a1.sinks = k1 a1.channels = c1 # 配置Source部分 a1.sources.r1.type = avro a1.sources.r1.bind = 192.0.0.2 a1.sources.r1.port = 44444 # 配置Sink部分 a1.sinks.k1.type = logger # 配置Channel部分 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # 绑定相关组件 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
logger4j配置文件log4j.properties(控制台输出+文件输出):配置同上
启动程序
flume-ng agent --name a1 --conf-file /root/trainging/flume-1.9.0/conf/example3.conf --conf /root/trainging/flume-1.9.0/conf/
发送端Java代码
package pkg01; import java.nio.charset.Charset; import org.apache.flume.Event; import org.apache.flume.EventDeliveryException; import org.apache.flume.api.RpcClient; import org.apache.flume.api.RpcClientFactory; import org.apache.flume.event.EventBuilder; public class class1 { public static void main(String[] args) throws EventDeliveryException { String ip = "192.0.0.2"; int port = 44444; RpcClient client = RpcClientFactory.getDefaultInstance(ip, port); // Avro // RpcClient client = RpcClientFactory.getThriftInstance(ip, port); // Thrift Event event = EventBuilder.withBody("hello flume!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!", Charset.forName("UTF8")); client.append(event); client.close(); } }
-
-
Channel:位于Source和Sink之间的缓冲块,允许Source和Sink运行在不同速率上
-
1.MemoryChannel:建立在内存中的通道,数据存储在JVM的堆上
- 允许数据少量丢失可使用内存通道,不允许数据丢失可使用文件通道
-
内存通道支持事务特性,如下所示:
# 配置Agent中的三要素 a1.sources = r1 a1.sinks = k1 a1.channels = c1 # 配置Source部分 a1.sources.r1.type = netcat a1.sources.r1.bind = 192.0.0.2 a1.sources.r1.port = 44444 # 配置Sink部分 a1.sinks.k1.type = logger # 配置Channel部分 # 设置内存通道 a1.channels.c1.type = memory # 可以存最大10万个事件event a1.channels.c1.capacity = 100000 # 每个事务可以存取最大100个事件event a1.channels.c1.transactionCapacity = 100 # 内存通道大小为500MB a1.channels.c1.byteCapacity = 500000000 # 其中10%存放头文件,500MB*10% a1.channels.c1.byteCapacityBufferPercentage = 10 # 绑定相关组件 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
-
-
Sink:从channel中获取数据并推送给下一级Flume Agent或者存储数据到指定位置
-
Sink组概念:
-
Sink优化:
-
Sink事务特性:
- 一般配置:
# sink类型 a1.sinks.k1.type = hdfs # hdfs目录 a1.sinks.k1.hdfs.path = /user/hduser/logs/data_%Y-%m-%d # 文件前缀 a1.sinks.k1.hdfs.filePrefix = retail # 文件后缀 a1.sinks.k1.hdfs.fileSuffix = .txt # 60秒或128兆或100个事件,则关闭文件 a1.sinks.k1.hdfs.rollInterval = 60 a1.sinks.k1.hdfs.rollSize = 128000000 a1.sinks.k1.hdfs.rollCount = 100 # 30秒没数据写入则关闭文件 a1.sinks.k1.hdfs.idleTimeout= 30 # 最多打开100个文件 a1.sinks.k1.hdfs.maxOpenFiles= 30 # 文件不压缩 # a1.sinks.k1.hdfs.fileType = DataStream # 文件压缩,snappy方式压缩 a1.sinks.k1.hdfs.fileType = CompressedStream a1.sinks.k1.hdfs.codeC = snappy # 使用本地时间戳 a1.sinks.k1.hdfs.useLocalTimeStamp = true # 每10分钟写到一个bucket(目录) a1.sinks.k1.hdfs.round = true a1.sinks.k1.hdfs.roundUnit = minute a1.sinks.k1.hdfs.roundValue= 10
-
-
拦截器:工作在Source和Channel之间,在Source接收到数据后,拦截器基于自定义的规则删除或转换相关事件,如果一个链路上存在多个拦截器,将按顺序依次执行
-
主机拦截器:Agent将主机IP或主机名添加到事件的报文头中,事件报文头使用hostHeader配置,默认是host
a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = host # 是否覆盖原有的值 a1.sources.r1.interceptors.i1.preserveExisting = true # 默认是使用IP,这里不使用IP a1.sources.r1.interceptors.i1.useIP = false # 使用主机名 a1.sources.r1.interceptors.i1.hostHeader = hostname
-
时间拦截器(常用):事件报文头带有timestamp键,可以方便HDFS Sink进行分桶
a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = timestamp
-
静态拦截器:简单的将固定报文的KV对插入到报文事件中
a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = static a1.sources.r1.interceptors.i1.key = Zone a1.sources.r1.interceptors.i1.value = NEW_YORK
-
UUID拦截器:通过拦截器给每个事件添加唯一标识符UUID
a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
-
正则过滤拦截器:选择性保留数据
a1.sources.r1.interceptors = i1 i2 a1.sources.r1.interceptors.i1.type = regex_filter a1.sources.r1.interceptors.i1.regex = .*info* # 保留此类事件 a1.sources.r1.interceptors.i1.excludeEvents = false a1.sources.r1.interceptors.i2.type = regex_filter a1.sources.r1.interceptors.i2.regex = .*data3* # 排除此类事件 a1.sources.r1.interceptors.i2.excludeEvents = true
-
自定义拦截器:自定义的拦截器,java实现
Map<String, String> headers = new HashMap<String, String>(); headers.put("ClientServer", "Client01srv"); List<Event> events = new ArrayList<Event>(); events.add(EventBuilder.withBody("info ", Charset.forName("UTF8"), headers)); RpcClient client = RpcClientFactory.getDefaultInstance(ip, port); // Avro client.appendBatch(events); client.close();
-
-
-
通道选择器:Channel处理器选择哪些事件进入哪个Channel
-
复制Channel选择器:复制Source过来的数据到每个Channel中
# 配置Agent中的三要素 a1.sources = r1 a1.sinks = k1 k2 a1.channels = c1 c2 # 配置Source部分 a1.sources.r1.type = avro a1.sources.r1.bind = 10.0.1.213 a1.sources.r1.port = 44444 # 配置Sink部分 a1.sinks.k1.type = logger a1.sinks.k2.type = logger # 配置Channel部分 a1.channels.c1.type = memory a1.channels.c1.capacity = 100 a1.channels.c2.type = memory a1.channels.c2.capacity = 100 # 绑定相关组件 a1.sources.r1.channels = c1 c2 a1.sinks.k1.channel = c1 a1.sinks.k2.channel = c2
-
多路复用Channel选择器:根据事件头数据中key对应的value来选择此事件进入哪个Channel
# 配置Agent中的三要素 a1.sources = r1 a1.sinks = k1 k2 a1.channels = c1 c2 # 配置Source部分 a1.sources.r1.type = avro a1.sources.r1.bind = 10.0.1.213 a1.sources.r1.port = 44444 a1.sources.r1.selector.type = multiplexing a1.sources.r1.selector.header = op # op为1,事件进入 c1 Channel a1.sources.r1.selector.mapping.1 = c1 # op为2,事件进入 c2 Channel a1.sources.r1.selector.mapping.2 = c2 # op为其他值,事件进入 c2 Channel a1.sources.r1.selector.default = c2 # 配置Sink部分 a1.sinks.k1.type = logger a1.sinks.k2.type = logger # 配置Channel部分 a1.channels.c1.type = memory a1.channels.c1.capacity = 100 a1.channels.c2.type = memory a1.channels.c2.capacity = 100 # 绑定相关组件 a1.sources.r1.channels = c1 c2 a1.sinks.k1.channel = c1 a1.sinks.k2.channel = c2
-
-
Sink组:多个Sink组成一个Sink组,可用于负载均衡和故障转移
-
负载均衡:多条链路都发送数据,不保证数据顺序
# 配置Sink部分 a1.sinkgroups = g1 a1.sinkgroups.g1.sinks = k1 k2 a1.sinkgroups.g1.processor.type = load_balance a1.sinkgroups.g1.processor.backoff = true #a1.sinkgroups.g1.processor.selector = random a1.sinkgroups.g1.processor.selector = round_robin a1.sinks.k1.type = logger a1.sinks.k2.type = logger
-
故障转移:一个链路坏了,其他链路会按照优先级起来顶替工作,保证数据顺序
# 配置Sink部分 a1.sinkgroups = g1 a1.sinkgroups.g1.sinks = k1 k2 a1.sinkgroups.g1.processor.type = failover a1.sinkgroups.g1.processor.priority.k1 = 5 a1.sinkgroups.g1.processor.priority.k2 = 10 a1.sinkgroups.g1.processor.maxpenalty = 10000 #a1.sinks.k1.type = logger a1.sinks.k2.type = logger a1.sinks.k1.type = avro a1.sinks.k1.hostname = localhost a1.sinks.k1.port = 44444
-