数据处理的过程基本可以分为三个阶段分别是,数据从来哪里,做什么业务逻辑,落地到哪里去。
flink也如此。
SourceFunction 简介
flink自定义数据源需要实现SourceFunction,内置的SourceFunction实现类有:SocketTextStreamFunction、FromElementsFunction、FlinkKafkaConsumer 等等
SourceFunction 定义了2个方法 run 和cancel 。如下图
run方法的主体就是实现数据的生产逻辑。比如从Redis里面获取数据,或者自己模拟产生数据逻辑。下面会举例说明
cancel方法就是在任务取消的时候调用,作一些状态赋值或者链接关闭之类的。
自定义flink source
首先根据并行度来区分,可分为单并行度(并行度为1)和多并行度的source。单并行度的source之后的算子中不能再通过setParallelism()来改变并行度,多并行度默认同任务的并行度
然后可以根据是否为RichFunction来区分。RichFunction接口中有open,close,getRuntimeContext和setRuntimeContext等方法来获取状态,缓存系统内部数据等
单并行度source 实现
SourceFunction
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.flink.streaming.api.functions.source.SourceFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
class NoParalleSource extends SourceFunction[String]{
private var isrunning =true
override def run(sourceContext: SourceFunction.SourceContext[String]):Unit = {
while (isrunning){
val time =new SimpleDateFormat("HH:mm:ss").format(new Date())
sourceContext.collect(Thread.currentThread().getId +"_"+time)
Thread.sleep(1000*1)
}
}
override def cancel():Unit = {
isrunning =false
}
}
object NoParalleSourceTest{
def main(args: Array[String]):Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream = env.addSource(new NoParalleSource())/*.setParallelism(2)*/
val reduce = stream.timeWindowAll(Time.seconds(5)).reduce(_+"~"+_)
reduce.print()
env.execute(NoParalleSourceTest.getClass.getName)
}
}
多并行度source 实现
ParallelSourceFunction
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.flink.streaming.api.functions.source.{ParallelSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
/**
* 不设置并发数,那就任务的默认并发数
*/
class ParalleSource extends ParallelSourceFunction[String]{
private var isrunning =true
override def run(sourceContext: SourceFunction.SourceContext[String]):Unit = {
while (isrunning){
val time =new SimpleDateFormat("HH:mm:ss").format(new Date())
sourceContext.collect(Thread.currentThread().getId +"_"+time)
Thread.sleep(1000*1)
}
}
override def cancel():Unit = {
isrunning =false
}
}
object ParalleSourceTest{
def main(args: Array[String]):Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream = env.addSource(new ParalleSource()).setParallelism(4)
val reduce = stream.timeWindowAll(Time.seconds(5)).reduce(_+"~"+_)
reduce.print()
env.execute(ParalleSourceTest.getClass.getName)
}
}