Dynamic table在flink中是一个逻辑概念,也正是Dynamic table可以让流数据支持table API和SQL。下图是stream和dynamic table转换关系,先将stream转化为dynamic table,再基于dynamic table进行SQL操作生成新的dynamic table,最后将dynamic table转化为stream。本文将重点介绍dynamic table转化为stream的3种方式。
注:本文所涉及的代码全部基于flink 1.9.0以及flink-table-planner-blink_2.11
Append-only stream
官方定义如下:
A dynamic table that is only modified by INSERT changes can be converted into a stream by emitting the inserted rows.
也就是说如果dynamic table只包含了插入新数据的操作那么就可以转化为append-only stream,所有数据追加到stream里面。
样例代码:
public class AppendOnlyExample {
public static void main(String[] args) throws Exception {
EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().useBlinkPlanner().build();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env, settings);
env.setParallelism(1);
DataStream<Tuple2<String, String>> data = env.fromElements(
new Tuple2<>("Mary", "./home"),
new Tuple2<>("Bob", "./cart"),
new Tuple2<>("Mary", "./prod?id=1"),
new Tuple2<>("Liz", "./home"),
new Tuple2<>("Bob", "./prod?id=3")
);
Table clicksTable = tEnv.fromDataStream(data, "user,url");
tEnv.registerTable("clicks", clicksTable);
Table rTable = tEnv.sqlQuery("select user,url from clicks where user='Mary'");
DataStream ds = tEnv.toAppendStream(rTable, TypeInformation.of(new TypeHint<Tuple2<String, String>>(){}));
ds.print();
env.execute();
}
}
运行结果:
(Mary,./prod?id=8)
(Mary,./prod?id=6)
Retract stream
官方定义
A retract stream is a stream with two types of messages, add messages and retract messages. A dynamic table is converted into an retract stream by encoding an INSERT change as add message, a DELETE change as retract message, and an UPDATE change as a retract message for the updated (previous) row and an add message for the updating (new) row. The following figure visualizes the conversion of a dynamic table into a retract stream.
样例代码:
public class RetractExample {
public static void main(String[] args) throws Exception {
EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().useBlinkPlanner().build();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env, settings);
env.setParallelism(1);
DataStream<Tuple2<String, String>> data = env.fromElements(
new Tuple2<>("Mary", "./home"),
new Tuple2<>("Bob", "./cart"),
new Tuple2<>("Mary", "./prod?id=1"),
new Tuple2<>("Liz", "./home"),
new Tuple2<>("Bob", "./prod?id=3")
);
Table clicksTable = tEnv.fromDataStream(data, "user,url");
tEnv.registerTable("clicks", clicksTable);
Table rTable = tEnv.sqlQuery("SELECT user,COUNT(url) FROM clicks GROUP BY user");
DataStream ds = tEnv.toRetractStream(rTable, TypeInformation.of(new TypeHint<Tuple2<String, Long>>(){}));
ds.print();
env.execute();
}
}
运行结果:
(true,(Mary,1))
(true,(Bob,1))
(false,(Mary,1))
(true,(Mary,2))
(true,(Liz,1))
(false,(Bob,1))
(true,(Bob,2))
对于toRetractStream
的返回值是一个Tuple2<Boolean, T>
类型,第一个元素为true
表示这条数据为要插入的新数据,false
表示需要删除的一条旧数据。也就是说可以把更新表中某条数据分解为先删除一条旧数据再插入一条新数据。
Upsert stream
官方定义:
An upsert stream is a stream with two types of messages, upsert messages and delete messages. A dynamic table that is converted into an upsert stream requires a (possibly composite) unique key. A dynamic table with unique key is converted into a stream by encoding INSERT and UPDATE changes as upsert messages and DELETE changes as delete messages. The stream consuming operator needs to be aware of the unique key attribute in order to apply messages correctly. The main difference to a retract stream is that UPDATE changes are encoded with a single message and hence more efficient. The following figure visualizes the conversion of a dynamic table into an upsert stream.
这个模式和以上两个模式不同的地方在于要实现将Dynamic table转化成Upsert stream需要实现一个UpsertStreamTableSink,而不能直接使用
StreamTableEnvironment
进行转换。
样例代码:
public class UpsertExample {
public static void main(String[] args) throws Exception {
EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().useBlinkPlanner().build();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env, settings);
env.setParallelism(1);
DataStream<Tuple2<String, String>> data = env.fromElements(
new Tuple2<>("Mary", "./home"),
new Tuple2<>("Bob", "./cart"),
new Tuple2<>("Mary", "./prod?id=1"),
new Tuple2<>("Liz", "./home"),
new Tuple2<>("Liz", "./prod?id=3"),
new Tuple2<>("Mary", "./prod?id=7")
);
Table clicksTable = tEnv.fromDataStream(data, "user,url");
tEnv.registerTable("clicks", clicksTable);
Table rTable = tEnv.sqlQuery("SELECT user,COUNT(url) FROM clicks GROUP BY user");
tEnv.registerTableSink("MemoryUpsertSink", new MemoryUpsertSink(rTable.getSchema()));
rTable.insertInto("MemoryUpsertSink");
env.execute();
}
private static class MemoryUpsertSink implements UpsertStreamTableSink<Tuple2<String, Long>> {
private TableSchema schema;
private String[] keyFields;
private boolean isAppendOnly;
private String[] fieldNames;
private TypeInformation<?>[] fieldTypes;
public MemoryUpsertSink() {
}
public MemoryUpsertSink(TableSchema schema) {
this.schema = schema;
}
@Override
public void setKeyFields(String[] keys) {
this.keyFields = keys;
}
@Override
public void setIsAppendOnly(Boolean isAppendOnly) {
this.isAppendOnly = isAppendOnly;
}
@Override
public TypeInformation<Tuple2<String, Long>> getRecordType() {
return TypeInformation.of(new TypeHint<Tuple2<String, Long>>(){});
}
@Override
public void emitDataStream(DataStream<Tuple2<Boolean, Tuple2<String, Long>>> dataStream) {
consumeDataStream(dataStream);
}
@Override
public DataStreamSink<?> consumeDataStream(DataStream<Tuple2<Boolean, Tuple2<String, Long>>> dataStream) {
return dataStream.addSink(new DataSink()).setParallelism(1);
}
@Override
public TableSink<Tuple2<Boolean, Tuple2<String, Long>>> configure(String[] fieldNames, TypeInformation<?>[] fieldTypes) {
MemoryUpsertSink memoryUpsertSink = new MemoryUpsertSink();
memoryUpsertSink.setFieldNames(fieldNames);
memoryUpsertSink.setFieldTypes(fieldTypes);
memoryUpsertSink.setKeyFields(keyFields);
memoryUpsertSink.setIsAppendOnly(isAppendOnly);
return memoryUpsertSink;
}
@Override
public String[] getFieldNames() {
return schema.getFieldNames();
}
public void setFieldNames(String[] fieldNames) {
this.fieldNames = fieldNames;
}
@Override
public TypeInformation<?>[] getFieldTypes() {
return schema.getFieldTypes();
}
public void setFieldTypes(TypeInformation<?>[] fieldTypes) {
this.fieldTypes = fieldTypes;
}
}
private static class DataSink extends RichSinkFunction<Tuple2<Boolean, Tuple2<String, Long>>> {
public DataSink() {
}
@Override
public void invoke(Tuple2<Boolean, Tuple2<String, Long>> value, Context context) throws Exception {
System.out.println("send message:" + value);
}
}
}
运行结果:
send message:(true,(Mary,1))
send message:(true,(Bob,1))
send message:(true,(Mary,2))
send message:(true,(Liz,1))
send message:(true,(Liz,2))
send message:(true,(Mary,3))
这种模式的返回值也是一个Tuple2<Boolean, T>
类型,和Retract的区别在于更新表中的某条数据并不会返回一条删除旧数据一条插入新数据,而是看上去真的是更新了某条数据。
Upsert stream番外篇
以上所讲的内容全部都是来自于flink官网,只是附上了与其对应的样例,可以让读者更直观的感受每种模式的输出效果。网上也有很多对官方文档的翻译,但是几乎没有文章或者样例说明在使用UpsertStreamTableSink
的时候什么情况下返回值Tuple2<Boolean, T>
第一个元素是false
?话不多说直接上样例,只要把上面的例子中的sql改为如下
String sql = "SELECT user, cnt " +
"FROM (" +
"SELECT user,COUNT(url) as cnt FROM clicks GROUP BY user" +
")" +
"ORDER BY cnt LIMIT 2";
返回结果:
send message:(true,(Mary,1))
send message:(true,(Bob,1))
send message:(false,(Mary,1))
send message:(true,(Bob,1))
send message:(true,(Mary,2))
send message:(true,(Liz,1))
send message:(false,(Liz,1))
send message:(true,(Mary,2))
send message:(false,(Mary,2))
send message:(true,(Liz,2))
具体的原理可以查看源码,org.apache.flink.table.planner.plan.nodes.physical.stream.StreamExecRank
和org.apache.flink.table.planner.plan.nodes.physical.stream.StreamExecSortLimit
解析sql的时候通过下面的方法得到不同的strategy,由此影响是否需要删除原有数据的行为。
def getStrategy(forceRecompute: Boolean = false): RankProcessStrategy = {
if (strategy == null || forceRecompute) {
strategy = RankProcessStrategy.analyzeRankProcessStrategy(
inputRel, ImmutableBitSet.of(), sortCollation, cluster.getMetadataQuery)
}
strategy
}
知道什么时候会产生false属性的数据,对于理解JDBCUpsertTableSink
和HBaseUpsertTableSink
的使用会有很大的帮助。
总结
在阅读flink文档的时候,总是想通过代码实战体会其中的原理,很多文章是将官网的描述翻译成了中文,而本文是将官网的描述“翻译”成了代码,希望可以帮助读者理解官网原文中的含义。