Flink dynamic table转成stream实战

Dynamic table在flink中是一个逻辑概念,也正是Dynamic table可以让流数据支持table API和SQL。下图是stream和dynamic table转换关系,先将stream转化为dynamic table,再基于dynamic table进行SQL操作生成新的dynamic table,最后将dynamic table转化为stream。本文将重点介绍dynamic table转化为stream的3种方式。

stream-query-stream

注:本文所涉及的代码全部基于flink 1.9.0以及flink-table-planner-blink_2.11

Append-only stream

官方定义如下:

A dynamic table that is only modified by INSERT changes can be converted into a stream by emitting the inserted rows.

也就是说如果dynamic table只包含了插入新数据的操作那么就可以转化为append-only stream,所有数据追加到stream里面。

样例代码:

public class AppendOnlyExample {
    public static void main(String[] args) throws Exception {
        EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().useBlinkPlanner().build();
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        StreamTableEnvironment tEnv = StreamTableEnvironment.create(env, settings);
        env.setParallelism(1);

        DataStream<Tuple2<String, String>> data = env.fromElements(
                new Tuple2<>("Mary", "./home"),
                new Tuple2<>("Bob", "./cart"),
                new Tuple2<>("Mary", "./prod?id=1"),
                new Tuple2<>("Liz", "./home"),
                new Tuple2<>("Bob", "./prod?id=3")
        );

        Table clicksTable = tEnv.fromDataStream(data, "user,url");

        tEnv.registerTable("clicks", clicksTable);
        Table rTable = tEnv.sqlQuery("select user,url from clicks where user='Mary'");

        DataStream ds = tEnv.toAppendStream(rTable, TypeInformation.of(new TypeHint<Tuple2<String, String>>(){}));
        ds.print();
        env.execute();
    }
}

运行结果:

(Mary,./prod?id=8)
(Mary,./prod?id=6)

Retract stream

官方定义

A retract stream is a stream with two types of messages, add messages and retract messages. A dynamic table is converted into an retract stream by encoding an INSERT change as add message, a DELETE change as retract message, and an UPDATE change as a retract message for the updated (previous) row and an add message for the updating (new) row. The following figure visualizes the conversion of a dynamic table into a retract stream.

undo-redo-mode

样例代码:

public class RetractExample {
    public static void main(String[] args) throws Exception {
        EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().useBlinkPlanner().build();
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        StreamTableEnvironment tEnv = StreamTableEnvironment.create(env, settings);
        env.setParallelism(1);

        DataStream<Tuple2<String, String>> data = env.fromElements(
                new Tuple2<>("Mary", "./home"),
                new Tuple2<>("Bob", "./cart"),
                new Tuple2<>("Mary", "./prod?id=1"),
                new Tuple2<>("Liz", "./home"),
                new Tuple2<>("Bob", "./prod?id=3")
        );

        Table clicksTable = tEnv.fromDataStream(data, "user,url");

        tEnv.registerTable("clicks", clicksTable);
        Table rTable = tEnv.sqlQuery("SELECT user,COUNT(url) FROM clicks GROUP BY user");

        DataStream ds = tEnv.toRetractStream(rTable, TypeInformation.of(new TypeHint<Tuple2<String, Long>>(){}));
        ds.print();
        env.execute();
    }
}

运行结果:

(true,(Mary,1))
(true,(Bob,1))
(false,(Mary,1))
(true,(Mary,2))
(true,(Liz,1))
(false,(Bob,1))
(true,(Bob,2))

对于toRetractStream的返回值是一个Tuple2<Boolean, T>类型,第一个元素为true表示这条数据为要插入的新数据,false表示需要删除的一条旧数据。也就是说可以把更新表中某条数据分解为先删除一条旧数据再插入一条新数据。

Upsert stream

官方定义:

An upsert stream is a stream with two types of messages, upsert messages and delete messages. A dynamic table that is converted into an upsert stream requires a (possibly composite) unique key. A dynamic table with unique key is converted into a stream by encoding INSERT and UPDATE changes as upsert messages and DELETE changes as delete messages. The stream consuming operator needs to be aware of the unique key attribute in order to apply messages correctly. The main difference to a retract stream is that UPDATE changes are encoded with a single message and hence more efficient. The following figure visualizes the conversion of a dynamic table into an upsert stream.

redo-mode

这个模式和以上两个模式不同的地方在于要实现将Dynamic table转化成Upsert stream需要实现一个UpsertStreamTableSink,而不能直接使用StreamTableEnvironment进行转换。

样例代码:

public class UpsertExample {
    public static void main(String[] args) throws Exception {
        EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().useBlinkPlanner().build();
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        StreamTableEnvironment tEnv = StreamTableEnvironment.create(env, settings);
        env.setParallelism(1);

        DataStream<Tuple2<String, String>> data = env.fromElements(
                new Tuple2<>("Mary", "./home"),
                new Tuple2<>("Bob", "./cart"),
                new Tuple2<>("Mary", "./prod?id=1"),
                new Tuple2<>("Liz", "./home"),
                new Tuple2<>("Liz", "./prod?id=3"),
                new Tuple2<>("Mary", "./prod?id=7")
        );

        Table clicksTable = tEnv.fromDataStream(data, "user,url");

        tEnv.registerTable("clicks", clicksTable);
        Table rTable = tEnv.sqlQuery("SELECT user,COUNT(url) FROM clicks GROUP BY user");
        tEnv.registerTableSink("MemoryUpsertSink", new MemoryUpsertSink(rTable.getSchema()));
        rTable.insertInto("MemoryUpsertSink");

        env.execute();
    }

    private static class MemoryUpsertSink implements UpsertStreamTableSink<Tuple2<String, Long>> {
        private TableSchema schema;
        private String[] keyFields;
        private boolean isAppendOnly;

        private String[] fieldNames;
        private TypeInformation<?>[] fieldTypes;

        public MemoryUpsertSink() {

        }

        public MemoryUpsertSink(TableSchema schema) {
            this.schema = schema;
        }

        @Override
        public void setKeyFields(String[] keys) {
            this.keyFields = keys;
        }

        @Override
        public void setIsAppendOnly(Boolean isAppendOnly) {
            this.isAppendOnly = isAppendOnly;
        }

        @Override
        public TypeInformation<Tuple2<String, Long>> getRecordType() {
            return TypeInformation.of(new TypeHint<Tuple2<String, Long>>(){});
        }

        @Override
        public void emitDataStream(DataStream<Tuple2<Boolean, Tuple2<String, Long>>> dataStream) {
            consumeDataStream(dataStream);
        }

        @Override
        public DataStreamSink<?> consumeDataStream(DataStream<Tuple2<Boolean, Tuple2<String, Long>>> dataStream) {
            return dataStream.addSink(new DataSink()).setParallelism(1);
        }

        @Override
        public TableSink<Tuple2<Boolean, Tuple2<String, Long>>> configure(String[] fieldNames, TypeInformation<?>[] fieldTypes) {
            MemoryUpsertSink memoryUpsertSink = new MemoryUpsertSink();
            memoryUpsertSink.setFieldNames(fieldNames);
            memoryUpsertSink.setFieldTypes(fieldTypes);
            memoryUpsertSink.setKeyFields(keyFields);
            memoryUpsertSink.setIsAppendOnly(isAppendOnly);

            return memoryUpsertSink;
        }

        @Override
        public String[] getFieldNames() {
            return schema.getFieldNames();
        }

        public void setFieldNames(String[] fieldNames) {
            this.fieldNames = fieldNames;
        }

        @Override
        public TypeInformation<?>[] getFieldTypes() {
            return schema.getFieldTypes();
        }

        public void setFieldTypes(TypeInformation<?>[] fieldTypes) {
            this.fieldTypes = fieldTypes;
        }
    }

    private static class DataSink extends RichSinkFunction<Tuple2<Boolean, Tuple2<String, Long>>> {
        public DataSink() {
        }

        @Override
        public void invoke(Tuple2<Boolean, Tuple2<String, Long>> value, Context context) throws Exception {
            System.out.println("send message:" + value);
        }
    }
}

运行结果:

send message:(true,(Mary,1))
send message:(true,(Bob,1))
send message:(true,(Mary,2))
send message:(true,(Liz,1))
send message:(true,(Liz,2))
send message:(true,(Mary,3))

这种模式的返回值也是一个Tuple2<Boolean, T>类型,和Retract的区别在于更新表中的某条数据并不会返回一条删除旧数据一条插入新数据,而是看上去真的是更新了某条数据。

Upsert stream番外篇

以上所讲的内容全部都是来自于flink官网,只是附上了与其对应的样例,可以让读者更直观的感受每种模式的输出效果。网上也有很多对官方文档的翻译,但是几乎没有文章或者样例说明在使用UpsertStreamTableSink的时候什么情况下返回值Tuple2<Boolean, T>第一个元素是false?话不多说直接上样例,只要把上面的例子中的sql改为如下

String sql = "SELECT user, cnt " +
             "FROM (" +
                    "SELECT user,COUNT(url) as cnt FROM clicks GROUP BY user" +
                   ")" +
             "ORDER BY cnt LIMIT 2";

返回结果:

send message:(true,(Mary,1))
send message:(true,(Bob,1))
send message:(false,(Mary,1))
send message:(true,(Bob,1))
send message:(true,(Mary,2))
send message:(true,(Liz,1))
send message:(false,(Liz,1))
send message:(true,(Mary,2))
send message:(false,(Mary,2))
send message:(true,(Liz,2))

具体的原理可以查看源码,org.apache.flink.table.planner.plan.nodes.physical.stream.StreamExecRankorg.apache.flink.table.planner.plan.nodes.physical.stream.StreamExecSortLimit
解析sql的时候通过下面的方法得到不同的strategy,由此影响是否需要删除原有数据的行为。

  def getStrategy(forceRecompute: Boolean = false): RankProcessStrategy = {
    if (strategy == null || forceRecompute) {
      strategy = RankProcessStrategy.analyzeRankProcessStrategy(
        inputRel, ImmutableBitSet.of(), sortCollation, cluster.getMetadataQuery)
    }
    strategy
  }

知道什么时候会产生false属性的数据,对于理解JDBCUpsertTableSinkHBaseUpsertTableSink的使用会有很大的帮助。

总结

在阅读flink文档的时候,总是想通过代码实战体会其中的原理,很多文章是将官网的描述翻译成了中文,而本文是将官网的描述“翻译”成了代码,希望可以帮助读者理解官网原文中的含义。

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,558评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,002评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 159,036评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,024评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,144评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,255评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,295评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,068评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,478评论 1 305
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,789评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,965评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,649评论 4 336
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,267评论 3 318
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,982评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,223评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,800评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,847评论 2 351