For each checkpoint we create new FlinkKafkaProducer so that new transactions will not clash with transactions created during previous checkpoints (producer.initTransactions() assures that we obtain new producerId and epoch counters).
核心:
snapshot存储pendingTxn
待恢复时pendingTXN对应的ckid<restore id 的pendingtxn一定要保证提交。
默认实现中,
跨绘画事务,新建 txn ,会分配新的pid,
此时若是恢复后,进行resume and commit 可能会误提交
在Kafka中
用相同的TID,producer挂了,新启的producer会abort之前尚未提交的记录,单由于notify在快照之后,故恢复时需要resume txn,kafka不支持resume txn,只能新建produer ,用反射设置PID,epoch,在resume txn。
单强行resume 事务,commit会导致后续相同的Tid的事务被commit。
存储的状态落后于实际提交。
正常情况下,commit后一定能保证成功。
恢复了历史pid 和 epoch 的producer ,commit时,不验证pid ,epoch。
initTransaction
Needs to be called before any other methods when the transactional.id is set in the configuration. This method does the following: 1. Ensures any transactions initiated by previous instances of the producer with the same transactional.id are completed. If the previous instance had failed with a transaction in progress, it will be aborted. If the last transaction had begun completion, but not yet finished, this method awaits its completion. 2. Gets the internal producer id and epoch, used in all future transactional messages issued by the producer.
abort epoch之前的数据,fence zombie
pid是在新建连接时随机分配,用来保证单会话,单partition幂等
public void resumeTransaction(long producerId, short epoch)
恢复TID PID epoch对应的txn,并提交
若不新建连接,则是相同的pid,epoch,会错误提交。
必须用相同的事务id fence之前的事务
流程
recoverandcommit
commit 快照时可能为提交的状态(保存了状态但是尚未notify的数据未收到notify)
之后abort poolsize (abort尚未保存状态)
由于快照失败导致并发度是1时 依然可能会超过poolsize存在。
若扩大poolsize ,在notify之后才真正回收,但是无法加入状态。