整理《LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking》阅读笔记

1. 序言

document AI tasks

现有document AI方法：用BERT提取文本信息，现有方法的区别主要在于图像模态的预训练目标不同
LayoutLMv3模型的出发点：为克服文本和图像模态的预训练目标之间的差异，提出了统一的文本、图像 masking目标。不用CNN等来提取视觉特征，而是类似ViT的方法来提取视觉特征。
代码模型路径：https://aka.ms/layoutlmv3

2.模型

LayoutLMv3

text embedding：用一个OCR工具提取文本和对应的2D位置。text embedding是由word embedding和position embedding结合起来得到的，其中word embedding是由Roberta预训练的参数来初始化，position embedding包括1D positions和2D layout position。
image embedding：之前的document AI方法是用CNN、Faster R-CNN来提取图像特征，但是这个计算量太大。受到ViT的启发，先将图像resize到 $H \times W$ 大小，然后将图像切分为 $P\times P$ 大小的patches（得到的patch数量总共 $\frac{H\times W} {P^2}$ ），把image patches线性映射到D维向量，最后对每个patch加上学习到的1D position embedding（文中提到观察到2D position embedding没带来提升，所以这里就加上1D position embedding即可）。

目标函数 $𝐿 = 𝐿_{𝑀𝐿𝑀} + 𝐿_{𝑀𝐼𝑀} + 𝐿_{𝑊𝑃𝐴}$

Masked Language Modeling (MLM)：随机mask掉30%一定长度的token span，这里span长度服从 $\lambda=3$ 的泊松分布。

MLM

其中 $X^{M'}$ 表示image tokens ， $Y^{L'}$ 表示text tokens， $M'$ , $L'$ 表示masked postions， $y_l$ 表示correct masked text tokens
Masked Image Modeling (MIM)：类似BEiT，利用blocking masking策略，随机mask 40% image tokens。其中，可以用image tokenizer将image tokens转换为离散tokens。

MIM
Word-Patch Alignment (WPA)：用WPA目标函数来学习text words和image patches的对齐。目标函数是预测text word对应的image patch是否被掩码。
在计算WPA损失时，为防止模型学masked text words和image patches的对应关系，不考虑masked text tokens。下面说明，label是如何确定的：若text token、corresponding image token均未被掩码，则label记为“aligned”；否则，label记为“unaligned”。

WPA

初始化：text embedding用Roberta来初始化，image tokenizer是用Dit中的pre-trained image tokenizer来初始化。
预训练：在一个IIT-CDIP数据（1100w document images）预训练
任务：
- 表格收据理解任务：FUNSD数据，其中实体标签为问题（question）、答案（answer）、标题（header）、其他（other）
  
  FUNSD数据（图来自https://zhuanlan.zhihu.com/p/588160901，橙色区域代表`header`，淡蓝色区域代表`question`, 绿色区域表`answer`，粉红色代区域表`other`）
- 文档图像分类任务：RVL-CDIP数据（这个数据是IIT-CDIP数据的子集，由信函、表单、电子邮件、简历、备忘录等16个类的扫描文档图像组成）
  
  RVL-CDIP数据，图源自https://www.pudn.com/news/6228daeb9ddf223e1ad26a85.html
- 文档视觉问答任务：DocVQA数据，输入文档图片和问题，输出回答。