InternVL 系列调研

MLLM模型基本范式：

image.png

1.主要调研InternVL 2.0-40B

InternVL2-40B: 总计 40.07B; Vit: 5.54B; 143.17M；LLM: 34.39B
stage1训练数据：我们在InternVL 1.5中使用的预训练数据集扩展了从多种来源收集的数据。这些数据集覆盖了多个任务，包括标题生成、视觉问答、检测、定位和OCR。
stage2训练数据：包括了诸如EgoTaskQA、Mementos、STAR、NTU RGB+D、VideoChat2IT和LSMDC-QA这样的视频数据，以及Medical-Diff-VQA、Pathology-VQA、PMC-CaseReport、PMC-VQA、Slake和VQA-RAD这样的医疗数据。我们还包括了SROIE、FUNSD和POIE，以进一步增强模型识别手写字体的能力。

1.1 模型评估和数据集：

OmniCorpus最大图文交错数据集：论文标题：OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
论文链接：https://arxiv.org/pdf/2406.08418
GitHub：https://github.com/OpenGVLab/Om
MM-NIAH（首个针对多模态长文档理解能力的评测基准）
论文标题：Needle In A Multimodal Haystack
论文链接：https://arxiv.org/pdf/2406.07230 【提出了“多模态干草堆中的针（MM-NIAH）”】
GitHub：GitHub - OpenGVLab/MM-NIAH: This is the official implementation of the paper "Needle In A Multimodal Haystack"
三种类型的评估任务：多模态检索、计数和推理。

1.2 Mini-InternVL：

技术报告：https://internvl.github.io/blog/2024-05-25-Mini-InternVL-1.5/
OpenGVLab：InternVL 1.5迷你款来了！8%的参数=80%的性能，一块1080Ti 就能跑！

1.3 InternVL-1.5：

论文标题：How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
论文链接：https://arxiv.org/pdf/2404.1682
具有卓越的bilingual能力，尤其Chinese相关任务中。模型基于"Vit-MLP-LLM"架构，结合了InternVit-6B视觉编码器和InternLM2-20B语言模型，并通过动态分辨率策略和数据翻译管道来增强其对不同语言和图像分辨率的支持。
模型结构： ViT-MLP-LLM架构，通过MLP投影器将预训练的InternVit-6B和InternLM2-20B结合，此处还采用简单的pixel shuffle方法，将视觉tokens的数量减少到四分之一。

2024-07-20 11-39-46屏幕截图.png

1.4 InternVL：

InternVL模型结构，训练方法(3个渐进阶段) 注意加图 ：包括视觉-语言对比训练、视觉-语言生成训练和监督式微调。这些阶段有效利用不同来源的公开数据。中间有些trainable, frozen和share weight。
训练过程：
1阶段：采用LLama-7B编码文本为 $T_f$ , InternViT-6B来提取视觉特征 $I_f$ , 遵循CLIP的目标函数，在图像-文本对的相似度分数上最小化一个对称的交叉熵损失。数据集清洗后的49.8亿图像-文本对。。所有参数完全可训练
2阶段：视觉-语言生成训练：将InternVit-6B和QLLama连接起来，采用生成训练策略。具体来说，QLLaMa继承第一阶段LLaMA-7B权重，保持InterVit-6B和QLLaMA冻结，只用过滤后的高质量数据训练新增加的可学习查询和交叉注意力层。进一步过滤低质量标题数据，从一阶段49.8亿减少到10亿。保持InternVit-6B、QLLaMa冻结，只训练参数
3阶段：一种单独使用InternVit-6B，另一种是使用整个InternVL模型。使用高质量Caption/VQA/多轮对话数据(4M)进行sft训练。

1.5 InternViT模型：【 InternVit-6B(https://zhuanlan.zhihu.com/p/427388113)】

Vit：切分image到固定的图片块，将这些图片块拉平成序列，加入positionembedding喂入标准Transformer encoder。在头部加一个类似的[CLS]标签，用于处理分类任务。主要特殊部分是patchEmbedding，其余部分和transformer类似。

ViT模型结构

PatchEmbedding: 具体处理input x的维度是

（B，C，H，W）

其中B是batch size，C通常是三通道，H和W分别是图片的高和宽，而输出则是

（B，N，E）

，B依然是batch size，N则是每张图被切割成了patch之后，patch的数量，E是embed_size，每个patch会通过一个全连接网络转换成一个向量，E是这个向量的长度，根据卷积的原理，也可以理解为每个patch的特征数量。

class PatchEmbed(nn.Module):
    """ 2D Image to Patch Embedding
    """
    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True):
        super().__init__()
        # img_size = (img_size, img_size)
        img_size = to_2tuple(img_size)
        patch_size = to_2tuple(patch_size)
        self.img_size = img_size
        self.patch_size = patch_size
        self.grid_size = (img_size[0] // patch_size[0], img_size[1] // patch_size[1])
        self.num_patches = self.grid_size[0] * self.grid_size[1]
        self.flatten = flatten
        # 输入通道，输出通道，卷积核大小，步长
        # C*H*W->embed_dim*grid_size*grid_size
        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
        self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()

    def forward(self, x):
        B, C, H, W = x.shape
        assert H == self.img_size[0] and W == self.img_size[1], \
            f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
        x = self.proj(x)
        if self.flatten:
            x = x.flatten(2).transpose(1, 2)  # BCHW -> BNC
        x = self.norm(x)
        return x

1.6 Pixel Shuffle和Dynamic High Resolution和CrossAttention

PixelShuffle：参考torch.nn.PixelShuffle(upscale_factor), 维度 $(B,C * r * r, H, w)$ ，reshape成 $(B, C, H * r, w * r)$ 。
DynamicHighResolution: 对输入图片是resize成448的倍数，设置比例2:3，( $H=448*2, W = 448*3$ ), 然后按照预定义尺寸比例从图片上crop相应区域。这里resize比例2:3是find_closet_aspect_ratio和aspect_ratio获取最优比例.同时还会将整个图片resize成448*448作为 thumbnail, 接在序列最后。
Cross Attention: 还是传统QKV, 但是Query, 来自一个模态的表示(如文本)，但是Key和Value，来自另外一个模态的表示.