Paper: https://arxiv.org/abs/2308.12966
1. qwen-VL 模型参数分布
Vision Encoder 1.9B;VL Adapter 0.08B;LLM 7.7B;Total 9.6B
语言模型:Qwen-7B
视觉模型:Vit-bigG
V-L Adapter: single-layer cross-attention
数据:image input: 图像前后加<image>标签, 增加了对region的描述信息和检测等信息
Training:
1. pretrainj阶段: 数据处理:
Ptrain时期数据过滤和处理
2. Multi-task Pre-training: vision encoder分辨率从224224提升到448448,解冻语言模型,所有参数都参与训练
Multi-Task Pretrain所使用的数据
多任务训练数据组织形式如图所示,黑色部分是数据输入,蓝色部分是输出,用于计算loss的。
Multi-Task的数据组织形式
3. vision-lanuage adapter:主要功能是对图片的sequence长度进行压缩,压缩到固定length256, 同时和llm的文本信息进行对齐。adapter中会随机初始化256个vector作为query-vector, 将Vit得到的图像特征作为attention层的key和value。然后将query-vector和图像key, value做attention后返回。
QwenVL 训练pipeline
如何进行图片信息融合:
具体做法步骤:
- 使用vision transformer对图像编码;
- 将图像特征送进vision-language adapter层,对图像信息进行进一步压缩和编码;\
- 将得到的图像详细赋值给LLM的hidden states中为图像预留span,将既包含图像表示又包含文本表示的hidden states送入LLM进行编码;
代码参考如下:
# -------------------------------------【step1】-------------------------------------------
# 1. tokenizer将query中的图片和文字转换成input_ids,先对图像进行编码:
self.visual = VisionTransformer(**config.visual)
...
if past_key_values is None and torch.any(input_ids == self.config.visual['image_start_id']):
bos_pos = torch.where(input_ids == self.config.visual['image_start_id'])
eos_pos = torch.where(input_ids == self.config.visual['image_start_id'] + 1)
assert (bos_pos[0] == eos_pos[0]).all()
img_pos = torch.stack((bos_pos[0], bos_pos[1], eos_pos[1]), dim=1)
images = []
for i, a, b in img_pos:
image = input_ids[i][a + 1 : b - 1].tolist()
image = image[ : image.index(self.config.visual['image_start_id'] + 2)]
images.append(bytes(image).decode('utf-8'))
# print(images)
# ['demo.jpg']
# self.visual.encode为VisionTransformer中的encode函数
# -------------------------------------【step2】-------------------------------------------
# 2. vision encode则主要调用VisionTransformer中的forward和adapter,获取图片表示
def forward(self, x: torch.Tensor):
x = x.to(
dtype=self.transformer.get_cast_dtype(),
device=self.transformer.get_cast_device(),
)
# to patches
x = self.conv1(x) # shape = [*, width, grid, grid]
x = x.reshape(x.shape[0], x.shape[1], -1) # shape = [*, width, grid ** 2]
x = x.permute(0, 2, 1) # shape = [*, grid ** 2, width]
x = x + get_abs_pos(self.positional_embedding, x.size(1))
x = self.ln_pre(x)
x = x.permute(1, 0, 2) # NLD -> LND
# 得到图像经过VisionTransformer之后的表示
x = self.transformer(x)
x = x.permute(1, 0, 2) # LND -> NLD
#过vision-language adapter模块
x = self.attn_pool(x)
x = self.ln_post(x)
x = x @ self.proj
# 最后将结果进行返回
return x
class Resampler(nn.Module):
"""
A 2D perceiver-resampler network with one cross attention layers by
(grid_size**2) learnable queries and 2d sincos pos_emb
Outputs:
A tensor with the shape of (grid_size**2, embed_dim)
"""
def __init__(
self,
grid_size,
embed_dim,
num_heads,
kv_dim=None,
norm_layer=nn.LayerNorm
):
super().__init__()
self.num_queries = grid_size ** 2
self.embed_dim = embed_dim
self.num_heads = num_heads
self.pos_embed = nn.Parameter(
torch.from_numpy(get_2d_sincos_pos_embed(embed_dim, grid_size)).float()
).requires_grad_(False)
# 随机初始化的query vector
self.query = nn.Parameter(torch.zeros(self.num_queries, embed_dim))
trunc_normal_(self.query, std=.02)
if kv_dim is not None and kv_dim != embed_dim:
self.kv_proj = nn.Linear(kv_dim, embed_dim, bias=False)
else:
self.kv_proj = nn.Identity()
self.attn = nn.MultiheadAttention(embed_dim, num_heads)
self.ln_q = norm_layer(embed_dim)
self.ln_kv = norm_layer(embed_dim)
self.apply(self._init_weights)
def _init_weights(self, m):
if isinstance(m, nn.Linear):
trunc_normal_(m.weight, std=.02)
if isinstance(m, nn.Linear) and m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.LayerNorm):
nn.init.constant_(m.bias, 0)
nn.init.constant_(m.weight, 1.0)
def forward(self, x, attn_mask=None):
# 获得绝对位置编码
pos_embed = get_abs_pos(self.pos_embed, x.size(1))
# 进行维度转换,vision transformer得到的特征向量可能和vision-language adapter中的query向量维度不匹配
x = self.kv_proj(x)
x = self.ln_kv(x).permute(1, 0, 2)
N = x.shape[1] # N为batch size
q = self.ln_q(self.query)
# q 和 vision transformer模块输出的x做cross attention的计算
out = self.attn(
self._repeat(q, N) + self.pos_embed.unsqueeze(1), # 对初始化的query向量加上位置编码
x + pos_embed.unsqueeze(1), # 对x再次加上位置编码,在vision transformer中,图像转换为patch之后也会加上位置编码
x,
attn_mask=attn_mask)[0]
# 最后再将结果进行返回
return out.permute(1, 0, 2)
def _repeat(self, query, N: int):
return query.unsqueeze(1).repeat(1, N, 1)
# -------------------------------------【step3】-------------------------------------------
# 3. 将vision transformer以及adapter模块编码后的特征赋值给LLM的hidden_states,再送入LLM进行编码:
# images的shape为bsz, 256, hidden_size
# a + 1:b 的长度为256,是tokenizer在对输入进行tokenize的时候,为图像编码信息预留的256个位置,
# 得到了图像编码的信息之后,再将值赋值给hidden_states中为它预留的位置
# hidden_states
for idx, (i, a, b) in enumerate(img_pos):
hidden_states[i][a + 1 : b] = images[idx]