Vision-language pretrained models have achieved impressive performance on multimodal reasoning and zero-shot recognition tasks. Many of these VL models are pretrained on unlabeled image and caption pairs from the internet. In this paper, we study whether the notion of primitive concepts, such as color and shape attributes, emerges automatically from these pretrained VL models. We propose to learn compositional derivations that map primitive concept activations into composite concepts, a task which we demonstrate to be straightforward given true primitive concept annotations. This compositional derivation learning (CompDL) framework allows us to quantitively measure the usefulness and interpretability of the learned derivations, by jointly considering the entire set of candidate primitive concepts. Our study reveals that state-of-the-art VL pretrained models learn primitive concepts that are highly useful as visual descriptors, as demonstrated by their strong performance on fine-grained visual recognition tasks, but those concepts struggle to provide interpretable compositional derivations, which highlights limitations of existing VL models.
视觉语言预训练模型在多模态推理和Zero-Shot识别任务中取得了令人印象深刻的性能。这些VL模型中的许多都是基于互联网上未标记的图像和字幕对进行预训练的。在本文中,我们研究了原始概念的概念,如颜色和形状属性,是否会自动从这些预训练的VL模型中出现。我们建议学习将原始概念激活映射为复合概念的组合派生,这项任务在给出真实原始概念注释的情况下非常简单。这个组合衍生学习(CompDL)框架允许我们通过联合考虑整个候选原始概念集,定量地衡量所学衍生的有用性和可解释性。我们的研究表明,最先进的VL预训练模型可以学习作为视觉描述符非常有用的原始概念,这可以通过它们在细粒度视觉识别任务中的强大性能来证明,但这些概念难以提供可解释的成分推导,这突出了现有VL模型的局限性。
原文:https://aiqianji.com/blog/article/124
下载PDF:https://arxiv.org/pdf/2203.17271v1.pdf
本篇文章由智能创作平台AI千集提供技术支持