Paper | Open Vocabulary Object Detection with Pseudo Bounding-Box Labels (ECCV 2022)

1. 写在前面

https://arxiv.org/abs/2111.09452

从经典的OD到OVD的最主要挑战在于现有的OD数据集类别都是有限的,例如最常用的COCO只有80个类,所以对于novel类别的识别会比较困难。

基于这一点,这篇文章的思路就是想从大规模的image-caption pairs中自动生成一些伪物体检测标注,用来训练模型。

1. Introduction

  1. OD is limited to a fixed set of objects (e.g., 80 objects for COCO)

  2. to reduce the need for human labor for annotating: Zero-shot OD & OVD

  3. ZSOD: transfer from base to novel by exploring the corelations between base and novel categories;

  4. OVD: transfer from base to novel with the help of image captions (个人不是那么准确,也不是一定要用caption来做OVD)

  5. both ZS-OD and OVD are limited by the small size of base category set

  6. This paper: automatically generates bounding-box annotations for objects at scale using existing resources.

  7. Existing vision-language models imply strong localization ability.

  8. this paper: improve OVD using pseudo-bounding box annotations generated from large-scale image caption pairs.

left:human labor

right:image-caption + VL model --> pseudo annotation

2. Method

two components:

  1. a pseudo bounding-box label generator;

  2. an open vocabulary object detector.

2.1 pseudo bounding-box label generator

image.png
  1. predefine objects of interest

  2. input: {image, caption} pairs

  3. image --> image encoder --> visual embedding per region

  4. caption --> text encoder --> text embedding per token

input {visual embeddings, text embeddings} --> multi-modal encoder --> multi modal features by image-text interaction via cross-attention.

  1. for each object in the predefined objects, e.g., racket, use the grad-cam to visulize its activation map.

  2. apply proposal generator to get multiple boxes

  3. the box with the largest overlap with the activation map is regarded as the pseudo box.

2.2 open vocabulary object detector with pseudo-bounding box:

  • because of the two-step pipeline, any OVD model is acceptable.

  • in this paper, a typical framework for OVD is thus selected.

image.png
  1. input: image,large-scale object vocabulary set
  2. image --> feature extractor --> object proposals --> RoI --> region-based visual embedding
  3. category texts in large-scale object vocabulary set + "background" --> text encoder --> text embeddings
  4. training: encourgae the paired {region-based visual embedding, text embedding}
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容