1. 写在前面

从经典的OD到OVD的最主要挑战在于现有的OD数据集类别都是有限的，例如最常用的COCO只有80个类，所以对于novel类别的识别会比较困难。

基于这一点，这篇文章的思路就是想从大规模的image-caption pairs中自动生成一些伪物体检测标注，用来训练模型。

1. Introduction

OD is limited to a fixed set of objects (e.g., 80 objects for COCO)
to reduce the need for human labor for annotating: Zero-shot OD & OVD
ZSOD: transfer from base to novel by exploring the corelations between base and novel categories;
OVD: transfer from base to novel with the help of image captions (个人不是那么准确，也不是一定要用caption来做OVD)
both ZS-OD and OVD are limited by the small size of base category set
This paper： automatically generates bounding-box annotations for objects at scale using existing resources.
Existing vision-language models imply strong localization ability.
this paper： improve OVD using pseudo-bounding box annotations generated from large-scale image caption pairs.

left：human labor

right：image-caption + VL model --> pseudo annotation

two components:

image.png

input {visual embeddings, text embeddings} --> multi-modal encoder --> multi modal features by image-text interaction via cross-attention.

for each object in the predefined objects, e.g., racket, use the grad-cam to visulize its activation map.
apply proposal generator to get multiple boxes
the box with the largest overlap with the activation map is regarded as the pseudo box.

image.png

input： image，large-scale object vocabulary set
image --> feature extractor --> object proposals --> RoI --> region-based visual embedding
category texts in large-scale object vocabulary set + "background" --> text encoder --> text embeddings
training: encourgae the paired {region-based visual embedding, text embedding}