Paper Site: http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf
Problem Definition
Employ a text retrieval approach for object recognition, search for and localize all the occurences of a user outlined object in a video.
Contribution and Discussion
The analogy with text retrieval really has demostrated its worth: this work have immediate run-time object retrieval throughout a movie database, despite significant viewpoint changes in many frames.
More visual descriptors for some scene types can be added to increase the ranking; Defining the object of interest over more than a single frame to allow for search on all its visual aspects.
Find the tempting possibility of following other successes of text retrieval community such as latent semantic indexing to find content and automatic clustering to find the principal objects that occur throughout the movie.
Method
Constructing the vocabulary from a subpart of the movie, and evaluated its matching accuracy and expressive power on the remainder of the movie.
-
Use a vector of word frequencies to represent each document. Apply a weighting to the components of this vector rather than use the frequencty vector directly for indexing.
-
Evaluate on 164 frames from 48 shots taken at 19 different 3D locations in the movie Run Lola Run; The entire frame is used as a query region. The retrieval performance is measured over all 164 frames using each in turn as a query region in the retrieval tests.
Descriptors are computed for stable regions in each keyframe and the mean values are computed using two frames either side of the keyframe. The descriptors are vector quantized using the centres clustered from the ground truth set. Using stop list and spatial consistency to reduce the redundancy descriptors.