Retrieval of Image Sequences from Multiple Paragraph Queries

People

Publication

  • Gunhee Kim, Seungwhan Moon, and Leonid Sigal
    Ranking and Retrieval of Image Sequences from Multiple Paragraph Queries
    28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, USA, Jun 7-12, 2015 (Acceptance = 602 / 2123 ~ 28.4 %)
    [Paper (PDF)] [Presentation (PDF)] [Poster (PDF)] [Extended Abstract (PDF)]

Description

Motivation of Research

The objective of this research is to rank and retrieve image sequences from a natural language text query, consisting of multiple sentences or paragraphs. While most previous work has dealt with the relations between a natural language sentence and an image or a video, our work extends to the relations between paragraphs and image sequences. Our approach leverages the vast user-generated resource of blog posts and photo streams on the Web. We use blog posts as text-image parallel training data that co-locate informative text with representative images that are carefully selected by users. We exploit large-scale photo streams to augment the image samples for retrieval.

One of the method’s key applications is to visualize visitors’ text-only reviews on TRIPADVISOR or YELP, by automatically retrieving the most illustrative image sequences. This application is significant because general users can understand key concepts and sentiment much easier and quicker with images. Moreover, the visuals are more useful for a new visitor. For example, a user who has never visited Disneyland may not fully understand the reviews about Bug’s land, without illustration of attractions, which our approach can generate.

problem

Figure 1. A depiction of our problem statement with Disneyland examples.We leverage (a) blog posts to learn the mapping between sentences and images, and (b) photo streams to augment the image samples.(c) Our objective is to rank and retrieve image sequences that best describe a given text query consisting of multiple sentences or paragraphs.

Method and Experiments

We design a latent structural SVM framework to learn the semantic relevance relations between text and image sequences. We evaluate the image retrieval performance of our method on a newly collected Disneyland dataset, which consists of more than 10K blog posts with 120K associated images, and 6K photo streams of more than 540K unique images. We present comprehensive empirical studies comparing the retrieval accuracies of image sequences between five text segmentation, three text description, and two text-to-image embedding methods and their combinations. Our approach using latent structural SVM can efficiently integrate multiple algorithmic outputs in a unified way. We also perform the visualization of users’ reviews on TRIPADVISOR or YELP, and evaluate them using crowdsourcing-based user studies via Amazon Mechanical Turk.

Funding

  • This research is supported by Disney Research.