Joint Summarization of Large Sets of Web Images and Videos



  • Gunhee Kim, Seungwhan Moon, and Leonid Sigal
    Joint Photo Stream and Blog Post Summarization and Exploration
    28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, USA, Jun 7-12, 2015 (Acceptance = 602 / 2123 ~ 28.4 %)
    [Paper (PDF)] [Presentation (PDF)] [Poster (PDF)] [Extended Abstract (PDF)]


Motivation of Research

Photographs taken by general users can be regarded as personal statements of what stories they want to remember and tell about their experiences. As an example of Fig.1, even in a single day, tens of thousands of people visit Disneyland, and many of them take large streams of photos about their special experiences with families or friends. In addition, some of the more enthusiastic visitors are also willing to write travel blogs, in which their personal stories unfold with itineraries, commentaries, impressions, and fun facts about the attractions.

The objective of this research is to take advantage of large collections of photo streams and blog posts in a mutually-beneficial way for the purpose of summarization and exploration. Blogs usually consist of sequences of images and associated text; they are written in a way of storytelling, by digesting key events with concise sentences and representative images. Thus we can transfer the semantic knowledge associated with blog pictures to the aligned photo stream images, many of which have noisy or no semantic labels. Specifically, we show that blog posts improve the image localization accuracy (i.e. finding where photos were taken), and automatic image titling (i.e. creating descriptive titles for images).

In the reverse direction, each blog benefits from a large set of photo streams, which can interpolate various photo paths between consecutive images in the blog. Each blog is written based on a single person's experience with a small number of selective images. Hence, the photo-path interpolation, achieved with photo streams, allows blog authors to explore alternative paths made by other visitors who follow a similar itinerary.


Figure 1. Motivation for joint summarization and exploration between large collections of photo streams and blog posts.(a) The input is two-fold: a set of photo streams and blog posts from Disneyland, which are captured by multiple users.(b) Blogs benefit photo stream summarization by transferring semantic knowledge: Examples are automatic image titling and attraction-based image localization.(c) Photo streams enhance blog posts by allowing interpolation between blog images.Two blog images of an attraction entrance used as a query,result in an illustration of what happens inside the attraction.

Method and Experiments

We formulate the problem of joint alignment from blogs to photo streams and photo stream summarization in a unified latent ranking SVM framework. We alternate between solving the two coupled latent SVM problems, by first fixing the summarization and solving for the alignment from blog images to photo streams and vice versa.

For evaluation, we collect large-scale Disneyland dataset of 10K blogs (120K associated images) and 6K photo streams (540K images). We perform quantitative experiments and user studies via Amazon Mechanical Turk, and demonstrate that blog posts and photo streams are mutually beneficial for summarization, exploration, semantic knowledge transfer, and photo interpolation.


  • This research is supported by Disney Research.