Joint Summarization of Large Sets of Web Images and VideosPeople
Publication
DescriptionMotivation of ResearchPhotographs taken by general users can be regarded as personal statements of what stories they want to remember and tell about their experiences. As an example of Fig.1, even in a single day, tens of thousands of people visit Disneyland, and many of them take large streams of photos about their special experiences with families or friends. In addition, some of the more enthusiastic visitors are also willing to write travel blogs, in which their personal stories unfold with itineraries, commentaries, impressions, and fun facts about the attractions. The objective of this research is to take advantage of large collections of photo streams and blog posts in a mutually-beneficial way for the purpose of summarization and exploration. Blogs usually consist of sequences of images and associated text; they are written in a way of storytelling, by digesting key events with concise sentences and representative images. Thus we can transfer the semantic knowledge associated with blog pictures to the aligned photo stream images, many of which have noisy or no semantic labels. Specifically, we show that blog posts improve the image localization accuracy (i.e. finding where photos were taken), and automatic image titling (i.e. creating descriptive titles for images). In the reverse direction, each blog benefits from a large set of photo streams, which can interpolate various photo paths between consecutive images in the blog. Each blog is written based on a single person's experience with a small number of selective images. Hence, the photo-path interpolation, achieved with photo streams, allows blog authors to explore alternative paths made by other visitors who follow a similar itinerary.
Method and ExperimentsWe formulate the problem of joint alignment from blogs to photo streams and photo stream summarization in a unified latent ranking SVM framework. We alternate between solving the two coupled latent SVM problems, by first fixing the summarization and solving for the alignment from blog images to photo streams and vice versa. For evaluation, we collect large-scale Disneyland dataset of 10K blogs (120K associated images) and 6K photo streams (540K images). We perform quantitative experiments and user studies via Amazon Mechanical Turk, and demonstrate that blog posts and photo streams are mutually beneficial for summarization, exploration, semantic knowledge transfer, and photo interpolation. Funding
|