Joint Aligning and Cosegmenting Multiple Photo Streams



  • Gunhee Kim and Eric P. Xing
    Jointly Aligning and Segmenting Multiple Web Photo Streams for the Inference of Collective Photo Storylines
    26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), Portland, Oregon, USA, Jun 23-28, 2013. (Oral) (Acceptance = 60 / 1870 ~ 3.2 %)
    [Paper (PDF)] [Supplementary (PDF)] [Presentation (PPTX)] [Poster (PDF)]

Matlab example code

We are working on a journal version. We will post the code after the Journal submission.


Motivation of Research

Suppose that we query and download millions of photo streams associated with the keyword scuba diving from the photo sharing site Flickr. Obviously, the photo streams are neither aligned nor calibrated since they are taken by different users at different time and locations. However, at the same time, they are likely to share common storylines consisting of sequences of events and activities repeatedly recurred across the scuba diving photo streams (e.g. riding a boat, wearing equipment, underwater exploration, and so on).

Our challenging goal is to build such collective storylines from the photo streams of millions of users. In this paper, as a first technical step, we propose a method to jointly perform alignment of multiple photo streams and cosegmentation of aligned images, as shown in the figure below. In the alignment step, the images of different photo sets are matched based on visual contents and associated meta-data. In the cosegmentation step, the aligned images are segmented together in order to facilitate image understanding such as pixel-level classification in the images. We close a loop between the two tasks so that solving one task helps enhance the performance of the other in a mutually rewarding way.


Figure 1. Motivation for jointly aligning and segmenting multiple photo streams with an example of three photo streams of scuba+diving. The input is any number of photo streams of a specific activity that are taken by various users at different time and places. The output is two-fold. (a) Photo stream alignment. The images of different photo streams are matched (as shown in the same colors). (b) Image cosegmentation. The shared regions in the aligned images are jointly segmented.


We design a scalable message-passing based optimization framework to jointly achieve both tasks for the whole input image set at once. Please see the details in the paper. For evaluation, we collect about 1.5 millions of images of 13 thousands of photo streams regarding 15 outdoor recreational activities from Flickr.

Take-home Message

We proposed a scalable approach to jointly aligning and segmenting multiple uncalibrated Web photo streams of different users in an unsupervised and bottom-up way. The empirical results assured that our method can be a key component to achieve our ultimate goal: inferring collective photo storylines from Web images, which is a next direction of our future work.


  • This research is supported by NSF IIS-1115313 and AFOSR FA9550010247.