A Memory Network Approach for Story-based
Temporal Summarization of 360° Videos

Sangho Lee      Jinyoung Sung      Youngjae Yu      Gunhee Kim     

Vision and Learning Lab
Computer Science and Engineering
Seoul National University, Seoul, Korea

CVPR 2018

Paper Poster Presentation BibTex


We address the problem of story-based temporal summarization of long 360° videos. We propose a novel memory network model named Past-Future Memory Network (PFMN), in which we first compute the scores of 81 normal field of view (NFOV) region proposals cropped from the input 360° video, and then recover a latent, collective summary using the network with two external memories that store the embeddings of previously selected subshots and future candidate subshots. Our major contributions are two-fold. First, our work is the first to address story-based temporal summarization of 360° videos. Second, our model is the first attempt to leverage memory networks for video summarization tasks. For evaluation, we perform three sets of experiments. First, we investigate the view selection capability of our model on the Pano2Vid dataset. Second, we evaluate the temporal summarization with a newly collected 360° video dataset. Finally, we experiment our model's performance in another domain, with image-based storytelling VIST dataset. We verify that our model achieves state-of-the-art performance on all the tasks.

Motivation and Objective

Temporal Summarization of 360° Videos

Unlike normal videos, 360° videos should be summarized both spatially and temporally. That is, given a long 360° video, we need to spatially decide "where to look" from unlimited field of view (FOV) of the 360° video, and temporally summarize it into a concise subset of key subshots. However, existing works on 360° video summarization, including AutoCam, Deep 360 Pilot, and CVS, have ony addressed the spaial summarization while the characteristics of 360° videos hinder the temporal summarization. In this paper, we focus on devloping a novel model to present solutions to the problems inherited in the temporal summarization of 360° videos.

Two Problems in Temporal Summarization

We focus on two problems to bother temporal summarization of 360° videos. First, there are many irrelevant regions to the plot of video in each subshot. Thus, we need to find key content like objects, persons, and events. Second, it is difficult to use supervised learning because there are very few pairs of original videos and their edited summaries.

Figure 1. The need to identify key content.

Our Solutions: Past-Future Memory Network (PFMN)

Figure 2. Architecture of Past-Future Memory Network (PFMN).

Our model named Past-Future Memory Network (PFMN) has three key characteristics.

1) View Selection

To find key objects in each subshot, PFMN performs view selection using a deep ranking network.

2) Past and Future Memories

To fully utilize what they have already selected and what remains in video, PFMN stores the past and future information into two separate memories, past and future memory.

3) Stroy-based Summarization

To train without any groundtruth, we use the latent, collective storyline in the video set of the same topic for summarizing 360° videos.


View Selection

Figure 3. The results of view selection. Higher scores mean better views.

Temporal Summarization

Figure 4. The results of temporal summarization. Red boxes indicate the matches with the GT and prediction.


We appreciate Joonil Na, Jaemin Cho, and Juyong Kim for helpful discussions. This work was supported by the Visual Display Business (RAK0117ZZ-21RF) of Samsung Electronics, and IITP grant funded by the Korea government (MSIT) (No. 2017-0-01772). Gunhee Kim is the corresponding author.


Sangho Lee, Jinyoung Sung, Youngjae Yu and Gunhee Kim.
"A Memory Network Approach for Story-based Temporal Summarization of 360° Vides"
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  author = {Sangho Lee and Jinyoung Sung and Youngjae Yu and Gunhee Kim},
  title = "{A Memory Network Approach for Story-based Temporal Summarization of 360\deg~Videos}",
  booktitle = {CVPR},
  year = {2018}
This page uses a template from the project page of Satoshi Iizuka et al, "Globally and Locally Consistent Image Completion".