AI Learns to Summarize Videos!

8 min readApr 26, 2021

Artificial Intelligence and especially the field of neural networks-based learning algorithms is making great leaps in a variety of areas. This leaves us wondering whether it is possible for learning algorithms to summarize videos or fetch videos based on what we are looking for, rather than endlessly scrolling through huge collections of videos on the internet. The global video market is taking the center stage- according to Forbes, over a billion hours of videos are watched on YouTube each day and it is claimed that the video streaming site receives over 400 hours of video content every single minute!

Global Online Video Viewing Statistics (Photo Credit)

Many other such statistics show that the online video content is constantly growing and will always remain mainstream for sharing any kind of information, also because of the recent advancements being made in the field of video retrieval and automatic summarization using AI.

This article aims to provide an overview of recent advances and existing deep-learning techniques surveyed in the research paper Video Summarization Using Deep Neural Networks: A Survey.

What is Video Summarization?

Video Summarization is the process of shortening videos by selecting frames that capture the most informative parts of a video content and generating a concise and complete synopsis out of it. The synopsis is typically produced in two forms:

1. Static video summarization: Set of video key-frames (a.k.a video storyboard)

2. Dynamic video summarization: Set of video key-fragments (a.k.a video skim)

One of the reasons why video skims are advantageous over video storyboards is because they include elements of audio and motion that makes the story narration more expressive and explanatory and comparatively interesting to watch.

Process of Static Video Summary Composition (Photo Credit)

Process of Dynamic Video Summarization (Photo Credit)

Video summarization has many use cases, and its applications are wide- the most important being facilitating effective content browsing, enhancing viewing experience and increasing consumers’ engagement and content consumption given the plethora of videos on the internet.

How does it work?

The process of video summarization begins by first representing the visual content of the video in terms of feature vectors, which is followed by extracting these deep features vectors for all the frames or sometimes for a subset of them selected via frame sampling, for instance processing 2 frames per second. This way these feature vectors extracted at the frame-level capture the video content at a very detailed level which is then used while selecting frames for the video summary.

In most of the deep learning based video summarization techniques, these feature vectors are extracted with the help of pre-trained neural networks like GoogleNet (most commonly used, read more about it here) ), AlexNet, variations of ResNet and variations of VGGnet belonging to the families of Convolutional Neural Networks (CNNs) and Deep Convolutional Neural Networks (DCNNs).

The extracted features are then passed through a Deep Summarizer network which returns either set of selected video key-frames (static video storyboard) or a set of selected video key-fragments (concatenated chronologically to form a short video skim) as an output.

Types of Deep-Learning-based Video Summarization Approaches

Based on the utilized type of data, deep-learning-based video summarization can be divided into two major approaches:

Unimodal approaches where feature extraction is based on the visual modality of the videos and summarization is learnt in a (weakly-)supervised or unsupervised manner.
Multimodal approaches where summarization is learnt in a supervised manner using the available textual metadata.

Based on the adopted training strategy, deep-learning-based video summarization can be divided into three major approaches:

Supervised approaches that are dependent on datasets with manually human labeled ground-truth annotations which could be in the form of video summaries or in the form of frame-level importance scores that provide information on criterion for choosing frames/ fragments for summarization.
Unsupervised approaches that do not need laborious hand-labeled data instead require a significant collection of original videos for their training.
Weakly-supervised approaches that require less-expensive and easy weak labels with the understanding that although they might be weaker compared to the human annotations but will still be able to create decently strong predictive models.

High-level representation of the typical deep-learning based video summarization pipeline (Photo Credit)

Further Classification of Video Summarization Approaches

The three broad categories of video summarization approaches, that is, supervised, unsupervised and weakly supervised methods are entirely dependent on the analysis of content in videos and can be further divided into subclasses as mentioned below.

Supervised Video Summarization approach entails:

Learn frame importance by modeling the temporal dependency among frames
Learn frame importance by modeling the spatiotemporal structure of the video
Learn summarization by fooling a discriminator when trying to discriminate a machine-generated from a human-generated summary

Unsupervised approaches entail:

Learn summarization by fooling a discriminator when trying to discriminate the original video (or set of keyframes) from a summary-based reconstruction of it
Learn summarization by targeting specific desired properties for the summary
Build object-oriented summaries by modeling the key motion of important visual objects

Review of the Summarization Approaches

It has been observed that early deep-learning-based approaches for video summarization adopted combinations of CNNs and RNNs, where CNNs are used as pre-trained components and RNNs (mostly LSTMs) are used to model the temporal dependency among video frames. Most of these methods are supervised and try to learn how to estimate the importance of video frames/ fragments based on the manually human-generated-labeled data.

Although the current research work and other ongoing studies are inclined towards the development of supervised algorithms, it is highly recommended to explore the potential and learning capability of unsupervised methods.

This is majorly because of the following reasons:

Generating ground-truth human-labeled training data is highly expensive and time consuming
A video can have multiple summaries based on annotations from multiple human annotators
The ground-truth summaries might be very different from each other making it hard to train using typical supervised training methods

Evaluation Protocols and Measures

a. Evaluating video storyboards:

For static video summarization tasks, the first few key-frame-based summaries were attempted to evaluate using human judgments who analyzed how accurate the results were based on relevance of each key-frame with the video content, redundant or missing information or the depth of information of the produced synopsis. However, this technique was time-consuming and not completely reliable. Therefore, subsequent summaries generated started adopting objective measures and ground-truth summaries as a measure of evaluation.

There are few proposed measures, the most commonly used being “Comparison of User Summaries” that evaluates the generated summary according to its overlap with predefined key-frame-based user summaries. Comparison is performed at a key-frame-basis and the quality of the generated summary is quantified by computing the accuracy and error rates based on the number of matched and non-matched key-frames respectively. Alternatively, instead of Accuracy, sometimes, Precision, Recall and F-Score measures are preferred based on the given problem statement.

b. Evaluating video skims:

For dynamic video summarization tasks, the commonly used evaluation methodologies assess the quality of video skims according to their alignment with human preferences, based mostly on objective measures, that is, the videos are fragmented into consecutive and non-overlapping fragments in order to compare the user-generated with the automatically created summary.

Once this is completed, based on the scores computed by a video summarization algorithm for the fragments of a given video, a subset of the key fragments are selected to form a concise summary. A final comparison between this summary and the user created summaries is calculated using the F-scores in a pairwise fashion.

Future Scope

Given the state of the art on automatic generation of video summaries, the future work should be primarily targeted on the below aspects:

Research work and developments in the field of unsupervised video summarization that uses a combination of adversarial and reinforcement learning.
Development of multimodal summarization approaches that calculate importance according to the visual and audio modality of the video and use audio segmentation for video summary generation.
Advanced multi-head attention mechanisms for better estimation of variable-range temporal dependencies among parts of the video.
Extension of LSTM architectures with high-capacity memory networks, to capture long-range dependencies of the visual content.
Combinations of architectures that use both- 3D-CNNs and convolutional LSTMs to model the spatiotemporal structure of the video.
Use of augmented training data in combination with curriculum learning approaches for improving learning for of unsupervised video summarization.
Development of better evaluation measures for accurate performance comparison of different summarization methods.

Illustration of Video Summarization based on different types of features (Photo Credit)

Conclusion

All in all, Video Summarization in Artificial Intelligence is an exciting area of research and there is a lot of potential for further studies. Based on the review of different approaches for deep learning-based video summarization, it is evident that the techniques used have evolved over the last few years and there lies a great potential for the future.

Upon analyzing the best performing models in supervised video summarization, it has been observed that they learn frames’ importance by modeling the variable-range temporal dependency among video frames/fragments with the help of Recurrent Neural Networks and tailored attention mechanisms. These models have shown even better performance using extension of the memorization capacity of LSTMs by using their memory networks.

Upon analyzing the best performing models in unsupervised video summarization, using Generative Adversarial Networks (GANs) for learning how to generate a representative video summary could show promising outputs making them a forerunner and hence, this area needs to be studied further.

For videos that do not possess any pattern and are very different from each other, GANs could work really well. These unsupervised techniques could be just the start of a new era in deep learning technology when it comes to video summarization!

Unsupervised Learning using GANs (Photo Credit)

Furthermore, knowing how difficult it is to create large-scale datasets with human annotations for training summarization models in a supervised way, further research should be done to figure out development of unsupervised video summarization methods that would eliminate the need of manually annotated datasets.

Besides all these proposals, it is necessary to start implementing these methodologies in real life applications to allow video content streaming websites to enhance user experiences and facilitate time-efficient video content viewing.

I hope this article inspires you to further explore the area of Video Summarization using Deep Learning based methods.