Presentation of paper "Explaining video summarization based on
the focus of attention", by E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras, delivered at IEEE ISM 2022, Dec. 2022, Naples, Italy.
In this paper we propose a method for explaining
video summarization. We start by formulating the problem as
the creation of an explanation mask which indicates the parts
of the video that influenced the most the estimates of a video
summarization network, about the frames’ importance. Then, we
explain how the typical analysis pipeline of attention-based networks for video summarization can be used to define explanation
signals, and we examine various attention-based signals that have
been studied as explanations in the NLP domain. We evaluate
the performance of these signals by investigating the video
summarization network’s input-output relationship according
to different replacement functions, and utilizing measures that quantify the capability of explanations to spot the most and
least influential parts of a video. We run experiments using an
attention-based network (CA-SUM) and two datasets (SumMe
and TVSum) for video summarization. Our evaluations indicate the advanced performance of explanations formed using the inherent attention weights, and demonstrate the ability of our
method to explain the video summarization results using clues
about the focus of the attention mechanism.
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Explaining video summarization based on the focus of attention
1. Title of presentation
Subtitle
Name of presenter
Date
Explaining video summarization based on
the focus of attention
E. Apostolidis1,2, G. Balaouras1, V. Mezaris1, I. Patras2
1 Information Technologies Institute, CERTH, Thermi - Thessaloniki, Greece
2 School of EECS, Queen Mary University of London, London, UK
24th IEEE International Symposium
on Multimedia (ISM 2022)
2. 2
• Explainable video summarization: why is it important?
• Related work
• Proposed method
• Experimental evaluations
• Conclusions
Overview
3. 3
Current practice for producing a video summary
• An editor has to watch the entire video
content and decide about the parts
that should be included in the summary
• Different summaries of the same video
could be needed for distribution via
different communication channels
• Laborious task that can be significantly
accelerated by video summarization
technologies
Image source: https://www.premiumbeat.com/
blog/3-tips-for-difficult-video-edit/
4. 4
Goal of video summarization technologies
This synopsis can be made of:
• A set of representative video key-
fragments (a.k.a. video skim)
• A set representative video key-frames
(a.k.a. video storyboard)
“Generate a short visual synopsis
that summarizes the video content
by selecting the most informative
and important parts of it”
Video content
Key-fragment
Key-frame
Analysis outcomes
Video storyboard
Video skim
Video title: “Susan Boyle's First Audition -
I Dreamed a Dream - Britain's Got Talent 2009”
Video source: https://www.youtube.com/watch? v=deRF9oEbRso
5. 5
Why explainable video summarization is important?
• Video summarization technologies can
drastically reduce the needed resources
for video summary production in terms
of both time and human effort
• However, their outcome needs to be
curated by the editor to ensure that all
needed parts have been selected
• Content curation could be facilitated if
the editor get’s explanations about the
suggestions of the used technology
Image source: https://www.appier.com/en/blog/
what-is-supervised-learning
Such explanations would increase the editor’s trust in the used technology,
thus facilitating and accelerating content curation
6. 6
Works on explainable networks for video analysis tasks
• (Aakur, 2018) extraction of explainable representations for video activity interpretation
• (Bargal, 2018) spatio-temporal cues contributing to network’s classification/captioning
output, to spot fragments linked to specific action/phrase from caption
• (Zhuo, 2019) spatio-temporal graph of semantic-level video states and state transition
analysis for video action reasoning
• (Stergiou, 2019) heatmaps visualizing focus of attention and explaining networks for
action classification and recognition
• (Manttari, 2020) perturbation-based method to spot the video fragment with the
greatest impact on the video classification results
• (Li, 2021) generic perturbation-based method for spatio-temporally-smooth
explanations of video classification networks
• (Gkalelis, 2022) in-degrees of graph attention networks’ adjacency matrices to explain
video event recognition, in terms of salient objects and frames
7. 7
Typical video summarization pipeline
1. Video frames are represented using pre-trained CNNs (e.g., GoogleNet)
2. Video summarization networks estimate the frames’ importance
3. Given a video fragmentation and a time budget, the video summary is formed
by selecting fragments that maximize its importance (Knapsack problem)
Proposed method: Problem formulation
8. 8
Explanation’s goal/output
• A video-fragment-level explanation mask indicating the most influential video
fragments for the network’s estimates about the frames’ importance
Assumptions
• Video is split in fixed-size fragments; Summary is made of the M top-scoring ones
Proposed method: Problem formulation
9. 9
• Studied in the NLP domain [Jain, 2019; Serrano, 2019; Wiegreffe, 2019;
Kobayashi, 2020; Chrysostomou, 2021; Liu, 2022] and elsewhere
• Can be used on attention-based video summarization networks?
• Various possible explanation signals can be formed using the Attention matrix
Proposed method: Attention as explanation
11. 11
Replacement functions
• Slice out: completely removes the specified part
• Input Mask: replaces the specified part with a mask composed of black/white
frames’ feature representations
• Randomization: replaces 50% of the elements of each feature representation
within the specified part
• Attention Mask: sets the attention weights associated with the specified part
equal to zero
Modeling network’s input-output relationship
12. 12
• Quantify the kth video fragment’s influence in the network’s output, based on
the Difference of Estimates
ΔΕ 𝚾, 𝚾𝜿
= 𝜏 𝒚, 𝒚𝒌
• 𝚾: original feature vectors
• 𝚾𝜿
: updated feature vectors after replacing the kth fragment
• 𝒚: network’s output for 𝚾
• 𝒚𝒌: network’s output for 𝚾𝜿
• 𝜏: Kendall’s τ correlation coefficient
Quantifying video fragment’s influence
13. 13
• Discoverability+ (D+): evaluates if fragments with high explanation scores have a
significant influence to the network’s estimates; D+ = Mean(ΔΕ) after replacing
top-1%, 5%, 10%, 15%, 20% (batch) and 5 top-scoring fragments (1-by-1)
• Discoverability- (D-): evaluates if fragments with low explanation scores have
small influence to network’s estimates ; D- = Mean(ΔΕ) after replacing bottom-
1%, 5%, 10%, 15%, 20% (batch) and the 5 less-scoring fragments (1-by-1)
• Sanity Violation (SV): quantifies the ability of explanations to discriminate
important from unimportant video fragments; SV = % of cases where the sanity
test (D+ > D-) is violated
• Rank Correlation (RC): measures the (Spearman) correlation between fragment-
level explanation scores and obtained ΔE values after replacing each fragment
Experiments: Evaluation measures
14. 14
SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)
• 25 videos capturing multiple events (e.g., cooking and sports) from
first-person and third-person view
• Video length: 1 to 6 min
TVSum (https://github.com/yalesong/tvsum)
• 50 videos of various genres (e.g., news, “how-to”, documentary, vlog,
egocentric) from 10 categories of the TRECVid MED task
• Video length: 1 to 11 min
Experiments: Datasets
15. 15
• Frame sampling: 2 fps
• Feature extraction: GoogleNet (pool5 layer) trained on ImageNet
• Highlighted fragments in explanation mask: 5
• Size of video fragments: 20 frames (10 sec)
• Data splits: 5
• Video summarization network: CA-SUM (Apostolidis, 2022); trained models
on SumMe and TVSum, available at: https://zenodo.org/record/6562992
Experiments: Evaluation settings
17. Experimental results: Quantitative analysis
Fragments’ replacement
in batch mode
Explanations formed using
the attention weights are the
most competitive on both
datasets
On average they achieve
higher/lower D- / D+ scores
and pass sanity test in ~66%
and 80% of cases on SumMe
and TVSum, respectively
20
18. 20
Experimental results: Quantitative analysis
Fragments’ replacement
in batch mode
Explanations formed using
the norm-based weighted
attention weights are also
good, but less effective, esp.
in terms of SV
19. 20
Experimental results: Quantitative analysis
Fragments’ replacement
in batch mode
The use of gradients to form
explanations results in clearly
worse performance
Sanity test is violated in 56%
and 82% of cases on SumMe
and TVSum, respectively
21. 21
Experimental results: Quantitative analysis
Fragments’ replacement
in 1-by-1 mode
Explanations formed using the
attention weights are the
best-performing ones
Pass the sanity test in 65%
and 80% of cases on SumMe
and TVSum, respectively
Assign scores that are more
representative of each
fragments’ influence
22. 21
Experimental results: Quantitative analysis
Fragments’ replacement
in 1-by-1 mode
Explanations formed using
the norm-based weighted
attention weights are also
performing well
23. 21
Experimental results: Quantitative analysis
Fragments’ replacement
in 1-by-1 mode
Explanations formed using
gradients typically perform
worse
Violate the sanity test in 57%
and 80% of cases on SumMe
and TVSum, respectively
Assign explanation scores
that are neutrally/negatively
correlated with fragments’
influence to the network
24. 22
Experimental results: Quantitative analysis
Replacement in batch mode Replacement in 1-by-1 mode
The use of inherent attention weights to form explanations
for the CA-SUM model, is the best option
25. 23
Experimental results: Qualitative analysis
Video summary (blue bounding boxes)
• Mainly associated with the dog (4 / 5 selected fragments)
• Contains visually diverse fragments showing the dog (3 / 4 are clearly different)
• Contains a fragment showing the dog’s owner (to further increase diversity)
Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue
boxes indicate the 5 (most important) fragments of the video summary
26. 23
Experimental results: Qualitative analysis
Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue
boxes indicate the 5 (most important) fragments of the video summary
Attention mechanism (yellow bounding boxes)
• Pays more attention on parts showing the dog
• Pays less attention to speaking persons, dog products, and the pet store
• Models the video’s context based on the dog
27. 24
Experimental results: Qualitative analysis
Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue
boxes indicate the 5 (most important) fragments of the video summary
Video summary (blue bounding boxes)
• Associated with the motorcycle riders doing tricks (5 / 5 selected fragments)
• Contains visually diverse fragments (all fragments are clearly different)
28. 24
Experimental results: Qualitative analysis
Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue
boxes indicate the 5 (most important) fragments of the video summary
Attention mechanism (yellow bounding boxes)
• Pays more attention on parts showing the tricks made by the riders
• Pays less attention to the logo of the TV-show and the interview
29. 25
Experimental results: Qualitative analysis
Explanation mask showing the 5 most influential fragments (yellow boxes) for a SumMe video; blue
boxes indicate the 5 (most important) fragments of the video summary
Video summary (blue bounding boxes)
• Containing parts showing the bird and the courtyard (e.g., paving, chair)
• Missing parts showing the dog and the bird playing together
30. 25
Experimental results: Qualitative analysis
Explanation mask showing the 5 most influential fragments (yellow boxes) for a SumMe video; blue
boxes indicate the 5 (most important) fragments of the video summary
Attention mechanism (yellow bounding boxes)
• Pays more attention on parts showing the courtyard (3 / 5 fragments)
• Pays less attention on parts showing the dog and the bird’s playing (1 fragment)
31. 25
Experimental results: Qualitative analysis
Explanation mask showing the 5 most influential fragments (yellow boxes) for a SumMe video; blue
boxes indicate the 5 (most important) fragments of the video summary
Forming explanations as proposed, can lead to useful clues about the focus
of attention, and assist the explanation of the video summarization results
32. 26
Concluding remarks
• First attempt on explaining the outcomes of video summarization networks
• Focused on attention-based network architectures and considered several related
explanation signals studied in the NLP domain and elsewhere
• Introduced evaluation measures to assess explanations’ ability to spot the most
and least influential parts of the video, for the network’s predictions
• Modeled network’s input-output relationship using various replacement functions
• Conducted experiments using the CA-SUM network, and SumMe and TVSum
datasets for video summarization
• Using the attention weights to form explanations as proposed, allows to spot the
focus of attention mechanism and assist the explanation of summarization results
33. 27
References
• S. N. Aakur et al., “An inherently explainable model for video activity interpretation,” in AAAI 2018
• E. Apostolidis et al., “Summarizing videos using concentrated attention and considering the
uniqueness and diversity of the video frames,” in 2022 ACM ICMR
• S. A. Bargal et al., “Excitation backprop for RNNs,” in CVPR 2018
• G. Chrysostomou et al., “Improving the faithfulness of attention-based explanations with task-
specific information for text classification,” in 2021 ACL Meeting
• N. Gkalelis et al., “ViGAT: Bottom-up event recognition and explanation in video using factorized
graph attention network,” IEEE Access, vol. 10, pp. 108 797–108 816, 2022
• S. Jain et al., “Attention is not Explanation,” in NAACL-HLT 2019
• G. Kobayashi et al., “Attention is not only a weight: Analyzing transformers with vector norms,” in
EMNLP 2020
34. 27
References
• Z. Li et al., “Towards visually explaining video understanding networks with perturbation,” in
IEEE WACV 2021
• Y. Liu et al., “Rethinking attention-model explainability through faithfulness violation test,” in
ICML 2022, vol. 162
• J. Manttari et al., “Interpreting video features: A comparison of 3D conv. networks and conv.
LSTM networks,” in ACCV 2020
• S. Serrano et al., “Is attention interpretable?” in 2019 ACL Meeting
• A. Stergiou et al., “Saliency tubes: Visual explanations for spatiotemporal convolutions,” in IEEE
ICIP 2019
• S. Wiegreffe et al., “Attention is not not explanation,” in EMNLP 2019
• T. Zhuo et al., “Explainable video action reasoning via prior knowledge and state transitions,” in
2019 ACM MM
35. Thank you for your attention!
Questions?
Vasileios Mezaris, bmezaris@iti.gr
Code and documentation publicly available at:
https://github.com/e-apostolidis/XAI-SUM
This work was supported by the EUs Horizon 2020 research and innovation programme
under grant agreement 951911 AI4Media