SlideShare une entreprise Scribd logo
1  sur  35
Title of presentation
Subtitle
Name of presenter
Date
Explaining video summarization based on
the focus of attention
E. Apostolidis1,2, G. Balaouras1, V. Mezaris1, I. Patras2
1 Information Technologies Institute, CERTH, Thermi - Thessaloniki, Greece
2 School of EECS, Queen Mary University of London, London, UK
24th IEEE International Symposium
on Multimedia (ISM 2022)
2
• Explainable video summarization: why is it important?
• Related work
• Proposed method
• Experimental evaluations
• Conclusions
Overview
3
Current practice for producing a video summary
• An editor has to watch the entire video
content and decide about the parts
that should be included in the summary
• Different summaries of the same video
could be needed for distribution via
different communication channels
• Laborious task that can be significantly
accelerated by video summarization
technologies
Image source: https://www.premiumbeat.com/
blog/3-tips-for-difficult-video-edit/
4
Goal of video summarization technologies
This synopsis can be made of:
• A set of representative video key-
fragments (a.k.a. video skim)
• A set representative video key-frames
(a.k.a. video storyboard)
“Generate a short visual synopsis
that summarizes the video content
by selecting the most informative
and important parts of it”
Video content
Key-fragment
Key-frame
Analysis outcomes
Video storyboard
Video skim
Video title: “Susan Boyle's First Audition -
I Dreamed a Dream - Britain's Got Talent 2009”
Video source: https://www.youtube.com/watch? v=deRF9oEbRso
5
Why explainable video summarization is important?
• Video summarization technologies can
drastically reduce the needed resources
for video summary production in terms
of both time and human effort
• However, their outcome needs to be
curated by the editor to ensure that all
needed parts have been selected
• Content curation could be facilitated if
the editor get’s explanations about the
suggestions of the used technology
Image source: https://www.appier.com/en/blog/
what-is-supervised-learning
Such explanations would increase the editor’s trust in the used technology,
thus facilitating and accelerating content curation
6
Works on explainable networks for video analysis tasks
• (Aakur, 2018) extraction of explainable representations for video activity interpretation
• (Bargal, 2018) spatio-temporal cues contributing to network’s classification/captioning
output, to spot fragments linked to specific action/phrase from caption
• (Zhuo, 2019) spatio-temporal graph of semantic-level video states and state transition
analysis for video action reasoning
• (Stergiou, 2019) heatmaps visualizing focus of attention and explaining networks for
action classification and recognition
• (Manttari, 2020) perturbation-based method to spot the video fragment with the
greatest impact on the video classification results
• (Li, 2021) generic perturbation-based method for spatio-temporally-smooth
explanations of video classification networks
• (Gkalelis, 2022) in-degrees of graph attention networks’ adjacency matrices to explain
video event recognition, in terms of salient objects and frames
7
Typical video summarization pipeline
1. Video frames are represented using pre-trained CNNs (e.g., GoogleNet)
2. Video summarization networks estimate the frames’ importance
3. Given a video fragmentation and a time budget, the video summary is formed
by selecting fragments that maximize its importance (Knapsack problem)
Proposed method: Problem formulation
8
Explanation’s goal/output
• A video-fragment-level explanation mask indicating the most influential video
fragments for the network’s estimates about the frames’ importance
Assumptions
• Video is split in fixed-size fragments; Summary is made of the M top-scoring ones
Proposed method: Problem formulation
9
• Studied in the NLP domain [Jain, 2019; Serrano, 2019; Wiegreffe, 2019;
Kobayashi, 2020; Chrysostomou, 2021; Liu, 2022] and elsewhere
• Can be used on attention-based video summarization networks?
• Various possible explanation signals can be formed using the Attention matrix
Proposed method: Attention as explanation
10
Attention-based explanation signals
• Inherent Attention (IA): 𝑎𝑖,𝑖 𝑖=1
𝑇
• Grad of Attention (GoA): 𝛻𝑎𝑖,𝑖 𝑖=1
𝑇
• Grad Attention (GA): 𝑎𝑖,𝑖 ∗ 𝛻𝑎𝑖,𝑖 𝑖=1
𝑇
• Input Norm Attention (NA): 𝑎𝑖,𝑖 ∗ 𝒗𝒊 𝑖=1
𝑇
• Input Norm Grad Attention (NGA): 𝑎𝑖,𝑖 ∗ 𝛻𝑎𝑖,𝑖 ∗ 𝒗𝒊 𝑖=1
𝑇
11
Replacement functions
• Slice out: completely removes the specified part
• Input Mask: replaces the specified part with a mask composed of black/white
frames’ feature representations
• Randomization: replaces 50% of the elements of each feature representation
within the specified part
• Attention Mask: sets the attention weights associated with the specified part
equal to zero
Modeling network’s input-output relationship
12
• Quantify the kth video fragment’s influence in the network’s output, based on
the Difference of Estimates
ΔΕ 𝚾, 𝚾𝜿
= 𝜏 𝒚, 𝒚𝒌
• 𝚾: original feature vectors
• 𝚾𝜿
: updated feature vectors after replacing the kth fragment
• 𝒚: network’s output for 𝚾
• 𝒚𝒌: network’s output for 𝚾𝜿
• 𝜏: Kendall’s τ correlation coefficient
Quantifying video fragment’s influence
13
• Discoverability+ (D+): evaluates if fragments with high explanation scores have a
significant influence to the network’s estimates; D+ = Mean(ΔΕ) after replacing
top-1%, 5%, 10%, 15%, 20% (batch) and 5 top-scoring fragments (1-by-1)
• Discoverability- (D-): evaluates if fragments with low explanation scores have
small influence to network’s estimates ; D- = Mean(ΔΕ) after replacing bottom-
1%, 5%, 10%, 15%, 20% (batch) and the 5 less-scoring fragments (1-by-1)
• Sanity Violation (SV): quantifies the ability of explanations to discriminate
important from unimportant video fragments; SV = % of cases where the sanity
test (D+ > D-) is violated
• Rank Correlation (RC): measures the (Spearman) correlation between fragment-
level explanation scores and obtained ΔE values after replacing each fragment
Experiments: Evaluation measures
14
SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)
• 25 videos capturing multiple events (e.g., cooking and sports) from
first-person and third-person view
• Video length: 1 to 6 min
TVSum (https://github.com/yalesong/tvsum)
• 50 videos of various genres (e.g., news, “how-to”, documentary, vlog,
egocentric) from 10 categories of the TRECVid MED task
• Video length: 1 to 11 min
Experiments: Datasets
15
• Frame sampling: 2 fps
• Feature extraction: GoogleNet (pool5 layer) trained on ImageNet
• Highlighted fragments in explanation mask: 5
• Size of video fragments: 20 frames (10 sec)
• Data splits: 5
• Video summarization network: CA-SUM (Apostolidis, 2022); trained models
on SumMe and TVSum, available at: https://zenodo.org/record/6562992
Experiments: Evaluation settings
16
Experimental results: Quantitative analysis
Fragments’ replacement
in batch mode
Experimental results: Quantitative analysis
Fragments’ replacement
in batch mode
Explanations formed using
the attention weights are the
most competitive on both
datasets
On average they achieve
higher/lower D- / D+ scores
and pass sanity test in ~66%
and 80% of cases on SumMe
and TVSum, respectively
20
20
Experimental results: Quantitative analysis
Fragments’ replacement
in batch mode
Explanations formed using
the norm-based weighted
attention weights are also
good, but less effective, esp.
in terms of SV
20
Experimental results: Quantitative analysis
Fragments’ replacement
in batch mode
The use of gradients to form
explanations results in clearly
worse performance
Sanity test is violated in 56%
and 82% of cases on SumMe
and TVSum, respectively
21
Experimental results: Quantitative analysis
Fragments’ replacement
in 1-by-1 mode
21
Experimental results: Quantitative analysis
Fragments’ replacement
in 1-by-1 mode
Explanations formed using the
attention weights are the
best-performing ones
Pass the sanity test in 65%
and 80% of cases on SumMe
and TVSum, respectively
Assign scores that are more
representative of each
fragments’ influence
21
Experimental results: Quantitative analysis
Fragments’ replacement
in 1-by-1 mode
Explanations formed using
the norm-based weighted
attention weights are also
performing well
21
Experimental results: Quantitative analysis
Fragments’ replacement
in 1-by-1 mode
Explanations formed using
gradients typically perform
worse
Violate the sanity test in 57%
and 80% of cases on SumMe
and TVSum, respectively
Assign explanation scores
that are neutrally/negatively
correlated with fragments’
influence to the network
22
Experimental results: Quantitative analysis
Replacement in batch mode Replacement in 1-by-1 mode
The use of inherent attention weights to form explanations
for the CA-SUM model, is the best option
23
Experimental results: Qualitative analysis
Video summary (blue bounding boxes)
• Mainly associated with the dog (4 / 5 selected fragments)
• Contains visually diverse fragments showing the dog (3 / 4 are clearly different)
• Contains a fragment showing the dog’s owner (to further increase diversity)
Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue
boxes indicate the 5 (most important) fragments of the video summary
23
Experimental results: Qualitative analysis
Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue
boxes indicate the 5 (most important) fragments of the video summary
Attention mechanism (yellow bounding boxes)
• Pays more attention on parts showing the dog
• Pays less attention to speaking persons, dog products, and the pet store
• Models the video’s context based on the dog
24
Experimental results: Qualitative analysis
Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue
boxes indicate the 5 (most important) fragments of the video summary
Video summary (blue bounding boxes)
• Associated with the motorcycle riders doing tricks (5 / 5 selected fragments)
• Contains visually diverse fragments (all fragments are clearly different)
24
Experimental results: Qualitative analysis
Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue
boxes indicate the 5 (most important) fragments of the video summary
Attention mechanism (yellow bounding boxes)
• Pays more attention on parts showing the tricks made by the riders
• Pays less attention to the logo of the TV-show and the interview
25
Experimental results: Qualitative analysis
Explanation mask showing the 5 most influential fragments (yellow boxes) for a SumMe video; blue
boxes indicate the 5 (most important) fragments of the video summary
Video summary (blue bounding boxes)
• Containing parts showing the bird and the courtyard (e.g., paving, chair)
• Missing parts showing the dog and the bird playing together
25
Experimental results: Qualitative analysis
Explanation mask showing the 5 most influential fragments (yellow boxes) for a SumMe video; blue
boxes indicate the 5 (most important) fragments of the video summary
Attention mechanism (yellow bounding boxes)
• Pays more attention on parts showing the courtyard (3 / 5 fragments)
• Pays less attention on parts showing the dog and the bird’s playing (1 fragment)
25
Experimental results: Qualitative analysis
Explanation mask showing the 5 most influential fragments (yellow boxes) for a SumMe video; blue
boxes indicate the 5 (most important) fragments of the video summary
Forming explanations as proposed, can lead to useful clues about the focus
of attention, and assist the explanation of the video summarization results
26
Concluding remarks
• First attempt on explaining the outcomes of video summarization networks
• Focused on attention-based network architectures and considered several related
explanation signals studied in the NLP domain and elsewhere
• Introduced evaluation measures to assess explanations’ ability to spot the most
and least influential parts of the video, for the network’s predictions
• Modeled network’s input-output relationship using various replacement functions
• Conducted experiments using the CA-SUM network, and SumMe and TVSum
datasets for video summarization
• Using the attention weights to form explanations as proposed, allows to spot the
focus of attention mechanism and assist the explanation of summarization results
27
References
• S. N. Aakur et al., “An inherently explainable model for video activity interpretation,” in AAAI 2018
• E. Apostolidis et al., “Summarizing videos using concentrated attention and considering the
uniqueness and diversity of the video frames,” in 2022 ACM ICMR
• S. A. Bargal et al., “Excitation backprop for RNNs,” in CVPR 2018
• G. Chrysostomou et al., “Improving the faithfulness of attention-based explanations with task-
specific information for text classification,” in 2021 ACL Meeting
• N. Gkalelis et al., “ViGAT: Bottom-up event recognition and explanation in video using factorized
graph attention network,” IEEE Access, vol. 10, pp. 108 797–108 816, 2022
• S. Jain et al., “Attention is not Explanation,” in NAACL-HLT 2019
• G. Kobayashi et al., “Attention is not only a weight: Analyzing transformers with vector norms,” in
EMNLP 2020
27
References
• Z. Li et al., “Towards visually explaining video understanding networks with perturbation,” in
IEEE WACV 2021
• Y. Liu et al., “Rethinking attention-model explainability through faithfulness violation test,” in
ICML 2022, vol. 162
• J. Manttari et al., “Interpreting video features: A comparison of 3D conv. networks and conv.
LSTM networks,” in ACCV 2020
• S. Serrano et al., “Is attention interpretable?” in 2019 ACL Meeting
• A. Stergiou et al., “Saliency tubes: Visual explanations for spatiotemporal convolutions,” in IEEE
ICIP 2019
• S. Wiegreffe et al., “Attention is not not explanation,” in EMNLP 2019
• T. Zhuo et al., “Explainable video action reasoning via prior knowledge and state transitions,” in
2019 ACM MM
Thank you for your attention!
Questions?
Vasileios Mezaris, bmezaris@iti.gr
Code and documentation publicly available at:
https://github.com/e-apostolidis/XAI-SUM
This work was supported by the EUs Horizon 2020 research and innovation programme
under grant agreement 951911 AI4Media

Contenu connexe

Tendances

Tendances (20)

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
[Paper review] contrastive language image pre-training, open ai, 2020
[Paper review] contrastive language image pre-training, open ai, 2020[Paper review] contrastive language image pre-training, open ai, 2020
[Paper review] contrastive language image pre-training, open ai, 2020
 
3次元計測とフィルタリング
3次元計測とフィルタリング3次元計測とフィルタリング
3次元計測とフィルタリング
 
Activity-Net Challenge 2021の紹介
Activity-Net Challenge 2021の紹介Activity-Net Challenge 2021の紹介
Activity-Net Challenge 2021の紹介
 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNN
 
Depth estimation using deep learning
Depth estimation using deep learningDepth estimation using deep learning
Depth estimation using deep learning
 
Towards Performant Video Recognition
Towards Performant Video RecognitionTowards Performant Video Recognition
Towards Performant Video Recognition
 
Pr083 Non-local Neural Networks
Pr083 Non-local Neural NetworksPr083 Non-local Neural Networks
Pr083 Non-local Neural Networks
 
TAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for ExplanationsTAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for Explanations
 
(Paper Review)Image to image translation with conditional adversarial network...
(Paper Review)Image to image translation with conditional adversarial network...(Paper Review)Image to image translation with conditional adversarial network...
(Paper Review)Image to image translation with conditional adversarial network...
 
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
 
動画認識における代表的なモデル・データセット(メタサーベイ)
動画認識における代表的なモデル・データセット(メタサーベイ)動画認識における代表的なモデル・データセット(メタサーベイ)
動画認識における代表的なモデル・データセット(メタサーベイ)
 
NIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding Model
NIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding ModelNIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding Model
NIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding Model
 
Image Retrieval Overview (from Traditional Local Features to Recent Deep Lear...
Image Retrieval Overview (from Traditional Local Features to Recent Deep Lear...Image Retrieval Overview (from Traditional Local Features to Recent Deep Lear...
Image Retrieval Overview (from Traditional Local Features to Recent Deep Lear...
 
Pose estimation from RGB images by deep learning
Pose estimation from RGB images by deep learningPose estimation from RGB images by deep learning
Pose estimation from RGB images by deep learning
 
FastDepth: Fast Monocular Depth Estimation on Embedded Systems
FastDepth: Fast Monocular Depth Estimation on Embedded SystemsFastDepth: Fast Monocular Depth Estimation on Embedded Systems
FastDepth: Fast Monocular Depth Estimation on Embedded Systems
 
Learning spatiotemporal features with 3 d convolutional networks
Learning spatiotemporal features with 3 d convolutional networksLearning spatiotemporal features with 3 d convolutional networks
Learning spatiotemporal features with 3 d convolutional networks
 
【論文読み会】Alias-Free Generative Adversarial Networks(StyleGAN3)
【論文読み会】Alias-Free Generative Adversarial Networks(StyleGAN3)【論文読み会】Alias-Free Generative Adversarial Networks(StyleGAN3)
【論文読み会】Alias-Free Generative Adversarial Networks(StyleGAN3)
 
動画認識サーベイv1(メタサーベイ )
動画認識サーベイv1(メタサーベイ )動画認識サーベイv1(メタサーベイ )
動画認識サーベイv1(メタサーベイ )
 
Multiple object detection
Multiple object detectionMultiple object detection
Multiple object detection
 

Similaire à Explaining video summarization based on the focus of attention

Key Frame Extraction in Video Stream using Two Stage Method with Colour and S...
Key Frame Extraction in Video Stream using Two Stage Method with Colour and S...Key Frame Extraction in Video Stream using Two Stage Method with Colour and S...
Key Frame Extraction in Video Stream using Two Stage Method with Colour and S...
ijtsrd
 
Key frame extraction methodology for video annotation
Key frame extraction methodology for video annotationKey frame extraction methodology for video annotation
Key frame extraction methodology for video annotation
IAEME Publication
 
absorption, Cu2+ : glass, emission, excitation, XRD
absorption, Cu2+ : glass, emission, excitation, XRDabsorption, Cu2+ : glass, emission, excitation, XRD
absorption, Cu2+ : glass, emission, excitation, XRD
IJERA Editor
 

Similaire à Explaining video summarization based on the focus of attention (20)

VIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHM
VIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHMVIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHM
VIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHM
 
VIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHM
VIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHMVIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHM
VIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHM
 
VIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHM
VIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHMVIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHM
VIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHM
 
Enhancing Video Summarization via Vision-Language Embedding
Enhancing Video Summarization via Vision-Language EmbeddingEnhancing Video Summarization via Vision-Language Embedding
Enhancing Video Summarization via Vision-Language Embedding
 
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATIONVISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
 
CA-SUM Video Summarization
CA-SUM Video SummarizationCA-SUM Video Summarization
CA-SUM Video Summarization
 
A Ensemble Learning-based No Reference QoE Model for User Generated Contents
A Ensemble Learning-based No Reference QoE Model for User Generated ContentsA Ensemble Learning-based No Reference QoE Model for User Generated Contents
A Ensemble Learning-based No Reference QoE Model for User Generated Contents
 
Video Thumbnail Selector
Video Thumbnail SelectorVideo Thumbnail Selector
Video Thumbnail Selector
 
Key Frame Extraction in Video Stream using Two Stage Method with Colour and S...
Key Frame Extraction in Video Stream using Two Stage Method with Colour and S...Key Frame Extraction in Video Stream using Two Stage Method with Colour and S...
Key Frame Extraction in Video Stream using Two Stage Method with Colour and S...
 
Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)
Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)
Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)
 
Key frame extraction methodology for video annotation
Key frame extraction methodology for video annotationKey frame extraction methodology for video annotation
Key frame extraction methodology for video annotation
 
Gated-ViGAT
Gated-ViGATGated-ViGAT
Gated-ViGAT
 
absorption, Cu2+ : glass, emission, excitation, XRD
absorption, Cu2+ : glass, emission, excitation, XRDabsorption, Cu2+ : glass, emission, excitation, XRD
absorption, Cu2+ : glass, emission, excitation, XRD
 
Improved Error Detection and Data Recovery Architecture for Motion Estimation...
Improved Error Detection and Data Recovery Architecture for Motion Estimation...Improved Error Detection and Data Recovery Architecture for Motion Estimation...
Improved Error Detection and Data Recovery Architecture for Motion Estimation...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
1829 1833
1829 18331829 1833
1829 1833
 
1829 1833
1829 18331829 1833
1829 1833
 
Deep neural networks for Youtube recommendations
Deep neural networks for Youtube recommendationsDeep neural networks for Youtube recommendations
Deep neural networks for Youtube recommendations
 
NMSL_2017summer
NMSL_2017summerNMSL_2017summer
NMSL_2017summer
 
Key frame extraction for video summarization using motion activity descriptors
Key frame extraction for video summarization using motion activity descriptorsKey frame extraction for video summarization using motion activity descriptors
Key frame extraction for video summarization using motion activity descriptors
 

Plus de VasileiosMezaris

Plus de VasileiosMezaris (20)

Multi-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and LocalizationMulti-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and Localization
 
CERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages TaskCERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages Task
 
Spatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees VideosSpatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees Videos
 
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
 
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
 
Combining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchCombining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video Search
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiers
 
Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...
 
Are all combinations equal? Combining textual and visual features with multi...
Are all combinations equal?  Combining textual and visual features with multi...Are all combinations equal?  Combining textual and visual features with multi...
Are all combinations equal? Combining textual and visual features with multi...
 
Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web application
 
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal RetrievalHard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
 
Misinformation on the internet: Video and AI
Misinformation on the internet: Video and AIMisinformation on the internet: Video and AI
Misinformation on the internet: Video and AI
 
LSTM Structured Pruning
LSTM Structured PruningLSTM Structured Pruning
LSTM Structured Pruning
 
PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020
 
GAN-based video summarization
GAN-based video summarizationGAN-based video summarization
GAN-based video summarization
 
Migration-related video retrieval
Migration-related video retrievalMigration-related video retrieval
Migration-related video retrieval
 
Fractional step discriminant pruning
Fractional step discriminant pruningFractional step discriminant pruning
Fractional step discriminant pruning
 
Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...
 
Subclass deep neural networks
Subclass deep neural networksSubclass deep neural networks
Subclass deep neural networks
 
Video & AI: capabilities and limitations of AI in detecting video manipulations
Video & AI: capabilities and limitations of AI in detecting video manipulationsVideo & AI: capabilities and limitations of AI in detecting video manipulations
Video & AI: capabilities and limitations of AI in detecting video manipulations
 

Dernier

SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Sérgio Sacani
 

Dernier (20)

TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 

Explaining video summarization based on the focus of attention

  • 1. Title of presentation Subtitle Name of presenter Date Explaining video summarization based on the focus of attention E. Apostolidis1,2, G. Balaouras1, V. Mezaris1, I. Patras2 1 Information Technologies Institute, CERTH, Thermi - Thessaloniki, Greece 2 School of EECS, Queen Mary University of London, London, UK 24th IEEE International Symposium on Multimedia (ISM 2022)
  • 2. 2 • Explainable video summarization: why is it important? • Related work • Proposed method • Experimental evaluations • Conclusions Overview
  • 3. 3 Current practice for producing a video summary • An editor has to watch the entire video content and decide about the parts that should be included in the summary • Different summaries of the same video could be needed for distribution via different communication channels • Laborious task that can be significantly accelerated by video summarization technologies Image source: https://www.premiumbeat.com/ blog/3-tips-for-difficult-video-edit/
  • 4. 4 Goal of video summarization technologies This synopsis can be made of: • A set of representative video key- fragments (a.k.a. video skim) • A set representative video key-frames (a.k.a. video storyboard) “Generate a short visual synopsis that summarizes the video content by selecting the most informative and important parts of it” Video content Key-fragment Key-frame Analysis outcomes Video storyboard Video skim Video title: “Susan Boyle's First Audition - I Dreamed a Dream - Britain's Got Talent 2009” Video source: https://www.youtube.com/watch? v=deRF9oEbRso
  • 5. 5 Why explainable video summarization is important? • Video summarization technologies can drastically reduce the needed resources for video summary production in terms of both time and human effort • However, their outcome needs to be curated by the editor to ensure that all needed parts have been selected • Content curation could be facilitated if the editor get’s explanations about the suggestions of the used technology Image source: https://www.appier.com/en/blog/ what-is-supervised-learning Such explanations would increase the editor’s trust in the used technology, thus facilitating and accelerating content curation
  • 6. 6 Works on explainable networks for video analysis tasks • (Aakur, 2018) extraction of explainable representations for video activity interpretation • (Bargal, 2018) spatio-temporal cues contributing to network’s classification/captioning output, to spot fragments linked to specific action/phrase from caption • (Zhuo, 2019) spatio-temporal graph of semantic-level video states and state transition analysis for video action reasoning • (Stergiou, 2019) heatmaps visualizing focus of attention and explaining networks for action classification and recognition • (Manttari, 2020) perturbation-based method to spot the video fragment with the greatest impact on the video classification results • (Li, 2021) generic perturbation-based method for spatio-temporally-smooth explanations of video classification networks • (Gkalelis, 2022) in-degrees of graph attention networks’ adjacency matrices to explain video event recognition, in terms of salient objects and frames
  • 7. 7 Typical video summarization pipeline 1. Video frames are represented using pre-trained CNNs (e.g., GoogleNet) 2. Video summarization networks estimate the frames’ importance 3. Given a video fragmentation and a time budget, the video summary is formed by selecting fragments that maximize its importance (Knapsack problem) Proposed method: Problem formulation
  • 8. 8 Explanation’s goal/output • A video-fragment-level explanation mask indicating the most influential video fragments for the network’s estimates about the frames’ importance Assumptions • Video is split in fixed-size fragments; Summary is made of the M top-scoring ones Proposed method: Problem formulation
  • 9. 9 • Studied in the NLP domain [Jain, 2019; Serrano, 2019; Wiegreffe, 2019; Kobayashi, 2020; Chrysostomou, 2021; Liu, 2022] and elsewhere • Can be used on attention-based video summarization networks? • Various possible explanation signals can be formed using the Attention matrix Proposed method: Attention as explanation
  • 10. 10 Attention-based explanation signals • Inherent Attention (IA): 𝑎𝑖,𝑖 𝑖=1 𝑇 • Grad of Attention (GoA): 𝛻𝑎𝑖,𝑖 𝑖=1 𝑇 • Grad Attention (GA): 𝑎𝑖,𝑖 ∗ 𝛻𝑎𝑖,𝑖 𝑖=1 𝑇 • Input Norm Attention (NA): 𝑎𝑖,𝑖 ∗ 𝒗𝒊 𝑖=1 𝑇 • Input Norm Grad Attention (NGA): 𝑎𝑖,𝑖 ∗ 𝛻𝑎𝑖,𝑖 ∗ 𝒗𝒊 𝑖=1 𝑇
  • 11. 11 Replacement functions • Slice out: completely removes the specified part • Input Mask: replaces the specified part with a mask composed of black/white frames’ feature representations • Randomization: replaces 50% of the elements of each feature representation within the specified part • Attention Mask: sets the attention weights associated with the specified part equal to zero Modeling network’s input-output relationship
  • 12. 12 • Quantify the kth video fragment’s influence in the network’s output, based on the Difference of Estimates ΔΕ 𝚾, 𝚾𝜿 = 𝜏 𝒚, 𝒚𝒌 • 𝚾: original feature vectors • 𝚾𝜿 : updated feature vectors after replacing the kth fragment • 𝒚: network’s output for 𝚾 • 𝒚𝒌: network’s output for 𝚾𝜿 • 𝜏: Kendall’s τ correlation coefficient Quantifying video fragment’s influence
  • 13. 13 • Discoverability+ (D+): evaluates if fragments with high explanation scores have a significant influence to the network’s estimates; D+ = Mean(ΔΕ) after replacing top-1%, 5%, 10%, 15%, 20% (batch) and 5 top-scoring fragments (1-by-1) • Discoverability- (D-): evaluates if fragments with low explanation scores have small influence to network’s estimates ; D- = Mean(ΔΕ) after replacing bottom- 1%, 5%, 10%, 15%, 20% (batch) and the 5 less-scoring fragments (1-by-1) • Sanity Violation (SV): quantifies the ability of explanations to discriminate important from unimportant video fragments; SV = % of cases where the sanity test (D+ > D-) is violated • Rank Correlation (RC): measures the (Spearman) correlation between fragment- level explanation scores and obtained ΔE values after replacing each fragment Experiments: Evaluation measures
  • 14. 14 SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark) • 25 videos capturing multiple events (e.g., cooking and sports) from first-person and third-person view • Video length: 1 to 6 min TVSum (https://github.com/yalesong/tvsum) • 50 videos of various genres (e.g., news, “how-to”, documentary, vlog, egocentric) from 10 categories of the TRECVid MED task • Video length: 1 to 11 min Experiments: Datasets
  • 15. 15 • Frame sampling: 2 fps • Feature extraction: GoogleNet (pool5 layer) trained on ImageNet • Highlighted fragments in explanation mask: 5 • Size of video fragments: 20 frames (10 sec) • Data splits: 5 • Video summarization network: CA-SUM (Apostolidis, 2022); trained models on SumMe and TVSum, available at: https://zenodo.org/record/6562992 Experiments: Evaluation settings
  • 16. 16 Experimental results: Quantitative analysis Fragments’ replacement in batch mode
  • 17. Experimental results: Quantitative analysis Fragments’ replacement in batch mode Explanations formed using the attention weights are the most competitive on both datasets On average they achieve higher/lower D- / D+ scores and pass sanity test in ~66% and 80% of cases on SumMe and TVSum, respectively 20
  • 18. 20 Experimental results: Quantitative analysis Fragments’ replacement in batch mode Explanations formed using the norm-based weighted attention weights are also good, but less effective, esp. in terms of SV
  • 19. 20 Experimental results: Quantitative analysis Fragments’ replacement in batch mode The use of gradients to form explanations results in clearly worse performance Sanity test is violated in 56% and 82% of cases on SumMe and TVSum, respectively
  • 20. 21 Experimental results: Quantitative analysis Fragments’ replacement in 1-by-1 mode
  • 21. 21 Experimental results: Quantitative analysis Fragments’ replacement in 1-by-1 mode Explanations formed using the attention weights are the best-performing ones Pass the sanity test in 65% and 80% of cases on SumMe and TVSum, respectively Assign scores that are more representative of each fragments’ influence
  • 22. 21 Experimental results: Quantitative analysis Fragments’ replacement in 1-by-1 mode Explanations formed using the norm-based weighted attention weights are also performing well
  • 23. 21 Experimental results: Quantitative analysis Fragments’ replacement in 1-by-1 mode Explanations formed using gradients typically perform worse Violate the sanity test in 57% and 80% of cases on SumMe and TVSum, respectively Assign explanation scores that are neutrally/negatively correlated with fragments’ influence to the network
  • 24. 22 Experimental results: Quantitative analysis Replacement in batch mode Replacement in 1-by-1 mode The use of inherent attention weights to form explanations for the CA-SUM model, is the best option
  • 25. 23 Experimental results: Qualitative analysis Video summary (blue bounding boxes) • Mainly associated with the dog (4 / 5 selected fragments) • Contains visually diverse fragments showing the dog (3 / 4 are clearly different) • Contains a fragment showing the dog’s owner (to further increase diversity) Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue boxes indicate the 5 (most important) fragments of the video summary
  • 26. 23 Experimental results: Qualitative analysis Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue boxes indicate the 5 (most important) fragments of the video summary Attention mechanism (yellow bounding boxes) • Pays more attention on parts showing the dog • Pays less attention to speaking persons, dog products, and the pet store • Models the video’s context based on the dog
  • 27. 24 Experimental results: Qualitative analysis Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue boxes indicate the 5 (most important) fragments of the video summary Video summary (blue bounding boxes) • Associated with the motorcycle riders doing tricks (5 / 5 selected fragments) • Contains visually diverse fragments (all fragments are clearly different)
  • 28. 24 Experimental results: Qualitative analysis Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue boxes indicate the 5 (most important) fragments of the video summary Attention mechanism (yellow bounding boxes) • Pays more attention on parts showing the tricks made by the riders • Pays less attention to the logo of the TV-show and the interview
  • 29. 25 Experimental results: Qualitative analysis Explanation mask showing the 5 most influential fragments (yellow boxes) for a SumMe video; blue boxes indicate the 5 (most important) fragments of the video summary Video summary (blue bounding boxes) • Containing parts showing the bird and the courtyard (e.g., paving, chair) • Missing parts showing the dog and the bird playing together
  • 30. 25 Experimental results: Qualitative analysis Explanation mask showing the 5 most influential fragments (yellow boxes) for a SumMe video; blue boxes indicate the 5 (most important) fragments of the video summary Attention mechanism (yellow bounding boxes) • Pays more attention on parts showing the courtyard (3 / 5 fragments) • Pays less attention on parts showing the dog and the bird’s playing (1 fragment)
  • 31. 25 Experimental results: Qualitative analysis Explanation mask showing the 5 most influential fragments (yellow boxes) for a SumMe video; blue boxes indicate the 5 (most important) fragments of the video summary Forming explanations as proposed, can lead to useful clues about the focus of attention, and assist the explanation of the video summarization results
  • 32. 26 Concluding remarks • First attempt on explaining the outcomes of video summarization networks • Focused on attention-based network architectures and considered several related explanation signals studied in the NLP domain and elsewhere • Introduced evaluation measures to assess explanations’ ability to spot the most and least influential parts of the video, for the network’s predictions • Modeled network’s input-output relationship using various replacement functions • Conducted experiments using the CA-SUM network, and SumMe and TVSum datasets for video summarization • Using the attention weights to form explanations as proposed, allows to spot the focus of attention mechanism and assist the explanation of summarization results
  • 33. 27 References • S. N. Aakur et al., “An inherently explainable model for video activity interpretation,” in AAAI 2018 • E. Apostolidis et al., “Summarizing videos using concentrated attention and considering the uniqueness and diversity of the video frames,” in 2022 ACM ICMR • S. A. Bargal et al., “Excitation backprop for RNNs,” in CVPR 2018 • G. Chrysostomou et al., “Improving the faithfulness of attention-based explanations with task- specific information for text classification,” in 2021 ACL Meeting • N. Gkalelis et al., “ViGAT: Bottom-up event recognition and explanation in video using factorized graph attention network,” IEEE Access, vol. 10, pp. 108 797–108 816, 2022 • S. Jain et al., “Attention is not Explanation,” in NAACL-HLT 2019 • G. Kobayashi et al., “Attention is not only a weight: Analyzing transformers with vector norms,” in EMNLP 2020
  • 34. 27 References • Z. Li et al., “Towards visually explaining video understanding networks with perturbation,” in IEEE WACV 2021 • Y. Liu et al., “Rethinking attention-model explainability through faithfulness violation test,” in ICML 2022, vol. 162 • J. Manttari et al., “Interpreting video features: A comparison of 3D conv. networks and conv. LSTM networks,” in ACCV 2020 • S. Serrano et al., “Is attention interpretable?” in 2019 ACL Meeting • A. Stergiou et al., “Saliency tubes: Visual explanations for spatiotemporal convolutions,” in IEEE ICIP 2019 • S. Wiegreffe et al., “Attention is not not explanation,” in EMNLP 2019 • T. Zhuo et al., “Explainable video action reasoning via prior knowledge and state transitions,” in 2019 ACM MM
  • 35. Thank you for your attention! Questions? Vasileios Mezaris, bmezaris@iti.gr Code and documentation publicly available at: https://github.com/e-apostolidis/XAI-SUM This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreement 951911 AI4Media