SlideShare une entreprise Scribd logo
1  sur  112
Télécharger pour lire hors ligne
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Vasileios Mezaris,
Evlampios Apostolidis
CERTH-ITI, Greece
Tutorial at IEEE ICME 2020
Section I.1: Video summarization
problem definition and literature
overview
Video Summarization and Re-use
Technologies and Tools
Part I: Automatic video summarization
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Tutorial’s structure and time schedule
2
Part I: Automatic video summarization
 Section I.1: Video summarization problem definition and literature overview (20’)
 Q&A (5’)
 Section I.2: In-depth discussion on a few unsupervised GAN-based methods (20’)
 Q&A (5’)
 Section I.3: Datasets, evaluation protocols and results, and future directions (20’)
20’ Q&A and break, then we are back with the tutorial’s Part II: Video summaries re-use and
recommendation
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
3
Video is everywhere!
Problem definition
Hours of video content uploaded on
YouTube every minute
 Captured by smart-devices and instantly
shared online
 Constantly and rapidly increased
volumes of video content
Image sources: https://www.financialexpress.com/india-news/govt-agencies-adopt-new-
age-video-sharing-apps-like-tiktok/1767354/ (left) & https://www.statista.com/ (right)
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
4
But how to find what we are looking for in endless collections of video content?
Problem definition - video consumption side
Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
5
But how to find what we are looking for in endless collections of video content?
Problem definition - video consumption side
Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/
Quickly inspect a video’s
content by checking its
synopsis!
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
6
But how to reach different audiences for a given media item?
Problem definition - video editing side
Image source: https://marketingland.com/social-media-audience-critical-content-marketing-223647
Good
Very
interesting Boring
Nice
Much
detailed
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
7
But how to reach different audiences for a given media item?
Problem definition - video editing side
Image source: https://marketingland.com/social-media-audience-critical-content-marketing-223647
Good
Very
interesting Boring
Nice
Use of technologies for
content adaptation, re-use
and re-purposing!
Much
detailed
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
8
Video summary: a short visual summary that encapsulates the flow of the story and
the essential parts of the full-length video
Original video
Video summary (storyboard)
Problem definition
Source: https://www.youtube.com/watch?v=deRF9oEbRso
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
9
Problem definition
General applications of video summarization
 Professional CMS: effective indexing,
browsing, retrieval & promotion of media
assets!
 Video sharing platforms: improved viewer
experience, enhanced viewer engagement &
increased content consumption!
Source: https://www.redbytes.in/how-to-build-an-app-like-hotstar/ Source: Screenshot of the BBC News channel on YouTube
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
10
Problem definition
General applications of video summarization
Audience- and channel-specific content adaptation: video content re-use and re-distribution in
the most appropriate way!
Image source: https://www.databagg.com/online-video-sharing
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
11
Problem definition
Domain-specific applications of video summarization
Full movie (e.g. 1h 30’-2h) Movie trailer (2’30’’)
J. R. Smith, D. Joshi, B. Huet, W. Hsu, and J. Cota, “Harnessing A.I. for Augmenting Creativity: Application to Movie Trailer
Creation,” in Proc. of the 25th ACM Int. Conf. on Multimedia, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 1799–1808.
Source: https://www.youtube.com/watch?v=wb49-oV0F78
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
12
Problem definition
Domain-specific applications of video summarization
Full game (e.g. 1h 30’)
Game’s synopsis & highlights (1’32’’)
Source: https://www.youtube.com/watch?v=oo-2IFTifUU
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
13
Problem definition
Domain-specific applications of video summarization
Video samples extracted from: https://www.youtube.com/watch?v=gk3qTMlcadk
Raw CCTV material (e.g. 24h) Summary of important actions/events (with timestamps)
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
14
Literature overview
Taxonomy of deep learning
based methods for automatic
video summarization
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
15
Literature overview
Supervised approaches: using video semantics and metadata
 [Zhang, 2016; Kaufman, 2017] learn and transfer the summary structure of
semantically-similar videos
 [Panda, 2017] metadata-driven video categorization and summarization by
maximizing relevance with the video category
 [Song, 2016; Zhou, 2018a] category-driven summarization by category feature
preservation (keep main parts of a wedding when summarizing a wedding video)
 [Otani, 2016; Yuan, 2019] maximize relevance of visual (video) and textual
(metadata) data in a common latent space
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
16
Literature overview
Supervised approaches: considering temporal structure and dependency
 [Zhang, 2016b] estimate frames’ importance by modeling their variable-range
temporal dependency using RNNs
 [Zhao, 2018] models and encodes the temporal structure of the video for
defining the key-fragments using hierarchies of RNNs
 [Ji, 2019] video-to-summary as a sequence-to-sequence learning problem using
attention-driven encoder-decoder network
 [Feng, 2018; Wang, 2019] estimate frames’ importance by modeling their long-
range dependency using high-capacity memory networks
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
17
Literature overview
Supervised approaches: imitating human summaries
 [Zhang, 2019] summarization by confusing a trainable discriminator when making
the distinction between a machine- and a human-generated summary; model the
variable-range temporal dependency using RNNs and Dilated Temporal Units
 [Fu, 2019] key-fragment selection by confusing a trainable discriminator when
making the distinction between the machine- and a human-selected key-fragments;
fragmentation based on attention-based Pointer Network, and discrimination using
a 3D-CNN classifier
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
18
Literature overview
Supervised approaches: targeting specific properties of the summary
 [Chu, 2019] models spatiotemporal information based on raw frames and optical
flow maps, and learns frames’ importance from human annotations via a label
distribution learning process
 [Elfeki, 2019] uses of CNNs and RNNs to form spatiotemporal feature vectors and
estimates the level of activity and importance of each frame to create the summary
 [Chen, 2019] summarization based on reinforcement learning and reward functions
associated to the diversity and representativeness of the video summary
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
19
Literature overview
Unsupervised approaches: inferring the original video
 [Mahasseni, 2017] SUM-GAN trains a summarizer to fool a discriminator when
distinguishing the original from the summary-based reconstructed video using
adversarial learning
 [Jung, 2019] CSNet extends [Mahasseni, 2017] with a chunk and stride network and
attention mechanism to assess variable-range dependencies and select the video key-
frames
 [Apostolidis, 2020] SUM-GAN-AAE extends [Mahasseni, 2017] with a stepwise, fine-
grained training strategy and an attention auto-encoder to improve the key-fragment
selection process
 [Rochan, 2019] UnpairedVSN learns video summarization from unpaired data based on
an adversarial process that defines a mapping function of a raw video to a human
summary
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
20
Literature overview
Unsupervised approaches: targeting specific properties of the summary
 [Zhou, 2018b] DR-DSN learns to create representative and diverse summaries via
reinforcement learning and relevant reward functions
 [Gonuguntla, 2019] EDSN extracts spatiotemporal information and learns
summarization by rewarding the maintenance of main spatiotemporal patterns in
the summary
 [Zhang, 2018] OnlineMotionAE extracts the key motions of appearing objects and
uses an online motion auto-encoder model to generate summaries that include the
main objects in the video and the attractive actions made by each of these objects
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 DL-based video summarization methods mainly rely on combinations of CNNs and RNNs
 Pre-trained CNNs are used to represent the visual content; RNNs (mostly LSTMs) are used to
model the temporal dependency among video frames
 The proposed video summarization approaches are mostly supervised
 Best supervised approaches utilize tailored attention mechanisms or memory networks to
capture variable- and long-range temporal dependencies respectively
 For unsupervised video summarization GANs are the central direction and RL is another but
less common approach
 Best unsupervised approaches rely on VAE-GAN architectures that have been enhanced with
attention mechanisms
Some concluding remarks
21
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 The generation of ground-truth data can be an expensive and laborious process
 Video summarization is a subjective task and multiple summaries can be proposed for a video
 Human annotations that vary a lot make it hard to train a method with the typical supervised
training approaches
 Unsupervised video summarization algorithms overcome the need for ground-truth data and
can be trained using only an adequately large collection of videos
 Unsupervised learning allows to train a summarization method using different types of video
content (TV shows, news) and then perform content-wise video summarization
Some concluding remarks
22
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 The generation of ground-truth data can be an expensive and laborious process
 Video summarization is a subjective task and multiple summaries can be proposed for a video
 Human annotations that vary a lot make it hard to train a method with the typical supervised
training approaches
 Unsupervised video summarization algorithms overcome the need for ground-truth data and
can be trained using only an adequately large collection of videos
 Unsupervised learning allows to train a summarization method using different types of video
content (TV shows, news) and then perform content-wise video summarization
Some concluding remarks
23
Unsupervised video summarization has great advantages, increases the applicability
of summarization technologies, and its potential should be investigated
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Vasileios Mezaris,
Evlampios Apostolidis
CERTH-ITI, Greece
Tutorial at IEEE ICME 2020
Short break; coming up:
Section I.2: Discussion on a few
unsupervised GAN-based
methods
Video Summarization and Re-use
Technologies and Tools
Part I: Automatic video summarization
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Vasileios Mezaris,
Evlampios Apostolidis
CERTH-ITI, Greece
Tutorial at IEEE ICME 2020
Section I.2: Discussion on a few
unsupervised GAN-based
methods
Video Summarization and Re-use
Technologies and Tools
Part I: Automatic video summarization
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
 Problem formulation: video summarization via selecting a
sparse subset of frames that optimally represent the video
 Main idea: learn summarization by minimizing the distance
between videos and a distribution of their summarizations
 Goal: select a set of keyframes such that a distance between
the deep representations of the selected keyframes and the
video is minimized
26
B. Mahasseni, M. Lam and S. Todorovic, "Unsupervised Video Summarization with Adversarial LSTM Networks," 2017 IEEE
CVPR, Honolulu, HI, 2017, pp. 2982-2991, doi: 10.1109/CVPR.2017.318.
Courtesy of
Mahasseni et al.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
 Problem formulation: video summarization via selecting a
sparse subset of frames that optimally represent the video
 Main idea: learn summarization by minimizing the distance
between videos and a distribution of their summarizations
 Goal: select a set of keyframes such that a distance between
the deep representations of the selected keyframes and the
video is minimized
 Challenge: how to define a good distance?
27
B. Mahasseni, M. Lam and S. Todorovic, "Unsupervised Video Summarization with Adversarial LSTM Networks," 2017 IEEE
CVPR, Honolulu, HI, 2017, pp. 2982-2991, doi: 10.1109/CVPR.2017.318.
Courtesy of
Mahasseni et al.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
 Problem formulation: video summarization via selecting a
sparse subset of frames that optimally represent the video
 Main idea: learn summarization by minimizing the distance
between videos and a distribution of their summarizations
 Goal: select a set of keyframes such that a distance between
the deep representations of the selected keyframes and the
video is minimized
 Challenge: how to define a good distance?
 Solution: use a Discriminator network and train it with the
Summarizer in an adversarial manner
28
B. Mahasseni, M. Lam and S. Todorovic, "Unsupervised Video Summarization with Adversarial LSTM Networks," 2017 IEEE
CVPR, Honolulu, HI, 2017, pp. 2982-2991, doi: 10.1109/CVPR.2017.318.
Courtesy of
Mahasseni et al.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
 Deep features of video frames in Frame Selector
=> normalized importance scores
 Weighted features in Encoder => latent
representation e
 Latent representation e in Decoder => sequence of
features for the frames of input video
 Original & reconstructed features in Discriminator
=> distance estimation and binary classification as
“video” or “summary”
29
Training pipeline and loss functions
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
 Deep features of video frames in Frame Selector
=> normalized importance scores
 Weighted features in Encoder => latent
representation e
 Latent representation e in Decoder => sequence of
features for the frames of input video
 Original & reconstructed features in Discriminator
=> distance estimation and binary classification as
“video” or “summary”
30
Training pipeline and loss functions
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
 Deep features of video frames in Frame Selector
=> normalized importance scores
 Weighted features in Encoder => latent
representation e
 Latent representation e in Decoder => sequence of
features for the frames of input video
 Original & reconstructed features in Discriminator
=> distance estimation and binary classification as
“video” or “summary”
31
Training pipeline and loss functions
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
 Deep features of video frames in Frame Selector
=> normalized importance scores
 Weighted features in Encoder => latent
representation e
 Latent representation e in Decoder => sequence of
features for the frames of input video
 Original & reconstructed features in Discriminator
=> distance estimation and binary classification as
“video” or “summary”
32
Training pipeline and loss functions
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
 Deep features of video frames in Frame Selector
=> normalized importance scores
 Weighted features in Encoder => latent
representation e
 Latent representation e in Decoder => sequence of
features for the frames of input video
 Original & reconstructed features in Discriminator
=> distance estimation and binary classification as
“video” or “summary”
33
Training pipeline and loss functions
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
 Train Frame Selector and Encoder by minimizing
Lsparsity + Lprior + Lreconst
 Train Decoder by minimizing Lreconst + LGAN
 Train Discriminator by maximizing LGAN
 Update all components via backward propagation
using Stochastic Gradient Variational Bayes
estimation
34
Training pipeline and loss functions
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
 Deep features of video frames in Frame Selector
=> normalized importance scores



35
Inference stage and video summarization
35
Video fragmentation using KTS
Fragment-level importance scores
Key-fragment selection as a Knapsack problem
Frame-level importance scores
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
36
E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the
Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production,
Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
 Builds on the SUM-GAN architecture
 Contains a linear compression layer that
reduces the size of CNN feature vectors
 Follows an incremental and fine-grained
approach to train the model’s components
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
37
E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the
Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production,
Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
 Builds on the SUM-GAN architecture
 Contains a linear compression layer that
reduces the size of CNN feature vectors
 Follows an incremental and fine-grained
approach to train the model’s components
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
 Builds on the SUM-GAN architecture
 Contains a linear compression layer that
reduces the size of CNN feature vectors
 Follows an incremental and fine-grained
approach to train the model’s components
38
E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the
Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production,
Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
 Step-wise training process
39
Training pipeline and loss functions
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
40
 Step-wise training process
Training pipeline and loss functions
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
41
 Step-wise training process
Training pipeline and loss functions
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
42
 Step-wise training process
Training pipeline and loss functions
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
 Deep features of video frames in LC layer and
Frame Selector => normalized importance scores



43
Inference stage and video summarization
43
Video fragmentation using KTS
Fragment-level importance scores
Key-fragment selection as a Knapsack problem
Frame-level importance scores
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
 Builds on the SUM-GAN-sl algorithm
 Introduces an attention mechanism by
replacing the VAE of SUM-GAN-sl with a
deterministic attention auto-encoder
44
E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven
Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961, pp.
492-504, Jan. 2020. Best paper award
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
 Builds on the SUM-GAN-sl algorithm
 Introduces an attention mechanism by
replacing the VAE of SUM-GAN-sl with a
deterministic attention auto-encoder
45
E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven
Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961, pp.
492-504, Jan. 2020. Best paper award
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
46
The attention auto-encoder: Processing pipeline
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
47
The attention auto-encoder: Processing pipeline
 Weighted feature vectors fed to the Encoder
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
48
The attention auto-encoder: Processing pipeline
 Weighted feature vectors fed to the Encoder
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 For t > 1: use the hidden state of the previous
Decoder’s step (h1)
 For t = 1: use the hidden state of the last
Encoder’s step (He)
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
49
The attention auto-encoder: Processing pipeline
 Weighted feature vectors fed to the Encoder
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 Attention weights (αt) computed using:
 Energy score function
 Soft-max function
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Weighted feature vectors fed to the Encoder
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 Attention weights (αt) computed using:
 Energy score function
 Soft-max function
The SUM-GAN-AAE method [Apostolidis, 2020]
50
The attention auto-encoder: Processing pipeline
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Weighted feature vectors fed to the Encoder
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 Attention weights (αt) computed using:
 Energy score function
 Soft-max function
 αt multiplied with V and form Context Vector vt’
The SUM-GAN-AAE method [Apostolidis, 2020]
51
The attention auto-encoder: Processing pipeline
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Weighted feature vectors fed to the Encoder
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 Attention weights (αt) computed using:
 Energy score function
 Soft-max function
 αt multiplied with V and form Context Vector vt’
 vt’ combined with Decoder’s previous output yt-1
The SUM-GAN-AAE method [Apostolidis, 2020]
52
The attention auto-encoder: Processing pipeline
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Weighted feature vectors fed to the Encoder
 Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
 Attention weights (αt) computed using:
 Energy score function
 Soft-max function
 αt multiplied with V and form Context Vector vt’
 vt’ combined with Decoder’s previous output yt-1
 Decoder gradually reconstructs the video
The SUM-GAN-AAE method [Apostolidis, 2020]
53
The attention auto-encoder: Processing pipeline
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
 Training is performed in an incremental way as in SUM-GAN-sl
 No prior loss is used
54
Training pipeline and loss functions
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
 Deep features of video frames in LC layer and
Frame Selector => normalized importance scores



55
Inference stage and video summarization
55
Video fragmentation using KTS
Fragment-level importance scores
Key-fragment selection as a Knapsack problem
Frame-level importance scores
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Much smoother series of importance scores
The SUM-GAN-AAE method [Apostolidis, 2020]
56
Impact of the introduced attention mechanism
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Much faster and more stable training of the model
The SUM-GAN-AAE method [Apostolidis, 2020]
57
Impact of the introduced attention mechanism
Average (over 5 splits) learning curve of SUM-GAN-sl and
SUM-GAN-AAE on SumMeLoss curves for the SUM-GAN-sl and SUM-GAN-AAE
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 The most common strategy for learning summarization in an unsupervised way
 A mechanism to build a representative summary by maximizing inference to the full video
 Summarization performance is superior to other unsupervised learning approaches (e.g.
reinforcement learning) and comparable to a few supervised learning methods
 Step-wise training facilitates the training of complex GAN-based architectures
 Introduction of attention mechanisms is beneficial to the quality of the created summary
 There is room for further improving GAN-based unsupervised video summarization via: a)
combination with reinforcement learning approaches, b) extension with memory networks
Some concluding remarks
58
Using GANs for video summarization
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Vasileios Mezaris,
Evlampios Apostolidis
CERTH-ITI, Greece
Tutorial at IEEE ICME 2020
Short break; coming up:
Section I.3: Datasets, evaluation
protocols and results, and future
directions
Video Summarization and Re-use
Technologies and Tools
Part I: Automatic video summarization
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Vasileios Mezaris,
Evlampios Apostolidis
CERTH-ITI, Greece
Tutorial at IEEE ICME 2020
Section I.3: Datasets, evaluation
protocols and results, and future
directions
Video Summarization and Re-use
Technologies and Tools
Part I: Automatic video summarization
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Datasets
61
 SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)
 25 videos capturing multiple events (e.g. cooking and sports)
 video length: 1 to 6 min
 annotation: fragment-based video summaries (15-18 per video)
 TVSum (https://github.com/yalesong/tvsum)
 50 videos from 10 categories of TRECVid MED task
 video length: 1 to 11 min
 annotation: frame-level importance scores (20 per video)
Most commonly used
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Datasets
62
 Open Video Project (OVP) (https://sites.google.com/site/vsummsite/download)
 50 videos of various genres (e.g. documentary, educational, historical, lecture)
 video length: 1 to 4 min
 annotation: keyframe-based video summaries (5 per video)
 Youtube (https://sites.google.com/site/vsummsite/download)
 50 videos of diverse content (e.g. cartoons, news, sports, commercials) collected from websites
 video length: 1 to 10 min
 annotation: keyframe-based video summaries (5 per video)
Less commonly used
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
63
Early approach
 Agreement between automatically-created (A) and user-defined (U) summary is expressed by
 Matching of a pair of frames is based on color histograms, the Manhattan distance and a
predefined similarity threshold
 80% of video samples are used for training and the remaining 20% for testing
 The final evaluation outcome occurs by:
 Computing the average F-Score for a test video given the different user summaries for this video
 Computing the average of the calculated F-Score values for the different test videos
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
64
Established approach
 The generated summary should not exceed 15% of the video length
 Agreement between automatically-generated (A) and user-defined (U) summary is expressed
by the F-Score (%), with (P)recision and (R)ecall measuring the temporal overlap (∩) (|| ||
means duration)
 Typical metrics for computing Precision and Recall at the frame-level
 80% of video samples are used for training and the remaining 20% for testing
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
65
Established approach - A side note
 TVSum annotations need conversion from frame-level importance scores to key-fragments
65
Human annotations in TVSum: frame-level importance scores
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
66
Established approach - A side note
 TVSum annotations need conversion from frame-level importance scores to key-fragments
66
Video fragmentation using KTS
Human annotations in TVSum: frame-level importance scores
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
67
Established approach - A side note
 TVSum annotations need conversion from frame-level importance scores to key-fragments
67
Video fragmentation using KTS
Fragment-level importance scores
Human annotations in TVSum: frame-level importance scores
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
68
Established approach - A side note
 TVSum annotations need conversion from frame-level importance scores to key-fragments
Video fragmentation using KTS
Fragment-level importance scores
Key-fragment selection as a Knapsack problem
Human annotations in TVSum: frame-level importance scores
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
69
Established approach
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
70
Established approach
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
71
F-Score1
Established approach
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
72
F-Score2
F-Score1
Evaluation protocols
Established approach
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
73
F-ScoreN
F-Score2
F-Score1
Evaluation protocols
Established approach
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
74
F-ScoreN
F-Score2
F-Score1
Evaluation protocols
Established approach
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Most used approach
SumMe: F-Score = max{F-Scorei}i=1
N
TVSum: F-Score = mean{F-Scorei}i=1
N
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
75
Evaluation protocols
Established approach
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Alternative approach
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
76
F-Score
Evaluation protocols
Established approach
 Slight but important distinction w.r.t. what is eventually used as ground-truth summary
 Alternative approach
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Results: comparison of unsupervised methods
77
Method Reference
Online Motion AE [Zhang, 2018]
SUM-FCNunsup [Rochan, 2018]
DR-DSN [Zhou, 2018b]
EDSN [Gonuguntla, 2019]
UnpairedVSN [Rochan, 2019]
PCDL [Zhao, 2019]
ACGAN [He, 2019]
Tesselation [Kaufman, 2017]
SUM-GAN-sl [Apostolidis, 2019]
SUM-GAN-AAE [Apostolidis, 2020]
CSNet [Jung, 2019]
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Best-performing unsupervised methods rely
on Generative Adversarial Networks
 The use of attention mechanisms allows the
identification of important parts of the video
 Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
 The use of rewards and reinforcement learning
is less competitive than the use of GANs
 A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
78
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Best-performing unsupervised methods rely
on Generative Adversarial Networks
 The use of attention mechanisms allows the
identification of important parts of the video
 Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
 The use of rewards and reinforcement learning
is less competitive than the use of GANs
 A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
79
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Best-performing unsupervised methods rely
on Generative Adversarial Networks
 The use of attention mechanisms allows the
identification of important parts of the video
 Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
 The use of rewards and reinforcement learning
is less competitive than the use of GANs
 A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
80
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Best-performing unsupervised methods rely
on Generative Adversarial Networks
 The use of attention mechanisms allows the
identification of important parts of the video
 Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
 The use of rewards and reinforcement learning
is less competitive than the use of GANs
 A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
81
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Best-performing unsupervised methods rely
on Generative Adversarial Networks
 The use of attention mechanisms allows the
identification of important parts of the video
 Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
 The use of rewards and reinforcement learning
is less competitive than the use of GANs
 A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
82
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Best-performing unsupervised methods rely
on Generative Adversarial Networks
 The use of attention mechanisms allows the
identification of important parts of the video
 Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
 The use of rewards and reinforcement learning
is less competitive than the use of GANs
 A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
83
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 Best-performing unsupervised methods rely
on Generative Adversarial Networks
 The use of attention mechanisms allows the
identification of important parts of the video
 Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
 The use of rewards and reinforcement learning
is less competitive than the use of GANs
 A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
84
General remarks
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 tbd
Results: comparison of supervised methods
85
Method Reference
vsLSTM [Zhang, 2016b]
dppLSTM [Zhang, 2016b]
SASUMwsup [Wei. 2018]
ActionRanking [Elfeki, 2019]
ESS-VS [Zhang, 2016a]
H-RNN [Zhao, 2017]
vsLSTM+Att [Lebron Casas, 2019]
DSSE [Yuan, 2019b]
DR-DSNsup [Zhou, 2018b]
Tessellationsup [Kaufman, 2017]
Method Reference
dppLSTM+Att [Lebron Casas, 2019]
WS-HRL [Chen, 2019]
UnpairedVSNsup [Rochan, 2019]
SUM-FCN [Rochan, 2018]
SF-CVS [Huang, 2020]
SASUMsup [Wei, 2018]
CRSum [Yuan, 2019c]
PCDLsup [Zhao, 2019]
MAVS [Feng, 2018]
HSA-RNN [Zhao, 2018]
Method Reference
DQSN [Zhou, 2018a]
ACGANsup [He, 2019]
SUM-DeepLab [Rochan, 2018]
CSNetsup [Yuan, 2019a]
SMLD [Chu, 2019]
H-MAN [Liu, 2019]
VASNet [Fajtl, 2019]
SMN [Wang, 2019]
* SUM-GAN-AAE [Apostolidis, 2020]
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 tbd
Results: comparison of supervised methods
86
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
Random summary 40.2 54.4 22.5
vsLSTM 37.6 54.2 24.5
dppLSTM 38.6 54.7 23
SASUMwsup 40.6 53.9 22.5
ActionRanking 40.1 56.3 21.5
ESS-VS 40.9 - 20
H-RNN 41.1 57.7 17.5
vsLSTM+Att 43.2 - 17
DSSE - 57.0 17
DR-DSNsup 42.1 58.1 16
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
Tessellationsup 37.2 63.4 15
dppLSTM+Att 43.8 - 14
WS-HRL 43.6 58.4 14
UnpairedVSNsup 48.0 56.1 13
SUM-FCN 47.5 56.8 13
SF-CVS 46.0 58.0 13
SASUMsup 45.3 58.2 12.5
CRSum 47.3 58.0 12
PCDLsup 43.7 59.2 12
MAVS 40.3 66.8 11.5
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
HSA-RNN 44.1 59.8 10
DQSN - 58.6 10
ACGANsup 47.2 59.4 9
SUM-DeepLab 48.8 58.4 8
HSA-RNN 44.1 59.8 10
CSNetsup 48.6 58.5 8
SMLD 47.6 61.0 6
H-MAN 51.8 60.4 4
VASNet 49.7 61.4 3.5
SMN 58.3 64.5 1.5
* SUM-GAN-AAE 48.9 58.3 8.5
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 tbd
Results: comparison of supervised methods
87
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
Random summary 40.2 54.4 22.5
vsLSTM 37.6 54.2 24.5
dppLSTM 38.6 54.7 23
SASUMwsup 40.6 53.9 22.5
ActionRanking 40.1 56.3 21.5
ESS-VS 40.9 - 20
H-RNN 41.1 57.7 17.5
vsLSTM+Att 43.2 - 17
DSSE - 57.0 17
DR-DSNsup 42.1 58.1 16
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
Tessellationsup 37.2 63.4 15
dppLSTM+Att 43.8 - 14
WS-HRL 43.6 58.4 14
UnpairedVSNsup 48.0 56.1 13
SUM-FCN 47.5 56.8 13
SF-CVS 46.0 58.0 13
SASUMsup 45.3 58.2 12.5
CRSum 47.3 58.0 12
PCDLsup 43.7 59.2 12
MAVS 40.3 66.8 11.5
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
HSA-RNN 44.1 59.8 10
DQSN - 58.6 10
ACGANsup 47.2 59.4 9
SUM-DeepLab 48.8 58.4 8
HSA-RNN 44.1 59.8 10
CSNetsup 48.6 58.5 8
SMLD 47.6 61.0 6
H-MAN 51.8 60.4 4
VASNet 49.7 61.4 3.5
SMN 58.3 64.5 1.5
* SUM-GAN-AAE 48.9 58.3 8.5
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 tbd
Results: comparison of supervised methods
88
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
Random summary 40.2 54.4 22.5
vsLSTM 37.6 54.2 24.5
dppLSTM 38.6 54.7 23
SASUMwsup 40.6 53.9 22.5
ActionRanking 40.1 56.3 21.5
ESS-VS 40.9 - 20
H-RNN 41.1 57.7 17.5
vsLSTM+Att 43.2 - 17
DSSE - 57.0 17
DR-DSNsup 42.1 58.1 16
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
Tessellationsup 37.2 63.4 15
dppLSTM+Att 43.8 - 14
WS-HRL 43.6 58.4 14
UnpairedVSNsup 48.0 56.1 13
SUM-FCN 47.5 56.8 13
SF-CVS 46.0 58.0 13
SASUMsup 45.3 58.2 12.5
CRSum 47.3 58.0 12
PCDLsup 43.7 59.2 12
MAVS 40.3 66.8 11.5
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
HSA-RNN 44.1 59.8 10
DQSN - 58.6 10
ACGANsup 47.2 59.4 9
SUM-DeepLab 48.8 58.4 8
HSA-RNN 44.1 59.8 10
CSNetsup 48.6 58.5 8
SMLD 47.6 61.0 6
H-MAN 51.8 60.4 4
VASNet 49.7 61.4 3.5
SMN 58.3 64.5 1.5
* SUM-GAN-AAE 48.9 58.3 8.5
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 tbd
Results: comparison of supervised methods
89
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
Random summary 40.2 54.4 22.5
vsLSTM 37.6 54.2 24.5
dppLSTM 38.6 54.7 23
SASUMwsup 40.6 53.9 22.5
ActionRanking 40.1 56.3 21.5
ESS-VS 40.9 - 20
H-RNN 41.1 57.7 17.5
vsLSTM+Att 43.2 - 17
DSSE - 57.0 17
DR-DSNsup 42.1 58.1 16
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
Tessellationsup 37.2 63.4 15
dppLSTM+Att 43.8 - 14
WS-HRL 43.6 58.4 14
UnpairedVSNsup 48.0 56.1 13
SUM-FCN 47.5 56.8 13
SF-CVS 46.0 58.0 13
SASUMsup 45.3 58.2 12.5
CRSum 47.3 58.0 12
PCDLsup 43.7 59.2 12
MAVS 40.3 66.8 11.5
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
HSA-RNN 44.1 59.8 10
DQSN - 58.6 10
ACGANsup 47.2 59.4 9
SUM-DeepLab 48.8 58.4 8
HSA-RNN 44.1 59.8 10
CSNetsup 48.6 58.5 8
SMLD 47.6 61.0 6
H-MAN 51.8 60.4 4
VASNet 49.7 61.4 3.5
SMN 58.3 64.5 1.5
* SUM-GAN-AAE 48.9 58.3 8.5
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 tbd
Results: comparison of supervised methods
90
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
Random summary 40.2 54.4 22.5
vsLSTM 37.6 54.2 24.5
dppLSTM 38.6 54.7 23
SASUMwsup 40.6 53.9 22.5
ActionRanking 40.1 56.3 21.5
ESS-VS 40.9 - 20
H-RNN 41.1 57.7 17.5
vsLSTM+Att 43.2 - 17
DSSE - 57.0 17
DR-DSNsup 42.1 58.1 16
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
Tessellationsup 37.2 63.4 15
dppLSTM+Att 43.8 - 14
WS-HRL 43.6 58.4 14
UnpairedVSNsup 48.0 56.1 13
SUM-FCN 47.5 56.8 13
SF-CVS 46.0 58.0 13
SASUMsup 45.3 58.2 12.5
CRSum 47.3 58.0 12
PCDLsup 43.7 59.2 12
MAVS 40.3 66.8 11.5
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
HSA-RNN 44.1 59.8 10
DQSN - 58.6 10
ACGANsup 47.2 59.4 9
SUM-DeepLab 48.8 58.4 8
HSA-RNN 44.1 59.8 10
CSNetsup 48.6 58.5 8
SMLD 47.6 61.0 6
H-MAN 51.8 60.4 4
VASNet 49.7 61.4 3.5
SMN 58.3 64.5 1.5
* SUM-GAN-AAE 48.9 58.3 8.5
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
 tbd
Results: comparison of supervised methods
91
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
Random summary 40.2 54.4 22.5
vsLSTM 37.6 54.2 24.5
dppLSTM 38.6 54.7 23
SASUMwsup 40.6 53.9 22.5
ActionRanking 40.1 56.3 21.5
ESS-VS 40.9 - 20
H-RNN 41.1 57.7 17.5
vsLSTM+Att 43.2 - 17
DSSE - 57.0 17
DR-DSNsup 42.1 58.1 16
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
Tessellationsup 37.2 63.4 15
dppLSTM+Att 43.8 - 14
WS-HRL 43.6 58.4 14
UnpairedVSNsup 48.0 56.1 13
SUM-FCN 47.5 56.8 13
SF-CVS 46.0 58.0 13
SASUMsup 45.3 58.2 12.5
CRSum 47.3 58.0 12
PCDLsup 43.7 59.2 12
MAVS 40.3 66.8 11.5
Method Sum
Me
TV
Sum
AVG
FSc FSc Rnk
HSA-RNN 44.1 59.8 10
DQSN - 58.6 10
ACGANsup 47.2 59.4 9
SUM-DeepLab 48.8 58.4 8
HSA-RNN 44.1 59.8 10
CSNetsup 48.6 58.5 8
SMLD 47.6 61.0 6
H-MAN 51.8 60.4 4
VASNet 49.7 61.4 3.5
SMN 58.3 64.5 1.5
* SUM-GAN-AAE 48.9 58.3 8.5
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Quantitative comparison
92
Keyframe-based overview of video #15 of TVSum (1 keyframe / shot)
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Quantitative comparison
93
Generated summaries by five summarization methods
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Quantitative comparison
94
Generated summaries by five summarization methods
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Quantitative comparison
95
Generated summaries by five summarization methods
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Quantitative comparison
96
Generated summaries by five summarization methods
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Quantitative comparison
97
Video #15 of TVSum: “How to Clean Your Dog’s Ears - Vetoquinol USA
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Quantitative comparison
98
Automatically generated summaries
VASNet SUM-GAN-AAE DR-DSN
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Use of video summarization technologies
99
Tool for content adaptation / re-purposing
 Developed by CERTH-ITI
 Elaborates GAN-based methods for unsupervised
learning [Apostolidis 2019, 2020]
 Enables content adaptation for distribution via
multiple communication channels
 Faciliates summary creation based on the audience
needs for: Twitter, Facebook (feed & stories),
Instagram (feed & stories), YouTube, TikTok
E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the
Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production,
Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven
Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961,
pp. 492-504, Jan. 2020.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Use of video summarization technologies
100
Tool for content adaptation / re-purposing
 Learns content-specific summarization
 Separate models can be trained and used for
different video content (e.g. TV shows)
 Creating these models does not require manually-
generated training data (it’s (almost) for free)
E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the
Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production,
Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven
Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961,
pp. 492-504, Jan. 2020.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Use of video summarization technologies
101
Tool for content adaption / re-purposing
 Try it with your video at: http://multimedia2.iti.gr/videosummarization/service/start.html
 Demo video: https://youtu.be/LbjPLJzeNII
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Future directions
102
 Unsupervised video summarization based on combining adversarial and reinforcement
learning
 Advanced attention mechanisms and memory networks for capturing long-range temporal
dependencies among parts of the video
 Exploiting augmented/extended training data
 Introducing editorial rules in unsupervised video summarization
 Examine the potential of transfer learning in video summarization
Analysis-oriented
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Future directions
103
 There is a lack of integrated technologies for automating video summarization and CERTH’s
web application is one of the first complete tools
 Automated summarization that is adaptive to the distribution channel / targeted audience or
the video content has a strong potential!
 Further applications of video summarization should be investigated by:
 monitoring the modern media/social media ecosystem
 identifying new application domains for content adaptation / re-purposing
 translating the needs of these application domains into analysis requirements
Application-oriented
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Apostolidis, 2019] E. Apostolidis, A. I. Metsai, E. Adamantidou, V. Mezaris, and I. Patras, “A stepwise, label-based approach for
improving the adversarial training in unsupervised video summarization,” in Proc. of the 1st Int. Workshop on AI for Smart TV
Content Production, Access and Delivery, ser. AI4TV ’19. New York, NY, USA: ACM, 2019, pp. 17–25.
[Apostolidis, 2020] E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, and I. Patras, “Unsupervised video summarization via
attention-driven adversarial learning,” in Proc. of the Int. Conf. on Multimedia Modeling. Springer, 2020, pp. 492–504.
[Bahdanau, 2015] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in
Proc. of the 3rd Int. Conf. on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track
Proceedings, Y. Bengio and Y. LeCun, Eds., 2015.
[Chen 2019] Y. Chen, L. Tao, X. Wang, and T. Yamasaki, “Weakly supervised video summarization by hierarchical reinforcement
learning,” in Proc. of the ACM Multimedia Asia, 2019, pp. 1–6.
[Cho, 2014] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder–
decoder approaches,” in Proc. of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation.
Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 103–111.
[Chu, 2019] W.-T. Chu and Y.-H. Liu, “Spatiotemporal modeling and label distribution learning for video summarization,” in Proc.
of the 2019 IEEE 21st Int. Workshop on Multimedia Signal Processing (MMSP). IEEE, 2019, pp. 1–6.
[Elfeki, 2019] M. Elfeki and A. Borji, “Video summarization via actionness ranking,” in Proc. of the IEEE Winter Conference on
Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, January 7-11, 2019, Jan 2019, pp. 754–763.
Key references
104
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Fajtl, 2019] J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, and P. Remagnino, “Summarizing videos with attention,” in Asian
Conf. on Computer Vision (ACCV) 2019 Workshops, G. Carneiro and S. You, Eds. Cham: Springer International Publishing,
2019, pp. 39–54.
[Feng, 2018] L. Feng, Z. Li, Z. Kuang, and W. Zhang, “Extractive video summarizer with memory augmented neural networks,” in
Proc. of the 26th ACM Int. Conf. on Multimedia, ser. MM ’18. New York, NY, USA: ACM, 2018, pp. 976–983.
[Fu, 2019] T. Fu, S. Tai, and H. Chen, “Attentive and adversarial learning for video summarization,” in Proc. of the IEEE Winter
Conf. on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, January 7-11, 2019, pp. 1579–1587.
[Gonuguntla, 2019] N. Gonuguntla, B. Mandal, N. Puhan et al., “Enhanced deep video summarization network,” in Proc. of the
2019 British Machine Vision Conference (BMVC), 2019.
[Goyal, 2017] A. Goyal, N. R. Ke, A. Lamb, R. D. Hjelm, C. J. Pal, J. Pineau, and Y. Bengio, “Actual: Actor-critic under adversarial
learning,” ArXiv, vol. abs/1711.04755, 2017.
[Gygli, 2014] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool, “Creating summaries from user videos,” in Proc. of the
European Conference on Computer Vision (ECCV) 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer
International Publishing, 2014, pp. 505–520.
[Gygli, 2015] M. Gygli, H. Grabner, and L. V. Gool, “Video summarization by learning submodular mixtures of objectives,” in Proc.
of the 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 3090–3098.
[Haarnoja, 2018] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep
reinforcement learning with a stochastic actor,” in Proc. of the 35th Int. Conf. on Machine Learning (ICML), 2018.
Key references
105
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[He, 2019] X. He, Y. Hua, T. Song, Z. Zhang, Z. Xue, R. Ma, N. Robertson, and H. Guan, “Unsupervised video summarization with
attentive conditional generative adversarial networks,” in Proc. of the 27th ACM Int. Conf. on Multimedia, ser. MM ’19. New
York, NY, USA: ACM, 2019, pp. 2296–2304.
[Hochreiter, 1997] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–
1780, 1997.
[Huang, 2020] C. Huang and H. Wang, “A novel key-frames selection framework for comprehensive video summarization,” IEEE
Transactions on Circuits and Systems for Video Technology, vol. 30, no. 2, pp. 577–589, 2020.
[Ji, 2019] Z. Ji, K. Xiong, Y. Pang, and X. Li, “Video summarization with attention-based encoder-decoder networks,” IEEE
Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2019.
[Jung, 2019] Y. Jung, D. Cho, D. Kim, S. Woo, and I. S. Kweon, “Discriminative feature learning for unsupervised video
summarization,” in Proc. of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 8537–8544.
[Kaufman, 2017] D. Kaufman, G. Levi, T. Hassner, and L. Wolf, “Temporal tessellation: A unified approach for video analysis,” in
Proc. of the 2017 IEEE Int. Conf. on Computer Vision (ICCV), Oct 2017, pp. 94–104.
[Kulesza, 2012] A. Kulesza and B. Taskar, Determinantal Point Processes for Machine Learning. Hanover, MA, USA: Now
Publishers Inc., 2012.
[Lal, 2019] S. Lal, S. Duggal, and I. Sreedevi, “Online video summarization: Predicting future to better summarize present,” in
Proc. of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019, pp. 471–480.
Key references
106
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Lebron Casas, 2019] L. Lebron Casas and E. Koblents, “Video summarization with LSTM and deep attention models,” in
MultiMedia Modeling, I. Kompatsiaris, B. Huet, V. Mezaris, C. Gurrin, W.-H. Cheng, and S. Vrochidis, Eds. Cham: Springer
International Publishing, 2019, pp. 67–79.
[Liu, 2019] Y.-T. Liu, Y.-J. Li, F.-E. Yang, S.-F. Chen, and Y.-C. F. Wang, “Learning hierarchical self-attention for video
summarization,” in Proc. of the 2019 IEEE Int. Conf. on Image Processing (ICIP). IEEE, 2019, pp. 3377–3381.
[Mahasseni, 2017] B. Mahasseni, M. Lam, and S. Todorovic, “Unsupervised video summarization with adversarial LSTM
networks,” in Proc. of the 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2982–
2991.
[Otani, 2016] M. Otani, Y. Nakashima, E. Rahtu, J. Heikkil¨a, and N. Yokoya, “Video summarization using deep semantic
features,” in Proc. of the 13th Asian Conference on Computer Vision (ACCV’16), 2016.
[Panda, 2017] R. Panda, A. Das, Z. Wu, J. Ernst, and A. K. Roy-Chowdhury, “Weakly supervised summarization of web videos,” in
Proc. of the 2017 IEEE Int. Conf. on Computer Vision (ICCV), Oct 2017, pp. 3677–3686.
[Pfau, 2016] D. Pfau and O. Vinyals, “Connecting generative adversarial networks and actor-critic methods,” in NIPS Workshop
on Adversarial Training, 2016.
[Potapov, 2014] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, “Category-specific video summarization,” in Proc. of the
European Conference on Computer Vision (ECCV) 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer
International Publishing, 2014, pp. 540–555.
Key references
107
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Rochan, 2018] M. Rochan, L. Ye, and Y. Wang, “Video summarization using fully convolutional sequence networks,” in Proc. of
the European Conference on Computer Vision (ECCV) 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds. Cham:
Springer International Publishing, 2018, pp. 358–374.
[Rochan, 2019] M. Rochan and Y. Wang, “Video summarization by learning from unpaired data,” in Proc. of the 2019 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[Savioli, 2019] N. Savioli, “A hybrid approach between adversarial generative networks and actor-critic policy gradient for low
rate high-resolution image compression,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, CVPR Workshops 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 2019.
[Smith, 2017] J. R. Smith, D. Joshi, B. Huet, W. Hsu, and J. Cota, “Harnessing A.I. for Augmenting Creativity: Application to Movie
Trailer Creation,” in Proc. of the 25th ACM Int. Conf. on Multimedia, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 1799–
1808.
[Song, 2015] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “TvSUM: Summarizing web videos using titles,” in Proc. of the 2015
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 5179–5187.
[Song, 2016] X. Song, K. Chen, J. Lei, L. Sun, Z. Wang, L. Xie, and M. Song, “Category driven deep recurrent neural network for
video summarization,” in Proc. of the 2016 IEEE Int. Conf. on Multimedia Expo Workshops (ICMEW), July 2016, pp. 1–6.
[Szegedy, 2015] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A.
Rabinovich, “Going deeper with convolutions,” in Proc. of the 2015 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), June 2015, pp. 1–9.
Key references
108
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Vinyals, 2015] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” in Advances in Neural Information Processing Systems
28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 2692–2700.
[Wang, 2019] J. Wang, W. Wang, Z. Wang, L. Wang, D. Feng, and T. Tan, “Stacked memory network for video summarization,” in
Proc. of the 27th ACM Int. Conf. on Multimedia, ser. MM ’19. New York, NY, USA: ACM, 2019, pp. 836–844.
[Wang, 2016] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good
practices for deep action recognition,” in Proc. of the European Conference on Computer Vision – ECCV 2016, B. Leibe, J.
Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing, 2016, pp. 20–36.
[Wei, 2018] H. Wei, B. Ni, Y. Yan, H. Yu, X. Yang, and C. Yao, “Video summarization via semantic attended networks,” in Proc. of
the 2018 AAAI Conf. on Artificial Intelligence (AAAI), 2018.
[Yu, 2017] L. Yu, W. Zhang, J. Wang, and Y. Yu, “SeqGAN: Sequence generative adversarial nets with policy gradient,” in Proc. of
the 2017 AAAI Conf. on Artificial Intelligence, ser. (AAAI). AAAI Press, 2017, pp. 2852–2858.
[Yuan, 2019a] L. Yuan, F. E. H. Tay, P. Li, L. Zhou, and J. Feng, “Cycle-SUM: Cycle-consistent adversarial lstm networks for
unsupervised video summarization,” in Proc. of the 2019 AAAI Conf. on Artificial Intelligence (AAAI), 2019.
[Yuan, 2019b] Y. Yuan, T. Mei, P. Cui, and W. Zhu, “Video summarization by learning deep side semantic embedding,” IEEE
Transactions on Circuits and Systems for Video Technology, vol. 29, no. 1, pp. 226–237, Jan 2019.
[Yuan, 2019c] Y. Yuan, H. Li, and Q. Wang, “Spatiotemporal modeling for video summarization using convolutional recurrent
neural network,” IEEE Access, vol. 7, pp. 64 676–64 685, 2019.
Key references
109
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Zhang, 2016a] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Summary transfer: Exemplar-based subset selection for video
summarization,” in Proc. of the 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp.
1059–1067.
[Zhang, 2016b] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Video summarization with long short-term memory,” in Proc. of
the European Conference on Computer Vision (ECCV) 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer
International Publishing, 2016, pp. 766–782.
[Zhang, 2018] Y. Zhang, X. Liang, D. Zhang, M. Tan, and E. P. Xing, “Unsupervised object-level video summarization with online
motion auto-encoder,” Pattern Recognition Letters, 2018.
[Zhang, 2019] Y. Zhang, M. Kampffmeyer, X. Zhao, and M. Tan, “DTR-GAN: Dilated temporal relational adversarial network for
video summarization,” in Proc. of the ACM Turing Celebration Conference - China, ser. ACM TURC ’19. New York, NY, USA:
ACM, 2019, pp. 89:1–89:6.
[Zhao, 2017] B. Zhao, X. Li, and X. Lu, “Hierarchical recurrent neural network for video summarization,” in Proc. of the 2017 ACM
on Multimedia Conference, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 863–871.
[Zhao, 2018] B. Zhao, X. Li, and X. Lu, “HSA-RNN: Hierarchical structure-adaptive RNN for video summarization,” in Proc. of the
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7405–7414.), 2018.
[Zhao, 2019] B. Zhao, X. Li, and X. Lu, “Property-constrained dual learning for video summarization,” IEEE Transactions on
Neural Networks and Learning Systems, 2019.
Key references
110
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Zhou, 2018a] K. Zhou, T. Xiang, and A. Cavallaro, “Video summarisation by classification with deep reinforcement learning,” in
Proc. of the 2018 British Machine Vision Conference (BMVC), 2018.
[Zhou, 2018b] K. Zhou and Y. Qiao, “Deep reinforcement learning for unsupervised video summarization with diversity-
representativeness reward,” in Proc. of the 2018 AAAI Conference on Artificial Intelligence (AAAI), 2018.
Key references
111
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Vasileios Mezaris
bmezaris@iti.gr
Evlampios Apostolidis
apostolid@iti.gr
CERTH-ITI, Greece
info@retv-project.eu
This work has received funding from the
European Union’s Horizon 2020 research
and innovation programme under grant
agreement H2020-780656 ReTV
Questions?
Following the Q&A session and the
break, we will be back with Part II of
the tutorial, on video summaries re-
use and recommendation

Contenu connexe

Tendances

Application of Image processing in Defect Detection of PCB by Jeevan B M
Application of Image processing in Defect Detection of PCB by Jeevan B MApplication of Image processing in Defect Detection of PCB by Jeevan B M
Application of Image processing in Defect Detection of PCB by Jeevan B M
Jeevan B M
 

Tendances (20)

The Art of Editing #2 Jan 24
The Art of Editing #2 Jan 24The Art of Editing #2 Jan 24
The Art of Editing #2 Jan 24
 
Super Resolution
Super ResolutionSuper Resolution
Super Resolution
 
AGE AND GENDER DETECTION.pptx
AGE AND GENDER DETECTION.pptxAGE AND GENDER DETECTION.pptx
AGE AND GENDER DETECTION.pptx
 
文献紹介:Elaborative Rehearsal for Zero-Shot Action Recognition
文献紹介:Elaborative Rehearsal for Zero-Shot Action Recognition文献紹介:Elaborative Rehearsal for Zero-Shot Action Recognition
文献紹介:Elaborative Rehearsal for Zero-Shot Action Recognition
 
Video Summarization
Video SummarizationVideo Summarization
Video Summarization
 
NIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding Model
NIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding ModelNIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding Model
NIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding Model
 
Multimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxMultimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptx
 
ImageJを使った画像解析実習〜起動・終了とファイルの入出力〜
ImageJを使った画像解析実習〜起動・終了とファイルの入出力〜ImageJを使った画像解析実習〜起動・終了とファイルの入出力〜
ImageJを使った画像解析実習〜起動・終了とファイルの入出力〜
 
Suspicious Activity Detection python Project Abstract
Suspicious Activity Detection python Project AbstractSuspicious Activity Detection python Project Abstract
Suspicious Activity Detection python Project Abstract
 
The Art of Editing #6
The Art of Editing #6The Art of Editing #6
The Art of Editing #6
 
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
 
動画認識サーベイv1(メタサーベイ )
動画認識サーベイv1(メタサーベイ )動画認識サーベイv1(メタサーベイ )
動画認識サーベイv1(メタサーベイ )
 
文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding
文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding
文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding
 
Face Recognition System
Face Recognition SystemFace Recognition System
Face Recognition System
 
Facial Expression Recognition
Facial Expression Recognition Facial Expression Recognition
Facial Expression Recognition
 
Facial emotion recognition
Facial emotion recognitionFacial emotion recognition
Facial emotion recognition
 
Video
VideoVideo
Video
 
DRIVER DROWSINESS ALERT SYSTEM
DRIVER DROWSINESS ALERT SYSTEMDRIVER DROWSINESS ALERT SYSTEM
DRIVER DROWSINESS ALERT SYSTEM
 
Matlab and Image Processing Workshop-SKERG
Matlab and Image Processing Workshop-SKERG Matlab and Image Processing Workshop-SKERG
Matlab and Image Processing Workshop-SKERG
 
Application of Image processing in Defect Detection of PCB by Jeevan B M
Application of Image processing in Defect Detection of PCB by Jeevan B MApplication of Image processing in Defect Detection of PCB by Jeevan B M
Application of Image processing in Defect Detection of PCB by Jeevan B M
 

Similaire à Icme2020 tutorial video_summarization_part1

Video content analysis and retrieval system using video storytelling and inde...
Video content analysis and retrieval system using video storytelling and inde...Video content analysis and retrieval system using video storytelling and inde...
Video content analysis and retrieval system using video storytelling and inde...
IJECEIAES
 
Review on content based video lecture retrieval
Review on content based video lecture retrievalReview on content based video lecture retrieval
Review on content based video lecture retrieval
eSAT Journals
 
Enhancing multi-class web video categorization model using machine and deep ...
Enhancing multi-class web video categorization model using  machine and deep ...Enhancing multi-class web video categorization model using  machine and deep ...
Enhancing multi-class web video categorization model using machine and deep ...
IJECEIAES
 

Similaire à Icme2020 tutorial video_summarization_part1 (20)

ReTV AI4TV Summarization
ReTV AI4TV SummarizationReTV AI4TV Summarization
ReTV AI4TV Summarization
 
SUMMARY GENERATION FOR LECTURING VIDEOS
SUMMARY GENERATION FOR LECTURING VIDEOSSUMMARY GENERATION FOR LECTURING VIDEOS
SUMMARY GENERATION FOR LECTURING VIDEOS
 
Video content analysis and retrieval system using video storytelling and inde...
Video content analysis and retrieval system using video storytelling and inde...Video content analysis and retrieval system using video storytelling and inde...
Video content analysis and retrieval system using video storytelling and inde...
 
Review on content based video lecture retrieval
Review on content based video lecture retrievalReview on content based video lecture retrieval
Review on content based video lecture retrieval
 
An Intelligent Approach for Effective Retrieval of Content from Large Data Se...
An Intelligent Approach for Effective Retrieval of Content from Large Data Se...An Intelligent Approach for Effective Retrieval of Content from Large Data Se...
An Intelligent Approach for Effective Retrieval of Content from Large Data Se...
 
OOMEN MEZARIS ReTV
OOMEN MEZARIS ReTVOOMEN MEZARIS ReTV
OOMEN MEZARIS ReTV
 
Implementing artificial intelligence strategies for content annotation and pu...
Implementing artificial intelligence strategies for content annotation and pu...Implementing artificial intelligence strategies for content annotation and pu...
Implementing artificial intelligence strategies for content annotation and pu...
 
Implementing Artificial Intelligence Strategies for Content Annotation and Pu...
Implementing Artificial Intelligence Strategies for Content Annotation and Pu...Implementing Artificial Intelligence Strategies for Content Annotation and Pu...
Implementing Artificial Intelligence Strategies for Content Annotation and Pu...
 
ICME 2020 Tutorial Part II: Video summary (re-)use and recommendation
ICME 2020 Tutorial Part II: Video summary (re-)use and recommendationICME 2020 Tutorial Part II: Video summary (re-)use and recommendation
ICME 2020 Tutorial Part II: Video summary (re-)use and recommendation
 
Enhancing multi-class web video categorization model using machine and deep ...
Enhancing multi-class web video categorization model using  machine and deep ...Enhancing multi-class web video categorization model using  machine and deep ...
Enhancing multi-class web video categorization model using machine and deep ...
 
CREW VRE Release 5 - 2009 May
CREW VRE Release 5 - 2009 MayCREW VRE Release 5 - 2009 May
CREW VRE Release 5 - 2009 May
 
Parking Surveillance Footage Summarization
Parking Surveillance Footage SummarizationParking Surveillance Footage Summarization
Parking Surveillance Footage Summarization
 
Automatic Visual Concept Detection in Videos: Review
Automatic Visual Concept Detection in Videos: ReviewAutomatic Visual Concept Detection in Videos: Review
Automatic Visual Concept Detection in Videos: Review
 
Content based video retrieval using discrete cosine transform
Content based video retrieval using discrete cosine transformContent based video retrieval using discrete cosine transform
Content based video retrieval using discrete cosine transform
 
On Annotation of Video Content for Multimedia Retrieval and Sharing
On Annotation of Video Content for Multimedia  Retrieval and SharingOn Annotation of Video Content for Multimedia  Retrieval and Sharing
On Annotation of Video Content for Multimedia Retrieval and Sharing
 
VIDEO TO TEXT SUMMARIZER USING AI.pdf
VIDEO TO TEXT SUMMARIZER USING AI.pdfVIDEO TO TEXT SUMMARIZER USING AI.pdf
VIDEO TO TEXT SUMMARIZER USING AI.pdf
 
CHI2021
CHI2021CHI2021
CHI2021
 
Multimodal video abstraction into a static document using deep learning
Multimodal video abstraction into a static document using deep learning Multimodal video abstraction into a static document using deep learning
Multimodal video abstraction into a static document using deep learning
 
Semantic Summarization of videos, Semantic Summarization of videos
Semantic Summarization of videos, Semantic Summarization of videosSemantic Summarization of videos, Semantic Summarization of videos
Semantic Summarization of videos, Semantic Summarization of videos
 
Overview of Video Concept Detection using (CNN) Convolutional Neural Network
Overview of Video Concept Detection using (CNN) Convolutional Neural NetworkOverview of Video Concept Detection using (CNN) Convolutional Neural Network
Overview of Video Concept Detection using (CNN) Convolutional Neural Network
 

Plus de VasileiosMezaris

Plus de VasileiosMezaris (20)

Multi-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and LocalizationMulti-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and Localization
 
CERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages TaskCERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages Task
 
Spatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees VideosSpatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees Videos
 
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
 
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
 
TAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for ExplanationsTAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for Explanations
 
Gated-ViGAT
Gated-ViGATGated-ViGAT
Gated-ViGAT
 
Combining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchCombining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video Search
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiers
 
Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...
 
Are all combinations equal? Combining textual and visual features with multi...
Are all combinations equal?  Combining textual and visual features with multi...Are all combinations equal?  Combining textual and visual features with multi...
Are all combinations equal? Combining textual and visual features with multi...
 
CA-SUM Video Summarization
CA-SUM Video SummarizationCA-SUM Video Summarization
CA-SUM Video Summarization
 
Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web application
 
Video Thumbnail Selector
Video Thumbnail SelectorVideo Thumbnail Selector
Video Thumbnail Selector
 
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal RetrievalHard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
 
Misinformation on the internet: Video and AI
Misinformation on the internet: Video and AIMisinformation on the internet: Video and AI
Misinformation on the internet: Video and AI
 
LSTM Structured Pruning
LSTM Structured PruningLSTM Structured Pruning
LSTM Structured Pruning
 
PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020
 
GAN-based video summarization
GAN-based video summarizationGAN-based video summarization
GAN-based video summarization
 
Migration-related video retrieval
Migration-related video retrievalMigration-related video retrieval
Migration-related video retrieval
 

Dernier

Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Sérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 

Dernier (20)

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 

Icme2020 tutorial video_summarization_part1

  • 1. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Vasileios Mezaris, Evlampios Apostolidis CERTH-ITI, Greece Tutorial at IEEE ICME 2020 Section I.1: Video summarization problem definition and literature overview Video Summarization and Re-use Technologies and Tools Part I: Automatic video summarization
  • 2. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Tutorial’s structure and time schedule 2 Part I: Automatic video summarization  Section I.1: Video summarization problem definition and literature overview (20’)  Q&A (5’)  Section I.2: In-depth discussion on a few unsupervised GAN-based methods (20’)  Q&A (5’)  Section I.3: Datasets, evaluation protocols and results, and future directions (20’) 20’ Q&A and break, then we are back with the tutorial’s Part II: Video summaries re-use and recommendation
  • 3. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 3 Video is everywhere! Problem definition Hours of video content uploaded on YouTube every minute  Captured by smart-devices and instantly shared online  Constantly and rapidly increased volumes of video content Image sources: https://www.financialexpress.com/india-news/govt-agencies-adopt-new- age-video-sharing-apps-like-tiktok/1767354/ (left) & https://www.statista.com/ (right)
  • 4. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 4 But how to find what we are looking for in endless collections of video content? Problem definition - video consumption side Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/
  • 5. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 5 But how to find what we are looking for in endless collections of video content? Problem definition - video consumption side Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/ Quickly inspect a video’s content by checking its synopsis!
  • 6. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 6 But how to reach different audiences for a given media item? Problem definition - video editing side Image source: https://marketingland.com/social-media-audience-critical-content-marketing-223647 Good Very interesting Boring Nice Much detailed
  • 7. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 7 But how to reach different audiences for a given media item? Problem definition - video editing side Image source: https://marketingland.com/social-media-audience-critical-content-marketing-223647 Good Very interesting Boring Nice Use of technologies for content adaptation, re-use and re-purposing! Much detailed
  • 8. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 8 Video summary: a short visual summary that encapsulates the flow of the story and the essential parts of the full-length video Original video Video summary (storyboard) Problem definition Source: https://www.youtube.com/watch?v=deRF9oEbRso
  • 9. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 9 Problem definition General applications of video summarization  Professional CMS: effective indexing, browsing, retrieval & promotion of media assets!  Video sharing platforms: improved viewer experience, enhanced viewer engagement & increased content consumption! Source: https://www.redbytes.in/how-to-build-an-app-like-hotstar/ Source: Screenshot of the BBC News channel on YouTube
  • 10. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 10 Problem definition General applications of video summarization Audience- and channel-specific content adaptation: video content re-use and re-distribution in the most appropriate way! Image source: https://www.databagg.com/online-video-sharing
  • 11. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 11 Problem definition Domain-specific applications of video summarization Full movie (e.g. 1h 30’-2h) Movie trailer (2’30’’) J. R. Smith, D. Joshi, B. Huet, W. Hsu, and J. Cota, “Harnessing A.I. for Augmenting Creativity: Application to Movie Trailer Creation,” in Proc. of the 25th ACM Int. Conf. on Multimedia, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 1799–1808. Source: https://www.youtube.com/watch?v=wb49-oV0F78
  • 12. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 12 Problem definition Domain-specific applications of video summarization Full game (e.g. 1h 30’) Game’s synopsis & highlights (1’32’’) Source: https://www.youtube.com/watch?v=oo-2IFTifUU
  • 13. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 13 Problem definition Domain-specific applications of video summarization Video samples extracted from: https://www.youtube.com/watch?v=gk3qTMlcadk Raw CCTV material (e.g. 24h) Summary of important actions/events (with timestamps)
  • 14. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 14 Literature overview Taxonomy of deep learning based methods for automatic video summarization
  • 15. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 15 Literature overview Supervised approaches: using video semantics and metadata  [Zhang, 2016; Kaufman, 2017] learn and transfer the summary structure of semantically-similar videos  [Panda, 2017] metadata-driven video categorization and summarization by maximizing relevance with the video category  [Song, 2016; Zhou, 2018a] category-driven summarization by category feature preservation (keep main parts of a wedding when summarizing a wedding video)  [Otani, 2016; Yuan, 2019] maximize relevance of visual (video) and textual (metadata) data in a common latent space
  • 16. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 16 Literature overview Supervised approaches: considering temporal structure and dependency  [Zhang, 2016b] estimate frames’ importance by modeling their variable-range temporal dependency using RNNs  [Zhao, 2018] models and encodes the temporal structure of the video for defining the key-fragments using hierarchies of RNNs  [Ji, 2019] video-to-summary as a sequence-to-sequence learning problem using attention-driven encoder-decoder network  [Feng, 2018; Wang, 2019] estimate frames’ importance by modeling their long- range dependency using high-capacity memory networks
  • 17. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 17 Literature overview Supervised approaches: imitating human summaries  [Zhang, 2019] summarization by confusing a trainable discriminator when making the distinction between a machine- and a human-generated summary; model the variable-range temporal dependency using RNNs and Dilated Temporal Units  [Fu, 2019] key-fragment selection by confusing a trainable discriminator when making the distinction between the machine- and a human-selected key-fragments; fragmentation based on attention-based Pointer Network, and discrimination using a 3D-CNN classifier
  • 18. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 18 Literature overview Supervised approaches: targeting specific properties of the summary  [Chu, 2019] models spatiotemporal information based on raw frames and optical flow maps, and learns frames’ importance from human annotations via a label distribution learning process  [Elfeki, 2019] uses of CNNs and RNNs to form spatiotemporal feature vectors and estimates the level of activity and importance of each frame to create the summary  [Chen, 2019] summarization based on reinforcement learning and reward functions associated to the diversity and representativeness of the video summary
  • 19. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 19 Literature overview Unsupervised approaches: inferring the original video  [Mahasseni, 2017] SUM-GAN trains a summarizer to fool a discriminator when distinguishing the original from the summary-based reconstructed video using adversarial learning  [Jung, 2019] CSNet extends [Mahasseni, 2017] with a chunk and stride network and attention mechanism to assess variable-range dependencies and select the video key- frames  [Apostolidis, 2020] SUM-GAN-AAE extends [Mahasseni, 2017] with a stepwise, fine- grained training strategy and an attention auto-encoder to improve the key-fragment selection process  [Rochan, 2019] UnpairedVSN learns video summarization from unpaired data based on an adversarial process that defines a mapping function of a raw video to a human summary
  • 20. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 20 Literature overview Unsupervised approaches: targeting specific properties of the summary  [Zhou, 2018b] DR-DSN learns to create representative and diverse summaries via reinforcement learning and relevant reward functions  [Gonuguntla, 2019] EDSN extracts spatiotemporal information and learns summarization by rewarding the maintenance of main spatiotemporal patterns in the summary  [Zhang, 2018] OnlineMotionAE extracts the key motions of appearing objects and uses an online motion auto-encoder model to generate summaries that include the main objects in the video and the attractive actions made by each of these objects
  • 21. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  DL-based video summarization methods mainly rely on combinations of CNNs and RNNs  Pre-trained CNNs are used to represent the visual content; RNNs (mostly LSTMs) are used to model the temporal dependency among video frames  The proposed video summarization approaches are mostly supervised  Best supervised approaches utilize tailored attention mechanisms or memory networks to capture variable- and long-range temporal dependencies respectively  For unsupervised video summarization GANs are the central direction and RL is another but less common approach  Best unsupervised approaches rely on VAE-GAN architectures that have been enhanced with attention mechanisms Some concluding remarks 21
  • 22. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  The generation of ground-truth data can be an expensive and laborious process  Video summarization is a subjective task and multiple summaries can be proposed for a video  Human annotations that vary a lot make it hard to train a method with the typical supervised training approaches  Unsupervised video summarization algorithms overcome the need for ground-truth data and can be trained using only an adequately large collection of videos  Unsupervised learning allows to train a summarization method using different types of video content (TV shows, news) and then perform content-wise video summarization Some concluding remarks 22
  • 23. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  The generation of ground-truth data can be an expensive and laborious process  Video summarization is a subjective task and multiple summaries can be proposed for a video  Human annotations that vary a lot make it hard to train a method with the typical supervised training approaches  Unsupervised video summarization algorithms overcome the need for ground-truth data and can be trained using only an adequately large collection of videos  Unsupervised learning allows to train a summarization method using different types of video content (TV shows, news) and then perform content-wise video summarization Some concluding remarks 23 Unsupervised video summarization has great advantages, increases the applicability of summarization technologies, and its potential should be investigated
  • 24. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Vasileios Mezaris, Evlampios Apostolidis CERTH-ITI, Greece Tutorial at IEEE ICME 2020 Short break; coming up: Section I.2: Discussion on a few unsupervised GAN-based methods Video Summarization and Re-use Technologies and Tools Part I: Automatic video summarization
  • 25. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Vasileios Mezaris, Evlampios Apostolidis CERTH-ITI, Greece Tutorial at IEEE ICME 2020 Section I.2: Discussion on a few unsupervised GAN-based methods Video Summarization and Re-use Technologies and Tools Part I: Automatic video summarization
  • 26. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN method [Mahasseni, 2017]  Problem formulation: video summarization via selecting a sparse subset of frames that optimally represent the video  Main idea: learn summarization by minimizing the distance between videos and a distribution of their summarizations  Goal: select a set of keyframes such that a distance between the deep representations of the selected keyframes and the video is minimized 26 B. Mahasseni, M. Lam and S. Todorovic, "Unsupervised Video Summarization with Adversarial LSTM Networks," 2017 IEEE CVPR, Honolulu, HI, 2017, pp. 2982-2991, doi: 10.1109/CVPR.2017.318. Courtesy of Mahasseni et al.
  • 27. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN method [Mahasseni, 2017]  Problem formulation: video summarization via selecting a sparse subset of frames that optimally represent the video  Main idea: learn summarization by minimizing the distance between videos and a distribution of their summarizations  Goal: select a set of keyframes such that a distance between the deep representations of the selected keyframes and the video is minimized  Challenge: how to define a good distance? 27 B. Mahasseni, M. Lam and S. Todorovic, "Unsupervised Video Summarization with Adversarial LSTM Networks," 2017 IEEE CVPR, Honolulu, HI, 2017, pp. 2982-2991, doi: 10.1109/CVPR.2017.318. Courtesy of Mahasseni et al.
  • 28. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN method [Mahasseni, 2017]  Problem formulation: video summarization via selecting a sparse subset of frames that optimally represent the video  Main idea: learn summarization by minimizing the distance between videos and a distribution of their summarizations  Goal: select a set of keyframes such that a distance between the deep representations of the selected keyframes and the video is minimized  Challenge: how to define a good distance?  Solution: use a Discriminator network and train it with the Summarizer in an adversarial manner 28 B. Mahasseni, M. Lam and S. Todorovic, "Unsupervised Video Summarization with Adversarial LSTM Networks," 2017 IEEE CVPR, Honolulu, HI, 2017, pp. 2982-2991, doi: 10.1109/CVPR.2017.318. Courtesy of Mahasseni et al.
  • 29. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN method [Mahasseni, 2017]  Deep features of video frames in Frame Selector => normalized importance scores  Weighted features in Encoder => latent representation e  Latent representation e in Decoder => sequence of features for the frames of input video  Original & reconstructed features in Discriminator => distance estimation and binary classification as “video” or “summary” 29 Training pipeline and loss functions
  • 30. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN method [Mahasseni, 2017]  Deep features of video frames in Frame Selector => normalized importance scores  Weighted features in Encoder => latent representation e  Latent representation e in Decoder => sequence of features for the frames of input video  Original & reconstructed features in Discriminator => distance estimation and binary classification as “video” or “summary” 30 Training pipeline and loss functions
  • 31. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN method [Mahasseni, 2017]  Deep features of video frames in Frame Selector => normalized importance scores  Weighted features in Encoder => latent representation e  Latent representation e in Decoder => sequence of features for the frames of input video  Original & reconstructed features in Discriminator => distance estimation and binary classification as “video” or “summary” 31 Training pipeline and loss functions
  • 32. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN method [Mahasseni, 2017]  Deep features of video frames in Frame Selector => normalized importance scores  Weighted features in Encoder => latent representation e  Latent representation e in Decoder => sequence of features for the frames of input video  Original & reconstructed features in Discriminator => distance estimation and binary classification as “video” or “summary” 32 Training pipeline and loss functions
  • 33. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN method [Mahasseni, 2017]  Deep features of video frames in Frame Selector => normalized importance scores  Weighted features in Encoder => latent representation e  Latent representation e in Decoder => sequence of features for the frames of input video  Original & reconstructed features in Discriminator => distance estimation and binary classification as “video” or “summary” 33 Training pipeline and loss functions
  • 34. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN method [Mahasseni, 2017]  Train Frame Selector and Encoder by minimizing Lsparsity + Lprior + Lreconst  Train Decoder by minimizing Lreconst + LGAN  Train Discriminator by maximizing LGAN  Update all components via backward propagation using Stochastic Gradient Variational Bayes estimation 34 Training pipeline and loss functions
  • 35. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN method [Mahasseni, 2017]  Deep features of video frames in Frame Selector => normalized importance scores    35 Inference stage and video summarization 35 Video fragmentation using KTS Fragment-level importance scores Key-fragment selection as a Knapsack problem Frame-level importance scores
  • 36. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-sl method [Apostolidis, 2019] 36 E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production, Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.  Builds on the SUM-GAN architecture  Contains a linear compression layer that reduces the size of CNN feature vectors  Follows an incremental and fine-grained approach to train the model’s components
  • 37. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-sl method [Apostolidis, 2019] 37 E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production, Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.  Builds on the SUM-GAN architecture  Contains a linear compression layer that reduces the size of CNN feature vectors  Follows an incremental and fine-grained approach to train the model’s components
  • 38. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-sl method [Apostolidis, 2019]  Builds on the SUM-GAN architecture  Contains a linear compression layer that reduces the size of CNN feature vectors  Follows an incremental and fine-grained approach to train the model’s components 38 E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production, Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
  • 39. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-sl method [Apostolidis, 2019]  Step-wise training process 39 Training pipeline and loss functions
  • 40. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-sl method [Apostolidis, 2019] 40  Step-wise training process Training pipeline and loss functions
  • 41. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-sl method [Apostolidis, 2019] 41  Step-wise training process Training pipeline and loss functions
  • 42. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-sl method [Apostolidis, 2019] 42  Step-wise training process Training pipeline and loss functions
  • 43. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-sl method [Apostolidis, 2019]  Deep features of video frames in LC layer and Frame Selector => normalized importance scores    43 Inference stage and video summarization 43 Video fragmentation using KTS Fragment-level importance scores Key-fragment selection as a Knapsack problem Frame-level importance scores
  • 44. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-AAE method [Apostolidis, 2020]  Builds on the SUM-GAN-sl algorithm  Introduces an attention mechanism by replacing the VAE of SUM-GAN-sl with a deterministic attention auto-encoder 44 E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961, pp. 492-504, Jan. 2020. Best paper award
  • 45. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-AAE method [Apostolidis, 2020]  Builds on the SUM-GAN-sl algorithm  Introduces an attention mechanism by replacing the VAE of SUM-GAN-sl with a deterministic attention auto-encoder 45 E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961, pp. 492-504, Jan. 2020. Best paper award
  • 46. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-AAE method [Apostolidis, 2020] 46 The attention auto-encoder: Processing pipeline
  • 47. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-AAE method [Apostolidis, 2020] 47 The attention auto-encoder: Processing pipeline  Weighted feature vectors fed to the Encoder
  • 48. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-AAE method [Apostolidis, 2020] 48 The attention auto-encoder: Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  For t > 1: use the hidden state of the previous Decoder’s step (h1)  For t = 1: use the hidden state of the last Encoder’s step (He)
  • 49. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-AAE method [Apostolidis, 2020] 49 The attention auto-encoder: Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function
  • 50. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function The SUM-GAN-AAE method [Apostolidis, 2020] 50 The attention auto-encoder: Processing pipeline
  • 51. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function  αt multiplied with V and form Context Vector vt’ The SUM-GAN-AAE method [Apostolidis, 2020] 51 The attention auto-encoder: Processing pipeline
  • 52. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function  αt multiplied with V and form Context Vector vt’  vt’ combined with Decoder’s previous output yt-1 The SUM-GAN-AAE method [Apostolidis, 2020] 52 The attention auto-encoder: Processing pipeline
  • 53. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function  αt multiplied with V and form Context Vector vt’  vt’ combined with Decoder’s previous output yt-1  Decoder gradually reconstructs the video The SUM-GAN-AAE method [Apostolidis, 2020] 53 The attention auto-encoder: Processing pipeline
  • 54. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-AAE method [Apostolidis, 2020]  Training is performed in an incremental way as in SUM-GAN-sl  No prior loss is used 54 Training pipeline and loss functions
  • 55. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project The SUM-GAN-AAE method [Apostolidis, 2020]  Deep features of video frames in LC layer and Frame Selector => normalized importance scores    55 Inference stage and video summarization 55 Video fragmentation using KTS Fragment-level importance scores Key-fragment selection as a Knapsack problem Frame-level importance scores
  • 56. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Much smoother series of importance scores The SUM-GAN-AAE method [Apostolidis, 2020] 56 Impact of the introduced attention mechanism
  • 57. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Much faster and more stable training of the model The SUM-GAN-AAE method [Apostolidis, 2020] 57 Impact of the introduced attention mechanism Average (over 5 splits) learning curve of SUM-GAN-sl and SUM-GAN-AAE on SumMeLoss curves for the SUM-GAN-sl and SUM-GAN-AAE
  • 58. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  The most common strategy for learning summarization in an unsupervised way  A mechanism to build a representative summary by maximizing inference to the full video  Summarization performance is superior to other unsupervised learning approaches (e.g. reinforcement learning) and comparable to a few supervised learning methods  Step-wise training facilitates the training of complex GAN-based architectures  Introduction of attention mechanisms is beneficial to the quality of the created summary  There is room for further improving GAN-based unsupervised video summarization via: a) combination with reinforcement learning approaches, b) extension with memory networks Some concluding remarks 58 Using GANs for video summarization
  • 59. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Vasileios Mezaris, Evlampios Apostolidis CERTH-ITI, Greece Tutorial at IEEE ICME 2020 Short break; coming up: Section I.3: Datasets, evaluation protocols and results, and future directions Video Summarization and Re-use Technologies and Tools Part I: Automatic video summarization
  • 60. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Vasileios Mezaris, Evlampios Apostolidis CERTH-ITI, Greece Tutorial at IEEE ICME 2020 Section I.3: Datasets, evaluation protocols and results, and future directions Video Summarization and Re-use Technologies and Tools Part I: Automatic video summarization
  • 61. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Datasets 61  SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)  25 videos capturing multiple events (e.g. cooking and sports)  video length: 1 to 6 min  annotation: fragment-based video summaries (15-18 per video)  TVSum (https://github.com/yalesong/tvsum)  50 videos from 10 categories of TRECVid MED task  video length: 1 to 11 min  annotation: frame-level importance scores (20 per video) Most commonly used
  • 62. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Datasets 62  Open Video Project (OVP) (https://sites.google.com/site/vsummsite/download)  50 videos of various genres (e.g. documentary, educational, historical, lecture)  video length: 1 to 4 min  annotation: keyframe-based video summaries (5 per video)  Youtube (https://sites.google.com/site/vsummsite/download)  50 videos of diverse content (e.g. cartoons, news, sports, commercials) collected from websites  video length: 1 to 10 min  annotation: keyframe-based video summaries (5 per video) Less commonly used
  • 63. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Evaluation protocols 63 Early approach  Agreement between automatically-created (A) and user-defined (U) summary is expressed by  Matching of a pair of frames is based on color histograms, the Manhattan distance and a predefined similarity threshold  80% of video samples are used for training and the remaining 20% for testing  The final evaluation outcome occurs by:  Computing the average F-Score for a test video given the different user summaries for this video  Computing the average of the calculated F-Score values for the different test videos
  • 64. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Evaluation protocols 64 Established approach  The generated summary should not exceed 15% of the video length  Agreement between automatically-generated (A) and user-defined (U) summary is expressed by the F-Score (%), with (P)recision and (R)ecall measuring the temporal overlap (∩) (|| || means duration)  Typical metrics for computing Precision and Recall at the frame-level  80% of video samples are used for training and the remaining 20% for testing
  • 65. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Evaluation protocols 65 Established approach - A side note  TVSum annotations need conversion from frame-level importance scores to key-fragments 65 Human annotations in TVSum: frame-level importance scores
  • 66. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Evaluation protocols 66 Established approach - A side note  TVSum annotations need conversion from frame-level importance scores to key-fragments 66 Video fragmentation using KTS Human annotations in TVSum: frame-level importance scores
  • 67. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Evaluation protocols 67 Established approach - A side note  TVSum annotations need conversion from frame-level importance scores to key-fragments 67 Video fragmentation using KTS Fragment-level importance scores Human annotations in TVSum: frame-level importance scores
  • 68. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Evaluation protocols 68 Established approach - A side note  TVSum annotations need conversion from frame-level importance scores to key-fragments Video fragmentation using KTS Fragment-level importance scores Key-fragment selection as a Knapsack problem Human annotations in TVSum: frame-level importance scores
  • 69. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Evaluation protocols 69 Established approach  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach
  • 70. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Evaluation protocols 70 Established approach  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach
  • 71. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Evaluation protocols 71 F-Score1 Established approach  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach
  • 72. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 72 F-Score2 F-Score1 Evaluation protocols Established approach  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach
  • 73. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 73 F-ScoreN F-Score2 F-Score1 Evaluation protocols Established approach  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach
  • 74. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 74 F-ScoreN F-Score2 F-Score1 Evaluation protocols Established approach  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach SumMe: F-Score = max{F-Scorei}i=1 N TVSum: F-Score = mean{F-Scorei}i=1 N
  • 75. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 75 Evaluation protocols Established approach  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Alternative approach
  • 76. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 76 F-Score Evaluation protocols Established approach  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Alternative approach
  • 77. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Results: comparison of unsupervised methods 77 Method Reference Online Motion AE [Zhang, 2018] SUM-FCNunsup [Rochan, 2018] DR-DSN [Zhou, 2018b] EDSN [Gonuguntla, 2019] UnpairedVSN [Rochan, 2019] PCDL [Zhao, 2019] ACGAN [He, 2019] Tesselation [Kaufman, 2017] SUM-GAN-sl [Apostolidis, 2019] SUM-GAN-AAE [Apostolidis, 2020] CSNet [Jung, 2019]
  • 78. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Best-performing unsupervised methods rely on Generative Adversarial Networks  The use of attention mechanisms allows the identification of important parts of the video  Best on TVSum is a dataset-tailored method as it has random-level performance on SumMe  The use of rewards and reinforcement learning is less competitive than the use of GANs  A few methods show random performance in at least one of the used datasets Results: comparison of unsupervised methods 78 Method SumMe TVSum AVG FSc Rnk FSc Rnk Rnk Random summary 40.2 10 54.4 9 9.5 Online Motion AE 37.7 11 51.5 11 11 SUM-FCNunsup 41.5 8 52.7 10 9 DR-DSN 41.4 9 57.6 6 7.5 EDSN 42.6 7 57.3 7 7 UnpairedVSN 47.5 4 55.6 8 6 PCDL 42.7 6 58.4 4 5 ACGAN 46.0 5 58.5 3 4 Tesselation 41.4 7 64.1 1 4 SUM-GAN-sl 47.8 3 58.4 4 3.5 SUM-GAN-AAE 48.9 2 58.3 5 3.5 CSNet 51.3 1 58.8 2 1.5 General remarks
  • 79. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Best-performing unsupervised methods rely on Generative Adversarial Networks  The use of attention mechanisms allows the identification of important parts of the video  Best on TVSum is a dataset-tailored method as it has random-level performance on SumMe  The use of rewards and reinforcement learning is less competitive than the use of GANs  A few methods show random performance in at least one of the used datasets Results: comparison of unsupervised methods 79 Method SumMe TVSum AVG FSc Rnk FSc Rnk Rnk Random summary 40.2 10 54.4 9 9.5 Online Motion AE 37.7 11 51.5 11 11 SUM-FCNunsup 41.5 8 52.7 10 9 DR-DSN 41.4 9 57.6 6 7.5 EDSN 42.6 7 57.3 7 7 UnpairedVSN 47.5 4 55.6 8 6 PCDL 42.7 6 58.4 4 5 ACGAN 46.0 5 58.5 3 4 Tesselation 41.4 7 64.1 1 4 SUM-GAN-sl 47.8 3 58.4 4 3.5 SUM-GAN-AAE 48.9 2 58.3 5 3.5 CSNet 51.3 1 58.8 2 1.5 General remarks
  • 80. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Best-performing unsupervised methods rely on Generative Adversarial Networks  The use of attention mechanisms allows the identification of important parts of the video  Best on TVSum is a dataset-tailored method as it has random-level performance on SumMe  The use of rewards and reinforcement learning is less competitive than the use of GANs  A few methods show random performance in at least one of the used datasets Results: comparison of unsupervised methods 80 Method SumMe TVSum AVG FSc Rnk FSc Rnk Rnk Random summary 40.2 10 54.4 9 9.5 Online Motion AE 37.7 11 51.5 11 11 SUM-FCNunsup 41.5 8 52.7 10 9 DR-DSN 41.4 9 57.6 6 7.5 EDSN 42.6 7 57.3 7 7 UnpairedVSN 47.5 4 55.6 8 6 PCDL 42.7 6 58.4 4 5 ACGAN 46.0 5 58.5 3 4 Tesselation 41.4 7 64.1 1 4 SUM-GAN-sl 47.8 3 58.4 4 3.5 SUM-GAN-AAE 48.9 2 58.3 5 3.5 CSNet 51.3 1 58.8 2 1.5 General remarks
  • 81. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Best-performing unsupervised methods rely on Generative Adversarial Networks  The use of attention mechanisms allows the identification of important parts of the video  Best on TVSum is a dataset-tailored method as it has random-level performance on SumMe  The use of rewards and reinforcement learning is less competitive than the use of GANs  A few methods show random performance in at least one of the used datasets Results: comparison of unsupervised methods 81 Method SumMe TVSum AVG FSc Rnk FSc Rnk Rnk Random summary 40.2 10 54.4 9 9.5 Online Motion AE 37.7 11 51.5 11 11 SUM-FCNunsup 41.5 8 52.7 10 9 DR-DSN 41.4 9 57.6 6 7.5 EDSN 42.6 7 57.3 7 7 UnpairedVSN 47.5 4 55.6 8 6 PCDL 42.7 6 58.4 4 5 ACGAN 46.0 5 58.5 3 4 Tesselation 41.4 7 64.1 1 4 SUM-GAN-sl 47.8 3 58.4 4 3.5 SUM-GAN-AAE 48.9 2 58.3 5 3.5 CSNet 51.3 1 58.8 2 1.5 General remarks
  • 82. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Best-performing unsupervised methods rely on Generative Adversarial Networks  The use of attention mechanisms allows the identification of important parts of the video  Best on TVSum is a dataset-tailored method as it has random-level performance on SumMe  The use of rewards and reinforcement learning is less competitive than the use of GANs  A few methods show random performance in at least one of the used datasets Results: comparison of unsupervised methods 82 Method SumMe TVSum AVG FSc Rnk FSc Rnk Rnk Random summary 40.2 10 54.4 9 9.5 Online Motion AE 37.7 11 51.5 11 11 SUM-FCNunsup 41.5 8 52.7 10 9 DR-DSN 41.4 9 57.6 6 7.5 EDSN 42.6 7 57.3 7 7 UnpairedVSN 47.5 4 55.6 8 6 PCDL 42.7 6 58.4 4 5 ACGAN 46.0 5 58.5 3 4 Tesselation 41.4 7 64.1 1 4 SUM-GAN-sl 47.8 3 58.4 4 3.5 SUM-GAN-AAE 48.9 2 58.3 5 3.5 CSNet 51.3 1 58.8 2 1.5 General remarks
  • 83. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Best-performing unsupervised methods rely on Generative Adversarial Networks  The use of attention mechanisms allows the identification of important parts of the video  Best on TVSum is a dataset-tailored method as it has random-level performance on SumMe  The use of rewards and reinforcement learning is less competitive than the use of GANs  A few methods show random performance in at least one of the used datasets Results: comparison of unsupervised methods 83 Method SumMe TVSum AVG FSc Rnk FSc Rnk Rnk Random summary 40.2 10 54.4 9 9.5 Online Motion AE 37.7 11 51.5 11 11 SUM-FCNunsup 41.5 8 52.7 10 9 DR-DSN 41.4 9 57.6 6 7.5 EDSN 42.6 7 57.3 7 7 UnpairedVSN 47.5 4 55.6 8 6 PCDL 42.7 6 58.4 4 5 ACGAN 46.0 5 58.5 3 4 Tesselation 41.4 7 64.1 1 4 SUM-GAN-sl 47.8 3 58.4 4 3.5 SUM-GAN-AAE 48.9 2 58.3 5 3.5 CSNet 51.3 1 58.8 2 1.5 General remarks
  • 84. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Best-performing unsupervised methods rely on Generative Adversarial Networks  The use of attention mechanisms allows the identification of important parts of the video  Best on TVSum is a dataset-tailored method as it has random-level performance on SumMe  The use of rewards and reinforcement learning is less competitive than the use of GANs  A few methods show random performance in at least one of the used datasets Results: comparison of unsupervised methods 84 General remarks Method SumMe TVSum AVG FSc Rnk FSc Rnk Rnk Random summary 40.2 10 54.4 9 9.5 Online Motion AE 37.7 11 51.5 11 11 SUM-FCNunsup 41.5 8 52.7 10 9 DR-DSN 41.4 9 57.6 6 7.5 EDSN 42.6 7 57.3 7 7 UnpairedVSN 47.5 4 55.6 8 6 PCDL 42.7 6 58.4 4 5 ACGAN 46.0 5 58.5 3 4 Tesselation 41.4 7 64.1 1 4 SUM-GAN-sl 47.8 3 58.4 4 3.5 SUM-GAN-AAE 48.9 2 58.3 5 3.5 CSNet 51.3 1 58.8 2 1.5
  • 85. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  tbd Results: comparison of supervised methods 85 Method Reference vsLSTM [Zhang, 2016b] dppLSTM [Zhang, 2016b] SASUMwsup [Wei. 2018] ActionRanking [Elfeki, 2019] ESS-VS [Zhang, 2016a] H-RNN [Zhao, 2017] vsLSTM+Att [Lebron Casas, 2019] DSSE [Yuan, 2019b] DR-DSNsup [Zhou, 2018b] Tessellationsup [Kaufman, 2017] Method Reference dppLSTM+Att [Lebron Casas, 2019] WS-HRL [Chen, 2019] UnpairedVSNsup [Rochan, 2019] SUM-FCN [Rochan, 2018] SF-CVS [Huang, 2020] SASUMsup [Wei, 2018] CRSum [Yuan, 2019c] PCDLsup [Zhao, 2019] MAVS [Feng, 2018] HSA-RNN [Zhao, 2018] Method Reference DQSN [Zhou, 2018a] ACGANsup [He, 2019] SUM-DeepLab [Rochan, 2018] CSNetsup [Yuan, 2019a] SMLD [Chu, 2019] H-MAN [Liu, 2019] VASNet [Fajtl, 2019] SMN [Wang, 2019] * SUM-GAN-AAE [Apostolidis, 2020]
  • 86. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  tbd Results: comparison of supervised methods 86 Method Sum Me TV Sum AVG FSc FSc Rnk Random summary 40.2 54.4 22.5 vsLSTM 37.6 54.2 24.5 dppLSTM 38.6 54.7 23 SASUMwsup 40.6 53.9 22.5 ActionRanking 40.1 56.3 21.5 ESS-VS 40.9 - 20 H-RNN 41.1 57.7 17.5 vsLSTM+Att 43.2 - 17 DSSE - 57.0 17 DR-DSNsup 42.1 58.1 16 Method Sum Me TV Sum AVG FSc FSc Rnk Tessellationsup 37.2 63.4 15 dppLSTM+Att 43.8 - 14 WS-HRL 43.6 58.4 14 UnpairedVSNsup 48.0 56.1 13 SUM-FCN 47.5 56.8 13 SF-CVS 46.0 58.0 13 SASUMsup 45.3 58.2 12.5 CRSum 47.3 58.0 12 PCDLsup 43.7 59.2 12 MAVS 40.3 66.8 11.5 Method Sum Me TV Sum AVG FSc FSc Rnk HSA-RNN 44.1 59.8 10 DQSN - 58.6 10 ACGANsup 47.2 59.4 9 SUM-DeepLab 48.8 58.4 8 HSA-RNN 44.1 59.8 10 CSNetsup 48.6 58.5 8 SMLD 47.6 61.0 6 H-MAN 51.8 60.4 4 VASNet 49.7 61.4 3.5 SMN 58.3 64.5 1.5 * SUM-GAN-AAE 48.9 58.3 8.5
  • 87. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  tbd Results: comparison of supervised methods 87 Method Sum Me TV Sum AVG FSc FSc Rnk Random summary 40.2 54.4 22.5 vsLSTM 37.6 54.2 24.5 dppLSTM 38.6 54.7 23 SASUMwsup 40.6 53.9 22.5 ActionRanking 40.1 56.3 21.5 ESS-VS 40.9 - 20 H-RNN 41.1 57.7 17.5 vsLSTM+Att 43.2 - 17 DSSE - 57.0 17 DR-DSNsup 42.1 58.1 16 Method Sum Me TV Sum AVG FSc FSc Rnk Tessellationsup 37.2 63.4 15 dppLSTM+Att 43.8 - 14 WS-HRL 43.6 58.4 14 UnpairedVSNsup 48.0 56.1 13 SUM-FCN 47.5 56.8 13 SF-CVS 46.0 58.0 13 SASUMsup 45.3 58.2 12.5 CRSum 47.3 58.0 12 PCDLsup 43.7 59.2 12 MAVS 40.3 66.8 11.5 Method Sum Me TV Sum AVG FSc FSc Rnk HSA-RNN 44.1 59.8 10 DQSN - 58.6 10 ACGANsup 47.2 59.4 9 SUM-DeepLab 48.8 58.4 8 HSA-RNN 44.1 59.8 10 CSNetsup 48.6 58.5 8 SMLD 47.6 61.0 6 H-MAN 51.8 60.4 4 VASNet 49.7 61.4 3.5 SMN 58.3 64.5 1.5 * SUM-GAN-AAE 48.9 58.3 8.5
  • 88. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  tbd Results: comparison of supervised methods 88 Method Sum Me TV Sum AVG FSc FSc Rnk Random summary 40.2 54.4 22.5 vsLSTM 37.6 54.2 24.5 dppLSTM 38.6 54.7 23 SASUMwsup 40.6 53.9 22.5 ActionRanking 40.1 56.3 21.5 ESS-VS 40.9 - 20 H-RNN 41.1 57.7 17.5 vsLSTM+Att 43.2 - 17 DSSE - 57.0 17 DR-DSNsup 42.1 58.1 16 Method Sum Me TV Sum AVG FSc FSc Rnk Tessellationsup 37.2 63.4 15 dppLSTM+Att 43.8 - 14 WS-HRL 43.6 58.4 14 UnpairedVSNsup 48.0 56.1 13 SUM-FCN 47.5 56.8 13 SF-CVS 46.0 58.0 13 SASUMsup 45.3 58.2 12.5 CRSum 47.3 58.0 12 PCDLsup 43.7 59.2 12 MAVS 40.3 66.8 11.5 Method Sum Me TV Sum AVG FSc FSc Rnk HSA-RNN 44.1 59.8 10 DQSN - 58.6 10 ACGANsup 47.2 59.4 9 SUM-DeepLab 48.8 58.4 8 HSA-RNN 44.1 59.8 10 CSNetsup 48.6 58.5 8 SMLD 47.6 61.0 6 H-MAN 51.8 60.4 4 VASNet 49.7 61.4 3.5 SMN 58.3 64.5 1.5 * SUM-GAN-AAE 48.9 58.3 8.5
  • 89. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  tbd Results: comparison of supervised methods 89 Method Sum Me TV Sum AVG FSc FSc Rnk Random summary 40.2 54.4 22.5 vsLSTM 37.6 54.2 24.5 dppLSTM 38.6 54.7 23 SASUMwsup 40.6 53.9 22.5 ActionRanking 40.1 56.3 21.5 ESS-VS 40.9 - 20 H-RNN 41.1 57.7 17.5 vsLSTM+Att 43.2 - 17 DSSE - 57.0 17 DR-DSNsup 42.1 58.1 16 Method Sum Me TV Sum AVG FSc FSc Rnk Tessellationsup 37.2 63.4 15 dppLSTM+Att 43.8 - 14 WS-HRL 43.6 58.4 14 UnpairedVSNsup 48.0 56.1 13 SUM-FCN 47.5 56.8 13 SF-CVS 46.0 58.0 13 SASUMsup 45.3 58.2 12.5 CRSum 47.3 58.0 12 PCDLsup 43.7 59.2 12 MAVS 40.3 66.8 11.5 Method Sum Me TV Sum AVG FSc FSc Rnk HSA-RNN 44.1 59.8 10 DQSN - 58.6 10 ACGANsup 47.2 59.4 9 SUM-DeepLab 48.8 58.4 8 HSA-RNN 44.1 59.8 10 CSNetsup 48.6 58.5 8 SMLD 47.6 61.0 6 H-MAN 51.8 60.4 4 VASNet 49.7 61.4 3.5 SMN 58.3 64.5 1.5 * SUM-GAN-AAE 48.9 58.3 8.5
  • 90. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  tbd Results: comparison of supervised methods 90 Method Sum Me TV Sum AVG FSc FSc Rnk Random summary 40.2 54.4 22.5 vsLSTM 37.6 54.2 24.5 dppLSTM 38.6 54.7 23 SASUMwsup 40.6 53.9 22.5 ActionRanking 40.1 56.3 21.5 ESS-VS 40.9 - 20 H-RNN 41.1 57.7 17.5 vsLSTM+Att 43.2 - 17 DSSE - 57.0 17 DR-DSNsup 42.1 58.1 16 Method Sum Me TV Sum AVG FSc FSc Rnk Tessellationsup 37.2 63.4 15 dppLSTM+Att 43.8 - 14 WS-HRL 43.6 58.4 14 UnpairedVSNsup 48.0 56.1 13 SUM-FCN 47.5 56.8 13 SF-CVS 46.0 58.0 13 SASUMsup 45.3 58.2 12.5 CRSum 47.3 58.0 12 PCDLsup 43.7 59.2 12 MAVS 40.3 66.8 11.5 Method Sum Me TV Sum AVG FSc FSc Rnk HSA-RNN 44.1 59.8 10 DQSN - 58.6 10 ACGANsup 47.2 59.4 9 SUM-DeepLab 48.8 58.4 8 HSA-RNN 44.1 59.8 10 CSNetsup 48.6 58.5 8 SMLD 47.6 61.0 6 H-MAN 51.8 60.4 4 VASNet 49.7 61.4 3.5 SMN 58.3 64.5 1.5 * SUM-GAN-AAE 48.9 58.3 8.5
  • 91. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  tbd Results: comparison of supervised methods 91 Method Sum Me TV Sum AVG FSc FSc Rnk Random summary 40.2 54.4 22.5 vsLSTM 37.6 54.2 24.5 dppLSTM 38.6 54.7 23 SASUMwsup 40.6 53.9 22.5 ActionRanking 40.1 56.3 21.5 ESS-VS 40.9 - 20 H-RNN 41.1 57.7 17.5 vsLSTM+Att 43.2 - 17 DSSE - 57.0 17 DR-DSNsup 42.1 58.1 16 Method Sum Me TV Sum AVG FSc FSc Rnk Tessellationsup 37.2 63.4 15 dppLSTM+Att 43.8 - 14 WS-HRL 43.6 58.4 14 UnpairedVSNsup 48.0 56.1 13 SUM-FCN 47.5 56.8 13 SF-CVS 46.0 58.0 13 SASUMsup 45.3 58.2 12.5 CRSum 47.3 58.0 12 PCDLsup 43.7 59.2 12 MAVS 40.3 66.8 11.5 Method Sum Me TV Sum AVG FSc FSc Rnk HSA-RNN 44.1 59.8 10 DQSN - 58.6 10 ACGANsup 47.2 59.4 9 SUM-DeepLab 48.8 58.4 8 HSA-RNN 44.1 59.8 10 CSNetsup 48.6 58.5 8 SMLD 47.6 61.0 6 H-MAN 51.8 60.4 4 VASNet 49.7 61.4 3.5 SMN 58.3 64.5 1.5 * SUM-GAN-AAE 48.9 58.3 8.5
  • 92. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Quantitative comparison 92 Keyframe-based overview of video #15 of TVSum (1 keyframe / shot)
  • 93. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Quantitative comparison 93 Generated summaries by five summarization methods
  • 94. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Quantitative comparison 94 Generated summaries by five summarization methods
  • 95. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Quantitative comparison 95 Generated summaries by five summarization methods
  • 96. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Quantitative comparison 96 Generated summaries by five summarization methods
  • 97. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Quantitative comparison 97 Video #15 of TVSum: “How to Clean Your Dog’s Ears - Vetoquinol USA
  • 98. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Quantitative comparison 98 Automatically generated summaries VASNet SUM-GAN-AAE DR-DSN
  • 99. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Use of video summarization technologies 99 Tool for content adaptation / re-purposing  Developed by CERTH-ITI  Elaborates GAN-based methods for unsupervised learning [Apostolidis 2019, 2020]  Enables content adaptation for distribution via multiple communication channels  Faciliates summary creation based on the audience needs for: Twitter, Facebook (feed & stories), Instagram (feed & stories), YouTube, TikTok E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production, Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019. E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961, pp. 492-504, Jan. 2020.
  • 100. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Use of video summarization technologies 100 Tool for content adaptation / re-purposing  Learns content-specific summarization  Separate models can be trained and used for different video content (e.g. TV shows)  Creating these models does not require manually- generated training data (it’s (almost) for free) E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production, Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019. E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961, pp. 492-504, Jan. 2020.
  • 101. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Use of video summarization technologies 101 Tool for content adaption / re-purposing  Try it with your video at: http://multimedia2.iti.gr/videosummarization/service/start.html  Demo video: https://youtu.be/LbjPLJzeNII
  • 102. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Future directions 102  Unsupervised video summarization based on combining adversarial and reinforcement learning  Advanced attention mechanisms and memory networks for capturing long-range temporal dependencies among parts of the video  Exploiting augmented/extended training data  Introducing editorial rules in unsupervised video summarization  Examine the potential of transfer learning in video summarization Analysis-oriented
  • 103. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Future directions 103  There is a lack of integrated technologies for automating video summarization and CERTH’s web application is one of the first complete tools  Automated summarization that is adaptive to the distribution channel / targeted audience or the video content has a strong potential!  Further applications of video summarization should be investigated by:  monitoring the modern media/social media ecosystem  identifying new application domains for content adaptation / re-purposing  translating the needs of these application domains into analysis requirements Application-oriented
  • 104. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project [Apostolidis, 2019] E. Apostolidis, A. I. Metsai, E. Adamantidou, V. Mezaris, and I. Patras, “A stepwise, label-based approach for improving the adversarial training in unsupervised video summarization,” in Proc. of the 1st Int. Workshop on AI for Smart TV Content Production, Access and Delivery, ser. AI4TV ’19. New York, NY, USA: ACM, 2019, pp. 17–25. [Apostolidis, 2020] E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, and I. Patras, “Unsupervised video summarization via attention-driven adversarial learning,” in Proc. of the Int. Conf. on Multimedia Modeling. Springer, 2020, pp. 492–504. [Bahdanau, 2015] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. of the 3rd Int. Conf. on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015. [Chen 2019] Y. Chen, L. Tao, X. Wang, and T. Yamasaki, “Weakly supervised video summarization by hierarchical reinforcement learning,” in Proc. of the ACM Multimedia Asia, 2019, pp. 1–6. [Cho, 2014] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder– decoder approaches,” in Proc. of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 103–111. [Chu, 2019] W.-T. Chu and Y.-H. Liu, “Spatiotemporal modeling and label distribution learning for video summarization,” in Proc. of the 2019 IEEE 21st Int. Workshop on Multimedia Signal Processing (MMSP). IEEE, 2019, pp. 1–6. [Elfeki, 2019] M. Elfeki and A. Borji, “Video summarization via actionness ranking,” in Proc. of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, January 7-11, 2019, Jan 2019, pp. 754–763. Key references 104
  • 105. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project [Fajtl, 2019] J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, and P. Remagnino, “Summarizing videos with attention,” in Asian Conf. on Computer Vision (ACCV) 2019 Workshops, G. Carneiro and S. You, Eds. Cham: Springer International Publishing, 2019, pp. 39–54. [Feng, 2018] L. Feng, Z. Li, Z. Kuang, and W. Zhang, “Extractive video summarizer with memory augmented neural networks,” in Proc. of the 26th ACM Int. Conf. on Multimedia, ser. MM ’18. New York, NY, USA: ACM, 2018, pp. 976–983. [Fu, 2019] T. Fu, S. Tai, and H. Chen, “Attentive and adversarial learning for video summarization,” in Proc. of the IEEE Winter Conf. on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, January 7-11, 2019, pp. 1579–1587. [Gonuguntla, 2019] N. Gonuguntla, B. Mandal, N. Puhan et al., “Enhanced deep video summarization network,” in Proc. of the 2019 British Machine Vision Conference (BMVC), 2019. [Goyal, 2017] A. Goyal, N. R. Ke, A. Lamb, R. D. Hjelm, C. J. Pal, J. Pineau, and Y. Bengio, “Actual: Actor-critic under adversarial learning,” ArXiv, vol. abs/1711.04755, 2017. [Gygli, 2014] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool, “Creating summaries from user videos,” in Proc. of the European Conference on Computer Vision (ECCV) 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, pp. 505–520. [Gygli, 2015] M. Gygli, H. Grabner, and L. V. Gool, “Video summarization by learning submodular mixtures of objectives,” in Proc. of the 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 3090–3098. [Haarnoja, 2018] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in Proc. of the 35th Int. Conf. on Machine Learning (ICML), 2018. Key references 105
  • 106. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project [He, 2019] X. He, Y. Hua, T. Song, Z. Zhang, Z. Xue, R. Ma, N. Robertson, and H. Guan, “Unsupervised video summarization with attentive conditional generative adversarial networks,” in Proc. of the 27th ACM Int. Conf. on Multimedia, ser. MM ’19. New York, NY, USA: ACM, 2019, pp. 2296–2304. [Hochreiter, 1997] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735– 1780, 1997. [Huang, 2020] C. Huang and H. Wang, “A novel key-frames selection framework for comprehensive video summarization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 2, pp. 577–589, 2020. [Ji, 2019] Z. Ji, K. Xiong, Y. Pang, and X. Li, “Video summarization with attention-based encoder-decoder networks,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2019. [Jung, 2019] Y. Jung, D. Cho, D. Kim, S. Woo, and I. S. Kweon, “Discriminative feature learning for unsupervised video summarization,” in Proc. of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 8537–8544. [Kaufman, 2017] D. Kaufman, G. Levi, T. Hassner, and L. Wolf, “Temporal tessellation: A unified approach for video analysis,” in Proc. of the 2017 IEEE Int. Conf. on Computer Vision (ICCV), Oct 2017, pp. 94–104. [Kulesza, 2012] A. Kulesza and B. Taskar, Determinantal Point Processes for Machine Learning. Hanover, MA, USA: Now Publishers Inc., 2012. [Lal, 2019] S. Lal, S. Duggal, and I. Sreedevi, “Online video summarization: Predicting future to better summarize present,” in Proc. of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019, pp. 471–480. Key references 106
  • 107. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project [Lebron Casas, 2019] L. Lebron Casas and E. Koblents, “Video summarization with LSTM and deep attention models,” in MultiMedia Modeling, I. Kompatsiaris, B. Huet, V. Mezaris, C. Gurrin, W.-H. Cheng, and S. Vrochidis, Eds. Cham: Springer International Publishing, 2019, pp. 67–79. [Liu, 2019] Y.-T. Liu, Y.-J. Li, F.-E. Yang, S.-F. Chen, and Y.-C. F. Wang, “Learning hierarchical self-attention for video summarization,” in Proc. of the 2019 IEEE Int. Conf. on Image Processing (ICIP). IEEE, 2019, pp. 3377–3381. [Mahasseni, 2017] B. Mahasseni, M. Lam, and S. Todorovic, “Unsupervised video summarization with adversarial LSTM networks,” in Proc. of the 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2982– 2991. [Otani, 2016] M. Otani, Y. Nakashima, E. Rahtu, J. Heikkil¨a, and N. Yokoya, “Video summarization using deep semantic features,” in Proc. of the 13th Asian Conference on Computer Vision (ACCV’16), 2016. [Panda, 2017] R. Panda, A. Das, Z. Wu, J. Ernst, and A. K. Roy-Chowdhury, “Weakly supervised summarization of web videos,” in Proc. of the 2017 IEEE Int. Conf. on Computer Vision (ICCV), Oct 2017, pp. 3677–3686. [Pfau, 2016] D. Pfau and O. Vinyals, “Connecting generative adversarial networks and actor-critic methods,” in NIPS Workshop on Adversarial Training, 2016. [Potapov, 2014] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, “Category-specific video summarization,” in Proc. of the European Conference on Computer Vision (ECCV) 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, pp. 540–555. Key references 107
  • 108. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project [Rochan, 2018] M. Rochan, L. Ye, and Y. Wang, “Video summarization using fully convolutional sequence networks,” in Proc. of the European Conference on Computer Vision (ECCV) 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds. Cham: Springer International Publishing, 2018, pp. 358–374. [Rochan, 2019] M. Rochan and Y. Wang, “Video summarization by learning from unpaired data,” in Proc. of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. [Savioli, 2019] N. Savioli, “A hybrid approach between adversarial generative networks and actor-critic policy gradient for low rate high-resolution image compression,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 2019. [Smith, 2017] J. R. Smith, D. Joshi, B. Huet, W. Hsu, and J. Cota, “Harnessing A.I. for Augmenting Creativity: Application to Movie Trailer Creation,” in Proc. of the 25th ACM Int. Conf. on Multimedia, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 1799– 1808. [Song, 2015] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “TvSUM: Summarizing web videos using titles,” in Proc. of the 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 5179–5187. [Song, 2016] X. Song, K. Chen, J. Lei, L. Sun, Z. Wang, L. Xie, and M. Song, “Category driven deep recurrent neural network for video summarization,” in Proc. of the 2016 IEEE Int. Conf. on Multimedia Expo Workshops (ICMEW), July 2016, pp. 1–6. [Szegedy, 2015] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proc. of the 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 1–9. Key references 108
  • 109. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project [Vinyals, 2015] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 2692–2700. [Wang, 2019] J. Wang, W. Wang, Z. Wang, L. Wang, D. Feng, and T. Tan, “Stacked memory network for video summarization,” in Proc. of the 27th ACM Int. Conf. on Multimedia, ser. MM ’19. New York, NY, USA: ACM, 2019, pp. 836–844. [Wang, 2016] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in Proc. of the European Conference on Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing, 2016, pp. 20–36. [Wei, 2018] H. Wei, B. Ni, Y. Yan, H. Yu, X. Yang, and C. Yao, “Video summarization via semantic attended networks,” in Proc. of the 2018 AAAI Conf. on Artificial Intelligence (AAAI), 2018. [Yu, 2017] L. Yu, W. Zhang, J. Wang, and Y. Yu, “SeqGAN: Sequence generative adversarial nets with policy gradient,” in Proc. of the 2017 AAAI Conf. on Artificial Intelligence, ser. (AAAI). AAAI Press, 2017, pp. 2852–2858. [Yuan, 2019a] L. Yuan, F. E. H. Tay, P. Li, L. Zhou, and J. Feng, “Cycle-SUM: Cycle-consistent adversarial lstm networks for unsupervised video summarization,” in Proc. of the 2019 AAAI Conf. on Artificial Intelligence (AAAI), 2019. [Yuan, 2019b] Y. Yuan, T. Mei, P. Cui, and W. Zhu, “Video summarization by learning deep side semantic embedding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 1, pp. 226–237, Jan 2019. [Yuan, 2019c] Y. Yuan, H. Li, and Q. Wang, “Spatiotemporal modeling for video summarization using convolutional recurrent neural network,” IEEE Access, vol. 7, pp. 64 676–64 685, 2019. Key references 109
  • 110. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project [Zhang, 2016a] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Summary transfer: Exemplar-based subset selection for video summarization,” in Proc. of the 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1059–1067. [Zhang, 2016b] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Video summarization with long short-term memory,” in Proc. of the European Conference on Computer Vision (ECCV) 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing, 2016, pp. 766–782. [Zhang, 2018] Y. Zhang, X. Liang, D. Zhang, M. Tan, and E. P. Xing, “Unsupervised object-level video summarization with online motion auto-encoder,” Pattern Recognition Letters, 2018. [Zhang, 2019] Y. Zhang, M. Kampffmeyer, X. Zhao, and M. Tan, “DTR-GAN: Dilated temporal relational adversarial network for video summarization,” in Proc. of the ACM Turing Celebration Conference - China, ser. ACM TURC ’19. New York, NY, USA: ACM, 2019, pp. 89:1–89:6. [Zhao, 2017] B. Zhao, X. Li, and X. Lu, “Hierarchical recurrent neural network for video summarization,” in Proc. of the 2017 ACM on Multimedia Conference, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 863–871. [Zhao, 2018] B. Zhao, X. Li, and X. Lu, “HSA-RNN: Hierarchical structure-adaptive RNN for video summarization,” in Proc. of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7405–7414.), 2018. [Zhao, 2019] B. Zhao, X. Li, and X. Lu, “Property-constrained dual learning for video summarization,” IEEE Transactions on Neural Networks and Learning Systems, 2019. Key references 110
  • 111. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project [Zhou, 2018a] K. Zhou, T. Xiang, and A. Cavallaro, “Video summarisation by classification with deep reinforcement learning,” in Proc. of the 2018 British Machine Vision Conference (BMVC), 2018. [Zhou, 2018b] K. Zhou and Y. Qiao, “Deep reinforcement learning for unsupervised video summarization with diversity- representativeness reward,” in Proc. of the 2018 AAAI Conference on Artificial Intelligence (AAAI), 2018. Key references 111
  • 112. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Vasileios Mezaris bmezaris@iti.gr Evlampios Apostolidis apostolid@iti.gr CERTH-ITI, Greece info@retv-project.eu This work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement H2020-780656 ReTV Questions? Following the Q&A session and the break, we will be back with Part II of the tutorial, on video summaries re- use and recommendation