Tutorial on "Video Summarization and Re-use Technologies and Tools", delivered at IEEE ICME 2020. These slides correspond to the first part of the tutorial, presented by Vasileios Mezaris and Evlampios Apostolidis. This part deals with automatic video summarization, and includes a presentation of the video summarization problem definition and a literature overview; an in-depth discussion on a few unsupervised GAN-based methods; and a discussion on video summarization datasets, evaluation protocols and results, and future directions.
1. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Vasileios Mezaris,
Evlampios Apostolidis
CERTH-ITI, Greece
Tutorial at IEEE ICME 2020
Section I.1: Video summarization
problem definition and literature
overview
Video Summarization and Re-use
Technologies and Tools
Part I: Automatic video summarization
2. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Tutorial’s structure and time schedule
2
Part I: Automatic video summarization
Section I.1: Video summarization problem definition and literature overview (20’)
Q&A (5’)
Section I.2: In-depth discussion on a few unsupervised GAN-based methods (20’)
Q&A (5’)
Section I.3: Datasets, evaluation protocols and results, and future directions (20’)
20’ Q&A and break, then we are back with the tutorial’s Part II: Video summaries re-use and
recommendation
3. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
3
Video is everywhere!
Problem definition
Hours of video content uploaded on
YouTube every minute
Captured by smart-devices and instantly
shared online
Constantly and rapidly increased
volumes of video content
Image sources: https://www.financialexpress.com/india-news/govt-agencies-adopt-new-
age-video-sharing-apps-like-tiktok/1767354/ (left) & https://www.statista.com/ (right)
4. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
4
But how to find what we are looking for in endless collections of video content?
Problem definition - video consumption side
Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/
5. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
5
But how to find what we are looking for in endless collections of video content?
Problem definition - video consumption side
Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/
Quickly inspect a video’s
content by checking its
synopsis!
6. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
6
But how to reach different audiences for a given media item?
Problem definition - video editing side
Image source: https://marketingland.com/social-media-audience-critical-content-marketing-223647
Good
Very
interesting Boring
Nice
Much
detailed
7. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
7
But how to reach different audiences for a given media item?
Problem definition - video editing side
Image source: https://marketingland.com/social-media-audience-critical-content-marketing-223647
Good
Very
interesting Boring
Nice
Use of technologies for
content adaptation, re-use
and re-purposing!
Much
detailed
8. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
8
Video summary: a short visual summary that encapsulates the flow of the story and
the essential parts of the full-length video
Original video
Video summary (storyboard)
Problem definition
Source: https://www.youtube.com/watch?v=deRF9oEbRso
9. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
9
Problem definition
General applications of video summarization
Professional CMS: effective indexing,
browsing, retrieval & promotion of media
assets!
Video sharing platforms: improved viewer
experience, enhanced viewer engagement &
increased content consumption!
Source: https://www.redbytes.in/how-to-build-an-app-like-hotstar/ Source: Screenshot of the BBC News channel on YouTube
10. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
10
Problem definition
General applications of video summarization
Audience- and channel-specific content adaptation: video content re-use and re-distribution in
the most appropriate way!
Image source: https://www.databagg.com/online-video-sharing
11. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
11
Problem definition
Domain-specific applications of video summarization
Full movie (e.g. 1h 30’-2h) Movie trailer (2’30’’)
J. R. Smith, D. Joshi, B. Huet, W. Hsu, and J. Cota, “Harnessing A.I. for Augmenting Creativity: Application to Movie Trailer
Creation,” in Proc. of the 25th ACM Int. Conf. on Multimedia, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 1799–1808.
Source: https://www.youtube.com/watch?v=wb49-oV0F78
12. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
12
Problem definition
Domain-specific applications of video summarization
Full game (e.g. 1h 30’)
Game’s synopsis & highlights (1’32’’)
Source: https://www.youtube.com/watch?v=oo-2IFTifUU
13. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
13
Problem definition
Domain-specific applications of video summarization
Video samples extracted from: https://www.youtube.com/watch?v=gk3qTMlcadk
Raw CCTV material (e.g. 24h) Summary of important actions/events (with timestamps)
14. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
14
Literature overview
Taxonomy of deep learning
based methods for automatic
video summarization
15. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
15
Literature overview
Supervised approaches: using video semantics and metadata
[Zhang, 2016; Kaufman, 2017] learn and transfer the summary structure of
semantically-similar videos
[Panda, 2017] metadata-driven video categorization and summarization by
maximizing relevance with the video category
[Song, 2016; Zhou, 2018a] category-driven summarization by category feature
preservation (keep main parts of a wedding when summarizing a wedding video)
[Otani, 2016; Yuan, 2019] maximize relevance of visual (video) and textual
(metadata) data in a common latent space
16. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
16
Literature overview
Supervised approaches: considering temporal structure and dependency
[Zhang, 2016b] estimate frames’ importance by modeling their variable-range
temporal dependency using RNNs
[Zhao, 2018] models and encodes the temporal structure of the video for
defining the key-fragments using hierarchies of RNNs
[Ji, 2019] video-to-summary as a sequence-to-sequence learning problem using
attention-driven encoder-decoder network
[Feng, 2018; Wang, 2019] estimate frames’ importance by modeling their long-
range dependency using high-capacity memory networks
17. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
17
Literature overview
Supervised approaches: imitating human summaries
[Zhang, 2019] summarization by confusing a trainable discriminator when making
the distinction between a machine- and a human-generated summary; model the
variable-range temporal dependency using RNNs and Dilated Temporal Units
[Fu, 2019] key-fragment selection by confusing a trainable discriminator when
making the distinction between the machine- and a human-selected key-fragments;
fragmentation based on attention-based Pointer Network, and discrimination using
a 3D-CNN classifier
18. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
18
Literature overview
Supervised approaches: targeting specific properties of the summary
[Chu, 2019] models spatiotemporal information based on raw frames and optical
flow maps, and learns frames’ importance from human annotations via a label
distribution learning process
[Elfeki, 2019] uses of CNNs and RNNs to form spatiotemporal feature vectors and
estimates the level of activity and importance of each frame to create the summary
[Chen, 2019] summarization based on reinforcement learning and reward functions
associated to the diversity and representativeness of the video summary
19. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
19
Literature overview
Unsupervised approaches: inferring the original video
[Mahasseni, 2017] SUM-GAN trains a summarizer to fool a discriminator when
distinguishing the original from the summary-based reconstructed video using
adversarial learning
[Jung, 2019] CSNet extends [Mahasseni, 2017] with a chunk and stride network and
attention mechanism to assess variable-range dependencies and select the video key-
frames
[Apostolidis, 2020] SUM-GAN-AAE extends [Mahasseni, 2017] with a stepwise, fine-
grained training strategy and an attention auto-encoder to improve the key-fragment
selection process
[Rochan, 2019] UnpairedVSN learns video summarization from unpaired data based on
an adversarial process that defines a mapping function of a raw video to a human
summary
20. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
20
Literature overview
Unsupervised approaches: targeting specific properties of the summary
[Zhou, 2018b] DR-DSN learns to create representative and diverse summaries via
reinforcement learning and relevant reward functions
[Gonuguntla, 2019] EDSN extracts spatiotemporal information and learns
summarization by rewarding the maintenance of main spatiotemporal patterns in
the summary
[Zhang, 2018] OnlineMotionAE extracts the key motions of appearing objects and
uses an online motion auto-encoder model to generate summaries that include the
main objects in the video and the attractive actions made by each of these objects
21. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
DL-based video summarization methods mainly rely on combinations of CNNs and RNNs
Pre-trained CNNs are used to represent the visual content; RNNs (mostly LSTMs) are used to
model the temporal dependency among video frames
The proposed video summarization approaches are mostly supervised
Best supervised approaches utilize tailored attention mechanisms or memory networks to
capture variable- and long-range temporal dependencies respectively
For unsupervised video summarization GANs are the central direction and RL is another but
less common approach
Best unsupervised approaches rely on VAE-GAN architectures that have been enhanced with
attention mechanisms
Some concluding remarks
21
22. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The generation of ground-truth data can be an expensive and laborious process
Video summarization is a subjective task and multiple summaries can be proposed for a video
Human annotations that vary a lot make it hard to train a method with the typical supervised
training approaches
Unsupervised video summarization algorithms overcome the need for ground-truth data and
can be trained using only an adequately large collection of videos
Unsupervised learning allows to train a summarization method using different types of video
content (TV shows, news) and then perform content-wise video summarization
Some concluding remarks
22
23. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The generation of ground-truth data can be an expensive and laborious process
Video summarization is a subjective task and multiple summaries can be proposed for a video
Human annotations that vary a lot make it hard to train a method with the typical supervised
training approaches
Unsupervised video summarization algorithms overcome the need for ground-truth data and
can be trained using only an adequately large collection of videos
Unsupervised learning allows to train a summarization method using different types of video
content (TV shows, news) and then perform content-wise video summarization
Some concluding remarks
23
Unsupervised video summarization has great advantages, increases the applicability
of summarization technologies, and its potential should be investigated
24. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Vasileios Mezaris,
Evlampios Apostolidis
CERTH-ITI, Greece
Tutorial at IEEE ICME 2020
Short break; coming up:
Section I.2: Discussion on a few
unsupervised GAN-based
methods
Video Summarization and Re-use
Technologies and Tools
Part I: Automatic video summarization
25. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Vasileios Mezaris,
Evlampios Apostolidis
CERTH-ITI, Greece
Tutorial at IEEE ICME 2020
Section I.2: Discussion on a few
unsupervised GAN-based
methods
Video Summarization and Re-use
Technologies and Tools
Part I: Automatic video summarization
26. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
Problem formulation: video summarization via selecting a
sparse subset of frames that optimally represent the video
Main idea: learn summarization by minimizing the distance
between videos and a distribution of their summarizations
Goal: select a set of keyframes such that a distance between
the deep representations of the selected keyframes and the
video is minimized
26
B. Mahasseni, M. Lam and S. Todorovic, "Unsupervised Video Summarization with Adversarial LSTM Networks," 2017 IEEE
CVPR, Honolulu, HI, 2017, pp. 2982-2991, doi: 10.1109/CVPR.2017.318.
Courtesy of
Mahasseni et al.
27. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
Problem formulation: video summarization via selecting a
sparse subset of frames that optimally represent the video
Main idea: learn summarization by minimizing the distance
between videos and a distribution of their summarizations
Goal: select a set of keyframes such that a distance between
the deep representations of the selected keyframes and the
video is minimized
Challenge: how to define a good distance?
27
B. Mahasseni, M. Lam and S. Todorovic, "Unsupervised Video Summarization with Adversarial LSTM Networks," 2017 IEEE
CVPR, Honolulu, HI, 2017, pp. 2982-2991, doi: 10.1109/CVPR.2017.318.
Courtesy of
Mahasseni et al.
28. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
Problem formulation: video summarization via selecting a
sparse subset of frames that optimally represent the video
Main idea: learn summarization by minimizing the distance
between videos and a distribution of their summarizations
Goal: select a set of keyframes such that a distance between
the deep representations of the selected keyframes and the
video is minimized
Challenge: how to define a good distance?
Solution: use a Discriminator network and train it with the
Summarizer in an adversarial manner
28
B. Mahasseni, M. Lam and S. Todorovic, "Unsupervised Video Summarization with Adversarial LSTM Networks," 2017 IEEE
CVPR, Honolulu, HI, 2017, pp. 2982-2991, doi: 10.1109/CVPR.2017.318.
Courtesy of
Mahasseni et al.
29. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
Deep features of video frames in Frame Selector
=> normalized importance scores
Weighted features in Encoder => latent
representation e
Latent representation e in Decoder => sequence of
features for the frames of input video
Original & reconstructed features in Discriminator
=> distance estimation and binary classification as
“video” or “summary”
29
Training pipeline and loss functions
30. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
Deep features of video frames in Frame Selector
=> normalized importance scores
Weighted features in Encoder => latent
representation e
Latent representation e in Decoder => sequence of
features for the frames of input video
Original & reconstructed features in Discriminator
=> distance estimation and binary classification as
“video” or “summary”
30
Training pipeline and loss functions
31. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
Deep features of video frames in Frame Selector
=> normalized importance scores
Weighted features in Encoder => latent
representation e
Latent representation e in Decoder => sequence of
features for the frames of input video
Original & reconstructed features in Discriminator
=> distance estimation and binary classification as
“video” or “summary”
31
Training pipeline and loss functions
32. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
Deep features of video frames in Frame Selector
=> normalized importance scores
Weighted features in Encoder => latent
representation e
Latent representation e in Decoder => sequence of
features for the frames of input video
Original & reconstructed features in Discriminator
=> distance estimation and binary classification as
“video” or “summary”
32
Training pipeline and loss functions
33. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
Deep features of video frames in Frame Selector
=> normalized importance scores
Weighted features in Encoder => latent
representation e
Latent representation e in Decoder => sequence of
features for the frames of input video
Original & reconstructed features in Discriminator
=> distance estimation and binary classification as
“video” or “summary”
33
Training pipeline and loss functions
34. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
Train Frame Selector and Encoder by minimizing
Lsparsity + Lprior + Lreconst
Train Decoder by minimizing Lreconst + LGAN
Train Discriminator by maximizing LGAN
Update all components via backward propagation
using Stochastic Gradient Variational Bayes
estimation
34
Training pipeline and loss functions
35. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
Deep features of video frames in Frame Selector
=> normalized importance scores
35
Inference stage and video summarization
35
Video fragmentation using KTS
Fragment-level importance scores
Key-fragment selection as a Knapsack problem
Frame-level importance scores
36. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
36
E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the
Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production,
Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
Builds on the SUM-GAN architecture
Contains a linear compression layer that
reduces the size of CNN feature vectors
Follows an incremental and fine-grained
approach to train the model’s components
37. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
37
E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the
Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production,
Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
Builds on the SUM-GAN architecture
Contains a linear compression layer that
reduces the size of CNN feature vectors
Follows an incremental and fine-grained
approach to train the model’s components
38. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
Builds on the SUM-GAN architecture
Contains a linear compression layer that
reduces the size of CNN feature vectors
Follows an incremental and fine-grained
approach to train the model’s components
38
E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the
Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production,
Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
39. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
Step-wise training process
39
Training pipeline and loss functions
40. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
40
Step-wise training process
Training pipeline and loss functions
41. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
41
Step-wise training process
Training pipeline and loss functions
42. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
42
Step-wise training process
Training pipeline and loss functions
43. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
Deep features of video frames in LC layer and
Frame Selector => normalized importance scores
43
Inference stage and video summarization
43
Video fragmentation using KTS
Fragment-level importance scores
Key-fragment selection as a Knapsack problem
Frame-level importance scores
44. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
Builds on the SUM-GAN-sl algorithm
Introduces an attention mechanism by
replacing the VAE of SUM-GAN-sl with a
deterministic attention auto-encoder
44
E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven
Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961, pp.
492-504, Jan. 2020. Best paper award
45. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
Builds on the SUM-GAN-sl algorithm
Introduces an attention mechanism by
replacing the VAE of SUM-GAN-sl with a
deterministic attention auto-encoder
45
E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven
Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961, pp.
492-504, Jan. 2020. Best paper award
46. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
46
The attention auto-encoder: Processing pipeline
47. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
47
The attention auto-encoder: Processing pipeline
Weighted feature vectors fed to the Encoder
48. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
48
The attention auto-encoder: Processing pipeline
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
For t > 1: use the hidden state of the previous
Decoder’s step (h1)
For t = 1: use the hidden state of the last
Encoder’s step (He)
49. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
49
The attention auto-encoder: Processing pipeline
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
Attention weights (αt) computed using:
Energy score function
Soft-max function
50. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
Attention weights (αt) computed using:
Energy score function
Soft-max function
The SUM-GAN-AAE method [Apostolidis, 2020]
50
The attention auto-encoder: Processing pipeline
51. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
Attention weights (αt) computed using:
Energy score function
Soft-max function
αt multiplied with V and form Context Vector vt’
The SUM-GAN-AAE method [Apostolidis, 2020]
51
The attention auto-encoder: Processing pipeline
52. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
Attention weights (αt) computed using:
Energy score function
Soft-max function
αt multiplied with V and form Context Vector vt’
vt’ combined with Decoder’s previous output yt-1
The SUM-GAN-AAE method [Apostolidis, 2020]
52
The attention auto-encoder: Processing pipeline
53. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
Attention weights (αt) computed using:
Energy score function
Soft-max function
αt multiplied with V and form Context Vector vt’
vt’ combined with Decoder’s previous output yt-1
Decoder gradually reconstructs the video
The SUM-GAN-AAE method [Apostolidis, 2020]
53
The attention auto-encoder: Processing pipeline
54. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
Training is performed in an incremental way as in SUM-GAN-sl
No prior loss is used
54
Training pipeline and loss functions
55. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
Deep features of video frames in LC layer and
Frame Selector => normalized importance scores
55
Inference stage and video summarization
55
Video fragmentation using KTS
Fragment-level importance scores
Key-fragment selection as a Knapsack problem
Frame-level importance scores
56. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Much smoother series of importance scores
The SUM-GAN-AAE method [Apostolidis, 2020]
56
Impact of the introduced attention mechanism
57. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Much faster and more stable training of the model
The SUM-GAN-AAE method [Apostolidis, 2020]
57
Impact of the introduced attention mechanism
Average (over 5 splits) learning curve of SUM-GAN-sl and
SUM-GAN-AAE on SumMeLoss curves for the SUM-GAN-sl and SUM-GAN-AAE
58. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The most common strategy for learning summarization in an unsupervised way
A mechanism to build a representative summary by maximizing inference to the full video
Summarization performance is superior to other unsupervised learning approaches (e.g.
reinforcement learning) and comparable to a few supervised learning methods
Step-wise training facilitates the training of complex GAN-based architectures
Introduction of attention mechanisms is beneficial to the quality of the created summary
There is room for further improving GAN-based unsupervised video summarization via: a)
combination with reinforcement learning approaches, b) extension with memory networks
Some concluding remarks
58
Using GANs for video summarization
59. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Vasileios Mezaris,
Evlampios Apostolidis
CERTH-ITI, Greece
Tutorial at IEEE ICME 2020
Short break; coming up:
Section I.3: Datasets, evaluation
protocols and results, and future
directions
Video Summarization and Re-use
Technologies and Tools
Part I: Automatic video summarization
60. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Vasileios Mezaris,
Evlampios Apostolidis
CERTH-ITI, Greece
Tutorial at IEEE ICME 2020
Section I.3: Datasets, evaluation
protocols and results, and future
directions
Video Summarization and Re-use
Technologies and Tools
Part I: Automatic video summarization
61. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Datasets
61
SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)
25 videos capturing multiple events (e.g. cooking and sports)
video length: 1 to 6 min
annotation: fragment-based video summaries (15-18 per video)
TVSum (https://github.com/yalesong/tvsum)
50 videos from 10 categories of TRECVid MED task
video length: 1 to 11 min
annotation: frame-level importance scores (20 per video)
Most commonly used
62. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Datasets
62
Open Video Project (OVP) (https://sites.google.com/site/vsummsite/download)
50 videos of various genres (e.g. documentary, educational, historical, lecture)
video length: 1 to 4 min
annotation: keyframe-based video summaries (5 per video)
Youtube (https://sites.google.com/site/vsummsite/download)
50 videos of diverse content (e.g. cartoons, news, sports, commercials) collected from websites
video length: 1 to 10 min
annotation: keyframe-based video summaries (5 per video)
Less commonly used
63. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
63
Early approach
Agreement between automatically-created (A) and user-defined (U) summary is expressed by
Matching of a pair of frames is based on color histograms, the Manhattan distance and a
predefined similarity threshold
80% of video samples are used for training and the remaining 20% for testing
The final evaluation outcome occurs by:
Computing the average F-Score for a test video given the different user summaries for this video
Computing the average of the calculated F-Score values for the different test videos
64. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
64
Established approach
The generated summary should not exceed 15% of the video length
Agreement between automatically-generated (A) and user-defined (U) summary is expressed
by the F-Score (%), with (P)recision and (R)ecall measuring the temporal overlap (∩) (|| ||
means duration)
Typical metrics for computing Precision and Recall at the frame-level
80% of video samples are used for training and the remaining 20% for testing
65. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
65
Established approach - A side note
TVSum annotations need conversion from frame-level importance scores to key-fragments
65
Human annotations in TVSum: frame-level importance scores
66. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
66
Established approach - A side note
TVSum annotations need conversion from frame-level importance scores to key-fragments
66
Video fragmentation using KTS
Human annotations in TVSum: frame-level importance scores
67. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
67
Established approach - A side note
TVSum annotations need conversion from frame-level importance scores to key-fragments
67
Video fragmentation using KTS
Fragment-level importance scores
Human annotations in TVSum: frame-level importance scores
68. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
68
Established approach - A side note
TVSum annotations need conversion from frame-level importance scores to key-fragments
Video fragmentation using KTS
Fragment-level importance scores
Key-fragment selection as a Knapsack problem
Human annotations in TVSum: frame-level importance scores
69. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
69
Established approach
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach
70. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
70
Established approach
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach
71. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
71
F-Score1
Established approach
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach
72. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
72
F-Score2
F-Score1
Evaluation protocols
Established approach
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach
73. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
73
F-ScoreN
F-Score2
F-Score1
Evaluation protocols
Established approach
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach
74. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
74
F-ScoreN
F-Score2
F-Score1
Evaluation protocols
Established approach
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach
SumMe: F-Score = max{F-Scorei}i=1
N
TVSum: F-Score = mean{F-Scorei}i=1
N
75. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
75
Evaluation protocols
Established approach
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Alternative approach
76. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
76
F-Score
Evaluation protocols
Established approach
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Alternative approach
78. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Best-performing unsupervised methods rely
on Generative Adversarial Networks
The use of attention mechanisms allows the
identification of important parts of the video
Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
The use of rewards and reinforcement learning
is less competitive than the use of GANs
A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
78
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks
79. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Best-performing unsupervised methods rely
on Generative Adversarial Networks
The use of attention mechanisms allows the
identification of important parts of the video
Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
The use of rewards and reinforcement learning
is less competitive than the use of GANs
A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
79
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks
80. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Best-performing unsupervised methods rely
on Generative Adversarial Networks
The use of attention mechanisms allows the
identification of important parts of the video
Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
The use of rewards and reinforcement learning
is less competitive than the use of GANs
A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
80
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks
81. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Best-performing unsupervised methods rely
on Generative Adversarial Networks
The use of attention mechanisms allows the
identification of important parts of the video
Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
The use of rewards and reinforcement learning
is less competitive than the use of GANs
A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
81
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks
82. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Best-performing unsupervised methods rely
on Generative Adversarial Networks
The use of attention mechanisms allows the
identification of important parts of the video
Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
The use of rewards and reinforcement learning
is less competitive than the use of GANs
A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
82
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks
83. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Best-performing unsupervised methods rely
on Generative Adversarial Networks
The use of attention mechanisms allows the
identification of important parts of the video
Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
The use of rewards and reinforcement learning
is less competitive than the use of GANs
A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
83
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks
84. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Best-performing unsupervised methods rely
on Generative Adversarial Networks
The use of attention mechanisms allows the
identification of important parts of the video
Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
The use of rewards and reinforcement learning
is less competitive than the use of GANs
A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
84
General remarks
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
97. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Quantitative comparison
97
Video #15 of TVSum: “How to Clean Your Dog’s Ears - Vetoquinol USA
99. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Use of video summarization technologies
99
Tool for content adaptation / re-purposing
Developed by CERTH-ITI
Elaborates GAN-based methods for unsupervised
learning [Apostolidis 2019, 2020]
Enables content adaptation for distribution via
multiple communication channels
Faciliates summary creation based on the audience
needs for: Twitter, Facebook (feed & stories),
Instagram (feed & stories), YouTube, TikTok
E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the
Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production,
Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven
Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961,
pp. 492-504, Jan. 2020.
100. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Use of video summarization technologies
100
Tool for content adaptation / re-purposing
Learns content-specific summarization
Separate models can be trained and used for
different video content (e.g. TV shows)
Creating these models does not require manually-
generated training data (it’s (almost) for free)
E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the
Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production,
Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven
Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961,
pp. 492-504, Jan. 2020.
101. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Use of video summarization technologies
101
Tool for content adaption / re-purposing
Try it with your video at: http://multimedia2.iti.gr/videosummarization/service/start.html
Demo video: https://youtu.be/LbjPLJzeNII
102. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Future directions
102
Unsupervised video summarization based on combining adversarial and reinforcement
learning
Advanced attention mechanisms and memory networks for capturing long-range temporal
dependencies among parts of the video
Exploiting augmented/extended training data
Introducing editorial rules in unsupervised video summarization
Examine the potential of transfer learning in video summarization
Analysis-oriented
103. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Future directions
103
There is a lack of integrated technologies for automating video summarization and CERTH’s
web application is one of the first complete tools
Automated summarization that is adaptive to the distribution channel / targeted audience or
the video content has a strong potential!
Further applications of video summarization should be investigated by:
monitoring the modern media/social media ecosystem
identifying new application domains for content adaptation / re-purposing
translating the needs of these application domains into analysis requirements
Application-oriented
104. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Apostolidis, 2019] E. Apostolidis, A. I. Metsai, E. Adamantidou, V. Mezaris, and I. Patras, “A stepwise, label-based approach for
improving the adversarial training in unsupervised video summarization,” in Proc. of the 1st Int. Workshop on AI for Smart TV
Content Production, Access and Delivery, ser. AI4TV ’19. New York, NY, USA: ACM, 2019, pp. 17–25.
[Apostolidis, 2020] E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, and I. Patras, “Unsupervised video summarization via
attention-driven adversarial learning,” in Proc. of the Int. Conf. on Multimedia Modeling. Springer, 2020, pp. 492–504.
[Bahdanau, 2015] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in
Proc. of the 3rd Int. Conf. on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track
Proceedings, Y. Bengio and Y. LeCun, Eds., 2015.
[Chen 2019] Y. Chen, L. Tao, X. Wang, and T. Yamasaki, “Weakly supervised video summarization by hierarchical reinforcement
learning,” in Proc. of the ACM Multimedia Asia, 2019, pp. 1–6.
[Cho, 2014] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder–
decoder approaches,” in Proc. of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation.
Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 103–111.
[Chu, 2019] W.-T. Chu and Y.-H. Liu, “Spatiotemporal modeling and label distribution learning for video summarization,” in Proc.
of the 2019 IEEE 21st Int. Workshop on Multimedia Signal Processing (MMSP). IEEE, 2019, pp. 1–6.
[Elfeki, 2019] M. Elfeki and A. Borji, “Video summarization via actionness ranking,” in Proc. of the IEEE Winter Conference on
Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, January 7-11, 2019, Jan 2019, pp. 754–763.
Key references
104
105. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Fajtl, 2019] J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, and P. Remagnino, “Summarizing videos with attention,” in Asian
Conf. on Computer Vision (ACCV) 2019 Workshops, G. Carneiro and S. You, Eds. Cham: Springer International Publishing,
2019, pp. 39–54.
[Feng, 2018] L. Feng, Z. Li, Z. Kuang, and W. Zhang, “Extractive video summarizer with memory augmented neural networks,” in
Proc. of the 26th ACM Int. Conf. on Multimedia, ser. MM ’18. New York, NY, USA: ACM, 2018, pp. 976–983.
[Fu, 2019] T. Fu, S. Tai, and H. Chen, “Attentive and adversarial learning for video summarization,” in Proc. of the IEEE Winter
Conf. on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, January 7-11, 2019, pp. 1579–1587.
[Gonuguntla, 2019] N. Gonuguntla, B. Mandal, N. Puhan et al., “Enhanced deep video summarization network,” in Proc. of the
2019 British Machine Vision Conference (BMVC), 2019.
[Goyal, 2017] A. Goyal, N. R. Ke, A. Lamb, R. D. Hjelm, C. J. Pal, J. Pineau, and Y. Bengio, “Actual: Actor-critic under adversarial
learning,” ArXiv, vol. abs/1711.04755, 2017.
[Gygli, 2014] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool, “Creating summaries from user videos,” in Proc. of the
European Conference on Computer Vision (ECCV) 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer
International Publishing, 2014, pp. 505–520.
[Gygli, 2015] M. Gygli, H. Grabner, and L. V. Gool, “Video summarization by learning submodular mixtures of objectives,” in Proc.
of the 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 3090–3098.
[Haarnoja, 2018] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep
reinforcement learning with a stochastic actor,” in Proc. of the 35th Int. Conf. on Machine Learning (ICML), 2018.
Key references
105
106. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[He, 2019] X. He, Y. Hua, T. Song, Z. Zhang, Z. Xue, R. Ma, N. Robertson, and H. Guan, “Unsupervised video summarization with
attentive conditional generative adversarial networks,” in Proc. of the 27th ACM Int. Conf. on Multimedia, ser. MM ’19. New
York, NY, USA: ACM, 2019, pp. 2296–2304.
[Hochreiter, 1997] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–
1780, 1997.
[Huang, 2020] C. Huang and H. Wang, “A novel key-frames selection framework for comprehensive video summarization,” IEEE
Transactions on Circuits and Systems for Video Technology, vol. 30, no. 2, pp. 577–589, 2020.
[Ji, 2019] Z. Ji, K. Xiong, Y. Pang, and X. Li, “Video summarization with attention-based encoder-decoder networks,” IEEE
Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2019.
[Jung, 2019] Y. Jung, D. Cho, D. Kim, S. Woo, and I. S. Kweon, “Discriminative feature learning for unsupervised video
summarization,” in Proc. of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 8537–8544.
[Kaufman, 2017] D. Kaufman, G. Levi, T. Hassner, and L. Wolf, “Temporal tessellation: A unified approach for video analysis,” in
Proc. of the 2017 IEEE Int. Conf. on Computer Vision (ICCV), Oct 2017, pp. 94–104.
[Kulesza, 2012] A. Kulesza and B. Taskar, Determinantal Point Processes for Machine Learning. Hanover, MA, USA: Now
Publishers Inc., 2012.
[Lal, 2019] S. Lal, S. Duggal, and I. Sreedevi, “Online video summarization: Predicting future to better summarize present,” in
Proc. of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019, pp. 471–480.
Key references
106
107. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Lebron Casas, 2019] L. Lebron Casas and E. Koblents, “Video summarization with LSTM and deep attention models,” in
MultiMedia Modeling, I. Kompatsiaris, B. Huet, V. Mezaris, C. Gurrin, W.-H. Cheng, and S. Vrochidis, Eds. Cham: Springer
International Publishing, 2019, pp. 67–79.
[Liu, 2019] Y.-T. Liu, Y.-J. Li, F.-E. Yang, S.-F. Chen, and Y.-C. F. Wang, “Learning hierarchical self-attention for video
summarization,” in Proc. of the 2019 IEEE Int. Conf. on Image Processing (ICIP). IEEE, 2019, pp. 3377–3381.
[Mahasseni, 2017] B. Mahasseni, M. Lam, and S. Todorovic, “Unsupervised video summarization with adversarial LSTM
networks,” in Proc. of the 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2982–
2991.
[Otani, 2016] M. Otani, Y. Nakashima, E. Rahtu, J. Heikkil¨a, and N. Yokoya, “Video summarization using deep semantic
features,” in Proc. of the 13th Asian Conference on Computer Vision (ACCV’16), 2016.
[Panda, 2017] R. Panda, A. Das, Z. Wu, J. Ernst, and A. K. Roy-Chowdhury, “Weakly supervised summarization of web videos,” in
Proc. of the 2017 IEEE Int. Conf. on Computer Vision (ICCV), Oct 2017, pp. 3677–3686.
[Pfau, 2016] D. Pfau and O. Vinyals, “Connecting generative adversarial networks and actor-critic methods,” in NIPS Workshop
on Adversarial Training, 2016.
[Potapov, 2014] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, “Category-specific video summarization,” in Proc. of the
European Conference on Computer Vision (ECCV) 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer
International Publishing, 2014, pp. 540–555.
Key references
107
108. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Rochan, 2018] M. Rochan, L. Ye, and Y. Wang, “Video summarization using fully convolutional sequence networks,” in Proc. of
the European Conference on Computer Vision (ECCV) 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds. Cham:
Springer International Publishing, 2018, pp. 358–374.
[Rochan, 2019] M. Rochan and Y. Wang, “Video summarization by learning from unpaired data,” in Proc. of the 2019 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[Savioli, 2019] N. Savioli, “A hybrid approach between adversarial generative networks and actor-critic policy gradient for low
rate high-resolution image compression,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, CVPR Workshops 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 2019.
[Smith, 2017] J. R. Smith, D. Joshi, B. Huet, W. Hsu, and J. Cota, “Harnessing A.I. for Augmenting Creativity: Application to Movie
Trailer Creation,” in Proc. of the 25th ACM Int. Conf. on Multimedia, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 1799–
1808.
[Song, 2015] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “TvSUM: Summarizing web videos using titles,” in Proc. of the 2015
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 5179–5187.
[Song, 2016] X. Song, K. Chen, J. Lei, L. Sun, Z. Wang, L. Xie, and M. Song, “Category driven deep recurrent neural network for
video summarization,” in Proc. of the 2016 IEEE Int. Conf. on Multimedia Expo Workshops (ICMEW), July 2016, pp. 1–6.
[Szegedy, 2015] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A.
Rabinovich, “Going deeper with convolutions,” in Proc. of the 2015 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), June 2015, pp. 1–9.
Key references
108
109. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Vinyals, 2015] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” in Advances in Neural Information Processing Systems
28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 2692–2700.
[Wang, 2019] J. Wang, W. Wang, Z. Wang, L. Wang, D. Feng, and T. Tan, “Stacked memory network for video summarization,” in
Proc. of the 27th ACM Int. Conf. on Multimedia, ser. MM ’19. New York, NY, USA: ACM, 2019, pp. 836–844.
[Wang, 2016] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good
practices for deep action recognition,” in Proc. of the European Conference on Computer Vision – ECCV 2016, B. Leibe, J.
Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing, 2016, pp. 20–36.
[Wei, 2018] H. Wei, B. Ni, Y. Yan, H. Yu, X. Yang, and C. Yao, “Video summarization via semantic attended networks,” in Proc. of
the 2018 AAAI Conf. on Artificial Intelligence (AAAI), 2018.
[Yu, 2017] L. Yu, W. Zhang, J. Wang, and Y. Yu, “SeqGAN: Sequence generative adversarial nets with policy gradient,” in Proc. of
the 2017 AAAI Conf. on Artificial Intelligence, ser. (AAAI). AAAI Press, 2017, pp. 2852–2858.
[Yuan, 2019a] L. Yuan, F. E. H. Tay, P. Li, L. Zhou, and J. Feng, “Cycle-SUM: Cycle-consistent adversarial lstm networks for
unsupervised video summarization,” in Proc. of the 2019 AAAI Conf. on Artificial Intelligence (AAAI), 2019.
[Yuan, 2019b] Y. Yuan, T. Mei, P. Cui, and W. Zhu, “Video summarization by learning deep side semantic embedding,” IEEE
Transactions on Circuits and Systems for Video Technology, vol. 29, no. 1, pp. 226–237, Jan 2019.
[Yuan, 2019c] Y. Yuan, H. Li, and Q. Wang, “Spatiotemporal modeling for video summarization using convolutional recurrent
neural network,” IEEE Access, vol. 7, pp. 64 676–64 685, 2019.
Key references
109
110. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Zhang, 2016a] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Summary transfer: Exemplar-based subset selection for video
summarization,” in Proc. of the 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp.
1059–1067.
[Zhang, 2016b] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Video summarization with long short-term memory,” in Proc. of
the European Conference on Computer Vision (ECCV) 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer
International Publishing, 2016, pp. 766–782.
[Zhang, 2018] Y. Zhang, X. Liang, D. Zhang, M. Tan, and E. P. Xing, “Unsupervised object-level video summarization with online
motion auto-encoder,” Pattern Recognition Letters, 2018.
[Zhang, 2019] Y. Zhang, M. Kampffmeyer, X. Zhao, and M. Tan, “DTR-GAN: Dilated temporal relational adversarial network for
video summarization,” in Proc. of the ACM Turing Celebration Conference - China, ser. ACM TURC ’19. New York, NY, USA:
ACM, 2019, pp. 89:1–89:6.
[Zhao, 2017] B. Zhao, X. Li, and X. Lu, “Hierarchical recurrent neural network for video summarization,” in Proc. of the 2017 ACM
on Multimedia Conference, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 863–871.
[Zhao, 2018] B. Zhao, X. Li, and X. Lu, “HSA-RNN: Hierarchical structure-adaptive RNN for video summarization,” in Proc. of the
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7405–7414.), 2018.
[Zhao, 2019] B. Zhao, X. Li, and X. Lu, “Property-constrained dual learning for video summarization,” IEEE Transactions on
Neural Networks and Learning Systems, 2019.
Key references
110
111. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Zhou, 2018a] K. Zhou, T. Xiang, and A. Cavallaro, “Video summarisation by classification with deep reinforcement learning,” in
Proc. of the 2018 British Machine Vision Conference (BMVC), 2018.
[Zhou, 2018b] K. Zhou and Y. Qiao, “Deep reinforcement learning for unsupervised video summarization with diversity-
representativeness reward,” in Proc. of the 2018 AAAI Conference on Artificial Intelligence (AAAI), 2018.
Key references
111
112. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Vasileios Mezaris
bmezaris@iti.gr
Evlampios Apostolidis
apostolid@iti.gr
CERTH-ITI, Greece
info@retv-project.eu
This work has received funding from the
European Union’s Horizon 2020 research
and innovation programme under grant
agreement H2020-780656 ReTV
Questions?
Following the Q&A session and the
break, we will be back with Part II of
the tutorial, on video summaries re-
use and recommendation