2. Definitions, Background, Related Work
Multimedia Remixing Support System
Video Clip Sequence Creation
Music Clip Selection
Shot Extraction
Conclusion and Future Work
3. From wikipedia…
A remix is a song that has been edited to sound different from the original version.
The person who remixed it might have changed the pitch of the singers' voice,
changed the tempo and speed and has made the song shorter or longer, or
instead of hearing just one person singing they might have duplicated the voice to
make it sound like two people are singing, or make the voice echo.
Remixes should not be confused with edits, which usually involve shortening a
final stereo master for marketing or broadcasting purposes. … A remix song
recombines audio pieces from a recording to create an altered version of the song.
In recent years the concept of the remix has been applied analogously to
other media. …. Scary Movie series is famous for its comic remix of various well-
known horror movies such as Ring, Scream, and Saw.
4. Video Remix: a video clip made by recombining various media
components to create an altered version of the original videos.
Video transition effects
(Cut, fade-in/out,
dissolve, etc.)
Audio clips
(music, sound effects,
voices, etc.)
Original video clips
Video remixes (e.g. movie trailers)
Video clip
selection & arrangement
Multimedia stream
Combination
How can we create video remixes of good quality?
from “The School
of Rock” (2003)
5. Semantic Aspect:
What should we present? (Semantic Content)
Highlights of Sports Games, etc.
Affective Aspect:
How should we present the video content?
(Aesthetic Compatibility, Film Syntax)
Commercial Films,Movie Trailers, etc.
How to arrange video clips or what music clip to augment
to enhance the expressive quality
Two aspects in video remixing
Video Summarization
6. Video Remix
Scene-Music Relation
Shot-Scene Relation A sequence of L video shots
A sequence of D music clips
A video scene
Problem of Video Remixing
A music clip
= A sequence of D video scenes
An excerpt from a video clip
To maintain the feeling of continuity in a scene
8. Video clip selection and arrangement
Focused on
how various types of video clips are arranged in sequence.
For example…
• A scene has to have at least three video clips[Sundaram01].
• Two video shots of extremely different shot sizes
should not be connected[Kumano02].
• The duration of a shot recorded with the camera fixed
is up to 15 seconds[Kumano02].
Film Syntax
[Sundaram01] H. Sundaram, et al., “Condensing computable scenes using visual complexity and film syntax analysis,” Proc. ICME,
pp.389-392, 2001.
[Kumano02] M. Kumano, et al., “Video editing support system based on video content analysis,” Proc. ACCV, pp.628-633, 2002.
[Canini10] L. Canini, et al., “Interactive video mashup based on emotional identity,” Proc. European Signal Processing Conf., pp.1499-1503, 2010.
Aesthetic Compatibility
•Shots with similar emotional impact
should be connected[Canini10].
9. Music clip selection
Focused on which types of music clips are mixed with video shots.
For example…
• dynamic, motion, and pitch of image and audio streams
coincide with each other[Mulhem03].
• novelty, velocity, and brightness of image and audio streams
coincide with each other[Yoon09].
Aesthetic Compatibility
[Mulhem03] P. Mulhem, et al., “Pivot vector space approach for audio-video mixing,” IEEE Multimedia, 10(2), pp.28-40, 2003
[Yoon09] J.-C. Yoon, et al., “Automated music video generation using multi-level feature-based segmentation,” MTAP, 41(2), pp.197-214, 2009
[Cristani10] M. Cristani, et al., “Toward an automatically generated soundtrack from low-level cross-modal correlations for automotive scenarios,”
Proc. ACM Multimedia, pp.551-559, 2010
Determined heuristically
• brightness of image and audio streams
and rhythm of audio stream and optical flow in image stream
coincide with each other[Cristani10]
Determined statistically
11. It is difficult to explicitly defining the rules and
know-how about how the video and music clips
should be arranged, considering the aesthetic
compatibility.
The rules and structures commonly used in
professionally created examples can be modeled
by standard machine learning techniques.
Non-professional users can be supported on their
interface based on the models which implicitly
describe shot-scene and scene-music relations
considering aesthetic compatibility.
12. A Set of Video Remix Examples
Professionally Created Video Remixes
13. A Set of Video Remix Examples
Target: Remixing original video clips based on Examples
A Set of Music Clips
A Set of Original Video Clips
video remix
14. video remix
I) Video Clip Sequence Creation
Interface
II) Music Clip Selection
III) Shot Extraction
(Video and Music
Synchronization)
User
・・・
・・・
A set of video clips:
A set of music clips:
A Set of
Video Remix Examples
・・・
・
・
・・・
Video Remix Template
Shot
Scene
Video Clip Suggestions
15. N. Nitta and N. Babaguchi, “Example-based video remixing,” Multimedia Tools and Applications,
51(2), pp.649-673, 2011
N. Nitta and N. Babaguchi, “Example-based home video remixing,” Proc. ICME, 2011
16. Video Remix Examples
Symbol Sequence
Home (Personal) Videos
Video Clips
Segmentation
Suitability[Nitta2011]
To Template
Perceived Quality[Tao2007]
B AB CGE
Template
Interface
Overview of Procedure I)
Template
Generation
T. Mei, et al., "Home Video Visual Quality Assessment With Spatiotemporal Factors," IEEE Trans.
Circuits and Systems for Video Technology, vol.17, no.6, pp.699-706, 2007.
17. Video Remix Examples
Slow
Scene
Active
Scene
HMM
Example-based Template Generation
Shot Length
Brightness
Motion Intensity
w/wo Camera Work
w/wo Human Objects
Low-level Features
Feature
Extraction
・・・
Sequences of video shots
Shot
ihg
fed
cba
Symbolization
Symbol Sequence
Video Remix Template
(New Symbol Sequence
& State Sequence)
GA
A Sequence of L Shots
A Sequence of D Scenes
18. Video Clip 1 Video Clip 2 Video Clip 3
A Home Video
Suitability to Template 0.3 0.20.7
Perceived Quality 0.7 0.5 0.6
From Shot to Video Clip
Shots in target video are divided into
video clips based on the camerawork
19. Video clip selection
Video Remix Template
Interface
3D book-style video clip presentation
Timeline Presentation
Suitability
To Template
Perceived Quality
◎
× △
▲
spine
Fore edgeFore edge
21. Video remix examples: 61 action movie trailers
Video clips: 265 home (personal) video clips recording a sports
field day held by a kindergarten
Subjective evaluation by 8 subjects
Compare with video clip sequence created by considering only
the perceived quality of video clips
22. Subjective Score: 3.5 Subjective Score: 3
With Template*Without Template
* Selected video clips are shortened according to the template
Created Video Clip Sequence
Using action movie trailers as examples
resulted in creating a sequence of many short video clips
23. N. Nitta and N. Babaguchi, “Example-based video remixing support system,”
Proc. ACM Multimedia, pp.563-572, 2011
24. Video Clip Sequence (Scene)
Overview of Procedure II)
A Set of Video Remix Examples (Scenes)
A set of Music Clips
visually similar
video remix examples
similar music clips
25. Evaluate the compatibility among video scenes and music clips by their
distances in the video scene and music feature spaces
Learn non-linear mapping of music feature space so that the distances
among video scenes and the mixed music clips would be correlated
[Suzuki07]
Music Clip Feature Space
(Music Clips
Mixed to Example Video Scenes)
Video scene feature space
(Example Video Scenes)
Expected Music clip feature space
(Music Clips
Mixed to Example Video Scenes)
[Suzuki07] K. Suzuki, et al., “A similarity-based neural network for facial expression analysis,” Pattern Recognition Letters, 28(9), pp.1104-1111, 2007
26. Music Clip Selection
Video Scenes・・・Visual Features
Music Clips・・・Audio Features
[Zettl99]
Emotion-based Music Classification
[Zettl99] H. Zettl, “Sight Sound Motion: Applied Media Aesthetics,” Wadsworth Publishing, 1999
27. Consists of 2 Neural Networks
Input: Audio Features xA
i and xB
i of Music Clips A and B
Output: Transformed Audio Features yA
j and yB
j of Music Clips A and B
Learn the weights wl,m of Neural Network so that the differences between the distances of
yA
j and yB
j and the distances of the video scenes mixed with music clips A and B would be
minimized.
wl,m: Weight for the edge between nodes I and m.
・
・
・ ・
・
・
・
・
・
・
・
・ ・
・
・
・・・
TAB
dAB
Teacher
(Distances ofVideo Scenes
Mixed with Music Clips A and B )
Input A
Input B
xA
i
xB
i
Neural Network A
Neural Network B
yA
j
yB
j
Distance calclulation
29. Video Remix Examples: 61 Action Movie Trailers
Video Scene Examples :45 Scenes
Music Clips:180 Music Clips of Various Genres
(Movie Soundtracks, Classical Music, Japanese-pop, Western-pop, etc.)
Video Clips:
Shots extracted from Original Movies
265 Home Video Clips recording a sports field day held by a kindergarten
Video Clip Sequence:
Made by Procedure I)
30. Input:10 Video Scenes randomly extracted from movie trailers (without Audio Stream )
10 subjects rated (1: very bad – 10: very good) 10 video scenes mixed with
Video1) 3 Music Clips Selected by Proposed Approach
Video2) Music Clips most similar to the music excerpts mixed with the 3 least
similar video scenes
Video3) Music Clip mixed with the video scenes in movie trailers (baseline:
professional)
Video4) 3 Music Clips selected in the same way as for Video 1) without music
feature space transformation
Video5) 3 Music Clips selected in the same way as for Video 2) without music
feature space transformation Video1 –Video 2 = 1.72±0.34
(95% confidence interval)
⇒indicates the effectiveness of
similarity-based music clip selection
Video1 –Video 4 = 1.11±0.35
⇒indicates the effectiveness of
music feature space transformation
Video 1 → closest toVideo 3
⇒selected music clips are subjectively
closest to professionally selected ones
0
1
2
3
4
5
6
7
8
Video1
Video2
Video3
Video4
Video5
Average Subjective Scores
6.1
4.4
7.2
5.0
4.5
33. Subjective Score: 3.8 Subjective Score: 5.3
With Template*Without Template
Video Clip Sequence after Music Mixing
Subjective score improved largely after music mixing
Created video clip sequence and selected music clips
are synergetic in improving the expressive quality.
* Selected video clips are shortened according to the template
34. Y. Kurihara, N. Nitta, and N. Babaguchi, “Automatic appropriate segment extraction from shots
based on learning from example videos,” Proc. PSIVT, pp.1082-1093, 2009
Y. Kurihara, N. Nitta, and N. Babaguchi, “Appropriate segment extraction from shots based on
temporal patterns of example videos,” Proc. MMM, pp.253-264, 2008
35. VideoClip
SequenceVideoRemix
Video Clip 1
Shot 1 Shot 2 Shot 3
Video Clip 3Video Clip 2
A video clip needs to be shortened.
A video clip contains redundant parts.
Which part of a video clip should be extracted as a shot?
Shot Extraction from Selected Video Clip
36. k frames
Discarded part
(Non-shot)
Selected Part
(Shot)
Video Clip Example Video Clip
Shot
Extraction
Feature Extraction
Pattern
Scan for the k frames
which best matches the shot HMM
Feature Extraction
ShotSymbolization
Symbol
Sequence
Shot HMM
Non-shot HMM
Overview of Procedure III)
37. •Shot Classification
action and conversation
•Feature extraction
Shot
Action Conversation Scenery ・・・
※VSTD : Volume Standard Deviation,
LVFR : Low Volume Signal Ratio,ERSB : Energy Ratio of Ferquency SubBand
ZCR : Zero Crossing Ratio
Each type of shot is characterized
by different features
38. Examples:Movies+Trailers
Video Clips:Shots in Movies
Shots:Shots in Trailers
Shot extraction from 69 video clips (shots in movies)
Shot Length (k) = Length of corresponding shots in trailers
(32.3% ×video clips on average)
2247Test
1210Training
ConversationAction
Experiments
39. Objective Evaluation
Video Clip (Action)
Ground Truth
(Shot in Trailer)
Extracted Shot
82 frames
k= 9 frames
Difference:3 frames(0.3sec)
•Compare Extracted Shot
with Ground Truth
•1 frame=0.1 sec
43. Objective Evaluation
clipsvideoof#
extractioncorrectof#
accuracy
※Correct Extraction : Shot was extracted within T-frame Difference
1 frame = 0.1 sec
Correct shots were extracted
from 72.5%(50/69) of video clips when T=5
73%(16/22)72%(34/47)T=5
64%(14/22)60%(28/47)T=3
50%(11/22)53%(25/47)T=2
Action Conversation
44. 14 subjects watch original long video clips, and then
three kinds of shortly extracted shots:
①Ground Truth
②Extracted Shot
③Random Shot
in random order and rank them.
(There can be a tie)
or
or
46. Ground Truth
Extracted Shot Random Shot
・・・Rank 1
・・・Rank 2
・・・Rank 3
69.1%
26.9%
4.0%
53.9%38.9%
7.2%
7.1%
12.9%
80.0%
Action:18 video clips
Conversation:13 video clips
Subjective Evaluation
Extracted Shot ≒Ground Truth >> Random Shot
47. Subjective Score: 6.2Subjective Score: 3.9
Without Template With Template
Created Video Remix
Proposed Comparative
I II III I II III
Length
(min:sec)
0:36 0:43 10:56 10:59
score 3 5.3 6.2 3.5 3.8 3.9
48. Introduced an example-based approach for video
remixing
Video Clip Sequence Creation
Music Clip Selection
Shot Extraction
Interface
Experiments using movie trailers as remix examples
and movies and home videos as video clips
Verified the effectiveness of using remix examples
With Support(6.2), Without Support(3.9)
Conclusion
49. Improvement of Interface
More investigations using various types/genres
of video remix examples
How many examples do we need?
Good examples can reduce the number of
examples.