Ismir2012 tutorial2

10/5/2012

ISMIR 2012 Tutorial 2 Speaker
Music Affect Recognition:
The State-of-the-art and
Lessons Learned
Xiao Hu, Ph.D Yi-Hsuan Eric Yang, Ph.D
The University of Hong Kong Academic Sinica, Taiwan

10/5/2012 1 10/5/2012 2

Speaker The Audience
Do you believe that music is powerful?
Why do you think so?
Have you searched for music by affect?
Have you searched for other things (photos,
video) by affect?
Have you questioned the difference between
emotion and mood?
Is your research related to affect?

10/5/2012 3 10/5/2012 4

Music Affect: Music Affect:

10/5/2012 5 10/5/2012 6

1

10/5/2012

Music Affect: Music Affect:

10/5/2012 7 10/5/2012 8

Agenda Agenda
Grand challenges on music affect Grand challenges on music affect
Music affect taxonomy and annotation Music affect taxonomy and annotation
Automatic music affect analysis Automatic music affect analysis
Categorical approach Categorical approach
Multimodal approach Multimodal approach
Dimensional approach Dimensional approach
Temporal approach Temporal approach
Beyond music Beyond music
Conclusion Conclusion
10/5/2012 9 10/5/2012 10

Emotion or Mood ? Emotion or Mood ?
Mood: “relatively permanent and
stable”
Emotion: “temporary and evanescent”
"most of the supposed [psychological]
studies of emotion in music are
actually concerned with mood and Leonard
association." Meyer

Meyer, Leonard B. (1956). Emotion and Meaning in Music.
Chicago: Chicago University Press
10/5/2012 11 10/5/2012 12

2

10/5/2012

Expressed or Induced Which Moods? 1/2
Different websites / studies use different terms
Designated/indicated/expressed by a
music piece
Induced/evoked/felt by a listener
Both are studied in MIR
Thayer’s stress-energy model gives 4 clusters Farnsworth’s 10 adjective groups
Mainly differ in the ways of collecting labels
“indicate how you feel when listen to the music”
“indicate the mood conveyed by the music”
Tellegen-Watson-
Clark model

10/5/2012 13 10/5/2012 14

Which Moods ? 2/2 Sources of Music Emotion
Intrinsic (structural characteristics of the music)
Lack of a general theory of emotions e.g., modality -> happy vs. sad
Ekman’s 6 basic emotions:
What about melody?
anger, joy, surprise disgust, sadness, fear
Extrinsic emotion (semantic context related but outside the
music)
Lee et al., (2012) identified a range of factors in people’s
assessment of music mood
Verbalization of emotional states is
Lyrics, tempo, instrumentation, genre, delivery, and even cultural
often a “distortion” (Meyer, 1956)
context
“unspeakable feelings”
Little has been known on the mapping of these factors to music mood
“ a restful feeling throughout ... like one of
going downstream while swimming”

Lee, J. H., Hill, T., & Work, L. (2012) What does music mood mean for real
10/5/2012 15 10/5/2012 16
users? Proceedings of the iConference

Let’s ask the users… (Lee et al., 2012) Data, data, data!
Extremely scarce resource
Annotations are time consuming
Consistency is low across annotators
Existent public datasets on mood:
MoodSwings Turk dataset
240 30-sec clips; Arousal – Valence scores
MIREX mood classification task
600 30-sec clips; in 5 mood clusters
MIREX tag classification task (mood sub-task)
3,469 30-sec clips; in 18 mood-related tag groups
Yang’s emotion regression dataset
193 25-sec clips; in 11 levels Arousal Valence scale

10/5/2012 17 10/5/2012 18

3

10/5/2012

Suboptimal Performance Newer Challenges
MIREX Mood Classification (2012) Cross-cultural applicability
Accuracy: 46% - 68% Existent efforts focus on Western music
OS1 @ ISMIR 2012 (tomorrow): Yang & Hu: Cross-cultural Music
MIREX Tag Classification mood subtask(2011) Mood Classification: A Comparison on English and Chinese Songs
Personalization
Ultimate solution to the subjectivity problem
Contextualization
Even the same person’s emotional responses change in different
time, location, occasions
PS1 @ ISMIR 2012 (Tomorrow) Watson & Mandryk: Modeling
Musical Mood From Audio Features and Listening Context on an In-
Situ Data Set

10/5/2012 19 10/5/2012 20

Summary of Challenges Agenda
Grand challenges on music affect
Terminology Music affect taxonomy and annotation
Models and categories Automatic music affect analysis
No consensus
Categorical approach
Sources and factors
Multimodal approach
No clear mapping between sources and affects
Dimensional approach
Data scarcity
Temporal approach
Suboptimal performances
Newer issues
Beyond music
Cross-cultural, personalization, contextualization,... Conclusion
10/5/2012 21 10/5/2012 22

Music affect taxonomy and
Taxonomy
annotation
Domain oriented controlled vocabulary
Background Contain labels (metadata)
What are taxonomies? Commonly used on websites
Taxonomy vs. Folksonomy Pick list; browsable directory, etc.
Developing music mood taxonomies
Taxonomy from Editorial Labels
Taxonomies from Social Tags
Annotations
Experts
Crowdsourcing (e.g., MTurks, games)
Subjects
Derived from online services
10/5/2012 23 10/5/2012 24

4

10/5/2012

Taxonomy vs. Folksonomy Models in Music Psychology 1/2
Taxonomy Categorical
Controlled, structured vocabulary Hevner’s
Often require expert knowledge
adjective circle
Top-down and bottom up approaches
e.g.,
(1936)
Folksonomy Hevner, K. 1936.
Experimental studies
Uncontrolled, unstructured vocabulary of the elements of
Social tags freely applied by users expression in music.
American Journal of
Commonality exists in large number of tags Psychology, 48
e.g.,
10/5/2012 25 10/5/2012 26

Models in Music Psychology 2/2 Borrow from Psychology to MIR
Dimensional
Russell’s
circumplex
Thayer’s stress-energy model gives 4 clusters Farnsworth’s 10 adjective groups
model
Russell, J. A. 1980. A Grounded in music
circumplex model of perception research, but
affect. Journal of
Personality and Social
lack social context of music
Psychology, 39: 1161- listening (Juslin & Laukka,
1178. 2004)
Tellegen-Watson-Clark model
Juslin, P. N. and Laukka, P. (2004). Expression, perception, and induction of musical
10/5/2012 27 emotions: a review and a questionnaire study of everyday listening. JNMR.
10/5/2012 28

Taxonomy Built from Editorial Labels
• Editorial labels:
-Given by professional
editors of online
repositories
-Have a certain level of
control
- Rooted in realistic social
contexts

allmusic.com: “the most comprehensive music reference
source on the planet”
288 mood labels created and assigned to music works
10/5/2012 29 10/5/2012 30

5

10/5/2012

Mood Label Clustering A Taxonomy of 5 Mood Clusters
Mood labels for albums Mood labels for songs
Cluster_1:
passionate, rousing, confident, boisterous, rowdy
Cluster_2:
rollicking, cheerful, fun, sweet, amiable/good natured
Cluster_3:
literate, poignant, wistful, bittersweet, autumnal, brooding
Cluster_4:
humorous, silly, campy, quirky, whimsical, witty, wry
Cluster_5:
aggressive, fiery, tense/anxious, intense, volatile, visceral
C1 C2 C3 C4 C5 C4 C1 C3 C2 C5
Hu, X., & Downie, J. S. (2007). Exploring Mood Metadata: Relationships with
10/5/2012
Genre, Artist and Usage Metadata. In Proceedings of ISMIR 31 10/5/2012 32

Taxonomy from Social Tags The Method
Social tags 1,586 terms in WordNet-Affect (a lexicon of affective words)
Pros: “The largest music tagging – 202 evaluation terms in General Inquirer
site for Western music”
Users’ perspectives (“good”, “great”, “poor”, etc.)
Large quantity – 135 non-affect/ ambiguous terms by experts
( “cold”, “chill”, “beat”, etc.)
= 1,249 terms
Cons:
Non-standardized 476 terms are last.fm tags
Linguistic Resources
Ambiguous Human Expertise group the tags by WordNet-Affect and experts
=> 36 categories
Hu, X. (2010). Music and Mood: Where Theory and Reality Meet. In
Proceedings of the 5th iConference, (Best Student Paper).
10/5/2012 33 10/5/2012 34

2-D Mood Taxonomy Comparison to Russell’s 2-D Model
2-Dimensional Representation

10/5/2012 35 10/5/2012 36
10/5/2012

6

10/5/2012

Our Taxonomy Laurier et al. (2009) Taxonomy from
Social Tags 1/2
Manually compiled 120 mood words from the literature

VALENCE
Crawled 6.8M social tags from last.fm
107 unique tags matched mood words
80 tags with more than 100 occurrences
Most used Least used
sad rollicking
AROUSAL
fun solemn
melancholy rowdy
happy tense

Laurier et al. (2009) Music mood representations from social tags, ISMIR
10/5/2012 37 10/5/2012 38

Laurier et al. (2009) Taxonomy from Agreement between Laurier’s and
Social Tags 2/2 the 5 cluster taxonomy
• Used LSA to project tag-track matrix to a space of 100 dim.
Based on Laurier’s 100-dimensional space
• Clustering trials with varied number of clusters
Intra-cluster similarity Inter-cluster dissimilarity
cluster 1 cluster 2 cluster 3 cluster 4
angry sad tender happy C1 C2 C3 C4 C5
aggressive bittersweet soothing joyous
C1 0 .74 .13 .20 .11
visceral sentimental sleepy bright
rousing tragic tranquil cheerful
C2 0 .86 .82 .88
intense depressing quiet humorous C3 0 .32 .27
confident sadness calm gay C4 0 .53
anger spooky serene amiable
C5 0
+A –V -A –V -A +V +A +V

Laurier et al. (2009) Music mood representations from social tags, ISMIR39 Laurier et al. (2009) Music mood representations from social tags, ISMIR
10/5/2012 10/5/2012 40

Summary on Taxonomy Mood Annotations
What are taxonomies? All annotation needs three things
Taxonomy vs. Folksonomy taxonomy, music, people
People
Developing music mood taxonomies
Experts
from Editorial Labels
Subjects
from Social Tags
Crowdsourcing (e.g., MTurks, games)
Derive annotations from online services

10/5/2012 41 10/5/2012 42

7

10/5/2012

Expert Annotation Expert Annotation: MIREX AMC
The MIREX Audio Mood Classification (AMC) task 2468 judgments collected (3750
•Each expert had 250 clips
5 cluster taxonomy • 8 of 21 experts finished all planned)
1,250 tracks selected from the APM libraries assignments Each clips had 2 or 3 judgments
A Web-based annotation system called E6K Avg. Cohen’s Kappa: 0.5

Dataset built from agreements among experts
Agreements C1 C2 C3 C4 C5 Total Accuracy
3 of 3 judges 21 24 56 21 31 153 0.59
2 of 3 judges 41 35 18 26 14 134 0.38
2 0f 2 judges 58 61 46 73 75 313 0.54
Total 120 120 120 120 120 600

Lessons: 1. Missed judgments -> low accuracy
Hu, X., Downie, J. S., Laurier, C., Bay, M., & Ehmann, A. (2008). The 2007 MIREX 2. Need more motivated annotators
10/5/2012 43 10/5/2012 44
Audio Mood Classification Task: Lessons Learned. In ISMIR.

Crowdsourcing: Amazon Mechanic Turk Annotation: Amazon Mechanic Turk
• Lee & Hu (2012): compare expert and MTurk annotations Human Intelligence Task
• The same 1,250 music clips as in MIREX AMC (HIT)
• The same 5 clusters Each HIT had 27 clips
• Annotators: “Turkers” who work on human intelligent 2 duplicates for consistency
tasks for very low payment check
Each clips had 2 judges
• Advantages of MTurk Paid 0.55 USD for 1 HIT
• Plenty of labor Qualification test before
• Disadvantages of MTurk proceeding to task
• Quality control 186 HITs collected
100 HITs accepted
Lee, J. H. & Hu, X. (2012) Generating Ground Truth for Music Mood Classification Avg. Cohen’s kappa: 0.48
Using Mechanical Turk, In Proceedings of Joint Conference on Digital Libraries
10/5/2012 45 10/5/2012 46

Comparison: Stats on Collecting Data Comparison: Agreement Rates
EVALUTRON 6000 EVALUTRON 6000

Number of Judgments Collected 2 22 % of clips with % of clips with
2468 (incomplete) 2500 (complete) agreements agreements
Total Time for Collecting All Judgments C1 40.2% C1 39.6%
C2 60.2% C2 48.9%
38 days 19 days
(+ additional in-house C3 70.5% C3 69.5%
assessment) C4 39.6% C4 46.3%
Cost for Collecting All Judgments
C5 70.8% C5 60.0%
Other 16.9% Other 21.3%
$0 $60.50
Average Time Spent on Each Music Clip
10/5/2012 21.54 seconds 17.46 seconds 47 10/5/2012 48

8

10/5/2012

Comparison: Confusions among
Confusions Shown in Russell’s Model
Clusters
Disagreed IN
Clusters Disagreed in E6K
EVALUTRON 6000
MTurk Cluster Cluster Cluster
Cluster 1 & Cluster 2 20 95 5 1 2
Cluster 2 & Cluster 4 31 86
⁞ ⁞ ⁞ Cluster
Cluster 4
Cluster 3 & Cluster 5 1 20 3
Total 253 595
10/5/2012 49 10/5/2012 50

Comparison: System Performances
Crowdsourcing: Games
(MIREX 2007)
MoodSwings (Kim et al., 2008)
EVALUTRON 6000
2-player Web-based game to collect
annotations of music pieces in the
arousal- valence space
Time-varying annotations are
collected at a rate of 1 sample per
second
Players “score” for agreement with
their competitor

Kim, Y. E., Schimdt, E., and Emelle, L. (2008). Moodswings: a collaborative
game for music mood label collection, ISMIR
10/5/2012 51 10/5/2012 52

MoodSwings: Challenges MoodSwings: MTurk version
Needs a pair of players • Single person game
Simulated AI player • No competition, no scores
Randomly following the real player less challenging • Monetary reward
Based on prediction model need training data (0.25 USD/11 pieces)
• Consistency check:
Attracting players (true for all games)
-- 2 identical pieces whose
Must be challenging and fun
labels must be within experts’
Music: more recent and entertaining decision boundary
Game interface: sleek, aesthetic -- must not label all clips the
Research values same way
Variety of music and mood

Speck, J. A., Schmidt, E. M., Morton, B. G., and Kim, Y. E. (2011). A comparative study
B. G. Morton, J. A. Speck, E. M. Schmidt, and Y. E. Kim (2010). Improving music of collaborative vs. traditional music mood annotation, ISMIR
emotion labeling using human computation,” in HCOMP
10/5/2012 53 10/5/2012 54

9

10/5/2012

MoodSwings: 2 version Comparison Subject Annotation
Do not require music expertise
Easier to recruit than experts
Arguably more authentic to MIR situations
Label Can be trained for annotation task
Corr. Higher data quality than MTurk
V: 0.71 Still needs verification/evaluation
A: 0.85
Often with payments
Rates much higher than MTurk

Speck, J. A., Schmidt, E. M., Morton, B. G., and Kim, Y. E. (2011). A comparative
study of collaborative vs. traditional music mood annotation, ISMIR
10/5/2012 55 10/5/2012 Image Copyright © www.allaboutaddiction.com 56

MIREX Mood Tag Classification
Derive Annotations from online services

Harness the power of
Music 2.0
Based on editorial
labels and noisy user
tags
e.g., the MSD
e.g., MIREX Audio
Tag Classification
mood dataset
Music 2.0 Logo by Rocketsurgeon
10/5/2012 57 10/5/2012 58

MIREX Mood Tag Classification Dataset: MIREX Mood Tag Classification Dataset:
Positive Examples in Each Category An Example
Based on the top 100 tags provided by last.fm API

Select songs
tagged
heavily with
terms in a
category

10/5/2012 59 10/5/2012 60

10

10/5/2012

Cross-Cultural Issue in Annotation
Annotation Derived from Music 2.0
A survey of 30 clips on Americans and Chinese
PROS CONS
Grounded on real-life • Need mood-related C1: passionate
social tags C2: cheerful
usage C3: bittersweet
Larger dataset, • Need clever ways to C4: humorous
supporting multi- filter out noise C5: aggressive
label • May be culturally
dependent Got to get you into
No manual my life by The
annotation required Beatles

Hu, X. & Lee, J. H. (2012). A Cross-cultural Study of Music Mood Perception
10/5/2012 61
between American and Chinese Listeners, ISMIR (PS3 – Thursday!)
10/5/2012 62

Summary on Annotation Agenda
Music affect taxonomy and annotation
Automatic Music affect analysis
Categorical approach
Multimodal approach
Dimensional approach
Expert annotation for small datasets Temporal approach
Crowdsourcing with careful designs Beyond music
Music 2.0 for super size datasets Conclusion
??
10/5/2012 63 10/5/2012 64

Categorical and Multimodal
Automatic Approaches Approaches
Categorical vs. Dimensional
Pros Cons Classification problem and framework
Categorical • Intuitive • Term are ambiguous Audio features and classification models
• Natural language • Difficult to offer fine- Existing experiments
grained differentiation
Multimodal classification
Dimensional • Continuous • Less intuitive Cross-cultural classification
affective scales • Difficult to annotate
• Good user
interface

10/5/2012 65 10/5/2012 66

11

10/5/2012

A Framework for Multimodal Mood Classification
Automatic Classification
(supervised learning) Textual Social tags Lyrics MP3s
…
Audio

Classifier Dataset Construction
“Here comes the sun”
Happy Feature Feature
“ I will be back” -> Happy linguistic stylistic … tempo timbral …
Extraction Extraction
Sad
“Down with the Angry Feature Feature
sickness” Angry Generation and Selection F-score language
Prediction modeling PCA
Song X Happy Training Sad Selection …
Song Y Sad
……… Classification and
Testing
Training examples Multimodal Classification feature late Hybrid
Combination concate fusion methods
SVM KNN … nation …

New examples Evaluation and
Analysis performance learning feature
comparison curves comparison
10/5/2012 67 10/5/2012 … 68

Audio Features Classification Models
Type Description Tool
Energy
The mean and standard deviation of root Marsyas, Generic supervised learning algorithms
mean square energy MIR Toolbox neural network, k-nearest neighbor (k-NN), maximum likelihood,
MIR Toolbox decision tree, support vector machine (SVM), Gaussian mixture
Rhythm Fluctuation pattern and tempo
PsySound models (GMM), Neural Network, etc.
Pitch class profile, the intensity of 12 MIR Toolbox Tools: generic machine learning packages
Pitch semitones of the musical octave in PsySound
Weka, RapidMiner, LibSVM, SVMLight
Western twelve-tone scale
Key clarity, musical mode (major/minor), MIR Toolbox SVM seems superior
Tonal
and harmonic change (e.g., chord change)
The mean and standard deviation of the Marsyas,
Timbre first 13 MFCCs, delta MFCCs, and delta MIR Toolbox
delta MFCCs
perceptual loudness, volume, sharpness
Psycho- (dull/sharp), timbre width (flat/rough),
PsySound
acoustic spectral and tonal dissonance
(dissonant/consonant) of music
MIREX AMC 2007 Results
10/5/2012 69 10/5/2012 70

Audio signal’s “glass-ceiling” Multimodal Classification
Aucouturier & Pachet (2004) Social Tags Metadata
“Semantic Gap” between low-Level music feature
and high-level human perception
Bischoff et al.
MIREX AMC performance (5 classes) MUSIC Schuller et al.
2009
2011
Year Top 3 accuracies
2007 61.50%, 60.50%, 59.67% Lyrics
2008 63.67%, 58.20%, 56.00%
Audio
2009 65.67%, 65.50%, 63.67%
2010 63.83%, 63.50%, 63.17%
Yang & Lee, 2004
2011 69.50%, 67.17%, 66.67% Laurie et al, 2009
2012 67.83%, 67.67%, 67.17% Hu & Downie, 2010

Aucouturier, J-J., & Pachet, F. (2004), Improving timbre similarity: How high is the Improving classification performance by combining
sky? Journal of Negative. Results in Speech and Audio Sciences, 1 (1).
10/5/2012 71 10/5/2012 multiple independent sources 72

12

10/5/2012

Lyric Features Lyric Feature Example
Basic features:
Content words, part-of-speech, function ANEW examples Top General Inquire (GI) features in category “Aggressive”
words
Vale Aro Domi GI Feature Description Example
Lexicon features: nce usal nance
Words in WordNet-Affect WlbPhys words connoting the physical aspects of well blood, dead, drunk,
Happy 8.21 6.49 6.63
being, including its absence pain
Psycholinguistic features: Sad 1.61 4.13 3.45 Perceiv words referring to the perceptual process of dazzle, fantasy, hear,
Psychological categories in GI (General Thrill 8.05 8.02 6.54
Inquirer) recognizing or identifying something by look, make, tell, view
Kiss 8.26 7.32 6.93 means of the senses
Scores in ANEW (Affective Norm of
English Words) Dead 1.94 5.73 2.84 Exert action words hit, kick, drag, upset
Stylistic features: Dream 6.73 4.53 5.53
TIME words indicating time noon, night, midnight
Punctuation marks; interjection words Angry 2.85 7.17 5.55
Statistics: e.g., how many words per Fear 2.76 6.96 3.22 COLL words referring to all human collectivities people, gang, party
minute
WlbLoss words related to a loss in a state of well burn, die, hurt, mad
Hu, X. & Downie, J. S. (2010) Improving Mood Classification in Music Digital being, including being upset
10/5/2012 73 10/5/2012 74
Libraries by Combining Lyrics and Audio, JCDL

Distribution of feature “!”

Lyric No significant
difference
Classification between top
combinations
Results
10/5/2012 75 10/5/2012 76

Distribution of feature “hey” “number of words per minute”

10/5/2012 77 10/5/2012 78

13

10/5/2012

Combine with Audio-based Classifier Hybrid Methods
– Late fusion Lyric Classifier
A leading system in MIREX AMC 2007 and 2008:
Marsyas Dominate due
Prediction
to clarity and
Music Analysis, Retrieval and Synthesis for Audio Signals Final
the avoidance
led by Prof. Tzanetakis at University of Victoria Prediction
of “curse of Prediction
Uses audio spectral features dimensionality”
Audio Classifier
marsyas.info
Finalist in the Sourceforge Community Choice Awards 2009 – Feature concatenation
(early fusion)
Classifier

Prediction

10/5/2012 79 10/5/2012 80

Effectiveness
Audio

Hybrid
(late Hybrid Lyrics
fusion) (early
fusion)

10/5/2012 81 10/5/2012 82

Audio vs. Lyrics
Learning Curves

Hu & Downie (2010) When Lyrics Outperform Audio for
10/5/2012 83 10/5/2012 84
Music Mood Classification: A Feature Analysis, ISMIR

14

10/5/2012

Top Lyric Features Top Lyric Features in “Calm”

10/5/2012 85 10/5/2012 86

Other Textual Features used in
Music Mood Classification
Top Affective
Words Based on SentiWordNet
assigns to each synset of WordNet three sentiment
scores: positivity, negativity, objectivity
Simple Syntactic Structures
Negation, modifier
vs. Lyric rhyme patterns (inspired by poems)
Contextual features (Beyond lyrics)
Social tags, blogs, playlists, etc.

10/5/2012 87 10/5/2012 88

Summary of Categorical and
Cross-cultural Mood Classification
Multimodal Approaches
Tomorrow, Oral Session 1 Natural language labels are intuitive to end users
Cross cultural Based on supervised learning techniques
model Studies mostly focusing on Feature Engineering
applicability: Multimodal approaches improve performances
-23 mood categories
based on Effectiveness and Efficiency
AllMusic.com Cross-cultural mood classification: just started
- Train on songs in Challenges
one culture and
classify songs in the Ambiguity inherent in terms (Meyer’s “distortion”)
other Hierarchy of mood categories
Connections between features and mood categories
Yang & Hu (2012) Cross-cultural Music Mood Classification: A Comparison on
10/5/2012 89 10/5/2012 90
English and Chinese Songs, ISMIR

15

10/5/2012

Agenda Dimensional Approach
Music affect taxonomy and annotation What is and why dimensional model
Automatic Music affect analysis Computational model for dimensional music
Categorical approach emotion recognition
Multimodal approach Issues
Dimensional approach Difficulty of emotion rating
Temporal approach Subjectivity of emotion perception
Beyond music Context of music listening
Usability of UI
Conclusion
10/5/2012 91 10/5/2012 92

Categorical Approach Dimensional Approach
Audio spectrum Audio spectrum

Circumplex model
Hevner’ model (1936) (Russell 1980)

10/5/2012 93 10/5/2012 94

The Valence-Arousal (VA) Emotion Model
What is the Dimensional Model
￮ Energy or neurophysiological
Alternative conceptualization of Activation‒Arousal stimulation level
emotions based on their placement
along broad affective dimensions

It is obtained by analyzing
“similarity ratings” of emotion words
or facial expression by factor analysis Evaluation‒Valence
or multi-dimensional scaling ￮ Pleasantness
￮ Positive and
For example, Russell (1980) asked negative affective
343 subjects to describe their emotional states using states
28 emotion words and use four different methods to
analyze the correlation between the emotion ratings
Many studies identifies similar dimensions
[psp80]
10/5/2012 95 10/5/2012 96

16

10/5/2012

More Dimensions Why the Dimensional Model 1/3
The world of emotions is not 2D Free of emotion words
(Fontaine et al., 2007)
3rd dimension: potency‒control Emotion words are not always precise and consistent
Feeling of power/weakness; We often cannot find proper words to express our feelings
dominance/submission Different people have different understandings to the words
Anger ↔ fear Emotion words are difficult to translate and might not exist with
Pride ↔ shame the exact same meaning in different languages (Russell 1991)
Interest ↔ disappointment Semantic overlap between emotion categories
4th dimension: predictability Cheerful, happy, joyous, party/celebratory
Surprise Melancholy, gloomy, sad, sorrowful
Stress↔ fear
Difficult to determine how many and what categories to
Contempt ↔ disgust
be used in a mood classification system
However, 2D model seems to
work fine for music emotion
10/5/2012 97 10/5/2012 98

No Consensus on Mood Taxonomy in MIR Why the Dimensional Model 2/3
Work # Emotion description Emotion changes
Katayose et al [icpr98] 4 Gloomy, urbane, pathetic, serious as time unfolds
Reliable and economical model
Feng et al [sigir03] 4 Happy, angry, fear, sad
Only two variables (valence, arousal),
Li et al [ismir03], Happy, light, graceful, dreamy, longing, dark, sacred,
Wieczorkowska 13 dramatic, agitated, frustrated, mysterious, passionate, instead of tens or hundreds of mood tags
et al [imtci04] bluesy Easy to compare the performance
Wang et al [icsp04] 6 Joyous, robust, restless, lyrical, sober, gloomy of different systems
Tolos et al [ccnc05] 3 Happy, aggressive, melancholic+calm
Lu et al [taslp06] 4 Exuberant, anxious/frantic, depressed, content
Suitable for continuous measurements arousal
Yang et al [mm06] 4 Happy, angry, sad, relaxed Emotions may change over time very angry

Skowronek et al Arousing, angry, calming, carefree, cheerful, emo- angry
12
[ismir07] tional, loving, peaceful, powerful, sad, restless, tender Emotion intensity
neutral
Happy, light, easy, touching, sad, sublime, More precise and intuitive than valence
Wu et al [mmm08] 8
grand, exciting
emotion words
Hu et al [ismir08] 5 Passionate, cheerful, bittersweet, witty, aggressive
Trohidis et al [ismir08]
10/5/2012 6 Surprised, happy, relaxed, quiet, sad, angry 99 10/5/2012 100

Why the Dimensional Model 3/3 Mapping Songs to the VA Space
Ready canvas for user Assumption
interaction View the VA space as a
Emotion-based retrieval Song collection navigation continuous, Euclidean space
View each point as an
emotional state
(valence, arousal)
Goal
Given a short music clip
(e.g., 10 to 30 seconds)
Automatically compute a pair of
valence and arousal (VA) values
that best quantify (summarize)
the expressed emotion of the overall clip
The research on time-dependent second-by-second emotion
recognition (emotion tracking) will be introduced in the next
Three dimensions are used: session
10/5/2012 valence, arousal, synthetic/acoustic
101 10/5/2012 102

17

10/5/2012

How to Predict Emotion Values 1/3 How to Predict Emotion Values 2/3
Sol (B): by further exploiting the “geographic information”
Sol (A): by dividing the emotion space into several (Yang et al., 2006)
mood classes 1
For example, perform
For example, into 16 classes binary classification 0.5

Pros for each quadrant 0
Apply arithmetic operations class 1 class 2 class 3 class 4
Standard classification problem
y = f(x), to the probability estimates
x is a feature vector, Valence = u1 + u4 – u2 – u3
y is a discrete label (1‒16)
Arousal = u1 + u2 – u3 – u4
Cons (u denotes likelihood)
Poor granularity of the Pros
emotion space Easy to compute
(not really VA values)
Cons
Moody by Crayonroom Lack theoretical foundation
10/5/2012 103 10/5/2012 104

How to Predict Emotion Values 3/3 Linear Regression: Example
Sol (C): by means of regression (Yang et al., 2007, 2008;
MacDorman et al., 2007; Eerola et al., 2009) Linear regression
Given features, predict a numerical value f(x) = wTx +b
Possible (hypothesized) w for valence and arousal
One for valence, one for arousal
yv = fv (x), x is a feature vector, loudness tempo pitch level harmony mode
ya = fa (x), yv and ya are both numerical values (loud/ (fast/ (high/ (consonant (major/
soft) slow) low) /dissonant) miner)
Pros
valence 0 0 0 1 1
Regression analysis is theoretical sound and well-developed
arousal 1 1 1 0 0
Many off-the-shelf good regression algorithms
positive valence = consonant harmony & major mode
Cons
high arousal = loud loudness & fast tempo & high pitch
Require ground truth “emotion values”
Need to ask human subject to “rate” the emotion values of songs Nonlinear regression functions can also be used

10/5/2012 105 10/5/2012 106

Computational Framework Feature Extraction: Get x
Extractor Language Features
Emotion annotation: obtain y for training data
MFCC, LPCC, spectral properties (centroid,
Marsyas-0.2 C
Feature extraction: obtain x moment, flatness, crest factor)
Regression model training: obtain w Spectral features, rhythm features, pitch, key
MIR toolbox Matlab
clarity, harmonic change, mode
Automatic prediction: obtain y for test data MFCC, spectral histogram, periodic
MA toolbox Matlab
histogram, fluctuation pattern
y
w Psychoacoustic model –based features
Emotion Emotion PsySound Matlab (loudness, sharpness, roughness, virtual
annotation value pitch, volume, timbre width, dissonance)
Training Regressor
data training Rhythm pattern
Feature Matlab Rhythm pattern, beat histogram, tempo
Feature extractor
extraction
Regressor
x EchoNest API Python Timbre, pitch, loudness, key, mode, tempo
Test Feature Feature Automatic Emotion
data extraction Prediction value MPEG-7 audio Spectral properties, harmonic ratio, noise
Java
x y encoder level, fundamental frequency type
10/5/2012 107 10/5/2012 108

18

10/5/2012

Relevant Features Example Matlab Code for Extracting MFCC
[Gomez and Danuser, 2007]
Using the MA Toolbox

Sound intensity Tempo Rhythm

DC value

major

Pitch range Take mean &
Mode Consonance STD along time
we take 20 coefficients

10/5/2012 109 10/5/2012 110

Emotion Annotation: Get y Example System
Rate the VA values of each song Data set (Yang et al., 2008)
Ordinal rating scale 195 pop songs (Chinese, Japanese, and English)
Scroll bar Each song is rated by 10+ subjects
Only need to annotate the y for training Ground truth is set by averaging
data, the y for the test data can be
automatically predicted by our regression Use Marsyas and PsySound to extract features
y
model
Emotion Emotion w Model learning (get w)
annotatio value
Training n Regressor Linear regression
data training
Feature Feature
Adaboost.RT (nonlinear)
extraction
x
Regressor Support vector regression (SVR)(nonlinear)
Test Feature Feature Automatic Emotion
data extraction Prediction value
Y.-H. Yang, Y.-C. Lin, Y.-F. Su, and H.-H. Chen (2008) A regression approach to
x y
10/5/2012 111 music emotion recognition, IEEE TASLP 16(2)
10/5/2012 112

Performance Evaluation Quantitative Result
Evaluation metric Method R2 of valence R2 of arousal
Linear regression 0.109 0.568
R 2 statistics
Adaboost.RT [ijcnn04] 0.117 0.553
Squared correlation between estimate and ground
SVR (support vector regression) [sc04] 0.222 0.570
truth
SVR + RReliefF (feature selection) [ml03] 0.254 0.609
The higher the better
R 2 = 1 perfectly fits Result
R 2 = 0 random guess SVR (nonlinear) performs the best
Feature selection by the algorithm RReliefF offers gain
10-fold cross validation Valence: 0.254
9/10 data for training and 1/10 for testing Arousal: 0.609
Valence is more difficult to model (it is more subjective)
Repeat 20 times to get average result
Valence: 0.25 – 0.35
Arousal: 0.60 – 0.85
10/5/2012 113 10/5/2012 114

19

Ismir2012 tutorial2

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Ismir2012 tutorial2