Emotion based music player

EMO PLAYER SEMINAR REPORT| 2016
MCET 1 nzmotp@gmail.com
EMO PLAYER
SEMINAR REPORT
Submitted in partial fulfillment of the requirements for the award of
Bachelor of Technology degree in Electronics & Communication Engineering
By
NISAMUDEEN P (MQAMEEC020)
DEPARTMENT OF ELECTRONICS & COMMUNICATION
ENGINEERING
MALABAR COLLEGE OF ENGINEERING & TECHNOLOGY
DESAMANGALAM, PALLUR (P.O),
THRISSUR. KERALA - 679532.

BONAFIDE CERTIFICATE
This is to certify that this Seminar Report is the bonafide work of NISAMUDEEN P
(MQAMEEC020) who carried out the Seminar entitled EMO PLAYER under our
supervision from __________ to ___________.
(Guide)
Mr. VYSHAK VITTAL Mr. ANSHAD A.S
(Assistant professor ECE Department) HOD ECE
Name of the Examiners Signature
1)
2)

DECLARATION
I NISAMUDEEN P (MQAMEEC020) hereby declare that the Seminar Report
entitled EMO PLAYER done by me under the guidance of Mr. VYSHAK VITTAL at
Malabar College of Engineering & Technology is submitted in partial fulfillment of the
requirements for the award of Bachelor of Technology degree in Electronics &
Communication Engineering.
DATE:
PLACE: SIGNATURE OF THECANDIDATE
NISAMUDEEN P

ACKNOWLEDGEMENT
I thank ALMIGHTY GOD, who laid the foundation for knowledge and wisdom and
has always been the source of my strength, inspiration and who guides me through out.
I express my sincere thanks to Dr. V.NANDANAN, Principal, for giving me the
permission to present this seminar work and providing me with all the facilities for the
successful completion of the same. I express deep sense of gratitude to Prof. ANSHAD,
HOD, Department of Electronic & Communication for providing me with best facilities and
atmosphere for the creative work guidance and encouragement.
I would like to thank my coordinator, Ms. GOPIKA GOPINATH V, Assistant
professor and my guide Mr. VYSHAK VITTAL, Assistant professor, for their judicious and
importunate piece of information, expert suggestion, guidance and support during every
stage of this work. I express sincere thanks to all the Lectures in the ECE Department for
providing me with valuable guidance and encouragement throughout the work.
I also express my sincere thanks to all my classmates, friends, and much great felt to
my loving Parents for their encouragement, prayers and inspiration during every stage of this
work.
CONTENTS

1. LIST OF FIGURES [iii]
2. LIST OF TABLES [iv]
3. ABSTRACT
4. CHAPTER 1: INTRODUCTION 4
5. CHAPTER 2: LITERATURE REVIEW
2.1 Face detection methods 4
2.2 Feature extraction methods 4
2.3 Expression Recognition 5
6. CHAPTER 3: CURRENT MUSIC SYSTEMS 7
3.1 Limitations of existing system 7
7. CHAPTER 4: PROPOSED SYSTEM 8
4.1 Emotion Feature Extraction Module 10
4.1.1 Face detection and tracking 11
4.1.2 Facial Expression Recognition 12
4.2 Audio Feature Extraction Module 14
4.2.1 MIR 1.5 Toolbox 15
4.2.2 Chroma Toolbox 16
4.2.3 Auditory ToolboX 16

4.2.4 MFCC 17
4.3 Emotion- Audio integration module 17
4.3.1 Audio Emotion Recognition 17
4.4 Classifier 20
4.5 Databases 21
4.5 Music 21
8. CHAPTER 5: Viola and Jones algorithm 22
5.1 Advantages 22
5.2 Disadvantages 23
5.3 MATLAB code on frontal cart for object detection 23
9. CHAPTER 6: Experimental results 24
6.1 EMOTION EXTRACTION 24
6.2 AUDIO FEATURE EXTRACTION 25
10. CHAPTER 7: CONCLUSION 26
11. CHAPTER 8: APLICATIONS 27
12 CHAPTER 9: FUTURE SCOPE 28
13 CHAPTER 10: REFRENCES 29

LIST OF FIGURES
1. Framework of recognizing emotions with privileged information 03
2. functional block diagram for basic music system 07
3. flow chart 09
4. Feature Extraction from Facial Image 12
5. Regions involved in Expression analysis 13
6. Face region grids (Bounding box) 17
7. Audio Preprocessing 14
8. Chroma Toolbox 16
9. Phases of the MFCC computation 16
10. Integration of three modules 18
11. Mapping of modules 19

LIST OF TABLES
Table 1: Time Estimation of Various modules of Emotion Extraction Module 28
Table 2: Estimated Accuracy for different categories of Audio Feature 29
Table 3: Average Time Estimation of Various modules of Proposed System 29

ABSTRACT
The human face is an important organ of an individual‘s body and it especially plays
an important role in extraction of an individual‘s behavior and emotional state. Manually
segregating the list of songs and generating an appropriate playlist based on an individual‘s
emotional features is a very tedious, time consuming, labor intensive and upheld task.
Various algorithms have been proposed and developed for automating the playlist generation
process. However the proposed existing algorithms in use are computationally slow, less
accurate and sometimes even require use of additional hardware like EEG or sensors. This
proposed system based on facial expression extracted will generate a playlist automatically
thereby reducing the effort and time involved in rendering the process manually. Thus the
proposed system tends to reduce the computational time involved in obtaining the results and
the overall cost of the designed system, thereby increasing the overall accuracy of the system.
Testing of the system is done on both user dependent (dynamic) and user independent (static)
dataset. Facial expressions are captured using an inbuilt camera. The accuracy of the emotion
detection algorithm used in the system for real time images is around 85-90%, while for
static images it is around 98- 100%.The proposed algorithm on an average calculated
estimation takes around 0.95-1.05 sec to generate an emotion based music playlist. Thus, it
yields better accuracy in terms of performance and computational time and reduces the
designing cost, compared to the algorithms used in the literature survey. Implementation and
testing of the proposed algorithm is carried out using an inbuilt camera. Hence, the proposed
algorithm reduces the overall cost of the system successfully. Also, on average, the proposed
algorithm takes 1.10 sec to generate a playlist based on facial expression. Thus, it yields
better performance, in terms of computational time, as compared to the algorithms in the
existing literature.

CHAPTER 1
INTRODUCTION
Music plays a very important role in enhancing an individual‘s life as it is an
important medium of entertainment for music lovers and listeners and sometimes even
imparts a therapeutic approach. In today‘s world, with ever increasing advancements in the
field of multimedia and technology, various music players have been developed with features
like fast forward, reverse, variable playback speed (seek & time compression),local playback,
streaming playback with multicast streams and including volume modulation, genre
classification etc. Although these features satisfy the user‘s basic requirements, yet the user
has to face the task of manually browsing through the playlist of songs and select songs
based on his current mood and behavior. That is the requirements of an individual, a user
sporadically suffered through the need and desire of browsing through his playlist, according
to his mood and emotions. Using traditional music players, a user had to manually browse
through his playlist and select songs that would soothe his mood and emotional experience.
This task was labor intensive and an individual often faced the dilemma of landing at an
appropriate list of songs.
Emotions are synonymous with the aftermath of interplay between an individual’s
cognitive gauging of an event and the corresponding physical response towards it. Among
the various ways of expressing emotions, including human speech and gesture, a facial
expression is the most natural way of relaying them. The ability to understand human
emotions is desirable for human-computer interaction. In the past decade, considerable
amounts of research have been done on emotion recognition from voice, visual behavior, and
physiological signals respectively or jointly. Very good progresses have been achieved in this
ﬁeld, and several commercial products have been developed, such as smile detection in
camera. Emotion is a subjective response to the outer stimulus. A facial expression is a
discernible manifestation of the emotive state, cognitive activity, motive, and
psychopathology of a person. Homo sapiens have been blessed with an ability to interpret
and analyze an individual’s emotional state.

The existing systems, providing the privileged information for the emotion recognitions are
 Audio Emotion Recognition (AER)
 Music Information Retrieval (MIR)
 Emotion Recognition with the Help of Privileged Information
AER is a technique which deals with classifying a received audio signal, by
considering its various audio features into various classes of emotions and moods. Whereas
MIR is a field that extracts some critical information from an audio signal by exploring some
audio features like pitch, energy, MFCC, flux etc.
Though both Audio Emotion Recognition (AER) and Music Information Retrieval
(MIR) included the capabilities of avoiding manual segregation of songs and generation of
playlist, yet it is unable to incorporate fully a human emotion controlled music player. The
advent of Audio Emotion Recognition (AER) and Music Information Retrieval (MIR)
equipped the traditional systems with a feature that automatically parsed the playlist, based
on different classes of emotions. While AER deals with categorizing an audio signal under
various classes of emotions, based on certain audio features , MIR is a field that relies upon
exploring crucial information and extracting various audio features (e.g. pitch, energy, flux,
MFCC, kurtosis etc.) from an audio signal. Although AER and MIR augmented the
capabilities of traditional music players by eradicating the need of manual segregation of
playlist and annotation of songs, based on a user’s emotion, yet, such systems did not
incorporate mechanisms that enabled a music player to be fully controlled by human
emotions.
Although human speech and gesture are a common way of expressing emotions, but
facial expression is the most ancient and natural way of expressing feelings, emotions and
mood. Facial expression is the most ancient and natural way of expressing feelings,
emotions. But in the above two methods ,Audio Emotion Recognition (AER) and Music
Information Retrieval (MIR) of emotion recognition systems there is no dependence up on
the face of the user.

In third, it is a novel approach to classify emotions from EEG signals, and a novel
method to tag videos’ emotions, here the integrations of privileged information as from EEG
feature and the Video feature space. By comparing this information with the music play list
system for the automatic generation as given in figure 1.
Figure 1: Framework of recognizing emotions with privileged information
The main objective of this paper is to design an efficient and accurate algorithm that
would generate a playlist based on current emotional state and behavior of the user. And
paper plays a novel approach that removes the risk that the user has to face the task of
manually browsing through the playlist of songs to select. The performance of an efficient
and accurate algorithm, that would generate a playlist based on current emotional state and
behavior of the user.
EEG DATA
Video data
EEG/Video features to
Be recognized / tagged
EEG/VIDEO
feature space
Video feature
EEG feature
EEG feature space
Video feature space
Tag
Emotion
Video model
EEG model
EEG +video

CHAPTER 2
LITERATURE REVIEW
Various techniques and approaches have been proposed and developed to classify human
emotional state of behavior. The proposed approaches have focused only on the some of the
basic emotions.
2.1 Face detection methods
The techniques for face detection can be distinguished into two groups: holistic,
where face is treated as a whole unit and analytic, where co-occurrence of characteristic
facial elements is studied. Pantic and Rothkrantz proposed system which process images of
frontal and profile face view. Vertical and horizontal histogram analysis is used to find face
boundaries. Then, face contour is obtained by thresholding the image with HSV color space
values. Kobayashi and Hara used image captured in monochrome mode to find face
brightness distribution. Position of face is estimated by iris localization. For the purpose of
feature recognition, facial features have been categorized into two major categories such as
Appearance-based feature extraction and Geometric based feature extraction. Geometric
based feature extraction technique considered only the shape or major prominent points of
some important facial features such as mouth and eyes in the system proposed by Changbo.
Around a total of 58 major landmark points was considered in crafting an ASM. The
appearance based extraction feature like texture, have also been considered in different areas
of work and development. An efficient method for coding and implementing extracted facial
features together with multi-orientation and multi-resolution set of Gabor filters was
proposed by Michael Lyons.
2.2 Feature extraction methods
Pantic and Rothkrantz selected a set of facial points from frontal and profile face
images. The expression is measured by a distance between position of those points in the
initial image (neutral face) and peak image (affected face).

Cohn developed geometric feature based system in which the optical flow algorithm
is performed only in 13x13 pixel regions surrounding facial landmarks. Shan investigated the
Local Binary Pattern method for texture encoding in facial expression description. Two
methods of feature extraction were proposed. In the first one, features are extracted from
fixed set of patches and in the second method from most probable patches found by boosting.
The music mood tags and A-V values from a total 20 subjects were tested and analyzed in
Jung Hyun Kim‘s work, and based on the results obtained from the analysis, the A-V plane
was classified into 8 regions(clusters), depicting mood by data mining efficient k-means
clustering algorithm. Thayer proposed a very useful 2-dimenesional (Stress v/s energy)
model plotted on two axes with emotions depicted by a 2- dimensional co-ordinate system,
lying on either 2 axes or the 4 quadrants formed by the 2-dimensional plot.
2.3 Expression Recognition
The last part of the FER system is based on machine learning theory; precisely it is
the classification task. The input to the classifier is a set of features which were retrieved
from face region in the previous stage. Classification requires supervised training, so the
training set should consist of labeled data. There are a lot of different machine learning
techniques for classification task, namely: K-Nearest Neighbors, Artificial Neural Networks,
Support Vector Machines, Hidden Markov Models, and Expert Systems with rule based
classifier, Bayesian Networks or Boosting Techniques (Adaboost, Gentleboost).The field of
Facial Expression recognition (FER) synthesized algorithm that excelled in furnishing such
demands. FER enabled the computer systems to monitor an individual’s emotional state
effectively and react appropriately. While various systems have been designed to recognize
facial expressions, their time and memory complexity is relatively high and hence fail in
achieving a real time performance. Their feature extraction techniques are also less reliable
and are responsible for reducing the overall accuracy of the system. An accurate statistical
based approach for was proposed by Renuka R. Londhe. The paper was majorly focused on
the study of the changes in curvatures on the face and intensities of corresponding pixels of
images.

Artificial Neural Networks (ANN) was used in the classification extracted features
into 6 major universal emotions like anger, disgust, fear, happy, sad, and surprise.
Some of the drawbacks of the existing system are as follows
i. Existing systems are highly complex in terms of time and storage for recognizing
facial expressions in a real environment.
ii. Existing systems lack accuracy in generating a playlist based on the current emotional
experience of a user.
iii. Existing systems employed additional hardware like EEG systems and sensors that
increased the overall cost of the system.
iv. Some of the existing system imposed a requirement of human speech for generating a
playlist, in accordance with a user’s facial expressions.
Several approaches have been proposed and have been adopted to classify human
emotions successfully. This paper primarily aims and focuses on resolving the drawbacks
involved in the existing system by designing an automated emotion based music player for
the generation of customized playlist based on user extracted facial features and thus
avoiding the employment of any additional hardware. It also includes a mood randomized
and appetizer function that shifts the mood generated playlist to another same level of
randomized mood generated playlist after some duration. In order to reduce the human effort
and time needed for manual segregation of songs from a playlist, in correlation with different
classes of emotions and moods, various approaches have been proposed. Numerous
approaches have been designed to extract facial features and audio features from an audio
signal and very few of the systems designed have the capability to generate an emotion based
music playlist using human emotions and the existing designs of the systems are capable to
generate an automated playlist using an additional hardware like Sensors or EEG systems
thereby increasing the cost of the design proposed.

CHAPTER 3
CURRENT MUSIC SYSTEMS
The features available in the existing Music players present in computer systems are as
follows:
i. Manual selection of Songs
ii. Party Shuffle
iii. Playlists
iv. Music squares where user has to classify the songs manually according to
particular emotions for only four basic emotions .Those are Passionate, Calm,
Joyful and Excitement.
Limitations of existing system:
i. It requires the user to manually select the songs.
ii. Randomly played songs may not match to the mood of the user.
iii. User has to classify the songs into various emotions and then for playing the
songs user has to manually select a particular emotion.
iv.
Figure 2: functional block diagram for basic music system
WHOLE MUSIC PLAYING SYSTEM
SELECTION
OF
PLAYLISTS
PLAYLISTS:
PLAYLIST-1
PLAYLIST-2
PLAYLIST-3
FAVORITES
RECENTLY
ADDED
SELECTION
OF SONGS

CHAPTER 4
PROPOSED SYSTEM
Proposed automatic play list generation scheme is combination of multiple schemes
together. In this paper, we consider the notion of collecting human emotion from the user’s
expressions, and explore how this information could be used to improve the user experience
with music players. We present a new emotion based and user-interactive music system,
EMO PLAYER. It aims to provide user-preferred music with emotion awareness. The system
starts recommendation with expert knowledge. If the user does not like the recommendation,
he/she can decline the recommendation and select the desired music himself/herself. This
paper is based on the idea of automating much of the interaction between the music player
and its user. It introduces a "smart" music player, EMO PLAYER that learns its user's
emotions, and tailors its music selections accordingly. After an initial training period, EMO
PLAYER is able to use its internal algorithms to make an educated selection of the song that
would best fit its user's emotion.
The proposed algorithm revolves around an automated music recommendation system that
generates a subset of the original playlist or a customized playlist in accordance with a user’s
facial expressions. A User’s facial expression helps the music recommendation system to
decipher the current emotional state of a user and recommend a corresponding subset of the
original playlist. It is composed of three main modules: Facial expression recognition
module, Audio emotion recognition module and a System integration module. Facial
expression recognition and audio emotion recognition modules are two mutually exclusive
modules. Hence, the system integration module maps the two sub-spaces by constructing and
querying a meta-data file. It is composed of meta-data corresponding to each audio file.
Figure 3 depicts the flowchart of the proposed algorithm.

Figure 3: flow chart

4.1 Emotion Extraction Module
Image of a user is captured using a webcam or it can be accessed from the stored
image in the hard disk. This acquired image undergoes image enhancement in the form of
tone mapping in order to restore the original contrast of the image. After image enhancement
all images are converted into binary image format and the face is detected using Viola and
Jones algorithm where the Frontal Cart property of the algorithm is used that only detects
upright and face forwarding features with a maximum threshold value set in the range of 16-
20. The output of Viola and Jones Face detection block forms an input to the facial feature
extraction block. To increase the accuracy and an aim to obtain real time performance only
features of eyes and mouth are appropriate enough to depict the emotions accurately. For
extracting the features of mouth and eyes certain calculations and measurements are taken
into consideration. Equations (1), (2), (3) and (4) illustrate the bounding box calculations for
extracting features of a mouth. The four equations for defining the feature of a mouth with
the help of bounding box is given as follows
X (start pt of mouth) =X (mid pt of nose) – (X (end pt of nose) – (Xstart pt of nose)) (1)
X (end pt of mouth) =X (mid pt of nose) + ((Xend pt of nose) – (Xstart pt of nose)) (2)
Y (start pt of mouth) =Y (mid pt of nose) +15 (3)
Y (end pt of mouth) =Y (start pt of mouth) +103 (4)
Where (X (start pt of mouth), Y (start pt of mouth)) and (X (end pt of mouth), Y (end
pt of mouth)) illustrates start and end points of the bounding box for mouth
respectively
X (mid pt of nose), Y (mid pt of nose)) illustrates midpoint of noise and
((Xend pt of nose), (Xstart pt of nose)) illustrates end and start point of noise.
Classification is performed using Support Vector Machine (SVM) which classifies it into 7
classes of emotions.

Face detection and tracking
First part of the system is a module for face detection and landmark localization in
the image. The algorithm for face detection is based on work by Viola and Jones. In this
approach image is represented by a set of Haar-like features. Possible types of features are
bounding box technique, 2, 3 rectangular feature and four rectangular feature. Bounding box
is the physical shape for a given object and can be used to graphically define the physical
shapes of the objects. Feature value is calculated by subtracting sum of the pixels covered by
white rectangle from sum of pixels under gray rectangle. Two rectangular features detect
contrast between two vertically or horizontally adjacent regions.
Three rectangular features detect contrasted region placed between two similar
regions and four rectangular features detect similar regions placed diagonally. Rectangle
features can be computed very rapidly using an intermediate representation for the image
which we call the integral image. This value helps in coagulating the multiple boxes into a
single bounding box. Thus feature can be computed rapidly because the value of each
rectangle requires only 4 pixel references. Having the representation of the image in
rectangular features, the classifier needs to be trained to decide if the image contains
searched object (face) or not. The number of features is much higher than the number of
pixels in the original image. However, it was proven that even a small set of well- chosen
features can build a strong classifier. That is why, the AdaBoost algorithm can be used for
training. Each step selects the most discriminative feature which separates positive and
negative examples in the best way. The method is widely used in area of face detection.
However, it can be trained to detect any object. This algorithm is quick and efficient and
could be used in real- time applications. The facial image obtained from the face detection
stage forms an input to the feature extraction stage. To obtain real time performance and to
reduce time complexity, for the intent of expression recognition, only eyes and mouth are
considered. The combination of two features is adequate to convey emotions accurately.
Figure 4 depicts the schematic for feature extraction.

Facial Expression Recognition
Figure 4: Feature Extraction from Facial Image
All RGB and gray scale images are converted into a binary image. This preprocessed
image is fed into the face detection block. Face detection is carried out using viola and jones
algorithm. The default property of ‘Frontal Cart’ with a merging threshold of 16 is employed
to carry out this process. The ‘frontal cart’ property only detects the frontal face that is
upright and forward facing. Since, viola and jones, by default, produces multiple bounding
boxes around the face, merging threshold of 16 helps in coagulating these multiple boxes into
a single bounding box. In-case of multiple faces, this threshold detects the face closest to the
camera and filters out all the distant images.

Figure 5: Regions involved in Expression analysis
The binary image obtained from the face detection block forms an input to the feature
extraction block, where eyes and mouth are extracted. Eyes are extracted using the viola and
jones method, however, to extract mouth, certain measurements were considered. To extract
mouth, first bounding box for nose is calculated using viola and jones and then using the
bounding box for nose, bounding box of mouth is deduced.
Figure 6: Face region grids (Bounding box)

4.2 Audio Feature Extraction Module
In this module a list of songs forms the input. As songs are audio files, they require a certain
amount of preprocessing Stereo signals obtained from the Internet are converted to 16 bit
PCM mono signal around a variable sampling rate of 48.6 kHz. The conversion process is
done using Audacity technique. The pre-processed signal obtained undergoes an audio
feature extraction, where features like rhythm toning is extracted using MIR 1.5 Toolbox,
pitch is extracted using Chroma Toolbox and other features like centroid, spectral flux,
spectral roll off, kurtosis,15 MFCC coefficients are extracted using Auditory Toolbox.
Figure 7: Audio Preprocessing

Audio signals are categorized into 8 types viz. sad, joy-anger, joy-surprise, joy-excitement,
joy, anger, sad-anger and others.
1. Songs that resemble cheerfulness, energetic and playfulness are classified under joy.
2. Songs that resemble very depressing are classified under the sad.
3. Songs that reflect mere attitude, revenge are classified under anger.
4. Songs with anger in playful is classified under Joy-anger category.
5. Songs with very depress mode and anger mood are classified under Sad-Anger category.
6. Songs which reflect excitement of joy is classified under Joy-Excitement category.
7. Songs which reflect surprise of joy is classified under Joy-surprise category.
8. All other songs fall under ‗others‘category.
The extractions of converted files are done by three tool box units, MIR 1.5 Toolbox,
Chroma Toolbox, and Auditory Toolbox.
MIR 1.5 Toolbox
Here the Features like tonality, rhythm, structures, etc extracted using integrated set
of functions written in MATLAB. It is one of the important tool boxes for audio editing
feature .Additionally to the basic computational processes, the toolbox also includes higher-
level musical feature extraction tools, whose alternative strategies, and their multiple
combinations, can be selected by the user. The toolbox was initially conceived in the context
of the Brain Tuning project financed by the European Union (FP6-NEST). One main
objective was to investigate the relation between musical features and music-induced
emotion and the associated neural activity.

Chroma Toolbox
Chroma Toolbox is MATLAB implementations for extracting various types of novel pitch-
based and chroma-based audio features. It contains feature extractors for pitch features as
well as parameterized families of variants of chroma-like features.
Figure 8: Chroma Toolbox
Auditory Toolbox
In auditory tool box the Features like centroid, spectral flux, spectral roll off, kurtosis, and 15
MFCC coefficients are extracted using Auditory Toolbox with the implementation on
MATLAB. Audio signals are categorized into 8 types viz. sad, joy-anger, joy-surprise, joy-
excitement, joy, anger, sad-anger and others. MFCC returns a description of the spectral
shape of the sound. The computation of the cepstral follows the scheme present in Figure 9.
Figure 9: Phases of the MFCC computation

4.2.1 MFCC
MFCC are perceptually motivated features that are also based on the Short-time Fourier
Transform (STFT). After taking the log-amplitude of the magnitude spectrum, the FFT bins
are grouped and smoothed according to the perceptually motivated Mel- frequency scaling.
Finally, in order to decorrelate the resulting feature vectors, a Discrete Cosine Transform
(DCT) is performed. Although typically 13 coefficients are used for speech representation, it
was found that the first five coefficients are adequate for music representation (Tzanetakis,
2002).
4.3 Emotion- Audio integration module
Emotions extracted for the songs are stored as a meta-data in the database. Mapping
is performed by querying the meta-data database. The emotion extraction module and audio
feature extraction module is finally mapped and combined using an Emotion-Audio
integration module.
Audio Emotion Recognition
In Music Emotion recognition block, the playlist of a user forms the input. An audio signal
initially undergoes certain amount of preprocessing. Since, music files acquired from the
internet are usually stereo signals; all stereo signals are converted to 16 bit PCM mono signal
at a sampling rate of 44.1 kHz. The conversion is performed using Audacity. Conversion of a
stereo signal into mono signal is crucial to reduce the mathematical complexity of processing
the similar content of both the channels of a stereo signal. An 80 second window is then
extracted from the entire audio signal. Audio files are usually very large in size. They are
computationally expensive in terms of memory. Hence, to reduce the memory complexity of
an audio file incurred during audio feature extraction, an 80 sec window is extracted from
each audio file. This also ensures uniformity in terms of size of each audio file. During
extraction, first 20 seconds of an audio file are discarded and are not considered. This helps
in ensuring an efficient retrieval of significant information from an audio file. Figure 7
depicts the process of extraction of an 80 second window.

This pre-processed signal then undergoes audio feature extraction, where centroid, spectral
flux, spectral roll off, kurtosis, zero-crossing rate and 13 MFCC coefficients are extracted.
Toolboxes used for audio feature extraction includes MIR 1.5, Auditory Toolbox and
Chroma toolbox. Music based emotion recognition is then carried out using Support Vector
Machine’s ‘one vs other’ approach. Audio signals are classified under 6 categories viz. Joy,
Sad, Anger, Sad-Anger, Joy-Anger and Others. While various mood models have been
proposed in the literature, they failed in capturing the perceptions of a user in real time. They
are figments of theoretical aspects that researchers have associated with different audio
signals. The mood model and the classes considered in this paper take into account, how a
user may associate different moods with an audio signal and a song. While composing a
song, an artist may not maintain the uniformity of a mood across the entire excerpt of a song.
Songs are usually based on a theme or a situation. A song may boast sadness in first half
while the second half may become cheerful. Hence the model adopted in this paper takes into
consideration all these aspects and generates paired classes for such songs, apart from the
individual classes. The model adopted is as follows. All those songs that are cheerful,
energetic and playful are classified under joy, Songs that are very depressing are classified
under the class sad, Songs that reflect attitude, ‘anger associated with patriotism’, and are
revengeful are classified under anger, The category Joy-Anger is associated with songs that
possess anger in a playful mode, Sad-anger category is composed of all those songs that
revolve around the theme of being extremely depressed and angry, All other songs apart from
these general categories falls under the ‘other’ category and When a user is detected with
emotions such as surprise and fear, the songs from the other category are suggested.
Figure 10: integration of three modules
Emotion Extraction
Module
Audio Feature
Extraction Module
Emotion- Audio
integration module
Data Base
(Meta - data)

Figure.11: mapping of modules
Figure 11 depicts the mapping of facial emotion recognition module and audio emotion
recognition module. The name of the file and emotions associated with the song is then
recorded in a database as its meta-data. The final mapping between the two blocks is carried
out by querying the meta-data database. Since, Facial Emotion recognition and Audio
emotion recognition modules are two mutually exclusive components, system integration is
performed using an intermediate mapping module that relates the two mutually exclusive
blocks with the help of the audio Meta data file. Figure 11 depicts the mapping of Facial
emotions and Audio emotions. While fear and surprise categories of facial emotions are
mapped onto the ‘Others’ category of audio emotions, each of joy, sad and anger categories
of facial emotions are mapped onto two categories in audio emotions as shown in diagram
below. During play list generation, when the system classifies a facial image under one of the
three emotions, it considers both individual as well as paired classes. For example, if a facial
image is classified under joy, the system will display songs under joy and joy-anger
categories.

4.3 Classifier
Many classifiers have been applied to expression recognition such as neural network
(NN), support vector machines (SVM), linear discriminant analysis (LDA), K- nearest
neighbor, multinomial logistic ridge regression (MLR), hidden Markov models (HMM), tree
augmented naive Bayes, and others.
In proposed method the Support Vector Machine with Radial Based Kernel Function is used
as a classifier. The Support Vector Machine is an adaptive learning system which receives
labeled training data and transforms it into higher dimensional feature space. Then separating
hyper-plane with respect to margin maximization is computed to determine the best
separation between classes. The greatest advantage of SVM is that even with small set of
training data it has good performance in generalization. Shan evaluated the performance of
LBP features as facial expression descriptors and they obtained the result of 89% by use of
SVM with RBF Kernel trained with CK database. Recently Bartlett conducted similar
experiments. They selected 313 image sequences from the Cohn-Kanade database. The
sequences came from 90 subjects, with 1 to 6 emotions per subject. The facial images were
converted into a Gabor magnitude representation using a bank of 40 Gabor filters. Then they
performed 10-fold cross-validation experiments using SVM with linear, polynomial, and
RBF kernels. Linear and RBF kernels performed best, with recognition rates of 84.8% and
86.9% respectively. As a powerful machine learning technique for data classification, SVM
performs an implicit mapping of data into a higher (maybe infinite) dimensional feature
space, and then finds a linear separating hyper-plane with the maximal margin to separate
data in this higher dimensional space.
SVM finds the hyper-plane that maximizes the distance between the support vectors and the
hyper-plane. SVM allows domain-specific selection of the kernel function. Though new
kernels are being proposed, the most frequently used kernel functions are the linear,
polynomial, and Radial Basis Function (RBF) kernels.

4.4 Databases
One of the most important aspects of developing any new recognition or detection system is
the choice of the database that will be used for testing the new system. For the purpose of
training, two facial expression databases are to be used. First one is the FG-NET Facial
Expression and Emotion Database [4] which consists of MPEG video files with spontaneous
emotions recorded. Database contains examples gathered from 18 subjects (9 female and 9
male). The system has to be trained with captured video frames in which the displayed
emotion is very representative. The training set consists of images of seven states neutral and
emotional (surprise, fear, disgust, sadness, happiness and anger). Second database is the
Cohn-Kanade Facial Expression Database [5] and contains 486 image sequences displayed
by 97 posers. The sequence displays the emotion from the start to the peak, however, only
the last image of a sequence is used for training. The training and testing sets formed from
Cohn-Kanade database contain images divided into seven classes: neutral, surprise, fear,
disgust, sadness, happiness and anger. Having appropriate training sets, the procedure of
classifier training can be performed. The database was planned and designed in a general and
expandable way. It supports the current needs but also allows different mood models,
including different types (both categorical and dimensional), user accounts, lists of artists,
albums, genres, features and classification profiles, saving different classification and
tracking values for the same song, based on different combinations of features and classifiers,
for instance.
4.5 Music
The last and the most important part of this system is the playing of music according to the
user’s current emotional state. Once the face expression of the user has been classified, user’s
corresponding emotion state is found. A musical database consisting of songs from a variety
of domains and pertaining to a number of emotions is maintained .Every song in each
emotion category of the database is assigned weights. Also, each emotion detected is
assigned weights. The algorithm finds out a song closest in weight with the classified
emotion. This song is then played.

CHAPTER 5
Viola and Jones algorithm
The Viola–Jones object detection framework is the first object detection framework
to provide competitive object detection rates in real-time proposed in 2001 by Paul Viola and
Michael Jones. Although it can be trained to detect a variety of object classes, it was
motivated primarily by the problem of face detection. The problem to be solved is detection
of faces in an image. A human can do this easily, but a computer needs precise instructions
and constraints. To make the task more manageable, Viola–Jones requires full view frontal
upright faces. Thus in order to be detected, the entire face must point towards the camera and
should not be tilted to either side. While it seems these constraints could diminish the
algorithm's utility somewhat, because the detection step is most often followed by a
recognition step, in practice these limits on pose are quite acceptable. It can be trained to
detect a variety of object classes; it was motivated primarily by the problem of face detection.
Here a computer needs precise instructions and constraints for the detection to make the task
more manageable, Viola–Jones requires full view frontal upright faces. Thus in order to be
detected, the entire face must point towards the camera and should not be tilted to either side.
Here we use the viola frontal cart property, the ‘frontal cart’ property only detects the frontal
face that is upright and forward facing. Here this property with a merging threshold of 16 is
employed to carry out this process. This value helps in coagulating the multiple boxes into a
single bounding box.
Advantages
• Extremely fast feature computation and Efficient feature selection.
• Scale and location invariant detector; Instead of scaling the image itself (e.g.
pyramid-filters), we scale the features.
• Such a generic detection scheme can be trained for detection of other types of objects
(e.g. cars, hands)

Disadvantages
• Detector is most effective only on frontal images of faces
• It can hardly cope with 45° face rotation both around the vertical and horizontal axis.
• Sensitive to lighting conditions
• We might get multiple detections of the same face, due to overlapping sub-windows.
MATLAB code on frontal cart for object detection
function [ ] = Viola_Jones_img ( Img )
%Viola_Jones_img( Img )
% Img - input image
% Example how to call function: Viola_Jones_img( imread
('name_of_the_picture.jpg'))
faceDetector =
vision.CascadeObjectDetector;
bboxes = step(faceDetector, Img);
figure, imshow( Img ), title( 'Detected faces' );hold on
for i = 1 : size (bboxes, 1 )
rectangle( 'Position' ,bboxes
( i ,:), ‘ LineWidth ' , 2 , ‘ EdgeColor ' , 'y' );
end
end

CHAPTER 6
Experimental results
Testing and implementation is performed using either MATLAB R2013a or latest
2014 version of MATLAB on Windows7/8, 32 bit operating system and Intel i3 core
processor. Facial expression extraction is done on both user independent and dependent
dataset. A dataset consisting of facial image of 25 individuals was selected for user
independent experiment and dataset of 10 individuals was selected for user dependent
experimentation. An image of size 4000X3000 was used for static and dynamic dataset
experiment. Implementation and experimentation in this paper is carried out using MATLAB
R2013a on Windows8, 32 bit operating system and Intel’s i5 M460(2.53 GHz) processor .
For facial emotion recognition, 2 experiments were carried out.
i. Using user independent dataset.
ii. Using user dependent dataset.
6.1 EMOTION EXTRACTION
A user independent dataset of 30 images and user dependent dataset of 5 images is selected
for extraction of emotions. Estimated time for various modules of Emotion Extraction
Module is illustrated in Table 1.
Table 1 Time Estimation of Various modules of Emotion Extraction Module

A user independent dataset of 30 images and user dependent dataset of 5 images is selected
for extraction of emotions. Estimated time for various modules of Emotion Extraction
Module is illustrated in Table 2.
6.2 AUDIO FEATURE EXTRACTION
A dataset of around 200 songs was considered for experimentation and testing of audio
feature extraction module and the songs were collected from various Bollywood music sites
like Djmaza.in, Songs.pk etc. Estimated accuracy for various categories of emotions is
depicted in Table 2. It is given as follows
Table 2 Estimated Accuracy for different categories of Audio Feature
6.3 EMOTION BASED MUSIC PLAYER
Table 4 gives the average time taken by the proposed algorithm (i.e. algorithm to generate
playlist, based on facial expressions) and its facial expression recognition and system
integration modules.
Table 3 Average Time Estimation of Various modules of Proposed System

CHAPTER 7
CONCLUSION
Experimental results have shown that the time required for audio feature extraction is
negligible (around 0.0006 sec) and songs are stored pre-handed the total estimation time of
the proposed system is proportional to the time required for extraction of facial features
(around 0.9994 sec).Also the various classes of emotion yield a better accuracy rate as
compared to previous existing systems. The computational time taken is 1.000sec which is
very less thus helping in achieving a better real time performance and efficiency. The system
thus aims at providing the Windows operating system users with a cheaper, additional
hardware free and accurate emotion based music system. The Emotion Based Music System
will be of great advantage to users looking for music based on their mood and emotional
behavior. It will help reduce the searching time for music thereby reducing the unnecessary
computational time and thereby increasing the overall accuracy and efficiency of the system.
The system will not only reduce physical stress but will also act as a boon for the music
therapy systems and may also assist the music therapist to therapize a patient. Also with its
additional features mentioned above, it will be a complete system for music lovers and
listeners. A wide variety of image processing techniques was developed to meet the facial
expression recognition system requirements. Apart from theoretical background, this work
provides ways to design and implement Emotion based music player. Proposed system will
be able to process the video of facial behavior, recognize displayed actions in terms of basic
emotions and then play music based on these emotions. Major strengths of the system are full
automation as well as user and environment independence. Even though the system cannot
handle occlusions and significant head rotations, the head shifts are allowed. In the future
work, we would like to focus on improving the recognition rate of our system.

CHAPTER 8
APLICATIONS
Music therapists use music to enhance social or interpersonal, affective, cognitive, and
behavioral functioning. Research indicates that music therapy is effective at reducing muscle
tension and anxiety, and at promoting relaxation, verbalization, interpersonal relationships,
and group cohesiveness. This can set the stage for open communication and provide a
starting place for non-threatening support and processing symptoms associated with or
exacerbated by trauma and disaster, such as the 9/11 event. A therapist can talk with a client,
but a qualified music therapist can use music to actively link a client to their psycho-
emotional state quickly. In certain settings, the active use of music therapy interventions has
resulted in a shorter length of stay (treatment period) and more efficient response to the
client’s overall intervention plan.
The use of songs in music therapy with cancer patients and their families effectively provides
important means for support and tools for change. Human contact and professional support
can often assist in diminishing the suffering involved in cancer. There exists an inherent
association between songs and human contact since lyrics represent melodic verbal
communication. Thus, the use of songs in music therapy can have naturally meaningful
applications.
The proposed algorithm was successful in crafting a mechanism that can find its
application in music therapy systems and can help a music therapist to therapize a patient,
suffering from disorders like acute depression, stress or aggression. The system is prone to
give unpredictable results in difficult light conditions, hence as part of the future work,
removing such a drawback from the system is intuited.

CHAPTER 9
FUTURE SCOPE
In future we would like to develop a mood- enhancing music player in the future which starts
with the user’s current emotion (which may be sad) and then plays music of positive
emotions thereby eventually giving a joyful feeling to the user. Finally, we would like to
improve the time efficiency of our system in order to make it appropriate to use in different
applications.

CHAPTER 10
REFERENCES
[1]. Chang, C. Hu, R. Feris, and M. Turk, ―Manifold based analysis of facial expression,‖
Image Vision Comput ,IEEE Trans. Pattern Anal. Mach. Intell. vol. 24, pp. 05–614, June
2006.
[2]. A. habibzad, ninavin, Mir kamalMirnia,‖ A new algorithm to classify face emotions
through eye and lip feature by using particle swarm optimization.‖
[3]. Byeong-jun Han, Seungmin Rho, Roger B. Dannenberg and Eenjun Hwang, ―SMERS:
music emotion recognition using support vector regression‖, 10thISMIR , 2009.
[4]. Alvin I. Goldmana, b.Chandra and SekharSripadab, ―Simulationist models of face-
based emotion recognition‖.
[5]. Carlos A. Cervantes and Kai-Tai Song , ―Embedded Design of an Emotion-Aware
Music Player‖, IEEE International Conference on Systems, Man, and Cybernetics, pp 2528-
2533 ,2013.
[6]. Fatma Guney, ‖Emotion Recognition using Face Images‖, Bogazici University, Istanbul,
Turkey 34342.
[7]. Jia-Jun Wong, Siu-Yeung Cho,‖ Facial emotion recognition by adaptive processing of
tree structures‖.
[8]. K.Hevener,‖The affective Character of the major and minor modes in music‖, The
American Journal of Psychology,Vol 47(1) pp 103-118,1935.
[9]. Samuel Strupp, Norbert Schmitz, and KarstenBerns, ―Visual-Based Emotion Detection
for Natural Man-Machine Interaction‖.

[10]. Michael lyon and Shigeru Akamatsu, ―Coding Facial expression with Gabor
wavelets.‖, IEEE conf. on Automatic face and gesture recognition, March 2000
[11]. Kuan-Chieh Huang, Yau-Hwang Kuo, Mong-Fong Horng ,‖Emotion Recognition by a
novel triangular facial feature extraction method‖.
[12]. S. Dornbush, K. Fisher, K. McKay, A. Prikhodko and Z. Segall ‖Xpod- A Human
Activity and Emotion Aware Mobile Music Player‖, UMBC Ebiquity, November 2005.
[13]. Renuka R. Londhe, Dr. Vrushshen P. Pawar, ―Analysis of Facial Expression and
Recognition Based On Statistical Approach‖, International Journal of Soft Computing and
Engineering (IJSCE) Volume-2, May 2012.
[14]. Russell, ―A circumplex model of affect‖, Journal of Personality and Social
Psychology Vol-39(6), pp1161–1178 , 1980.
[15]. Sanghoon Jun, Seungmin Rho, Byeong-jun Han and Eenjun Hwang, ―A fuzzy
inference-based music emotion recognition system‖,VIE,2008.
[16]. Thayer ‖ The biopsychology of mood & arousal‖, Oxford University Press ,1989.
[17]. Z. Zeng ―A survey of affect recognition methods: Audio, visual, and spontaneous
expressions,‖IEEE. Transaction Pattern Analysis, vol 31, January 2009

Emotion based music player

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Emotion based music player

Similaire à Emotion based music player (20)

Dernier

Dernier (20)

Emotion based music player