Emotion recognition using facial expressions and speech
1. Non intrusive vision and acoustic
based emotion recognition of
driver in Advanced Driver
Assistance System
2. Motivation
• Driving is one of the most dangerous tasks in
our everyday lives.
• Some statistics in Vijayawada
http://www.aptransport.org/html/accidents.htm
3. • Majority of the accidents on roads are mainly due
to the driver’s inattentiveness.
• The main reason to the poor attention of the
driver in the driving is caused by various
emotions/moods (for example
sad, angry, joy, pleasure, despair and irritation) of
the driver.
• The emotions are generally measured by
analyzing either head movement patterns or
eyelid movements or face expressions or all
together.
• In this project, we develop a system to identify
emotions of the driver using non intrusive
methods.
4. Emotion
• There are more than 300 crisply identified emotions by researchers.
However all of them are not experienced in day-to-day life.
• Palette theory quotes, any emotion is composition of 6 primary emotions as
any color is combination of 3 primary colors.
• Anger, disgust, fear, happiness, sadness and surprise are considered as the
primary or basic emotions and also referred to as archetypal emotions
http://2wanderlust.files.wordpress.com/2009/03/picture-2.png
5. Face recognition techniques
Different face recognition techniques are
• Model based, a 3D model is constructed based on the facial variations in the image
Disadvantages:
* need high expensive camera (Stereo vision).
* construction of 3D model is difficult and takes more time.
• Appearance based, performance depends on the quality of extracted features.
• Feature based, the overall technique describes the position and size of each feature
(eye, nose, mouth or face outline)
Disadvantages:
* Extracting features in different poses (viewing condition) and lighting
conditions is a very complex task.
* For applications with large database, large set of features with
different sizes and positions, feature points identification difficult.
6. Feature Extraction from the Visual
Information
• The appearance based linear subspace techniques
extract the global features, as these techniques use the
statistical properties like the mean and variance of the
image.
• Challenge: The major difficulty in applying these
techniques over large databases is that the
computational load and memory requirements for
calculating features increase dramatically for large
databases
• Solution: In order to increase the performance of the
feature extraction techniques, the nonlinear feature
extraction techniques are introduced.
7. Nonlinear feature extraction techniques
• Radon transform
• Wavelet transform.
The radon transform based nonlinear feature
extraction gives the direction of the local
features.
When features are extracted using radon
transform, the variations in this facial
frequency are also boosted. The wavelet
transform gives the spatial and frequency
components present in an image.
9. Feature Extraction from acoustic information
The important voice features to consider for emotion classification are:
• Fundamental frequency (F0) or Pitch,
• Intensity (Energy),
• Speaking rate,
• Voice quality and many other features that may be extracted/calculated from the
voice information are the formants,
• the vocal tract cross-section areas,
• the MFCC (Mel Frequency Cepstral Coefficient),
• Linear frequency cepstrum coefficients (LFCC),
• Linear Predictive Coding (LPC) and
• the teager energy operator-based features
Pitch is the fundamental frequency of audio signals (highness or lowness of a sound).
The MFCC is “spectrum of the spectrum” used to find the number of voices in the
speech.
The teager energy operator is used to find the number of harmonics due to nonlinear air
flow in the vocal track
The LPC provides an accurate and economical representation of the envelope of the
short-time power spectrum.
The LFCC is similar to MFCC but without the perceptually oriented transformation
into the Mel frequency scale; emphasize changes or periodicity in the spectrum,
while being relatively robust against noise. These features are measured from the
mean, range, variance and transmission duration between utterances .
10. Advantages and Disadvantages of using
acoustic features for detecting emotions
Advantages:
• We can often detect a speaker’s emotion even if we can not
understand the language.
• Speech is easy to record even under extreme environmental
conditions (temperature, high humidity and bright light),
requires cheap, durable and maintenance free sensors
Disadvantages:
Depends on age and gender. Angry males show higher
levels of energy than angry females. It is found that males
express anger with a slow speech rate as opposed to females
who employ a fast speech rate
11. Previous Work On Emotion Detection
From Speech
•
•
•
•
•
•
•
•
•
Schuller et al. [3]used Hidden Markov Model based approach for speech emotion
recognition. They achieved an overall accuracy of about 87%.
In [4] using spectral features and GMM supervector based SVMs emotion
recognition reached an accuracy level of more than 90% in some cases.
Many other approaches for emotion recognition has been tried like decision tree
based approach in [5],
rough set and SVM based approach in [6].
ANN and HMM based Multilevel speech recognition work was done in [7]
Some authors have done comparative study of two or more approaches for emotion
detection using speech [8] [9].
Speaker dependent and Speaker independent studies has also been done
[9] and proved that different approaches will give different accuracy level for the
two cases.
Different features used affect the emotion recognition [10] and hence proper
feature set must be taken for emotion recognition.
Since large number of features can be extracted for audio, few works related to
feature selection method has also been done [11].
12. Recent work
• Using 3D shape information:
Increased availability of 3D databases and
affordable 3D sensors .
3D shape information provides invariance
against head pose and illumination conditions.
• Using Thermal cameras.
• Integration of audio, video and body language.
13. References
•
•
•
•
•
•
•
•
•
•
•
•
•
[1] H. D. Vankayalapati and K. Kyamakya, "Nonlinear Feature Extraction Approaches for Scalable Face
Recognition Applications," ISAST transactions on computers and intelligent systems, vol. 2, 2009.
[2] Extraction of visual and acoustic features of the driver for real-time driver monitoring system - Sandeep Kotte
[3] Schuller, B.; Rigoll, G.; Lang, M.; ”Hidden Markov Model-based Speech Emotion Recognition” IEEE
International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03). 2003
[4] Hao Hu; Ming-Xing Xu; Wei Wu; ”GMM Supervector Based SVM with Spectral Features for Speech Emotion
Recognition” IEEE International Conference on Acoustics, Speech and Signal Processing, 2007. ICASSP 2007.
[5] Chi-Chun Lee, Emily Mower, Carlos Busso, Sungbok Lee and Shrikanth S. Narayanan, ”Emotion recognition
using a hierarchical binary decision tree approach”, in: Proceedings of Inter-Speech, 2009.
[6] Jian Zhou1,Guoyin Wang, Yong Yang, Peijun Chen ”Speech Emotilon Ruecognition Based on Rough Set and
SVM” 5th IEEE International Conference on Cognitive Informatics, 2006. ICCI 2006.
[7] Xia Mao, Lijiang Chen, Liqin Fu. ”Multi-Level Speech Emotion Recognition based on HMM and ANN”,World
Congress on Computer Science and Information Engineering, 2009
[8] Razak, A.A.; Komiya, R.; Izani, M.; Abidin, Z.; ”Comparison between Fuzzy and NN method for Speech
Emotion Recognition” Third International Conference on Information Technology and Applications, 2005. ICITA
2005.
[9] Iliou, Theodoros; Anagnostopoulos, Christos-Nikolaos; ”SVM-MLP-PNN Classifiers on Speech Emotion
Recognition Field A Comparative Study” Fifth International Conference on DigitalTelecommunications 2010
[10] Anton Batliner, Stefan Steidl, Bjorn Schuller, Dino Seppi, Thurid Vogt, Johannes Wagner, Laurence
Devillers, Laurence Vidrascu, Vered Aharonson, Loic Kessous, Noam Amir,”Searching for the Most Important
Feature Types Signalling Emotion-Related User States in Speech”, 2009, Computer Science & Language
[11] Ling Cen, Wee Ser, Zhu Liang Yu , ”Speech Emotion Recogni- tion Using Canonical Correlation Analysis and
Probabilistic Neu-ral Network” 2008 Seventh International Conference on Machine Learning and Applications
[12] Dimitrios Ververidis and Constantine Kotropoulos. Emotional speech recognition: Resources, features, and
methods. Speech Communication, 48(9):1162 -1181, 2006.
[13] Emotion Recognition using Speech Features By K. Sreenivasa Rao, Shashidhar G. Koolagudi
Notes de l'éditeur
It should be noted that the principal axis of PCA for an image rotates when the image rotates.Radon transform, computed with respect to this axis, tenders robust features.
LDA is often referred to as a Fisher's Linear Discriminant (FLD). The images in the training set are divided into the corresponding classes. LDA then finds a set of Vectors such that Fisher Discriminant Criterion is maximizedThese regions are: one low frequency region LL (approximate component), and three high-frequency regions,namely LH (horizontal component), HL (vertical component),and HH (diagonal component) The low frequency region in decompositions at different levels is the blurred version of the input image, while the high frequency regions contain the finer detail or edge information contained in the input image