SlideShare une entreprise Scribd logo
1  sur  4
Télécharger pour lire hors ligne
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 09 | Sep 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 938
SPEECH EMOTION RECOGNITION
Darshan K.A1, Dr. B.N. Veerappa2
1U.B.D.T. College of Engineering, Davanagere, Karnataka, India
2Dr. B.N. Veerappa, Department of Studies in Computer Science and Engineering, U.B.D.T. College of Engineering,
Davanagere.
---------------------------------------------------------------------***----------------------------------------------------------------------
Abstract – The aim of this paper is to document the
development of speech emotionrecognitionsystemusingCNN.
A model is designed that could recognize the emotion in a
speech sample. Various parameters are modified to improve
the accuracy of the model. The paper also aims to find out the
factors which are affecting the accuracy of the model and key
factors which are needed to improve the efficiency of the
model. The paper concludes with a discussion on accuracy of
different CNN architecture and parametersneededtoimprove
accuracy and possible areas of improvement.
Key Words: ASR, Deep Neural Network, Convolution
Neural Network, MFCC.
1. INTRODUCTION
Speech is a means of communication in humans. Speech is
series sequence of a wordsof pre-establishedlanguageandit
is an essential medium for communication. Speech
technology is a computing technology that empowers an
electronic device to recognize, analyse and understand
spoken words or audio. There are many researches and
studies going on from several years in understanding the
building blocks of human brain that makes up human
intelligence. Human brain is a complex organ which has
inspired for exploration in Artificial Intelligence. The aim of
Artificial intelligence is to develop a system that are
intelligent and capable enough to developthe behaviour and
thoughts like human behaviour and thoughts. One of the
important fields of research in AI and machine learning is
Automatic speech recognition (ASR), which aims to design
machines that can understand the human speech and
interact with people through speech.
In order to communicate effectivelywith people,thesystems
need to understand the emotions in speech.Therefore,there
is a need to develop machines that can recognize the
paralinguistic information like emotion to have effective
clear communication like humans. One important data in
paralinguistic information is Emotion,whichiscarriedalong
with speech. A lot of machine learning algorithms have been
developed and tested in order to classify these emotions
carried by speech. The aim to develop machines to interpret
paralinguistic data, like emotion, helps in human-machine
interaction and it helps to make the interaction clearer and
natural. In this study Convolution Neural Networks areused
to predict the emotions in speech sample. To train the
network RAVDEES and SAVEE dataset are used which
contains speech samples that were generated by the 24
actors and 4 actors respectively. The results showed
accuracy of 77%.
2. SYSTEM DESIGN
Convolution Neural Networks (CNN’s) are used to
differentiate the speech samples based on their emotion.
Databases such as RAVDEES and SAVEE are utilized to
prepare and assess CNN models. Kears (TensorFlow's high-
level API for building and training deep learning models) is
used as the programming framework to implement CNN
models. Seven exploratory arrangement ofthepresent work
are explained in this section.
2.1 Database
▪ Ryerson Audio-Visual Database of Emotional Speech
and Song (RAVDESS)
The Ryerson Audio-Visual DatabaseofEmotional Speech and
Songs contains 24 professional actors (12 females,12male),
articulating two lexically-similar sentence ina neutral North
American accent. Speech samples include emotions such as
calm, happy, sad, angry, fearful, surprise, and disgust. Each
emotion is produced at two stages of emotional intensity
(normal, strong), with an additional neutral emotion.
▪ Survey Audio-Visual Expressed Emotion (SAVEE)
The audio folder of this dataset consists of speech samples
recorded by four male speakers. For each emotion class
there are 15 sentences. The starting letter of the file name
represents emotion class. The letters 'su', 'sa', 'n', 'h', 'f', 'd'
and 'a' represent 'surprise' 'sadness' 'neutral', 'happiness',
'fear', 'disgust', and 'anger’ emotion classes respectively. For
example, ‘a03’ represents the third anger sentence.
2.2 Pre-processing
The first step involves organizing the audio files. The
emotion in an audio samplecanbedeterminedbytheunique
identifier of the file name at the 3rd position, which
represents the type of emotion. The dataset consists of five
different emotions
1. Calm 2. Happy 3. Sad 4. Angry 5. Fearful
2.3 Defining Labels
Based on the number of classes to classify the speech labels
are defined. Some of the classes are as follows:
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 09 | Sep 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 939
Class: positive and negative
Positive: Calm, Happy.
Negative: Fearful, Sad, Angry.
Class: Angry, Sad, Happy, Fearful, Calm.
Class: Angry, Sad, Happy, Fearful, Calm, Neutral, Disgust,
surprised.
2.4 Feature Extraction
The Shape of the Speech signal determines what sound
comes out. If the shape is determined accurately, then the
correct representation of the sound being generated is
obtained. The job of Mel Frequency Cepstral Coefficients’
(MFCC’s)is to correctly represent it. MFCCs is used as input
feature. Loading and converting audio data into MFCCs
format is done by python package librosa.
2.5 Designing the Dimensions of the Model
This step includes designing the layers of the CNN model,
choosing the activation function, choosing the appropriate
pooling options, defining the SoftMax to classify the speech
into different classes.
2.6 Model Training and Testing
The model is trained with the training dataset and tested
with the test data set. Actual values are compared with the
predicted values. This comparison gives us the accuracy of
the model.
2.7 Architecture of CNN
In the current study, the deep neural network architecture
actualized is convolutional neural network. In the proposed
architecture after each convolutional layer max-pooling
layer is placed. To establish non linearity in the model, for
activation function Rectified Linear Units (ReLU) is used in
both convolutional and fully connected layers.
Fig 1: Architecture of CNN
Batch normalization is used to improve the firmness of
neural network, whichnormalizestheresultofthepreceding
activation layer by reducing the number by what the hidden
unit values move around and allows each of the layer in a
network to learn by itself. Dense layer is used; in which all
the neurons in a layer are connected to neurons in the next
layer and it is a fully connected layer. SoftMax unit is used to
compute probability distribution of the classes. The number
of SoftMax to be used depends on number of classes to
classify the emotions. The model took between 10hrs to
14hrs to be trained. CPU's consumes lot of time to train the
model, instead of that GPU's can be used to speed the
training process. The several cores in GPU accelerate the
speed and saves much time. The Figure 1 shows the
Convolutional Neural Network architecture used in this
paper. Lighter CNN architecture is also used to classify
among a greater number of classes and good results are
obtained.
3. RESULTS AND DISCUSIONS
The experiment is carried out with deeper CNN architecture.
Batch normalization, max pooling, ReLU activation function
are used. The model is trained for 250 epochs. For training
70% of dataset is used and for testing 30% of dataset is used.
The model is trained to classify among the two different
classes of male voice i.e., male happy and male sad. For this
model the accuracy was 87% which is a very goodresult.The
below figures show the emotion distribution and confusion
matrix of it.
Fig 2:Distribution of Emotion in Dataset
Fig 3: Confusion Matrix for First Experiment
The second experiment was carried out with lighter CNN
architecture. ReLU activation function and max pooling are
used. Two datasets are used to train the model (RAVDEES
and SAVEE dataset). The model is trained for 1000 epochs.
For training 80% of dataset is used and for testing 20% of
dataset is used. The model is trainedtoclassifyamongtheten
different classes of male a and female voice. For this model
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 09 | Sep 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 940
the accuracy is 77 % which is a very good result. The below
figures show the confusion matrix of it.
Fig 4: Confusion Matrix for Second Experiment
3.1 Real Time Analysis
The user selects the speech sample to find out the emotion in
it. The model extracts the MFCCfeaturesfromthesampleand
predicts the emotion in the class as per the pre-defined
emotion class.
Fig 5: User Selects a Speech Sample
Fig 6: MFCC features Extraction from a Speech Sample
Fig 7: Predicted Emotion for a Speech Sample
4. CONCLUSION AND FEATURE WORK
Various experiments are conducted by changing the several
parameters like dimension of the model, number of epochs,
changing the partition ratio between training and test data
set. Different accuracy was found for different experiments.
A lighter CNN architecture with 80% of training data and
20% of test data gave good result compared to deeper CNN
architecture while classifying among ten classes. The
accuracy of this model was found to be 77%. The
performance of deeper CNN model was found to be very
good when the classification was among two classes the
reason is there were a greater number of training samples
available to classify among two classes. When the same
model was used to classify among ten classes the training
dataset was divided into ten labels this led to the fewer
number of training samples available for each class. This led
to the poor accuracy of the model. With the greater number
of training samples available for each class andwiththehelp
of GPU’s to speed up the training process more accuracy can
be achieved in the future enhancements.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 09 | Sep 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 941
REFERENCES
[1] Arianna Mencattini, Eugenio Martinelli, Giovanni
Costantini, Massimiliano Todisco, Barbara Basile, Marco
Bozzali, and Corrado Di Natale. Speech emotion recognition
using amplitude modulation parameters and a combined
feature selection procedure. Knowledge-Based Systems,
63:68–81, 2014.
[2] Longfei Li, Yong Zhao, Dongmei Jiang, Yanning Zhang,
Fengna Wang, Isabel Gonzalez, Enescu Valentin,andHichem
Sahli. Hybrid deep neural network–hidden markov model
(dnn-hmm) based speech emotion recognition. In Affective
Computing and Intelligent Interaction(ACII),2013Humaine
Association Conference on, pages 312–317. IEEE, 2013.
[3] Qirong Mao, Ming Dong, Zhengwei Huang,andYongzhao
Zhan. Learning salient features for speech emotion
recognition using convolutional neural networks. IEEE
Transactions on Multimedia, 16(8):2203–2213, 2014.
[4] Haytham M Fayek, Margaret Lech, and Lawrence
Cavedon. Towards real-time speech emotion recognition
using deep neural networks. In Signal Processing and
Communication Systems (ICSPCS), 2015 9th International
Conference on, pages 1–5. IEEE, 2015.
[5] WQ Zheng, JS Yu, and YX Zou. An experimental study of
speech emotion recognition based on deep convolutional
neural networks. In Affective Computing and Intelligent
Interaction (ACII), 2015 International Conference on, pages
827–831. IEEE, 2015.
[6] George Trigeorgis, FabienRingeval,RaymondBrueckner,
Erik Marchi, MihalisANicolaou,Bj¨ornSchuller,andStefanos
Zafeiriou. Adieu features? end-to-end speech emotion
recognition usinga deepconvolutional recurrent network.In
Acoustics, Speech andSignal Processing(ICASSP),2016IEEE
International Conference on, pages 5200–5204. IEEE, 2016.
[7] Andrew Ng. Sequence models. Deeplearning.ai on
Coursera, February 2018.
[8] Michalis Papakostas, Evaggelos Spyrou, Theodoros
Giannakopoulos, Giorgos Siantikos, DimitriosSgouropoulos,
Phivos Mylonas, and Fillia Makedon. Deep visual attributes
vs. hand-crafted audio features on multidomain speech
emotion recognition. Computation, 5(2):26, 2017.
[9] Tom M Mitchell et al. Machine learning. wcb, 1997.
[10] C Bishop. Pattern recognition and machine learning
(information science and statistics), 1st edn. 2006. corr. 2nd
printing edn. Springer, New York, 2007.
[11] RS Suton and Andrew G Barto. Reinforcement learning:
an introduction. adaptive computation and machine
learning, 2002.
[12] Gareth James, Daniela Witten,TrevorHastie,andRobert
Tibshirani. An introduction to statistical learning, volume
112. Springer, 2013.
[13] Aur´elien G´eron. Hands on machine learning with
scikit-learn and tensorflow, 2017.

Contenu connexe

Similaire à IRJET-speech emotion.pdf

Human Emotion Recognition From Speech
Human Emotion Recognition From SpeechHuman Emotion Recognition From Speech
Human Emotion Recognition From Speech
IJERA Editor
 

Similaire à IRJET-speech emotion.pdf (20)

IRJET- Emotion Recognition using Facial Expression
IRJET-  	  Emotion Recognition using Facial ExpressionIRJET-  	  Emotion Recognition using Facial Expression
IRJET- Emotion Recognition using Facial Expression
 
Human Emotion Recognition From Speech
Human Emotion Recognition From SpeechHuman Emotion Recognition From Speech
Human Emotion Recognition From Speech
 
IRJET- Emotion recognition using Speech Signal: A Review
IRJET-  	  Emotion recognition using Speech Signal: A ReviewIRJET-  	  Emotion recognition using Speech Signal: A Review
IRJET- Emotion recognition using Speech Signal: A Review
 
Survey of Various Approaches of Emotion Detection Via Multimodal Approach
Survey of Various Approaches of Emotion Detection Via Multimodal ApproachSurvey of Various Approaches of Emotion Detection Via Multimodal Approach
Survey of Various Approaches of Emotion Detection Via Multimodal Approach
 
ANALYSING SPEECH EMOTION USING NEURAL NETWORK ALGORITHM
ANALYSING SPEECH EMOTION USING NEURAL NETWORK ALGORITHMANALYSING SPEECH EMOTION USING NEURAL NETWORK ALGORITHM
ANALYSING SPEECH EMOTION USING NEURAL NETWORK ALGORITHM
 
SPEECH EMOTION RECOGNITION
SPEECH EMOTION RECOGNITIONSPEECH EMOTION RECOGNITION
SPEECH EMOTION RECOGNITION
 
STUDENT TEACHER INTERACTION ANALYSIS WITH EMOTION RECOGNITION FROM VIDEO AND ...
STUDENT TEACHER INTERACTION ANALYSIS WITH EMOTION RECOGNITION FROM VIDEO AND ...STUDENT TEACHER INTERACTION ANALYSIS WITH EMOTION RECOGNITION FROM VIDEO AND ...
STUDENT TEACHER INTERACTION ANALYSIS WITH EMOTION RECOGNITION FROM VIDEO AND ...
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
Speech emotion recognition using 2D-convolutional neural network
Speech emotion recognition using 2D-convolutional neural  networkSpeech emotion recognition using 2D-convolutional neural  network
Speech emotion recognition using 2D-convolutional neural network
 
SENTIMENT ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING MODELS
SENTIMENT ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING MODELSSENTIMENT ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING MODELS
SENTIMENT ANALYSIS FOR MOVIES REVIEWS DATASET USING DEEP LEARNING MODELS
 
CONFIDENCE LEVEL ESTIMATOR BASED ON FACIAL AND VOICE EXPRESSION RECOGNITION A...
CONFIDENCE LEVEL ESTIMATOR BASED ON FACIAL AND VOICE EXPRESSION RECOGNITION A...CONFIDENCE LEVEL ESTIMATOR BASED ON FACIAL AND VOICE EXPRESSION RECOGNITION A...
CONFIDENCE LEVEL ESTIMATOR BASED ON FACIAL AND VOICE EXPRESSION RECOGNITION A...
 
A Review Paper on Speech Based Emotion Detection Using Deep Learning
A Review Paper on Speech Based Emotion Detection Using Deep LearningA Review Paper on Speech Based Emotion Detection Using Deep Learning
A Review Paper on Speech Based Emotion Detection Using Deep Learning
 
Ct35535539
Ct35535539Ct35535539
Ct35535539
 
IRJET- Emotion Recognition from Voice
IRJET- Emotion Recognition from VoiceIRJET- Emotion Recognition from Voice
IRJET- Emotion Recognition from Voice
 
IRJET- Factoid Question and Answering System
IRJET-  	  Factoid Question and Answering SystemIRJET-  	  Factoid Question and Answering System
IRJET- Factoid Question and Answering System
 
IRJET- Study of Effect of PCA on Speech Emotion Recognition
IRJET- Study of Effect of PCA on Speech Emotion RecognitionIRJET- Study of Effect of PCA on Speech Emotion Recognition
IRJET- Study of Effect of PCA on Speech Emotion Recognition
 
IRJET- Automatic Real-Time Speech Emotion Recognition using Support Vecto...
IRJET-  	  Automatic Real-Time Speech Emotion Recognition using Support Vecto...IRJET-  	  Automatic Real-Time Speech Emotion Recognition using Support Vecto...
IRJET- Automatic Real-Time Speech Emotion Recognition using Support Vecto...
 
MUSIC RECOMMENDATION THROUGH FACE RECOGNITION AND EMOTION DETECTION
MUSIC RECOMMENDATION THROUGH FACE RECOGNITION AND EMOTION DETECTIONMUSIC RECOMMENDATION THROUGH FACE RECOGNITION AND EMOTION DETECTION
MUSIC RECOMMENDATION THROUGH FACE RECOGNITION AND EMOTION DETECTION
 
FACIAL EMOTION RECOGNITION SYSTEM
FACIAL EMOTION RECOGNITION SYSTEMFACIAL EMOTION RECOGNITION SYSTEM
FACIAL EMOTION RECOGNITION SYSTEM
 
A017410108
A017410108A017410108
A017410108
 

Dernier

Nagavara Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore Es...
Nagavara Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore Es...Nagavara Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore Es...
Nagavara Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore Es...
amitlee9823
 
Call Girls Hosur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hosur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Hosur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hosur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night StandCall Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Tirupati Call-girls in Women Seeking Men 🔝Tirupati🔝 Escor...
➥🔝 7737669865 🔝▻ Tirupati Call-girls in Women Seeking Men  🔝Tirupati🔝   Escor...➥🔝 7737669865 🔝▻ Tirupati Call-girls in Women Seeking Men  🔝Tirupati🔝   Escor...
➥🔝 7737669865 🔝▻ Tirupati Call-girls in Women Seeking Men 🔝Tirupati🔝 Escor...
amitlee9823
 
Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Vip Mumbai Call Girls Ghatkopar Call On 9920725232 With Body to body massage ...
Vip Mumbai Call Girls Ghatkopar Call On 9920725232 With Body to body massage ...Vip Mumbai Call Girls Ghatkopar Call On 9920725232 With Body to body massage ...
Vip Mumbai Call Girls Ghatkopar Call On 9920725232 With Body to body massage ...
amitlee9823
 

Dernier (20)

Toxicokinetics studies.. (toxicokinetics evaluation in preclinical studies)
Toxicokinetics studies.. (toxicokinetics evaluation in preclinical studies)Toxicokinetics studies.. (toxicokinetics evaluation in preclinical studies)
Toxicokinetics studies.. (toxicokinetics evaluation in preclinical studies)
 
Rearing technique of lac insect and their management
Rearing technique of lac insect and their managementRearing technique of lac insect and their management
Rearing technique of lac insect and their management
 
Nagavara Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore Es...
Nagavara Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore Es...Nagavara Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore Es...
Nagavara Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore Es...
 
Dubai Call Girls Demons O525547819 Call Girls IN DUbai Natural Big Boody
Dubai Call Girls Demons O525547819 Call Girls IN DUbai Natural Big BoodyDubai Call Girls Demons O525547819 Call Girls IN DUbai Natural Big Boody
Dubai Call Girls Demons O525547819 Call Girls IN DUbai Natural Big Boody
 
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Sa...
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Sa...Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Sa...
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Sa...
 
Call Girls Hosur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hosur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Hosur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hosur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Dubai Call Girls Starlet O525547819 Call Girls Dubai Showen Dating
Dubai Call Girls Starlet O525547819 Call Girls Dubai Showen DatingDubai Call Girls Starlet O525547819 Call Girls Dubai Showen Dating
Dubai Call Girls Starlet O525547819 Call Girls Dubai Showen Dating
 
CFO_SB_Career History_Multi Sector Experience
CFO_SB_Career History_Multi Sector ExperienceCFO_SB_Career History_Multi Sector Experience
CFO_SB_Career History_Multi Sector Experience
 
Call Girls Alandi Road Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Road Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Road Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Road Call Me 7737669865 Budget Friendly No Advance Booking
 
Personal Brand Exploration ppt.- Ronnie Jones
Personal Brand  Exploration ppt.- Ronnie JonesPersonal Brand  Exploration ppt.- Ronnie Jones
Personal Brand Exploration ppt.- Ronnie Jones
 
Call Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night StandCall Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night Stand
 
Joshua Minker Brand Exploration Sports Broadcaster .pptx
Joshua Minker Brand Exploration Sports Broadcaster .pptxJoshua Minker Brand Exploration Sports Broadcaster .pptx
Joshua Minker Brand Exploration Sports Broadcaster .pptx
 
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
 
➥🔝 7737669865 🔝▻ Tirupati Call-girls in Women Seeking Men 🔝Tirupati🔝 Escor...
➥🔝 7737669865 🔝▻ Tirupati Call-girls in Women Seeking Men  🔝Tirupati🔝   Escor...➥🔝 7737669865 🔝▻ Tirupati Call-girls in Women Seeking Men  🔝Tirupati🔝   Escor...
➥🔝 7737669865 🔝▻ Tirupati Call-girls in Women Seeking Men 🔝Tirupati🔝 Escor...
 
Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
 
Presentation for the country presentation
Presentation for the country presentationPresentation for the country presentation
Presentation for the country presentation
 
Resumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying OnlineResumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying Online
 
Vip Mumbai Call Girls Ghatkopar Call On 9920725232 With Body to body massage ...
Vip Mumbai Call Girls Ghatkopar Call On 9920725232 With Body to body massage ...Vip Mumbai Call Girls Ghatkopar Call On 9920725232 With Body to body massage ...
Vip Mumbai Call Girls Ghatkopar Call On 9920725232 With Body to body massage ...
 
Miletti Gabriela_Vision Plan for artist Jahzel.pdf
Miletti Gabriela_Vision Plan for artist Jahzel.pdfMiletti Gabriela_Vision Plan for artist Jahzel.pdf
Miletti Gabriela_Vision Plan for artist Jahzel.pdf
 

IRJET-speech emotion.pdf

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 09 | Sep 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 938 SPEECH EMOTION RECOGNITION Darshan K.A1, Dr. B.N. Veerappa2 1U.B.D.T. College of Engineering, Davanagere, Karnataka, India 2Dr. B.N. Veerappa, Department of Studies in Computer Science and Engineering, U.B.D.T. College of Engineering, Davanagere. ---------------------------------------------------------------------***---------------------------------------------------------------------- Abstract – The aim of this paper is to document the development of speech emotionrecognitionsystemusingCNN. A model is designed that could recognize the emotion in a speech sample. Various parameters are modified to improve the accuracy of the model. The paper also aims to find out the factors which are affecting the accuracy of the model and key factors which are needed to improve the efficiency of the model. The paper concludes with a discussion on accuracy of different CNN architecture and parametersneededtoimprove accuracy and possible areas of improvement. Key Words: ASR, Deep Neural Network, Convolution Neural Network, MFCC. 1. INTRODUCTION Speech is a means of communication in humans. Speech is series sequence of a wordsof pre-establishedlanguageandit is an essential medium for communication. Speech technology is a computing technology that empowers an electronic device to recognize, analyse and understand spoken words or audio. There are many researches and studies going on from several years in understanding the building blocks of human brain that makes up human intelligence. Human brain is a complex organ which has inspired for exploration in Artificial Intelligence. The aim of Artificial intelligence is to develop a system that are intelligent and capable enough to developthe behaviour and thoughts like human behaviour and thoughts. One of the important fields of research in AI and machine learning is Automatic speech recognition (ASR), which aims to design machines that can understand the human speech and interact with people through speech. In order to communicate effectivelywith people,thesystems need to understand the emotions in speech.Therefore,there is a need to develop machines that can recognize the paralinguistic information like emotion to have effective clear communication like humans. One important data in paralinguistic information is Emotion,whichiscarriedalong with speech. A lot of machine learning algorithms have been developed and tested in order to classify these emotions carried by speech. The aim to develop machines to interpret paralinguistic data, like emotion, helps in human-machine interaction and it helps to make the interaction clearer and natural. In this study Convolution Neural Networks areused to predict the emotions in speech sample. To train the network RAVDEES and SAVEE dataset are used which contains speech samples that were generated by the 24 actors and 4 actors respectively. The results showed accuracy of 77%. 2. SYSTEM DESIGN Convolution Neural Networks (CNN’s) are used to differentiate the speech samples based on their emotion. Databases such as RAVDEES and SAVEE are utilized to prepare and assess CNN models. Kears (TensorFlow's high- level API for building and training deep learning models) is used as the programming framework to implement CNN models. Seven exploratory arrangement ofthepresent work are explained in this section. 2.1 Database ▪ Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) The Ryerson Audio-Visual DatabaseofEmotional Speech and Songs contains 24 professional actors (12 females,12male), articulating two lexically-similar sentence ina neutral North American accent. Speech samples include emotions such as calm, happy, sad, angry, fearful, surprise, and disgust. Each emotion is produced at two stages of emotional intensity (normal, strong), with an additional neutral emotion. ▪ Survey Audio-Visual Expressed Emotion (SAVEE) The audio folder of this dataset consists of speech samples recorded by four male speakers. For each emotion class there are 15 sentences. The starting letter of the file name represents emotion class. The letters 'su', 'sa', 'n', 'h', 'f', 'd' and 'a' represent 'surprise' 'sadness' 'neutral', 'happiness', 'fear', 'disgust', and 'anger’ emotion classes respectively. For example, ‘a03’ represents the third anger sentence. 2.2 Pre-processing The first step involves organizing the audio files. The emotion in an audio samplecanbedeterminedbytheunique identifier of the file name at the 3rd position, which represents the type of emotion. The dataset consists of five different emotions 1. Calm 2. Happy 3. Sad 4. Angry 5. Fearful 2.3 Defining Labels Based on the number of classes to classify the speech labels are defined. Some of the classes are as follows:
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 09 | Sep 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 939 Class: positive and negative Positive: Calm, Happy. Negative: Fearful, Sad, Angry. Class: Angry, Sad, Happy, Fearful, Calm. Class: Angry, Sad, Happy, Fearful, Calm, Neutral, Disgust, surprised. 2.4 Feature Extraction The Shape of the Speech signal determines what sound comes out. If the shape is determined accurately, then the correct representation of the sound being generated is obtained. The job of Mel Frequency Cepstral Coefficients’ (MFCC’s)is to correctly represent it. MFCCs is used as input feature. Loading and converting audio data into MFCCs format is done by python package librosa. 2.5 Designing the Dimensions of the Model This step includes designing the layers of the CNN model, choosing the activation function, choosing the appropriate pooling options, defining the SoftMax to classify the speech into different classes. 2.6 Model Training and Testing The model is trained with the training dataset and tested with the test data set. Actual values are compared with the predicted values. This comparison gives us the accuracy of the model. 2.7 Architecture of CNN In the current study, the deep neural network architecture actualized is convolutional neural network. In the proposed architecture after each convolutional layer max-pooling layer is placed. To establish non linearity in the model, for activation function Rectified Linear Units (ReLU) is used in both convolutional and fully connected layers. Fig 1: Architecture of CNN Batch normalization is used to improve the firmness of neural network, whichnormalizestheresultofthepreceding activation layer by reducing the number by what the hidden unit values move around and allows each of the layer in a network to learn by itself. Dense layer is used; in which all the neurons in a layer are connected to neurons in the next layer and it is a fully connected layer. SoftMax unit is used to compute probability distribution of the classes. The number of SoftMax to be used depends on number of classes to classify the emotions. The model took between 10hrs to 14hrs to be trained. CPU's consumes lot of time to train the model, instead of that GPU's can be used to speed the training process. The several cores in GPU accelerate the speed and saves much time. The Figure 1 shows the Convolutional Neural Network architecture used in this paper. Lighter CNN architecture is also used to classify among a greater number of classes and good results are obtained. 3. RESULTS AND DISCUSIONS The experiment is carried out with deeper CNN architecture. Batch normalization, max pooling, ReLU activation function are used. The model is trained for 250 epochs. For training 70% of dataset is used and for testing 30% of dataset is used. The model is trained to classify among the two different classes of male voice i.e., male happy and male sad. For this model the accuracy was 87% which is a very goodresult.The below figures show the emotion distribution and confusion matrix of it. Fig 2:Distribution of Emotion in Dataset Fig 3: Confusion Matrix for First Experiment The second experiment was carried out with lighter CNN architecture. ReLU activation function and max pooling are used. Two datasets are used to train the model (RAVDEES and SAVEE dataset). The model is trained for 1000 epochs. For training 80% of dataset is used and for testing 20% of dataset is used. The model is trainedtoclassifyamongtheten different classes of male a and female voice. For this model
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 09 | Sep 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 940 the accuracy is 77 % which is a very good result. The below figures show the confusion matrix of it. Fig 4: Confusion Matrix for Second Experiment 3.1 Real Time Analysis The user selects the speech sample to find out the emotion in it. The model extracts the MFCCfeaturesfromthesampleand predicts the emotion in the class as per the pre-defined emotion class. Fig 5: User Selects a Speech Sample Fig 6: MFCC features Extraction from a Speech Sample Fig 7: Predicted Emotion for a Speech Sample 4. CONCLUSION AND FEATURE WORK Various experiments are conducted by changing the several parameters like dimension of the model, number of epochs, changing the partition ratio between training and test data set. Different accuracy was found for different experiments. A lighter CNN architecture with 80% of training data and 20% of test data gave good result compared to deeper CNN architecture while classifying among ten classes. The accuracy of this model was found to be 77%. The performance of deeper CNN model was found to be very good when the classification was among two classes the reason is there were a greater number of training samples available to classify among two classes. When the same model was used to classify among ten classes the training dataset was divided into ten labels this led to the fewer number of training samples available for each class. This led to the poor accuracy of the model. With the greater number of training samples available for each class andwiththehelp of GPU’s to speed up the training process more accuracy can be achieved in the future enhancements.
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 09 | Sep 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 941 REFERENCES [1] Arianna Mencattini, Eugenio Martinelli, Giovanni Costantini, Massimiliano Todisco, Barbara Basile, Marco Bozzali, and Corrado Di Natale. Speech emotion recognition using amplitude modulation parameters and a combined feature selection procedure. Knowledge-Based Systems, 63:68–81, 2014. [2] Longfei Li, Yong Zhao, Dongmei Jiang, Yanning Zhang, Fengna Wang, Isabel Gonzalez, Enescu Valentin,andHichem Sahli. Hybrid deep neural network–hidden markov model (dnn-hmm) based speech emotion recognition. In Affective Computing and Intelligent Interaction(ACII),2013Humaine Association Conference on, pages 312–317. IEEE, 2013. [3] Qirong Mao, Ming Dong, Zhengwei Huang,andYongzhao Zhan. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia, 16(8):2203–2213, 2014. [4] Haytham M Fayek, Margaret Lech, and Lawrence Cavedon. Towards real-time speech emotion recognition using deep neural networks. In Signal Processing and Communication Systems (ICSPCS), 2015 9th International Conference on, pages 1–5. IEEE, 2015. [5] WQ Zheng, JS Yu, and YX Zou. An experimental study of speech emotion recognition based on deep convolutional neural networks. In Affective Computing and Intelligent Interaction (ACII), 2015 International Conference on, pages 827–831. IEEE, 2015. [6] George Trigeorgis, FabienRingeval,RaymondBrueckner, Erik Marchi, MihalisANicolaou,Bj¨ornSchuller,andStefanos Zafeiriou. Adieu features? end-to-end speech emotion recognition usinga deepconvolutional recurrent network.In Acoustics, Speech andSignal Processing(ICASSP),2016IEEE International Conference on, pages 5200–5204. IEEE, 2016. [7] Andrew Ng. Sequence models. Deeplearning.ai on Coursera, February 2018. [8] Michalis Papakostas, Evaggelos Spyrou, Theodoros Giannakopoulos, Giorgos Siantikos, DimitriosSgouropoulos, Phivos Mylonas, and Fillia Makedon. Deep visual attributes vs. hand-crafted audio features on multidomain speech emotion recognition. Computation, 5(2):26, 2017. [9] Tom M Mitchell et al. Machine learning. wcb, 1997. [10] C Bishop. Pattern recognition and machine learning (information science and statistics), 1st edn. 2006. corr. 2nd printing edn. Springer, New York, 2007. [11] RS Suton and Andrew G Barto. Reinforcement learning: an introduction. adaptive computation and machine learning, 2002. [12] Gareth James, Daniela Witten,TrevorHastie,andRobert Tibshirani. An introduction to statistical learning, volume 112. Springer, 2013. [13] Aur´elien G´eron. Hands on machine learning with scikit-learn and tensorflow, 2017.