SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Research Issues in Speech Processing




                    Dr. M. Sabarimalai Manikandan
                        msm.sabari@gmail.com
Speech Production: the source-filter model
Speech signal conveys the information contained in the spoken word
         highly non-stationary signal
         Short segments of speech (20 to 30 ms )
         acoustical energy is in the frequency range of 100-6000 Hz




        Vocal tract transfer function can be modeled by an all-pole filter
Speech Processing Tasks


Speech recognition (recognizing lexical content)
Speech synthesis (Text-to speech)
Speaker recognition (recognizing who is speaking)
Speech understanding and vocal dialog
Speech coding (data rate deduction)
Speech enhancement (Noise reduction)
Speech transmission (noise free communication)
Voice conversion
Speech Processing
Speech measurements
       Short-time energy (STE)
       Zero crossing rate (ZCR)
       Autocorrelation (AC)
       Pitch period or frequency
       Formants

Speech signal components
       Speech-Silence or Non-speech
       Voiced speech-Unvoiced speech
Speech Processing
Speech representations or models
       Temporal features
          •   Low energy rate
          •   Zero crossing rate (ZCR)
          •   4Hz modulation energy
          •   Pitch contour

       Spectral features
           •    Spectral Centroid (sharpness)
           •    Spectral Flux (rate of change)
           •    Spectral Roll-Off (spectral shape)
           •    Spectral Flatness (deviation of the spectral form)
       Linear Predictive Coefficients (LPC)
       Cepstral coefficients
       Mel Frequency Cepstral Coefficients (MFCC): human auditory system
       Harmonic features: sinusoidal harmonic modelling
       Perceptual features: model of the human hearing process
       First order derivative (DELTA)
Elements of the speech signal
Phonemes: the smallest units of speech sounds
       Vowels and Consonants
       ~12 to 21 different vowel sounds used in the English language

       Consonants involve rapid and sometimes subtle changes in sound
              according to the manner of articulation:
                   •    plosive (p, b, t, etc.)
                   •    fricative (f, s, sh, etc.)
                   •    nasal (m, n, ng)
                   •    liquid (r, l) and
                   •    semivowel (w, y)

       Consonants are more independent of language than vowels are.

Syllable: one or more phonemes

Word: one or more syllables
Automatic Speech Recognition
There are two uses for speech recognition systems:

    Dictation: translation of the spoken word into written text
    Computer Control: control of the computer, and software
    applications by speaking commands

    Speaker dependent system: to operate for a single speaker
    Speaker independent system: to operate for any speaker
    of a particular type
    Speaker adaptive system: to adapt its operation to the
    characteristics of new speakers

    The size of vocabulary affects the complexity, processing
    requirements and the accuracy of the system
Speech Recognition: Applications

Automatic translation
Vehicle navigation systems
Human computer Interaction
Content-based spoken audio search
Home automation
Pronunciation evaluation
Robotics
Video games
Transcription of speech into mobile text messages
People with disabilities
Speech Recognition System

Sampling of speech

Acoustic signal processing:
   •     Linear Prediction Cepstral Coefficients (LPCC)
   •     Mel Frequency Cepstral Coefficients (MFCC)
   •     Perceptual Linear Prediction Cepstral Coefficients (PLPCC)

Recognition of phonemes, groups of phonemes and words:
   •    Dynamic Time Warping (DTW)
   •    hidden Markov models (HMMs)
   •    Gaussian mixture models (GMMs)
   •    Neural Networks (NNs)
   •    Expert systems and combinations of techniques
Automatic Speaker Recognition
Speaker recognition: the process of automatically recognizing who is
speaking by using the speaker-specific information included in speech
sounds

Speaker identity: physiological and behavioral characteristics of the speech
production model of an individual speaker
         the spectral envelope (vocal tract characteristics)
         the supra-segmental features (voice source characteristics) of
         speech

Applications:
    •    banking over a telephone network
    •    telephone shopping and database access services
    •    voice dialing and mail
    •     information and reservation services
    •    security control for confidential information
    •    forensics and surveillance applications
Speaker Recognition
Speaker identification: the process of determining which registered speaker
provides input speech sounds

                                  Similarity



                               Ref. template or
                              model (speaker #1)


                                   Similarity                     Identification
  Input       Feature                              Maximum
 speech      Extraction                                               result
                                                   selection
                                                                   (Speaker ID)
                               Ref. template or
                              model (speaker #2)



                                   Similarity



                               Ref. template or
                              model (speaker #N)
Speaker Recognition
Speaker verification: the process of accepting or rejecting the
identity claim of a speaker.
     Input        Feature                                   Verification
    speech       Extraction    Similarity     Decision         result
                                                          (Accept /Reject)


                              Ref. template   Threshold
                Input           or model
               speech         (speaker #M)




         Open Set and Closed Set Recognition

         Text-dependent and Text-independent Recognition
                 •   Vector quantization
                 •   Gaussian mixture models (GMM)
                 •   Dynamic time warping (DTW)
                 •   Hidden Markov model (HMM)
Text-to-Speech (TTS) System
    Synthesis of Speech for effective human machine communications
                     reading email messages
                     call center help desks and customer care
                     announcement machines



Raw or            Text             Phonetic          Prosodic        Speech            Synthetic
tagged text      Analysis          Analysis          Analysis       Synthesis          Speech

                    Document
                                      Homograph
                    Structure                           Pitch        Voice Rendering
                                    disambiguation
                    Detection


                                    Grapheme-to-
                       Text
                                      Phoneme          Duration
                   Normalization
                                     Conversion



                     Linguistic
                      Analysis




              Synthetic speech should be intelligible and natural
Speech Synthesis

Text-to-speech (TTS) synthesis systems
       Approach
       TTS system performance measure
          • Synthetic Speech Intelligibility
          • Synthetic speech naturalness

Speech Intelligibility Tests
      Segmental level analysis
          • the Rhyme Test
          • the Modified Rhyme Test
          • the Diagnostic Rhyme Test
      Supra-segmental analysis
          • the Harvard Psychoacoustic Sentences (HPS)
          • the Haskins syntactic sentences
Speech Coding (Compression)
Speech Coding for efficient transmission and storage of speech
           narrowband and broadband wired telephony
           cellular communications
           Voice over IP (VoIP) to utilize the Internet
           Telephone answering machines
           IVR systems
           Prerecorded messages
Speech-Assisted Translation Corrector System

 Objective: Develop a speech-assisted translation corrector (SATC)
 system which provides a grammatically correct sentence for a
 translated sentence from the machine translation
                              translated sentence                               grammatically
input                                 with                                      correct sentence
sentence       Multilingual   grammatical errors        Speech assisted
                Machine                               translation corrector
               Translation                                   system               text




He          came     here                                           speech               storage
                                                    Translator
                                                    speech signal is produced from the
                                                    words in the translated sentence.



“A MT system is correct and complete if it can analyze of the grammatical structures
encountered in the source language, and it can generate all of the grammatical structures
necessary in the target language translation.”
8/25/2011                                                                                    16
SATC System: Requirements and Challenging Tasks

   Creation of large scale rich multilingual speech databases is crucial
 task for research and development in language and speech technology

            Indian languages
            speakers (10 Males and 10 Females)
            age groups ( <20, 15-40, >40)
            audio format: 16-bit stereo, and sampling rate of 44.1 kHz
            annotation and assessment of speech databases


   Development of multilingual text to speech interface

   Development of spoken word matching module

   Development of speech signal processing (SSP) tools



8/25/2011                                                                17
Major Problems in Speech Processing
Acoustic variability: the same phonemes pronounced in
different contexts will have different acoustic realization
(coarticulation effect)

The signal is different when speech is uttered in various
environments:
       noise
       reverberation
       different types of microphones.

Speaking variability: when the same speaker speaks normally,
shouts, whispers, uses a creaky voice, or has a cold

Speaker variability: since different speakers have different
timbers and different speaking habits
Major Problems in Speech Processing
Linguistic variability: the same sentence can be pronounced
in many different ways, using many different words,
synonyms, and many different syntactic structures and
prosodic schemes

Phonetic variability: due to the different possible
pronunciations of the same words by speakers having
different regional accents

Lombard effect: noise modifies the utterance of the words (as
people tend to speak louder)
Major Problems in Speech Processing
Continuous speech:
   words are connected together (not separated by pauses or
   silences).

   It is difficult to find the start and end points of words

   The production of each phoneme is affected by the
   production of surrounding phonemes

   The start and end of words are affected by the preceding
   and following words

   the rate of speech (fast speech tends to be harder)
References

M. Honda, NTT CS Laboratories, Speech synthesis technology based on speech production mechanism, How to
observe and mimic speech production by human, Journal of the Acoustical Society of Japan, Vol. 55, No. 11, pp.
777-782, 1999

S. Saito and K. Nakata, Fundamentals of Speech Signal Processing, 1981

M. Honda, H. Gomi, T. Ito and A. Fujino, NTT CS Laboratories, Mechanism of articulatory cooperated movements
in speech production, Proceedings of Autumn Meeting of the Acoustical Society of Japan, Vol. 1, pp. 283-286,
2001

T. Kaburagi and M. Honda, NTT CS Laboratories “A model of articulator trajectory formation based on the motor
tasks of vocal-tract shapes,” J. Acoust. Soc. Am. Vol. 99, pp. 3154-3170, 1996.

S. Suzuki, T. Okadome and M. Honda, NTT CS Laboratories, “Determination of articulatory positions from speech
acoustics by applying dynamic articulatory constraints,” Proc. ICSLP98, pp. 2251-2254, 1998.

Benoit, C. and Grice, M. The SUS test: a method for the assessment of text-to-speech intelligibility using
Semantically Unpredictable Sentences, Speech Communication, vol. 18, pp. 381-392.

Contenu connexe

Tendances

Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
Diptimaya Sarangi
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
Hugo Moreno
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
Amrita More
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCC
Hira Shaukat
 

Tendances (20)

Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Speech synthesis technology
Speech synthesis technologySpeech synthesis technology
Speech synthesis technology
 
speech processing and recognition basic in data mining
speech processing and recognition basic in  data miningspeech processing and recognition basic in  data mining
speech processing and recognition basic in data mining
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
Speaker Recognition
Speaker RecognitionSpeaker Recognition
Speaker Recognition
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentation
 
Speech recognition an overview
Speech recognition   an overviewSpeech recognition   an overview
Speech recognition an overview
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Spectrograms
SpectrogramsSpectrograms
Spectrograms
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Speech signal processing lizy
Speech signal processing lizySpeech signal processing lizy
Speech signal processing lizy
 
Speech Recognition
Speech Recognition Speech Recognition
Speech Recognition
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By Matlab
 
Speech recognition
Speech recognitionSpeech recognition
Speech recognition
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognition
 
Introduction to text to speech
Introduction to text to speechIntroduction to text to speech
Introduction to text to speech
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCC
 

En vedette (9)

Essential linguistics Chap 3 part 1 Graphic Organizer
Essential linguistics Chap 3 part 1 Graphic OrganizerEssential linguistics Chap 3 part 1 Graphic Organizer
Essential linguistics Chap 3 part 1 Graphic Organizer
 
Ppt on speech processing by ranbeer
Ppt on speech processing by ranbeerPpt on speech processing by ranbeer
Ppt on speech processing by ranbeer
 
Physiology of speech
Physiology of speechPhysiology of speech
Physiology of speech
 
Radio communication presentation
Radio communication presentationRadio communication presentation
Radio communication presentation
 
Radio Presentation
Radio PresentationRadio Presentation
Radio Presentation
 
Radio Communication
Radio CommunicationRadio Communication
Radio Communication
 
presentation on digital signal processing
presentation on digital signal processingpresentation on digital signal processing
presentation on digital signal processing
 
DIGITAL SIGNAL PROCESSING
DIGITAL SIGNAL PROCESSINGDIGITAL SIGNAL PROCESSING
DIGITAL SIGNAL PROCESSING
 
Gsm.....ppt
Gsm.....pptGsm.....ppt
Gsm.....ppt
 

Similaire à Speech processing

General Speereo Technology
General Speereo TechnologyGeneral Speereo Technology
General Speereo Technology
Daniel Ischenko
 
44 i9 advanced-speaker-recognition
44 i9 advanced-speaker-recognition44 i9 advanced-speaker-recognition
44 i9 advanced-speaker-recognition
sunnysyed
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
ankit_saluja
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
ankit_saluja
 

Similaire à Speech processing (20)

Automatic Speech Recognion
Automatic Speech RecognionAutomatic Speech Recognion
Automatic Speech Recognion
 
Speech Technology Overview
Speech Technology OverviewSpeech Technology Overview
Speech Technology Overview
 
Speech recognition techniques
Speech recognition techniquesSpeech recognition techniques
Speech recognition techniques
 
Speech recognition (dr. m. sabarimalai manikandan)
Speech recognition (dr. m. sabarimalai manikandan)Speech recognition (dr. m. sabarimalai manikandan)
Speech recognition (dr. m. sabarimalai manikandan)
 
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionTeaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
 
lec26_audio.pptx
lec26_audio.pptxlec26_audio.pptx
lec26_audio.pptx
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Web AI.pptx
Web AI.pptxWeb AI.pptx
Web AI.pptx
 
Assign
AssignAssign
Assign
 
Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...
 
General Speereo Technology
General Speereo TechnologyGeneral Speereo Technology
General Speereo Technology
 
44 i9 advanced-speaker-recognition
44 i9 advanced-speaker-recognition44 i9 advanced-speaker-recognition
44 i9 advanced-speaker-recognition
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
 
Speech-Recognition.pptx
Speech-Recognition.pptxSpeech-Recognition.pptx
Speech-Recognition.pptx
 
[IJET-V1I6P21] Authors : Easwari.N , Ponmuthuramalingam.P
[IJET-V1I6P21] Authors : Easwari.N , Ponmuthuramalingam.P[IJET-V1I6P21] Authors : Easwari.N , Ponmuthuramalingam.P
[IJET-V1I6P21] Authors : Easwari.N , Ponmuthuramalingam.P
 
Deciphering voice of customer through speech analytics
Deciphering voice of customer through speech analyticsDeciphering voice of customer through speech analytics
Deciphering voice of customer through speech analytics
 
dialogue act modeling for automatic tagging and recognition
 dialogue act modeling for automatic tagging and recognition dialogue act modeling for automatic tagging and recognition
dialogue act modeling for automatic tagging and recognition
 
Performance Calculation of Speech Synthesis Methods for Hindi language
Performance Calculation of Speech Synthesis Methods for Hindi languagePerformance Calculation of Speech Synthesis Methods for Hindi language
Performance Calculation of Speech Synthesis Methods for Hindi language
 
Speaker recognition.
Speaker recognition.Speaker recognition.
Speaker recognition.
 

Dernier

Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Dernier (20)

Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 

Speech processing

  • 1. Research Issues in Speech Processing Dr. M. Sabarimalai Manikandan msm.sabari@gmail.com
  • 2. Speech Production: the source-filter model Speech signal conveys the information contained in the spoken word highly non-stationary signal Short segments of speech (20 to 30 ms ) acoustical energy is in the frequency range of 100-6000 Hz Vocal tract transfer function can be modeled by an all-pole filter
  • 3. Speech Processing Tasks Speech recognition (recognizing lexical content) Speech synthesis (Text-to speech) Speaker recognition (recognizing who is speaking) Speech understanding and vocal dialog Speech coding (data rate deduction) Speech enhancement (Noise reduction) Speech transmission (noise free communication) Voice conversion
  • 4. Speech Processing Speech measurements Short-time energy (STE) Zero crossing rate (ZCR) Autocorrelation (AC) Pitch period or frequency Formants Speech signal components Speech-Silence or Non-speech Voiced speech-Unvoiced speech
  • 5. Speech Processing Speech representations or models Temporal features • Low energy rate • Zero crossing rate (ZCR) • 4Hz modulation energy • Pitch contour Spectral features • Spectral Centroid (sharpness) • Spectral Flux (rate of change) • Spectral Roll-Off (spectral shape) • Spectral Flatness (deviation of the spectral form) Linear Predictive Coefficients (LPC) Cepstral coefficients Mel Frequency Cepstral Coefficients (MFCC): human auditory system Harmonic features: sinusoidal harmonic modelling Perceptual features: model of the human hearing process First order derivative (DELTA)
  • 6. Elements of the speech signal Phonemes: the smallest units of speech sounds Vowels and Consonants ~12 to 21 different vowel sounds used in the English language Consonants involve rapid and sometimes subtle changes in sound according to the manner of articulation: • plosive (p, b, t, etc.) • fricative (f, s, sh, etc.) • nasal (m, n, ng) • liquid (r, l) and • semivowel (w, y) Consonants are more independent of language than vowels are. Syllable: one or more phonemes Word: one or more syllables
  • 7. Automatic Speech Recognition There are two uses for speech recognition systems: Dictation: translation of the spoken word into written text Computer Control: control of the computer, and software applications by speaking commands Speaker dependent system: to operate for a single speaker Speaker independent system: to operate for any speaker of a particular type Speaker adaptive system: to adapt its operation to the characteristics of new speakers The size of vocabulary affects the complexity, processing requirements and the accuracy of the system
  • 8. Speech Recognition: Applications Automatic translation Vehicle navigation systems Human computer Interaction Content-based spoken audio search Home automation Pronunciation evaluation Robotics Video games Transcription of speech into mobile text messages People with disabilities
  • 9. Speech Recognition System Sampling of speech Acoustic signal processing: • Linear Prediction Cepstral Coefficients (LPCC) • Mel Frequency Cepstral Coefficients (MFCC) • Perceptual Linear Prediction Cepstral Coefficients (PLPCC) Recognition of phonemes, groups of phonemes and words: • Dynamic Time Warping (DTW) • hidden Markov models (HMMs) • Gaussian mixture models (GMMs) • Neural Networks (NNs) • Expert systems and combinations of techniques
  • 10. Automatic Speaker Recognition Speaker recognition: the process of automatically recognizing who is speaking by using the speaker-specific information included in speech sounds Speaker identity: physiological and behavioral characteristics of the speech production model of an individual speaker the spectral envelope (vocal tract characteristics) the supra-segmental features (voice source characteristics) of speech Applications: • banking over a telephone network • telephone shopping and database access services • voice dialing and mail • information and reservation services • security control for confidential information • forensics and surveillance applications
  • 11. Speaker Recognition Speaker identification: the process of determining which registered speaker provides input speech sounds Similarity Ref. template or model (speaker #1) Similarity Identification Input Feature Maximum speech Extraction result selection (Speaker ID) Ref. template or model (speaker #2) Similarity Ref. template or model (speaker #N)
  • 12. Speaker Recognition Speaker verification: the process of accepting or rejecting the identity claim of a speaker. Input Feature Verification speech Extraction Similarity Decision result (Accept /Reject) Ref. template Threshold Input or model speech (speaker #M) Open Set and Closed Set Recognition Text-dependent and Text-independent Recognition • Vector quantization • Gaussian mixture models (GMM) • Dynamic time warping (DTW) • Hidden Markov model (HMM)
  • 13. Text-to-Speech (TTS) System Synthesis of Speech for effective human machine communications reading email messages call center help desks and customer care announcement machines Raw or Text Phonetic Prosodic Speech Synthetic tagged text Analysis Analysis Analysis Synthesis Speech Document Homograph Structure Pitch Voice Rendering disambiguation Detection Grapheme-to- Text Phoneme Duration Normalization Conversion Linguistic Analysis Synthetic speech should be intelligible and natural
  • 14. Speech Synthesis Text-to-speech (TTS) synthesis systems Approach TTS system performance measure • Synthetic Speech Intelligibility • Synthetic speech naturalness Speech Intelligibility Tests Segmental level analysis • the Rhyme Test • the Modified Rhyme Test • the Diagnostic Rhyme Test Supra-segmental analysis • the Harvard Psychoacoustic Sentences (HPS) • the Haskins syntactic sentences
  • 15. Speech Coding (Compression) Speech Coding for efficient transmission and storage of speech narrowband and broadband wired telephony cellular communications Voice over IP (VoIP) to utilize the Internet Telephone answering machines IVR systems Prerecorded messages
  • 16. Speech-Assisted Translation Corrector System Objective: Develop a speech-assisted translation corrector (SATC) system which provides a grammatically correct sentence for a translated sentence from the machine translation translated sentence grammatically input with correct sentence sentence Multilingual grammatical errors Speech assisted Machine translation corrector Translation system text He came here speech storage Translator speech signal is produced from the words in the translated sentence. “A MT system is correct and complete if it can analyze of the grammatical structures encountered in the source language, and it can generate all of the grammatical structures necessary in the target language translation.” 8/25/2011 16
  • 17. SATC System: Requirements and Challenging Tasks Creation of large scale rich multilingual speech databases is crucial task for research and development in language and speech technology Indian languages speakers (10 Males and 10 Females) age groups ( <20, 15-40, >40) audio format: 16-bit stereo, and sampling rate of 44.1 kHz annotation and assessment of speech databases Development of multilingual text to speech interface Development of spoken word matching module Development of speech signal processing (SSP) tools 8/25/2011 17
  • 18. Major Problems in Speech Processing Acoustic variability: the same phonemes pronounced in different contexts will have different acoustic realization (coarticulation effect) The signal is different when speech is uttered in various environments: noise reverberation different types of microphones. Speaking variability: when the same speaker speaks normally, shouts, whispers, uses a creaky voice, or has a cold Speaker variability: since different speakers have different timbers and different speaking habits
  • 19. Major Problems in Speech Processing Linguistic variability: the same sentence can be pronounced in many different ways, using many different words, synonyms, and many different syntactic structures and prosodic schemes Phonetic variability: due to the different possible pronunciations of the same words by speakers having different regional accents Lombard effect: noise modifies the utterance of the words (as people tend to speak louder)
  • 20. Major Problems in Speech Processing Continuous speech: words are connected together (not separated by pauses or silences). It is difficult to find the start and end points of words The production of each phoneme is affected by the production of surrounding phonemes The start and end of words are affected by the preceding and following words the rate of speech (fast speech tends to be harder)
  • 21. References M. Honda, NTT CS Laboratories, Speech synthesis technology based on speech production mechanism, How to observe and mimic speech production by human, Journal of the Acoustical Society of Japan, Vol. 55, No. 11, pp. 777-782, 1999 S. Saito and K. Nakata, Fundamentals of Speech Signal Processing, 1981 M. Honda, H. Gomi, T. Ito and A. Fujino, NTT CS Laboratories, Mechanism of articulatory cooperated movements in speech production, Proceedings of Autumn Meeting of the Acoustical Society of Japan, Vol. 1, pp. 283-286, 2001 T. Kaburagi and M. Honda, NTT CS Laboratories “A model of articulator trajectory formation based on the motor tasks of vocal-tract shapes,” J. Acoust. Soc. Am. Vol. 99, pp. 3154-3170, 1996. S. Suzuki, T. Okadome and M. Honda, NTT CS Laboratories, “Determination of articulatory positions from speech acoustics by applying dynamic articulatory constraints,” Proc. ICSLP98, pp. 2251-2254, 1998. Benoit, C. and Grice, M. The SUS test: a method for the assessment of text-to-speech intelligibility using Semantically Unpredictable Sentences, Speech Communication, vol. 18, pp. 381-392.