SlideShare a Scribd company logo
1 of 31
Download to read offline
MAIN PROJECT ‘10                                         SPEECH RECOGNITION USING WAVELET TRANSFORM
                       www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/




                              1. INTRODUCTION



          Automatic speech recognition (ASR) aims at converting spoken language to text.
Scientists all over the globe have been working under the domain, speech recognition for last
many decades. This is one of the intensive areas of research. Recent advances in soft
computing techniques give more importance to automatic speech recognition. Large variation
in speech signals and other criteria like native accent and varying pronunciations makes the
task very difficult. ASR is hence a complex task and it requires more intelligence to achieve a
good recognition result.

         Speech recognition is currently used in many real-time applications, such as cellular
telephones, computers, and security systems. However, these systems are far from perfect in
correctly classifying human speech into words. Speech recognizers consist of a feature
extraction stage and a classification stage. The parameters from the feature extraction stage
are compared in some form to parameters extracted from signals stored in a database or
template. The parameters could be fed to a neural network.

         Speech word recognition systems commonly carry out some kind of classification
recognition based on speech features which are usually obtained via Fourier Transforms
(FTs), Short Time Fourier Transforms (STFTs), or Linear Predictive Coding techniques.
However, these methods have some disadvantages. These methods accept signal stationarity
within a given time frame and may therefore lack the ability to analyze localized events
correctly. The wavelet transform copes with some of these problems. Other factors
influencing the selection of Wavelet Transforms (WT) over conventional methods include
their ability to determine localized features. Discrete Wavelet Transform method is used for
speech processing.




                                                   1
MAIN PROJECT ‘10                                          SPEECH RECOGNITION USING WAVELET TRANSFORM
                        www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/




                        2. LITERATURE SURVEY



           Designing a machine that mimics human behavior, particularly the capability of
speaking naturally and responding properly to spoken language, has intrigued engineers and
scientists for centuries. Since the 1930s, when Homer Dudley of Bell Laboratories proposed a
system model for speech analysis and synthesis, the problem of automatic speech recognition
has been approached progressively, from a simple machine that responds to a small set of
sounds to a sophisticated system that responds to fluently spoken natural language and takes
into account the varying statistics of the language in which the speech is produced. Based on
major advances in statistical modeling of speech in the 1980s, automatic speech recognition
systems today find widespread application in tasks that require a human-machine interface,
such as automatic call processing in the telephone network and query-based information
systems that do things like provide updated travel information, stock price quotations,
weather reports, etc.

            Speech is the primary means of communication between people. For reasons
ranging from technological curiosity about the mechanisms for mechanical realization of
human speech capabilities, to the desire to automate simple tasks inherently requiring human-
machine interactions, research in automatic speech recognition (and speech synthesis) by
machine has attracted a great deal of attention over the past five decades.

            The desire for automation of simple tasks is not a modern phenomenon, but one
that goes back more than one hundred years in history. By way of example, in 1881
Alexander Graham Bell, his cousin Chichester Bell and Charles Sumner Tainter invented a
recording device that used a rotating cylinder with a wax coating on which up-and-down
grooves could be cut by a stylus, which responded to incoming sound pressure (in much the
same way as a microphone that Bell invented earlier for use with the telephone). Based on
this invention, Bell and Tainter formed the Volta Graphophone Co. in 1888 in order to
manufacture machines for the recording and reproduction of sound in office environments.
The American Graphophone Co., which later became the Columbia Graphophone Co.,
acquired the patent in 1907 and trademarked the term “Dictaphone.” Just about the same



                                                    2
MAIN PROJECT ‘10                                          SPEECH RECOGNITION USING WAVELET TRANSFORM
                        www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/



time, Thomas Edison invented the phonograph using a tinfoil based cylinder, which was
subsequently adapted to wax, and developed the “Ediphone” to compete directly with
Columbia. The purpose of these products was to record dictation of notes and letters for a
secretary (likely in a large pool that offered the service) who would later type them out
(offline), thereby circumventing the need for costly stenographers.

            This turn-of-the-century concept of “office mechanization” spawned a range of
electric and electronic implements and improvements, including the electric typewriter,
which changed the face of office automation in the mid-part of the twentieth century. It does
not take much imagination to envision the obvious interest in creating an “automatic
typewriter” that could directly respond to and transcribe a human‟s voice without having to
deal with the annoyance of recording and handling the speech on wax cylinders or other
recording media.

            A similar kind of automation took place a century later in the 1990‟s in the area
of “call centers.” A call center is a concentration of agents or associates that handle telephone
calls from customers requesting assistance. Among the tasks of such call centers are routing
the in-coming calls to the proper department, where specific help is provided or where
transactions are carried out. One example of such a service was the AT&T Operator line
which helped a caller place calls, arrange payment methods, and conduct credit card
transactions. The number of agent positions (or stations) in a large call center could reach
several thousand Automatic speech recognition.




    From Speech Production Models to Spectral Representations

            Attempts to develop machines to mimic a human‟s speech communication
capability appear to have started in the 2nd half of the 18th century. The early interest was not
on recognizing and understanding speech but instead on creating a speaking machine,
perhaps due to the readily available knowledge of acoustic resonance tubes which were used
to approximate the human vocal tract. In 1773, the Russian scientist Christian Kratzenstein, a
professor of physiology in Copenhagen, succeeded in producing vowel sounds using
resonance tubes connected to organ pipes. Later, Wolfgang von Kempelen in                Vienna
constructed an “Acoustic-Mechanical Speech Machine” (1791) and in the mid-1800's Charles
Wheatstone [6] built a version of von Kempelen's speaking machine using resonators made of



                                                    3
MAIN PROJECT ‘10                                          SPEECH RECOGNITION USING WAVELET TRANSFORM
                        www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/



leather, the configuration of which could be altered or controlled with a hand to produce
different speech-like sounds.

            During the first half of the 20th century, work by Fletcher [8] and others at Bell
Laboratories documented the relationship between a given speech spectrum (which is the
distribution of power of a speech sound across frequency), and its sound characteristics as
well as its intelligibility, as perceived by a human listener. In the 1930‟s Homer Dudley,
influenced greatly by Fletcher‟s research, developed a speech synthesizer called the VODER
(Voice Operating Demonstrator), which was an electrical equivalent (with mechanical
control) of Wheatstone‟s mechanical speaking machine. Dudley‟s VODER which consisted
of a wrist bar for selecting either a relaxation oscillator output or noise as the driving signal,
and a foot pedal to control the oscillator frequency (the pitch of the synthesized voice). The
driving signal was passed through ten band pass filters whose output levels were controlled
by the operator‟s fingers. These ten band pass filters were used to alter the power distribution
of the source signal across a frequency range, thereby determining the characteristics of the
speech-like sound at the loudspeaker. Thus to synthesize a sentence, the VODER operator
had to learn how to control and “play” the VODER so that the appropriate sounds of the
sentence were produced. The VODER was demonstrated at the World Fair in New York City
in 1939 and was considered an important milestone in the evolution of speaking machines.

            Speech pioneers like Harvery Fletcher and Homer Dudley firmly established the
importance of the signal spectrum for reliable identification of the phonetic nature of a speech
sound. Following the convention established by these two outstanding scientists, most
modern systems and algorithms for speech recognition are based on the concept of
measurement of the (time-varying) speech power spectrum (or its variants such as the
cepstrum), in part due to the fact that measurement of the power spectrum from a signal is
relatively easy to accomplish with modern digital signal processing techniques.



    Early Automatic Speech Recognizers

            Early attempts to design systems for automatic speech recognition were mostly
guided by the theory of acoustic-phonetics, which describes the phonetic elements of speech
(the basic sounds of the language) and tries to explain how they are acoustically realized in a
spoken utterance. These elements include the phonemes and the corresponding place and
manner of articulation used to produce the sound in various phonetic contexts. For example,


                                                    4
MAIN PROJECT ‘10                                          SPEECH RECOGNITION USING WAVELET TRANSFORM
                        www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/



in order to produce a steady vowel sound, the vocal cords need to vibrate (to excite the vocal
tract), and the air that propagates through the vocal tract results in sound with natural modes
of resonance similar to what occurs in an acoustic tube. These natural modes of resonance,
called the formants or formant frequencies, are manifested as major regions of energy
concentration in the speech power spectrum. In 1952, Davis, Biddulph, and Balashek of Bell
Laboratories built a system for isolated digit recognition for a single speaker, using the
formant frequencies measured (or estimated) during vowel regions of each digit. These
trajectories served as the “reference pattern” for determining the identity of an unknown digit
utterance as the best matching digit.

            In other early recognition systems of the 1950‟s, Olson and Belar of RCA
Laboratories built a system to recognize 10 syllables of a single talker and at MIT Lincoln
Lab, Forgie and Forgie built a speaker-independent 10-vowel recognizer. In the 1960‟s,
several Japanese laboratories demonstrated their capability of building special purpose
hardware to perform a speech recognition task. Most notable were the vowel recognizer of
Suzuki and Nakata at the Radio Research Lab in Tokyo, the phoneme recognizer of Sakai and
Doshita at Kyoto University, and the digit recognizer of NEC Laboratories [14]. The work of
Sakai and Doshita involved the first use of a speech segmenter for analysis and recognition of
speech in different portions of the input utterance. In contrast, an isolated digit recognizer
implicitly assumed that the unknown utterance contained a complete digit (and no other
speech sounds or words) and thus did not need an explicit “segmenter.” Kyoto University‟s
work could be considered a precursor to a continuous speech recognition system.

            In another early recognition system Fry and Denes, at University College in
England, built a phoneme recognizer to recognize 4 vowels and 9 consonants. By
incorporating statistical information about allowable phoneme sequences in English, they
increased the overall phoneme recognition accuracy for words consisting of two or more
phonemes. This work marked the first use of statistical syntax (at the phoneme level) in
automatic speech recognition. An alternative to the use of a speech segmenter was the
concept of adopting a non-uniform time scale for aligning speech patterns. This concept
started to gain acceptance in the 1960‟s through the work of Tom Martin at RCA
Laboratories and Vintsyuk in the Soviet Union. Martin recognized the need to deal with the
temporal non-uniformity in repeated speech events and suggested a range of solutions,
including detection of utterance endpoints, which greatly enhanced the reliability of the
recognizer performance. Vintsyuk proposed the use of dynamic programming for time


                                                    5
MAIN PROJECT ‘10                                         SPEECH RECOGNITION USING WAVELET TRANSFORM
                       www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/



alignment between two utterances in order to derive a meaningful assessment of their
similarity. His work, though largely unknown in the West, appears to have preceded that of
Sakoe and Chiba as well as others who proposed more formal methods, generally known as
dynamic time warping, in speech pattern matching. Since the late 1970‟s, mainly due to the
publication by Sakoe and Chiba, dynamic programming, in numerous variant forms
(including the Viterbi algorithm [19] which came from the communication theory
community), has become an indispensable technique in automatic speech recognition.




    Advancement in technology

           Figure shows a timeline of progress in speech recognition and understanding
technology over the past several decades. We see that in the 1960‟s we were able to
recognize small vocabularies (order of 10-100 words) of isolated words, based on simple
acoustic-phonetic properties of speech sounds. The key technologies that were developed
during this time frame were filter-bank analyses, simple time normalization methods, and the
beginnings of sophisticated dynamic programming methodologies. In the 1970‟s we were
able to recognize medium vocabularies (order of 100-1000 words) using simple template-
based, pattern recognition methods. The key technologies that were developed during this
period were the pattern recognition models, the introduction of LPC methods for spectral
representation, the pattern clustering methods for speaker-independent recognizers, and the
introduction of dynamic programming methods for solving connected word recognition
problems. In the 1980‟s we started to tackle large vocabulary (1000-unlimited number of
words) speech recognition problems based on statistical methods, with a wide range of
networks for handling language structures. The key technologies introduced during this
period were the hidden Markov model (HMM) and the stochastic language model, which
together enabled powerful new methods for handling virtually any continuous speech
recognition problem efficiently and with high performance. In the 1990‟s we were able to
build large vocabulary systems with unconstrained language models, and constrained task
syntax models for continuous speech recognition and understanding. The key technologies
developed during this period were the methods for stochastic language understanding,
statistical learning of acoustic and language models, and the introduction of finite state
transducer framework (and the FSM Library) and the methods for their determination and
minimization for efficient implementation of large vocabulary speech understanding systems.



                                                   6
MAIN PROJECT ‘10                                         SPEECH RECOGNITION USING WAVELET TRANSFORM
                       www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/



Finally, in the last few years, we have seen the introduction of very large vocabulary systems
with full semantic models, integrated with text-to-speech (TTS) synthesis systems, and multi-
modal inputs (pointing, keyboards, mice, etc.). These systems enable spoken dialog systems
with a range of input and output modalities for ease-of-use and flexibility in handling adverse
environments where speech might not be as suitable as other input-output modalities. During
this period we have seen the emergence of highly natural concatenative speech synthesis
systems, the use of machine learning to improve both speech understanding and speech
dialogs, and the introduction of mixed-initiative dialog systems to enable user control when
necessary.

         After nearly five decades of research, speech recognition technologies have finally
entered the marketplace, benefiting the users in a variety of ways. Throughout the course of
development of such systems, knowledge of speech production and perception was used in
establishing the technological foundation for the resulting speech recognizers. Major
advances, however, were brought about in the 1960‟s and 1970‟s via the introduction of
advanced speech representations based on LPC analysis and cepstral analysis methods, and in
the 1980‟s through the introduction of rigorous statistical methods based on hidden Markov
models. All of this came about because of significant research contributions from academia,
private industry and the government. As the technology continues to mature, it is clear that
many new applications will emerge and become part of our way of life – thereby taking full
advantage of machines that are partially able to mimic human speech capabilities.




                                                   7
MAIN PROJECT ‘10                                     SPEECH RECOGNITION USING WAVELET TRANSFORM
                   www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/




                                               8
MAIN PROJECT ‘10                                          SPEECH RECOGNITION USING WAVELET TRANSFORM
                        www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/




         3. METHODOLOGY OF THE PROJECT


The methodology of the project involves the following steps

   1. Database collection

   2. Decomposition of the speech signal

   3. Feature vectors extraction

   4. Developing a classifier

   5. Training the classifier

   6. Testing the classifier

Each of the section is discussed in detail below




    3.1 Database collection

            Database collection is the most important step in speech recognition. Only an
efficient database can yield a good speech recognition system. As we know different people
say words differently. This is due to the difference in the pitch, slang, pronunciation. In this
step the same word is recorded by different persons. All words are recorded at the same
frequency 16KHz. Collection of too much samples need not benefit the speech recognition.
Sometimes it can affect it adversely. So, right number of samples should be taken. The same
step is repeated for other words also.




    3.2 Decomposition of speech signal

            The next step is speech signal decomposition. For this we can use different
techniques like LPC, MFCC, STFT, wavelet transform. Over the past 10 years wavelet
transform is mostly used in speech recognition. Speech recognition systems generally carry



                                                    9
MAIN PROJECT ‘10                                           SPEECH RECOGNITION USING WAVELET TRANSFORM
                         www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/



out some kind of classification/recognition based upon speech features which are usually
obtained via time-frequency representations such as Short Time Fourier Transforms (STFTs)
or Linear Predictive Coding (LPC) techniques. In some respects, these methods may not be
suitable for representing speech; they assume signal stationarity within a given time frame
and may therefore lack the ability to analyze localized events accurately. Furthermore, the
LPC approach assumes a particular linear (all-pole) model of speech production which
strictly speaking is not the case.

             Other approaches based on Cohen‟s general class of time-frequency distributions
such as the Cone-Kernel and Choi-Williams methods have also found use in speech
recognition applications but have the drawback of introducing unwanted cross-terms into the
representation. The Wavelet Transform overcomes some of these limitations; it can provide a
constant-Q analysis of a given signal by projection onto a set of basic functions that are scale
variant with frequency. Each wavelet is a shifted scaled version of an original or mother
wavelet. These families are usually orthogonal to one another, important since this yields
computational efficiency and ease of numerical implementation. Other factors influencing the
choice of Wavelet Transforms over conventional methods include their ability to capture
localized features.




                  Tiling of time frequency plane via the wavelet transform




                                                    10
MAIN PROJECT ‘10                                          SPEECH RECOGNITION USING WAVELET TRANSFORM
                        www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/



       Wavelet Transform

        The Wavelet transform provides the time-frequency representation. (There are other
transforms which give this information too, such as short time Fourier transform, Wigner
distributions, etc.)

        Often times a particular spectral component occurring at any instant can be of
particular interest. In these cases it may be very beneficial to know the time intervals these
particular spectral components occur. For example, in EEGs, the latency of an event-related
potential is of particular interest (Event-related potential is the response of the brain to a
specific stimulus like flash-light, the latency of this response is the amount of time elapsed
between the onset of the stimulus and the response).




Wavelet transform is capable of providing the time and frequency information
simultaneously.

             Wavelet transform can be applied to non-stationary signals. It concentrates into
small portions of the signal which can be considered as stationary. It has got a variable size
window unlike constant size window in STFT. WT gives us information about what band of
frequencies is there in a given interval of time.



                                                   11
MAIN PROJECT ‘10                                            SPEECH RECOGNITION USING WAVELET TRANSFORM
                          www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/



             There are two methodologies for speech decomposition using wavelet. Discrete
Wavelet Transform (DWT) and Wavelet Packet Decomposition (WPD). Out of the two DWT
is used in our project.




      Discrete Wavelet Transform

             The transform of a signal is just another form of representing the signal. It does
not change the information content present in the signal. For many signals, the low-frequency
part contains the most important part. It gives an identity to a signal. Consider the human
voice. If we remove the high-frequency components, the voice sounds different, but we can
still tell what‟s being said. In wavelet analysis, we often speak of approximations and details.
The approximations are the high- scale, low-frequency components of the signal. The details
are the low-scale, high frequency components. The DWT is defined by the following
equation:




Where ψ(t) is a time function with finite energy and fast decay called the mother wavelet.
The DWT analysis can be performed using a fast, pyramidal algorithm related to multi-rate
filter-banks. As a multi-rate filter-bank the DWT can be viewed as a constant Q filter-bank
with octave spacing between the centers of the filters. Each sub-band contains half the
samples of the neighboring higher frequency sub-band. In the pyramidal algorithm the signal
is analyzed at different frequency bands with different resolution by decomposing the signal
into a coarse approximation and detail information. The coarse approximation is then further
decomposed using the same wavelet decomposition step. This is achieved by successive
high-pass and low-pass filtering of the time domain signal and is defined by the following
equations:




                                                     12
MAIN PROJECT ‘10                                         SPEECH RECOGNITION USING WAVELET TRANSFORM
                       www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/




         Figure 1: Signal x[n] is passed through lowpass and highpass filters and it is down
                                                 sampled by 2




            In the DWT, each level is calculated by passing the previous approximation
coefficients though a high and low pass filters. However, in the WPD, both the detail and
approximation coefficients are decomposed.




                               Figure 2: Decomposition Tree

           The DWT is computed by successive low-pass and high-pass filtering of the
discrete time-domain signal as shown in figure 1 and 2. This is called the Mallat algorithm or
Mallat-tree decomposition.

            The mother wavelet used is daubichies 4 type wavelet. It contains more number
of filters. Daubichies wavelets are the most popular wavelets. They represent the foundations
of wavelet signal processing and are used in numerous applications. These are also called
Maxflat wavelets as their frequency responses have maximum flatness at frequencies 0 and π




                                                  13
MAIN PROJECT ‘10                                          SPEECH RECOGNITION USING WAVELET TRANSFORM
                        www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/




                                  Daubechies wavelet of order 4



    3.3 Feature vectors extraction
            Feature extraction is the key for ASR, so that it is arguably the most important
component of designing an intelligent system based on speech/speaker recognition, since the
best classifier will perform poorly if the features are not chosen well. A feature extractor
should reduce the pattern vector (i.e., the original waveform) to a lower dimension, which
contains most of the useful information from the original vector.

            The extracted wavelet coefficients provide a compact representation that
shows the energy distribution of the signal in time and frequency. In order to further
reduce the dimensionality of the extracted feature vectors, statistics over the set of the
wavelet coefficients are used. That way the statistical characteristics of the “texture” or the
“music surface” of the piece can be represented. For example the distribution of energy in
time and frequency for music is different from that of speech.

            The following features are used in our system:

        The mean of the absolute value of the coefficients in each sub-band. These features

         provide information about the frequency distribution of the audio signal.

        The standard deviation of the coefficients in each sub-band. These features provide

         information about the amount of change of the frequency distribution.

        Energy of each sub-band of the signal. These features provide information about the
         energy of the each sub-band.
        Kurtosis of each sub-band of the signal. These features measure whether the data are
         peaked or flat relative to a normal distribution.




                                                   14
MAIN PROJECT ‘10                                         SPEECH RECOGNITION USING WAVELET TRANSFORM
                       www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/



        Skewness of each sub-band of the signals. These features are the measure of
         symmetry or lack of symmetry.



            These features are then combined into a hybrid feature and are fed to a classifier.
Features are combined using a matrix. All the features of one sample correspond to a column.




    3.4 Developing a classifier

            Generally, there are three usual methods in speech recognition: Dynamic Time
Warping (DTW), Hidden Markov Model (HMM) and Artificial Neural Networks (ANNs).

            Dynamic time warping (DTW) is a technique that finds the optimal alignment
between two time series if one time series may be warped non-linearly by stretching or
shrinking it along its time axis. This warping between two time series can then be used to find
corresponding regions between the two time series or to determine the similarity between the
two time series.

            In speech recognition Dynamic time warping is often used to determine if two
waveforms represent the same spoken phrase. This method is used for time adjustment of two
words and estimation their difference. In a speech waveform, the duration of each spoken
sound and the interval between sounds are permitted to vary, but the overall speech
waveforms must be similar. Main problem of this systems is little amount of learning words
high calculating rate and large memory requirement.

            Hidden Markov Models are finite automates, having a given number of states;
passing from one state to another is made instantaneously at equally spaced time moments.
At every pass from one state to another, the system generates observations, two processes are
taking place: the transparent one, represented by the observations string (feature sequence),
and the hidden one, which cannot be observed, represented by the state string. Main point of
this method is timing sequence and comparing methods.

            Nowadays, ANNs are utilized in wide ranges for their parallel distributed
processing, distributed memories, error stability, and pattern learning distinguishing ability.
The Complexity of all these systems increased when their generality rises. The biggest



                                                  15
MAIN PROJECT ‘10                                          SPEECH RECOGNITION USING WAVELET TRANSFORM
                        www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/



restriction of two first methods is their low speed for searching and comparing in models. But
ANNs are faster, because output is resulted from multiplication of adjusted weights in present
input. At present TDNN (Time-Delay Neural Network) is widely used in speech recognition.




      Neural Networks


      A neural network (NN) is a massive processing system that consists of many
processing entities connected through links that represent the relationship between them. A
Multilayer Perceptron (MLP) network consists of an input layer, one or more hidden layers,
and an output layer. Each layer consists of multiple neurons. An artificial neuron is the
smallest unit that constitutes the artificial neural network. The actual computation and
processing of the neural network happens inside the neuron. In this work, we use an
architecture of the MLP networks which is the feed-forward network with back-propagation
training algorithm (FFBP). In this type of network, the input is presented to the network and
moves through the weights and nonlinear activation functions toward the output layer, and
the error is corrected in a backward direction using the well-known error back-propagation
correction algorithm. The FFBP is best suited for structural pattern recognition. In structural
pattern recognition tasks, there are N training examples, where each training example consists
of a pattern and a target class (x,y). These examples are assumed to be generated
independently according to the joint distribution P(x,y). A structural classifier is then defined
as a function h that performs the static mapping from patterns to target classes y=h(x). The
function h is usually produced by searching through a space of candidate classifiers and
returning the function h that performs well on the training examples during a learning
process. A neural network returns the function h in the form of a matrix of weights.




                                                   16
MAIN PROJECT ‘10                                         SPEECH RECOGNITION USING WAVELET TRANSFORM
                       www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/




                                       An Artificial Neuron


            The number of neurons in each hidden layer has a direct impact on the
performance of the network during training as well as during operation. Having more
neurons than needed for a problem runs the network into an over fitting problem. Over fitting
problem is a situation whereby the network memorizes the training examples. Networks that
run into over fitting problem perform well on training examples and poorly on unseen
examples. Also having less number of neurons than needed for a problem causes the network
to run into under fitting problem. The under fitting problem happens when the network
architecture does not cope with the complexity of the problem in hand. The under fitting
problem results in an inadequate modeling and therefore poor performance of the network.




                                   MLP Neural network architecture



                                                  17
MAIN PROJECT ‘10                                          SPEECH RECOGNITION USING WAVELET TRANSFORM
                        www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/




 The Backpropagation Algorithm


               The backpropagation algorithm (Rumelhart and McClelland, 1986) is used in
layered feed-forward ANNs. This means that the artificial neurons are organized in layers,
and send their signals “forward”, and then the errors are propagated backwards. The network
receives inputs by neurons in the input layer, and the output of the network is given by the
neurons on an output layer. There may be one or more intermediate hidden layers. The
backpropagation algorithm uses supervised learning, which means that we provide the
algorithm with examples of the inputs and outputs we want the network to compute, and then
the error (difference between actual and expected results) is calculated. The idea of the
backpropagation algorithm is to reduce this error, until the ANN learns the training data. The
training begins with random weights, and the goal is to adjust them so that the error will be
minimal.




    3.5 Training the classifier

                 After development the classifier has got 2 steps. Training and testing. In
training phase the features of the samples are fed as input to the ANN. The target is set. Then
the network is trained. The network will adjust its weights such that the target is achieved for
the given input. In this project we have used the function „tansig‟ and „logsig‟. So the output
should be bounded between 0 and 1. The output is given as .9 .1 .1……1 for 1st word. .1 .9
.1…….1 for 2nd word and so on. The position of maximum value corresponds to the output.



    3.6 Testing the classifier

                The next phase is testing. The samples which are set aside for testing is given
to the classifier and the output is noted. If we don‟t get the desired output ,we reach the
required output by adjusting the number of neurons.




                                                   18
MAIN PROJECT ‘10                                         SPEECH RECOGNITION USING WAVELET TRANSFORM
                       www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/




                              4. OBSERVATION



                We recorded five Malayalam words “onnu”, ”randu”, ”naalu” , ”anju “ and
“aaru” .The words corresponds to Malayalam words for numerals 1,2,4 ,5 and 6 respectively.
The reason for specifically selecting these words was that,the project was intended to
implement a password system with numerals.




            Malayalam Word                                                Numeral




                   ഒ
                                                                              1


                                                                               2


                                                                              4


                                                                              5


                                                                              6



                                                  19
MAIN PROJECT ‘10                                            SPEECH RECOGNITION USING WAVELET TRANSFORM
                          www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/



        20 samples for each word was recorded from different people and these samples were
then normalized by dividing their maximum values.Then they were decomposed using
wavelet transform technique upto eight levels since majority of the information about the
signal is present in the low frequency region.




        In order to classify the signals an ANN is developed and trained by fixing outputs such
that

If the word is „onnu‟ then output will be                      .9    .1 .1 .1 .1



If the word is „randu‟ then output will be                     .1 .9 .1 .1 .1


If the word is „naalu‟ then output will be                      .1 .1 .9 .1 .1


If the word is „anju‟ then output will be                       .1   .1 .1 .9 .1


If the word is „aaru‟ then output will be                       .1   .1 .1 .1 .9


       Out of 20 samples recorded,16 samples are used to train the ANN and the unused 4 samples are
used for test purpose.



 Plots




                                          Plot for word ‘onnu’


                                                     20
MAIN PROJECT ‘10                                     SPEECH RECOGNITION USING WAVELET TRANSFORM
                   www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/




                                  Plot for word ‘randu’




                                  Plot for word ‘naalu’




                                              21
MAIN PROJECT ‘10                                     SPEECH RECOGNITION USING WAVELET TRANSFORM
                   www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/




                                   Plot for word ‘anju’




                                   Plot for word ‘aaru’




                                              22
MAIN PROJECT ‘10                                     SPEECH RECOGNITION USING WAVELET TRANSFORM
                   www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/




    DWT Tree




The 8 level decomposition tree for a signal using DWT is shown in the
figure,which produces one approximation coefficient and eight detailed
coefficients



                                              23
MAIN PROJECT ‘10                                     SPEECH RECOGNITION USING WAVELET TRANSFORM
                   www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/




                    Decomposed waveforms for word ‘randu’




                                              24
MAIN PROJECT ‘10                                     SPEECH RECOGNITION USING WAVELET TRANSFORM
                   www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/




                     Decomposed waveforms for word ‘aaru’




                                              25
MAIN PROJECT ‘10                                          SPEECH RECOGNITION USING WAVELET TRANSFORM
                        www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/




                            5. TESTING AND RESULT




    Testing with pre-recorded samples

        Out of the 20 samples recorded for each word, 16 were used for training purpose.
We tested our program‟s accuracy with these 4 unused samples. A total of 20 samples were
tested ( 4 samples each for the 5 words) and the program yielded the right result for all 20
samples. Thus, we obtained 100% accuracy with pre- recorded samples.

    Real-time testing:

     For real-time testing, we took a sample using microphone and directly executed the
program using this sample. A total of 30 samples were tested, out of which 20 samples gave
the right result. This gives an accuracy of about 66% with real-time samples.




                                                   26
MAIN PROJECT ‘10                                         SPEECH RECOGNITION USING WAVELET TRANSFORM
                       www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/



    Change in efficiency by changing the parameters of the ANN were observed and
       are plotted below




Plot 1: Accuracy with 2 layer feed forward network,Number of neurons in the first layer=15




                                                  27
MAIN PROJECT ‘10                                         SPEECH RECOGNITION USING WAVELET TRANSFORM
                       www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/




Plot 2: Accuracy with 2 layer feed forward network ,Number of neurons in the first layer=20




                                                  28
MAIN PROJECT ‘10                                        SPEECH RECOGNITION USING WAVELET TRANSFORM
                      www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/




     Plot 3: Accuracy with 3 layer feed forward network,Number of neurons in the first
                layer,N1=15& number of neurons in the second layer,N2=5




                                                 29
MAIN PROJECT ‘10                                          SPEECH RECOGNITION USING WAVELET TRANSFORM
                        www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/




                                   7. CONCLUSION

         Speech recognition is one of the advanced areas. Many research works has been
taking place under this domain to implement new and enhanced approaches. During the
experiment we experienced the effectiveness of Daubechies4 mother wavelet in feature
extraction. In this experiment we have only used a limited number of samples. Increasing the
number of samples may give better feature and a good recognition result for Malayalam word
utterances. The performance of Neural Network with wavelet is appreciable. We have used
software with some limitations, if we increase the number of samples as well as the number
iterations (training), it can produce a good recognition result.

      We also observed that, Neural Network is an effective tool which can be embedded
successfully with wavelet. The effectiveness of wavelet based feature extraction with other
classification methods like neuro-fuzzy and genetic algorithm techniques can be used to do
the same task.

         From this study we could understand and experience the effectiveness of discrete
wavelet transform in feature extraction. Our recognition results under different kind of noise
and noisy conditions, show that choosing dyadic bandwidths have better performance than
choosing equal bandwidths in sub-band recombination. This result adapts to way which
human ear recognizes speech and shows a useful benefit of dyadic nature of multi-level
wavelet transform for sub-band speech recognition.

                 The wavelet transform is a more dominant technique for speech processing
than other previous techniques. ANN has proved to be the most successful classifier
compared to HMM.




                                                   30
MAIN PROJECT ‘10                                              SPEECH RECOGNITION USING WAVELET TRANSFORM
                            www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/




                                         8. REFERENCES
[1] Vimal Krishnan V.R, Athulya Jayakumar, Babu Anto.P, “Speech Recognition of Isolated Malayalam
Words Using Wavelet Features and Artificial Neural Network”, 4th IEEE International Symposium on
Electronic Design, Test & Applications

[2] Lawrance Rabiner, Bing-Hwang Juang, “Fundamentals Speech Recognition”, Eaglewood Cliffs, NJ,
Prentice hall, 1993.

[3] Mallat Stephen, “A Wavelet Tour of Signal Processing”, San Dieago: Academic Press, 1999, ISBN
012466606.

[4] Mallat SA, “Theory for MuItiresolution Signal Decomposition: The Wavelet Representation”, IEEE
Transactions on Pattern Analysis Machine Intelligence. Vol. 31, pp 674-693, 1989.

[5] K.P. Soman, K.I. Ramachandran, “Insight into Wavelets from Theory to Practice”, Second Edition, PHI,
2005.

[6] Kadambe S., Srinivasan P. “Application of Adaptive Wavelets for Speech “, Optical Engineering 33(7),
pp. 2204-2211, July 1994.

[7] Stuart Russel, Peter Norvig, “Artificial Intelligence, A Modern Approach”, New Delhi: Prentice Hall of
India, 2005.

[8] S.N. Srinivasan, S. Sumathi, S.N. Deepa, “Introduction to Neural Networks using Matlab 6.0,” New Delhi,
Tata McGraw Hill, 2006.

[9] James A Freeman, David M Skapura, “Neural Networks Algorithm”. Application and Programming
Techniques, Pearson Education, 2006.




                                                       31

More Related Content

What's hot

Voice/Speech recognition in mobile devices
Voice/Speech recognition in mobile devicesVoice/Speech recognition in mobile devices
Voice/Speech recognition in mobile devicesHarshad Karmarkar
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech RecognitionHugo Moreno
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition TechnologySrijanKumar18
 
Automatic speech recognition system
Automatic speech recognition systemAutomatic speech recognition system
Automatic speech recognition systemAlok Tiwari
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversionankit_saluja
 
Automatic speech recognition system
Automatic speech recognition systemAutomatic speech recognition system
Automatic speech recognition systemAlok Tiwari
 
A seminar report on speech recognition technology
A seminar report on speech recognition technologyA seminar report on speech recognition technology
A seminar report on speech recognition technologySrijanKumar18
 
Speech recognition challenges
Speech recognition challengesSpeech recognition challenges
Speech recognition challengesAlexandru Chica
 
Voice input and speech recognition system in tourism/social media
Voice input and speech recognition system in tourism/social mediaVoice input and speech recognition system in tourism/social media
Voice input and speech recognition system in tourism/social mediacidroypaes
 
Introduction to text to speech
Introduction to text to speechIntroduction to text to speech
Introduction to text to speechBilgin Aksoy
 
Speech Recognition by Iqbal
Speech Recognition by IqbalSpeech Recognition by Iqbal
Speech Recognition by IqbalIqbal
 
Automatic Speech Recognition
Automatic Speech RecognitionAutomatic Speech Recognition
Automatic Speech RecognitionYogesh Vijay
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech RecognitionAhmed Moawad
 
Artificial Intelligence for Speech Recognition
Artificial Intelligence for Speech RecognitionArtificial Intelligence for Speech Recognition
Artificial Intelligence for Speech RecognitionRHIMRJ Journal
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition TechnologyAamir-sheriff
 
Abstract of speech recognition
Abstract of speech recognitionAbstract of speech recognition
Abstract of speech recognitionVinay Jaisriram
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentationhimanshubhatti
 
Speech Recognition in Artificail Inteligence
Speech Recognition in Artificail InteligenceSpeech Recognition in Artificail Inteligence
Speech Recognition in Artificail InteligenceIlhaan Marwat
 

What's hot (20)

Voice/Speech recognition in mobile devices
Voice/Speech recognition in mobile devicesVoice/Speech recognition in mobile devices
Voice/Speech recognition in mobile devices
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Automatic speech recognition system
Automatic speech recognition systemAutomatic speech recognition system
Automatic speech recognition system
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
 
Automatic speech recognition system
Automatic speech recognition systemAutomatic speech recognition system
Automatic speech recognition system
 
A seminar report on speech recognition technology
A seminar report on speech recognition technologyA seminar report on speech recognition technology
A seminar report on speech recognition technology
 
Speech recognition challenges
Speech recognition challengesSpeech recognition challenges
Speech recognition challenges
 
Voice input and speech recognition system in tourism/social media
Voice input and speech recognition system in tourism/social mediaVoice input and speech recognition system in tourism/social media
Voice input and speech recognition system in tourism/social media
 
Introduction to text to speech
Introduction to text to speechIntroduction to text to speech
Introduction to text to speech
 
Speech Recognition by Iqbal
Speech Recognition by IqbalSpeech Recognition by Iqbal
Speech Recognition by Iqbal
 
Automatic Speech Recognition
Automatic Speech RecognitionAutomatic Speech Recognition
Automatic Speech Recognition
 
An Introduction To Speech Recognition
An Introduction To Speech RecognitionAn Introduction To Speech Recognition
An Introduction To Speech Recognition
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Artificial Intelligence for Speech Recognition
Artificial Intelligence for Speech RecognitionArtificial Intelligence for Speech Recognition
Artificial Intelligence for Speech Recognition
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Abstract of speech recognition
Abstract of speech recognitionAbstract of speech recognition
Abstract of speech recognition
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentation
 
Speech Recognition in Artificail Inteligence
Speech Recognition in Artificail InteligenceSpeech Recognition in Artificail Inteligence
Speech Recognition in Artificail Inteligence
 
Voice recognition
Voice recognitionVoice recognition
Voice recognition
 

Viewers also liked

Principal's speech for annual day 2007
Principal's speech for annual day 2007Principal's speech for annual day 2007
Principal's speech for annual day 2007purti2
 
Welcome speech 2014
Welcome speech 2014Welcome speech 2014
Welcome speech 2014Viraf Pesuna
 
emcee / mc Opening speech example
emcee / mc Opening speech example emcee / mc Opening speech example
emcee / mc Opening speech example ddaya11
 
Welcome speech bo_t-danida
Welcome speech bo_t-danidaWelcome speech bo_t-danida
Welcome speech bo_t-danidaRCREEE
 
Principal Speech - Shriutsav May 2010
Principal Speech - Shriutsav May 2010Principal Speech - Shriutsav May 2010
Principal Speech - Shriutsav May 2010siet.facebook
 
Hr conclave and entrepreneurs' strategy summit at international school of bus...
Hr conclave and entrepreneurs' strategy summit at international school of bus...Hr conclave and entrepreneurs' strategy summit at international school of bus...
Hr conclave and entrepreneurs' strategy summit at international school of bus...Dr. Trilok Kumar Jain
 
Introduction to Wavelet Transform with Applications to DSP
Introduction to Wavelet Transform with Applications to DSPIntroduction to Wavelet Transform with Applications to DSP
Introduction to Wavelet Transform with Applications to DSPHicham Berkouk
 
Colour doppler in iugr
Colour doppler in iugrColour doppler in iugr
Colour doppler in iugrdrmcbansal
 
CMRFC ILIGAN NARRATIVE REPORT
CMRFC ILIGAN NARRATIVE REPORTCMRFC ILIGAN NARRATIVE REPORT
CMRFC ILIGAN NARRATIVE REPORTjundumaug1
 
Opening remarks for nstp lts graduation 2012
Opening remarks for nstp lts graduation 2012Opening remarks for nstp lts graduation 2012
Opening remarks for nstp lts graduation 2012Jun Pontiveros
 
Wavelet transform and its applications in data analysis and signal and image ...
Wavelet transform and its applications in data analysis and signal and image ...Wavelet transform and its applications in data analysis and signal and image ...
Wavelet transform and its applications in data analysis and signal and image ...Sourjya Dutta
 
Thermal Imaging Powerpoint
Thermal Imaging PowerpointThermal Imaging Powerpoint
Thermal Imaging Powerpointginnyt85
 
Lecture6 audio
Lecture6   audioLecture6   audio
Lecture6 audioMr SMAK
 
Introduction to wavelet transform
Introduction to wavelet transformIntroduction to wavelet transform
Introduction to wavelet transformRaj Endiran
 
Lamp lightsssssssssssss
Lamp lightsssssssssssssLamp lightsssssssssssss
Lamp lightssssssssssssssudhi8281
 
Principles of Doppler ultrasound
Principles of Doppler ultrasoundPrinciples of Doppler ultrasound
Principles of Doppler ultrasoundSamir Haffar
 
Opening speech for ict cocktail
Opening speech for ict cocktailOpening speech for ict cocktail
Opening speech for ict cocktailICTPA
 

Viewers also liked (20)

Principal's speech for annual day 2007
Principal's speech for annual day 2007Principal's speech for annual day 2007
Principal's speech for annual day 2007
 
Welcome speech 2014
Welcome speech 2014Welcome speech 2014
Welcome speech 2014
 
emcee / mc Opening speech example
emcee / mc Opening speech example emcee / mc Opening speech example
emcee / mc Opening speech example
 
Welcome speech bo_t-danida
Welcome speech bo_t-danidaWelcome speech bo_t-danida
Welcome speech bo_t-danida
 
Principal Speech - Shriutsav May 2010
Principal Speech - Shriutsav May 2010Principal Speech - Shriutsav May 2010
Principal Speech - Shriutsav May 2010
 
Hr conclave and entrepreneurs' strategy summit at international school of bus...
Hr conclave and entrepreneurs' strategy summit at international school of bus...Hr conclave and entrepreneurs' strategy summit at international school of bus...
Hr conclave and entrepreneurs' strategy summit at international school of bus...
 
Introduction to Wavelet Transform with Applications to DSP
Introduction to Wavelet Transform with Applications to DSPIntroduction to Wavelet Transform with Applications to DSP
Introduction to Wavelet Transform with Applications to DSP
 
Colour doppler in iugr
Colour doppler in iugrColour doppler in iugr
Colour doppler in iugr
 
CMRFC ILIGAN NARRATIVE REPORT
CMRFC ILIGAN NARRATIVE REPORTCMRFC ILIGAN NARRATIVE REPORT
CMRFC ILIGAN NARRATIVE REPORT
 
Opening remarks for nstp lts graduation 2012
Opening remarks for nstp lts graduation 2012Opening remarks for nstp lts graduation 2012
Opening remarks for nstp lts graduation 2012
 
Wavelet transform and its applications in data analysis and signal and image ...
Wavelet transform and its applications in data analysis and signal and image ...Wavelet transform and its applications in data analysis and signal and image ...
Wavelet transform and its applications in data analysis and signal and image ...
 
Thermal Imaging Powerpoint
Thermal Imaging PowerpointThermal Imaging Powerpoint
Thermal Imaging Powerpoint
 
Lecture6 audio
Lecture6   audioLecture6   audio
Lecture6 audio
 
New session school speech
New session school speechNew session school speech
New session school speech
 
welcome speech
welcome speechwelcome speech
welcome speech
 
Introduction to wavelet transform
Introduction to wavelet transformIntroduction to wavelet transform
Introduction to wavelet transform
 
Lamp lightsssssssssssss
Lamp lightsssssssssssssLamp lightsssssssssssss
Lamp lightsssssssssssss
 
Ct brain basics and anatomy
Ct brain basics and anatomyCt brain basics and anatomy
Ct brain basics and anatomy
 
Principles of Doppler ultrasound
Principles of Doppler ultrasoundPrinciples of Doppler ultrasound
Principles of Doppler ultrasound
 
Opening speech for ict cocktail
Opening speech for ict cocktailOpening speech for ict cocktail
Opening speech for ict cocktail
 

Similar to Speech recognition-using-wavelet-transform

Silent sound interface
Silent sound interfaceSilent sound interface
Silent sound interfaceJeevitha Reddy
 
Text To Speech
Text To SpeechText To Speech
Text To Speechlucyalexa
 
Controlling Devices through Voice using AVR Microcontroller.
Controlling Devices through Voice using AVR Microcontroller.Controlling Devices through Voice using AVR Microcontroller.
Controlling Devices through Voice using AVR Microcontroller.Central University of Rajasthan
 
Iaetsd artificial intelligence
Iaetsd artificial intelligenceIaetsd artificial intelligence
Iaetsd artificial intelligenceIaetsd Iaetsd
 
Information Communication Technology by Shiela F. Fresnido
Information Communication Technology by Shiela F. FresnidoInformation Communication Technology by Shiela F. Fresnido
Information Communication Technology by Shiela F. FresnidoShiela Fresnido
 
E0ad silent sound technology
E0ad silent  sound technologyE0ad silent  sound technology
E0ad silent sound technologyMadhuri Rudra
 
NASAM_CPAF_MODULE 1.pptx
NASAM_CPAF_MODULE 1.pptxNASAM_CPAF_MODULE 1.pptx
NASAM_CPAF_MODULE 1.pptxCrispinNasam
 
Silent-Sound-Technology-PPT.pptx
Silent-Sound-Technology-PPT.pptxSilent-Sound-Technology-PPT.pptx
Silent-Sound-Technology-PPT.pptxomkarrekulwar
 
SILENT SOUND TECHNOLOGY
SILENT SOUND TECHNOLOGYSILENT SOUND TECHNOLOGY
SILENT SOUND TECHNOLOGYNagma Parween
 
Voice controlled robot
Voice controlled robotVoice controlled robot
Voice controlled robotEcwaytech
 
Voice controlled robot
Voice controlled robotVoice controlled robot
Voice controlled robotEcwayt
 
Wreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionWreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionStephen Marquard
 

Similar to Speech recognition-using-wavelet-transform (20)

Silent sound interface
Silent sound interfaceSilent sound interface
Silent sound interface
 
Text To Speech
Text To SpeechText To Speech
Text To Speech
 
Controlling Devices through Voice using AVR Microcontroller.
Controlling Devices through Voice using AVR Microcontroller.Controlling Devices through Voice using AVR Microcontroller.
Controlling Devices through Voice using AVR Microcontroller.
 
De4201715719
De4201715719De4201715719
De4201715719
 
Iaetsd artificial intelligence
Iaetsd artificial intelligenceIaetsd artificial intelligence
Iaetsd artificial intelligence
 
Asr
AsrAsr
Asr
 
Phone phreaking
Phone phreakingPhone phreaking
Phone phreaking
 
Information Communication Technology by Shiela F. Fresnido
Information Communication Technology by Shiela F. FresnidoInformation Communication Technology by Shiela F. Fresnido
Information Communication Technology by Shiela F. Fresnido
 
Bt35408413
Bt35408413Bt35408413
Bt35408413
 
E0ad silent sound technology
E0ad silent  sound technologyE0ad silent  sound technology
E0ad silent sound technology
 
Asr
AsrAsr
Asr
 
NASAM_CPAF_MODULE 1.pptx
NASAM_CPAF_MODULE 1.pptxNASAM_CPAF_MODULE 1.pptx
NASAM_CPAF_MODULE 1.pptx
 
Silent-Sound-Technology-PPT.pptx
Silent-Sound-Technology-PPT.pptxSilent-Sound-Technology-PPT.pptx
Silent-Sound-Technology-PPT.pptx
 
FINAL report
FINAL reportFINAL report
FINAL report
 
Speech Analysis
Speech AnalysisSpeech Analysis
Speech Analysis
 
10
1010
10
 
SILENT SOUND TECHNOLOGY
SILENT SOUND TECHNOLOGYSILENT SOUND TECHNOLOGY
SILENT SOUND TECHNOLOGY
 
Voice controlled robot
Voice controlled robotVoice controlled robot
Voice controlled robot
 
Voice controlled robot
Voice controlled robotVoice controlled robot
Voice controlled robot
 
Wreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionWreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognition
 

Recently uploaded

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 

Speech recognition-using-wavelet-transform

  • 1. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ 1. INTRODUCTION Automatic speech recognition (ASR) aims at converting spoken language to text. Scientists all over the globe have been working under the domain, speech recognition for last many decades. This is one of the intensive areas of research. Recent advances in soft computing techniques give more importance to automatic speech recognition. Large variation in speech signals and other criteria like native accent and varying pronunciations makes the task very difficult. ASR is hence a complex task and it requires more intelligence to achieve a good recognition result. Speech recognition is currently used in many real-time applications, such as cellular telephones, computers, and security systems. However, these systems are far from perfect in correctly classifying human speech into words. Speech recognizers consist of a feature extraction stage and a classification stage. The parameters from the feature extraction stage are compared in some form to parameters extracted from signals stored in a database or template. The parameters could be fed to a neural network. Speech word recognition systems commonly carry out some kind of classification recognition based on speech features which are usually obtained via Fourier Transforms (FTs), Short Time Fourier Transforms (STFTs), or Linear Predictive Coding techniques. However, these methods have some disadvantages. These methods accept signal stationarity within a given time frame and may therefore lack the ability to analyze localized events correctly. The wavelet transform copes with some of these problems. Other factors influencing the selection of Wavelet Transforms (WT) over conventional methods include their ability to determine localized features. Discrete Wavelet Transform method is used for speech processing. 1
  • 2. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ 2. LITERATURE SURVEY Designing a machine that mimics human behavior, particularly the capability of speaking naturally and responding properly to spoken language, has intrigued engineers and scientists for centuries. Since the 1930s, when Homer Dudley of Bell Laboratories proposed a system model for speech analysis and synthesis, the problem of automatic speech recognition has been approached progressively, from a simple machine that responds to a small set of sounds to a sophisticated system that responds to fluently spoken natural language and takes into account the varying statistics of the language in which the speech is produced. Based on major advances in statistical modeling of speech in the 1980s, automatic speech recognition systems today find widespread application in tasks that require a human-machine interface, such as automatic call processing in the telephone network and query-based information systems that do things like provide updated travel information, stock price quotations, weather reports, etc. Speech is the primary means of communication between people. For reasons ranging from technological curiosity about the mechanisms for mechanical realization of human speech capabilities, to the desire to automate simple tasks inherently requiring human- machine interactions, research in automatic speech recognition (and speech synthesis) by machine has attracted a great deal of attention over the past five decades. The desire for automation of simple tasks is not a modern phenomenon, but one that goes back more than one hundred years in history. By way of example, in 1881 Alexander Graham Bell, his cousin Chichester Bell and Charles Sumner Tainter invented a recording device that used a rotating cylinder with a wax coating on which up-and-down grooves could be cut by a stylus, which responded to incoming sound pressure (in much the same way as a microphone that Bell invented earlier for use with the telephone). Based on this invention, Bell and Tainter formed the Volta Graphophone Co. in 1888 in order to manufacture machines for the recording and reproduction of sound in office environments. The American Graphophone Co., which later became the Columbia Graphophone Co., acquired the patent in 1907 and trademarked the term “Dictaphone.” Just about the same 2
  • 3. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ time, Thomas Edison invented the phonograph using a tinfoil based cylinder, which was subsequently adapted to wax, and developed the “Ediphone” to compete directly with Columbia. The purpose of these products was to record dictation of notes and letters for a secretary (likely in a large pool that offered the service) who would later type them out (offline), thereby circumventing the need for costly stenographers. This turn-of-the-century concept of “office mechanization” spawned a range of electric and electronic implements and improvements, including the electric typewriter, which changed the face of office automation in the mid-part of the twentieth century. It does not take much imagination to envision the obvious interest in creating an “automatic typewriter” that could directly respond to and transcribe a human‟s voice without having to deal with the annoyance of recording and handling the speech on wax cylinders or other recording media. A similar kind of automation took place a century later in the 1990‟s in the area of “call centers.” A call center is a concentration of agents or associates that handle telephone calls from customers requesting assistance. Among the tasks of such call centers are routing the in-coming calls to the proper department, where specific help is provided or where transactions are carried out. One example of such a service was the AT&T Operator line which helped a caller place calls, arrange payment methods, and conduct credit card transactions. The number of agent positions (or stations) in a large call center could reach several thousand Automatic speech recognition.  From Speech Production Models to Spectral Representations Attempts to develop machines to mimic a human‟s speech communication capability appear to have started in the 2nd half of the 18th century. The early interest was not on recognizing and understanding speech but instead on creating a speaking machine, perhaps due to the readily available knowledge of acoustic resonance tubes which were used to approximate the human vocal tract. In 1773, the Russian scientist Christian Kratzenstein, a professor of physiology in Copenhagen, succeeded in producing vowel sounds using resonance tubes connected to organ pipes. Later, Wolfgang von Kempelen in Vienna constructed an “Acoustic-Mechanical Speech Machine” (1791) and in the mid-1800's Charles Wheatstone [6] built a version of von Kempelen's speaking machine using resonators made of 3
  • 4. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ leather, the configuration of which could be altered or controlled with a hand to produce different speech-like sounds. During the first half of the 20th century, work by Fletcher [8] and others at Bell Laboratories documented the relationship between a given speech spectrum (which is the distribution of power of a speech sound across frequency), and its sound characteristics as well as its intelligibility, as perceived by a human listener. In the 1930‟s Homer Dudley, influenced greatly by Fletcher‟s research, developed a speech synthesizer called the VODER (Voice Operating Demonstrator), which was an electrical equivalent (with mechanical control) of Wheatstone‟s mechanical speaking machine. Dudley‟s VODER which consisted of a wrist bar for selecting either a relaxation oscillator output or noise as the driving signal, and a foot pedal to control the oscillator frequency (the pitch of the synthesized voice). The driving signal was passed through ten band pass filters whose output levels were controlled by the operator‟s fingers. These ten band pass filters were used to alter the power distribution of the source signal across a frequency range, thereby determining the characteristics of the speech-like sound at the loudspeaker. Thus to synthesize a sentence, the VODER operator had to learn how to control and “play” the VODER so that the appropriate sounds of the sentence were produced. The VODER was demonstrated at the World Fair in New York City in 1939 and was considered an important milestone in the evolution of speaking machines. Speech pioneers like Harvery Fletcher and Homer Dudley firmly established the importance of the signal spectrum for reliable identification of the phonetic nature of a speech sound. Following the convention established by these two outstanding scientists, most modern systems and algorithms for speech recognition are based on the concept of measurement of the (time-varying) speech power spectrum (or its variants such as the cepstrum), in part due to the fact that measurement of the power spectrum from a signal is relatively easy to accomplish with modern digital signal processing techniques.  Early Automatic Speech Recognizers Early attempts to design systems for automatic speech recognition were mostly guided by the theory of acoustic-phonetics, which describes the phonetic elements of speech (the basic sounds of the language) and tries to explain how they are acoustically realized in a spoken utterance. These elements include the phonemes and the corresponding place and manner of articulation used to produce the sound in various phonetic contexts. For example, 4
  • 5. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ in order to produce a steady vowel sound, the vocal cords need to vibrate (to excite the vocal tract), and the air that propagates through the vocal tract results in sound with natural modes of resonance similar to what occurs in an acoustic tube. These natural modes of resonance, called the formants or formant frequencies, are manifested as major regions of energy concentration in the speech power spectrum. In 1952, Davis, Biddulph, and Balashek of Bell Laboratories built a system for isolated digit recognition for a single speaker, using the formant frequencies measured (or estimated) during vowel regions of each digit. These trajectories served as the “reference pattern” for determining the identity of an unknown digit utterance as the best matching digit. In other early recognition systems of the 1950‟s, Olson and Belar of RCA Laboratories built a system to recognize 10 syllables of a single talker and at MIT Lincoln Lab, Forgie and Forgie built a speaker-independent 10-vowel recognizer. In the 1960‟s, several Japanese laboratories demonstrated their capability of building special purpose hardware to perform a speech recognition task. Most notable were the vowel recognizer of Suzuki and Nakata at the Radio Research Lab in Tokyo, the phoneme recognizer of Sakai and Doshita at Kyoto University, and the digit recognizer of NEC Laboratories [14]. The work of Sakai and Doshita involved the first use of a speech segmenter for analysis and recognition of speech in different portions of the input utterance. In contrast, an isolated digit recognizer implicitly assumed that the unknown utterance contained a complete digit (and no other speech sounds or words) and thus did not need an explicit “segmenter.” Kyoto University‟s work could be considered a precursor to a continuous speech recognition system. In another early recognition system Fry and Denes, at University College in England, built a phoneme recognizer to recognize 4 vowels and 9 consonants. By incorporating statistical information about allowable phoneme sequences in English, they increased the overall phoneme recognition accuracy for words consisting of two or more phonemes. This work marked the first use of statistical syntax (at the phoneme level) in automatic speech recognition. An alternative to the use of a speech segmenter was the concept of adopting a non-uniform time scale for aligning speech patterns. This concept started to gain acceptance in the 1960‟s through the work of Tom Martin at RCA Laboratories and Vintsyuk in the Soviet Union. Martin recognized the need to deal with the temporal non-uniformity in repeated speech events and suggested a range of solutions, including detection of utterance endpoints, which greatly enhanced the reliability of the recognizer performance. Vintsyuk proposed the use of dynamic programming for time 5
  • 6. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ alignment between two utterances in order to derive a meaningful assessment of their similarity. His work, though largely unknown in the West, appears to have preceded that of Sakoe and Chiba as well as others who proposed more formal methods, generally known as dynamic time warping, in speech pattern matching. Since the late 1970‟s, mainly due to the publication by Sakoe and Chiba, dynamic programming, in numerous variant forms (including the Viterbi algorithm [19] which came from the communication theory community), has become an indispensable technique in automatic speech recognition.  Advancement in technology Figure shows a timeline of progress in speech recognition and understanding technology over the past several decades. We see that in the 1960‟s we were able to recognize small vocabularies (order of 10-100 words) of isolated words, based on simple acoustic-phonetic properties of speech sounds. The key technologies that were developed during this time frame were filter-bank analyses, simple time normalization methods, and the beginnings of sophisticated dynamic programming methodologies. In the 1970‟s we were able to recognize medium vocabularies (order of 100-1000 words) using simple template- based, pattern recognition methods. The key technologies that were developed during this period were the pattern recognition models, the introduction of LPC methods for spectral representation, the pattern clustering methods for speaker-independent recognizers, and the introduction of dynamic programming methods for solving connected word recognition problems. In the 1980‟s we started to tackle large vocabulary (1000-unlimited number of words) speech recognition problems based on statistical methods, with a wide range of networks for handling language structures. The key technologies introduced during this period were the hidden Markov model (HMM) and the stochastic language model, which together enabled powerful new methods for handling virtually any continuous speech recognition problem efficiently and with high performance. In the 1990‟s we were able to build large vocabulary systems with unconstrained language models, and constrained task syntax models for continuous speech recognition and understanding. The key technologies developed during this period were the methods for stochastic language understanding, statistical learning of acoustic and language models, and the introduction of finite state transducer framework (and the FSM Library) and the methods for their determination and minimization for efficient implementation of large vocabulary speech understanding systems. 6
  • 7. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ Finally, in the last few years, we have seen the introduction of very large vocabulary systems with full semantic models, integrated with text-to-speech (TTS) synthesis systems, and multi- modal inputs (pointing, keyboards, mice, etc.). These systems enable spoken dialog systems with a range of input and output modalities for ease-of-use and flexibility in handling adverse environments where speech might not be as suitable as other input-output modalities. During this period we have seen the emergence of highly natural concatenative speech synthesis systems, the use of machine learning to improve both speech understanding and speech dialogs, and the introduction of mixed-initiative dialog systems to enable user control when necessary. After nearly five decades of research, speech recognition technologies have finally entered the marketplace, benefiting the users in a variety of ways. Throughout the course of development of such systems, knowledge of speech production and perception was used in establishing the technological foundation for the resulting speech recognizers. Major advances, however, were brought about in the 1960‟s and 1970‟s via the introduction of advanced speech representations based on LPC analysis and cepstral analysis methods, and in the 1980‟s through the introduction of rigorous statistical methods based on hidden Markov models. All of this came about because of significant research contributions from academia, private industry and the government. As the technology continues to mature, it is clear that many new applications will emerge and become part of our way of life – thereby taking full advantage of machines that are partially able to mimic human speech capabilities. 7
  • 8. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ 8
  • 9. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ 3. METHODOLOGY OF THE PROJECT The methodology of the project involves the following steps 1. Database collection 2. Decomposition of the speech signal 3. Feature vectors extraction 4. Developing a classifier 5. Training the classifier 6. Testing the classifier Each of the section is discussed in detail below  3.1 Database collection Database collection is the most important step in speech recognition. Only an efficient database can yield a good speech recognition system. As we know different people say words differently. This is due to the difference in the pitch, slang, pronunciation. In this step the same word is recorded by different persons. All words are recorded at the same frequency 16KHz. Collection of too much samples need not benefit the speech recognition. Sometimes it can affect it adversely. So, right number of samples should be taken. The same step is repeated for other words also.  3.2 Decomposition of speech signal The next step is speech signal decomposition. For this we can use different techniques like LPC, MFCC, STFT, wavelet transform. Over the past 10 years wavelet transform is mostly used in speech recognition. Speech recognition systems generally carry 9
  • 10. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ out some kind of classification/recognition based upon speech features which are usually obtained via time-frequency representations such as Short Time Fourier Transforms (STFTs) or Linear Predictive Coding (LPC) techniques. In some respects, these methods may not be suitable for representing speech; they assume signal stationarity within a given time frame and may therefore lack the ability to analyze localized events accurately. Furthermore, the LPC approach assumes a particular linear (all-pole) model of speech production which strictly speaking is not the case. Other approaches based on Cohen‟s general class of time-frequency distributions such as the Cone-Kernel and Choi-Williams methods have also found use in speech recognition applications but have the drawback of introducing unwanted cross-terms into the representation. The Wavelet Transform overcomes some of these limitations; it can provide a constant-Q analysis of a given signal by projection onto a set of basic functions that are scale variant with frequency. Each wavelet is a shifted scaled version of an original or mother wavelet. These families are usually orthogonal to one another, important since this yields computational efficiency and ease of numerical implementation. Other factors influencing the choice of Wavelet Transforms over conventional methods include their ability to capture localized features. Tiling of time frequency plane via the wavelet transform 10
  • 11. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/  Wavelet Transform The Wavelet transform provides the time-frequency representation. (There are other transforms which give this information too, such as short time Fourier transform, Wigner distributions, etc.) Often times a particular spectral component occurring at any instant can be of particular interest. In these cases it may be very beneficial to know the time intervals these particular spectral components occur. For example, in EEGs, the latency of an event-related potential is of particular interest (Event-related potential is the response of the brain to a specific stimulus like flash-light, the latency of this response is the amount of time elapsed between the onset of the stimulus and the response). Wavelet transform is capable of providing the time and frequency information simultaneously. Wavelet transform can be applied to non-stationary signals. It concentrates into small portions of the signal which can be considered as stationary. It has got a variable size window unlike constant size window in STFT. WT gives us information about what band of frequencies is there in a given interval of time. 11
  • 12. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ There are two methodologies for speech decomposition using wavelet. Discrete Wavelet Transform (DWT) and Wavelet Packet Decomposition (WPD). Out of the two DWT is used in our project.  Discrete Wavelet Transform The transform of a signal is just another form of representing the signal. It does not change the information content present in the signal. For many signals, the low-frequency part contains the most important part. It gives an identity to a signal. Consider the human voice. If we remove the high-frequency components, the voice sounds different, but we can still tell what‟s being said. In wavelet analysis, we often speak of approximations and details. The approximations are the high- scale, low-frequency components of the signal. The details are the low-scale, high frequency components. The DWT is defined by the following equation: Where ψ(t) is a time function with finite energy and fast decay called the mother wavelet. The DWT analysis can be performed using a fast, pyramidal algorithm related to multi-rate filter-banks. As a multi-rate filter-bank the DWT can be viewed as a constant Q filter-bank with octave spacing between the centers of the filters. Each sub-band contains half the samples of the neighboring higher frequency sub-band. In the pyramidal algorithm the signal is analyzed at different frequency bands with different resolution by decomposing the signal into a coarse approximation and detail information. The coarse approximation is then further decomposed using the same wavelet decomposition step. This is achieved by successive high-pass and low-pass filtering of the time domain signal and is defined by the following equations: 12
  • 13. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ Figure 1: Signal x[n] is passed through lowpass and highpass filters and it is down sampled by 2 In the DWT, each level is calculated by passing the previous approximation coefficients though a high and low pass filters. However, in the WPD, both the detail and approximation coefficients are decomposed. Figure 2: Decomposition Tree The DWT is computed by successive low-pass and high-pass filtering of the discrete time-domain signal as shown in figure 1 and 2. This is called the Mallat algorithm or Mallat-tree decomposition. The mother wavelet used is daubichies 4 type wavelet. It contains more number of filters. Daubichies wavelets are the most popular wavelets. They represent the foundations of wavelet signal processing and are used in numerous applications. These are also called Maxflat wavelets as their frequency responses have maximum flatness at frequencies 0 and π 13
  • 14. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ Daubechies wavelet of order 4  3.3 Feature vectors extraction Feature extraction is the key for ASR, so that it is arguably the most important component of designing an intelligent system based on speech/speaker recognition, since the best classifier will perform poorly if the features are not chosen well. A feature extractor should reduce the pattern vector (i.e., the original waveform) to a lower dimension, which contains most of the useful information from the original vector. The extracted wavelet coefficients provide a compact representation that shows the energy distribution of the signal in time and frequency. In order to further reduce the dimensionality of the extracted feature vectors, statistics over the set of the wavelet coefficients are used. That way the statistical characteristics of the “texture” or the “music surface” of the piece can be represented. For example the distribution of energy in time and frequency for music is different from that of speech. The following features are used in our system:  The mean of the absolute value of the coefficients in each sub-band. These features provide information about the frequency distribution of the audio signal.  The standard deviation of the coefficients in each sub-band. These features provide information about the amount of change of the frequency distribution.  Energy of each sub-band of the signal. These features provide information about the energy of the each sub-band.  Kurtosis of each sub-band of the signal. These features measure whether the data are peaked or flat relative to a normal distribution. 14
  • 15. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/  Skewness of each sub-band of the signals. These features are the measure of symmetry or lack of symmetry. These features are then combined into a hybrid feature and are fed to a classifier. Features are combined using a matrix. All the features of one sample correspond to a column.  3.4 Developing a classifier Generally, there are three usual methods in speech recognition: Dynamic Time Warping (DTW), Hidden Markov Model (HMM) and Artificial Neural Networks (ANNs). Dynamic time warping (DTW) is a technique that finds the optimal alignment between two time series if one time series may be warped non-linearly by stretching or shrinking it along its time axis. This warping between two time series can then be used to find corresponding regions between the two time series or to determine the similarity between the two time series. In speech recognition Dynamic time warping is often used to determine if two waveforms represent the same spoken phrase. This method is used for time adjustment of two words and estimation their difference. In a speech waveform, the duration of each spoken sound and the interval between sounds are permitted to vary, but the overall speech waveforms must be similar. Main problem of this systems is little amount of learning words high calculating rate and large memory requirement. Hidden Markov Models are finite automates, having a given number of states; passing from one state to another is made instantaneously at equally spaced time moments. At every pass from one state to another, the system generates observations, two processes are taking place: the transparent one, represented by the observations string (feature sequence), and the hidden one, which cannot be observed, represented by the state string. Main point of this method is timing sequence and comparing methods. Nowadays, ANNs are utilized in wide ranges for their parallel distributed processing, distributed memories, error stability, and pattern learning distinguishing ability. The Complexity of all these systems increased when their generality rises. The biggest 15
  • 16. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ restriction of two first methods is their low speed for searching and comparing in models. But ANNs are faster, because output is resulted from multiplication of adjusted weights in present input. At present TDNN (Time-Delay Neural Network) is widely used in speech recognition.  Neural Networks A neural network (NN) is a massive processing system that consists of many processing entities connected through links that represent the relationship between them. A Multilayer Perceptron (MLP) network consists of an input layer, one or more hidden layers, and an output layer. Each layer consists of multiple neurons. An artificial neuron is the smallest unit that constitutes the artificial neural network. The actual computation and processing of the neural network happens inside the neuron. In this work, we use an architecture of the MLP networks which is the feed-forward network with back-propagation training algorithm (FFBP). In this type of network, the input is presented to the network and moves through the weights and nonlinear activation functions toward the output layer, and the error is corrected in a backward direction using the well-known error back-propagation correction algorithm. The FFBP is best suited for structural pattern recognition. In structural pattern recognition tasks, there are N training examples, where each training example consists of a pattern and a target class (x,y). These examples are assumed to be generated independently according to the joint distribution P(x,y). A structural classifier is then defined as a function h that performs the static mapping from patterns to target classes y=h(x). The function h is usually produced by searching through a space of candidate classifiers and returning the function h that performs well on the training examples during a learning process. A neural network returns the function h in the form of a matrix of weights. 16
  • 17. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ An Artificial Neuron The number of neurons in each hidden layer has a direct impact on the performance of the network during training as well as during operation. Having more neurons than needed for a problem runs the network into an over fitting problem. Over fitting problem is a situation whereby the network memorizes the training examples. Networks that run into over fitting problem perform well on training examples and poorly on unseen examples. Also having less number of neurons than needed for a problem causes the network to run into under fitting problem. The under fitting problem happens when the network architecture does not cope with the complexity of the problem in hand. The under fitting problem results in an inadequate modeling and therefore poor performance of the network. MLP Neural network architecture 17
  • 18. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/  The Backpropagation Algorithm The backpropagation algorithm (Rumelhart and McClelland, 1986) is used in layered feed-forward ANNs. This means that the artificial neurons are organized in layers, and send their signals “forward”, and then the errors are propagated backwards. The network receives inputs by neurons in the input layer, and the output of the network is given by the neurons on an output layer. There may be one or more intermediate hidden layers. The backpropagation algorithm uses supervised learning, which means that we provide the algorithm with examples of the inputs and outputs we want the network to compute, and then the error (difference between actual and expected results) is calculated. The idea of the backpropagation algorithm is to reduce this error, until the ANN learns the training data. The training begins with random weights, and the goal is to adjust them so that the error will be minimal.  3.5 Training the classifier After development the classifier has got 2 steps. Training and testing. In training phase the features of the samples are fed as input to the ANN. The target is set. Then the network is trained. The network will adjust its weights such that the target is achieved for the given input. In this project we have used the function „tansig‟ and „logsig‟. So the output should be bounded between 0 and 1. The output is given as .9 .1 .1……1 for 1st word. .1 .9 .1…….1 for 2nd word and so on. The position of maximum value corresponds to the output.  3.6 Testing the classifier The next phase is testing. The samples which are set aside for testing is given to the classifier and the output is noted. If we don‟t get the desired output ,we reach the required output by adjusting the number of neurons. 18
  • 19. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ 4. OBSERVATION We recorded five Malayalam words “onnu”, ”randu”, ”naalu” , ”anju “ and “aaru” .The words corresponds to Malayalam words for numerals 1,2,4 ,5 and 6 respectively. The reason for specifically selecting these words was that,the project was intended to implement a password system with numerals. Malayalam Word Numeral ഒ 1 2 4 5 6 19
  • 20. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ 20 samples for each word was recorded from different people and these samples were then normalized by dividing their maximum values.Then they were decomposed using wavelet transform technique upto eight levels since majority of the information about the signal is present in the low frequency region. In order to classify the signals an ANN is developed and trained by fixing outputs such that If the word is „onnu‟ then output will be .9 .1 .1 .1 .1 If the word is „randu‟ then output will be .1 .9 .1 .1 .1 If the word is „naalu‟ then output will be .1 .1 .9 .1 .1 If the word is „anju‟ then output will be .1 .1 .1 .9 .1 If the word is „aaru‟ then output will be .1 .1 .1 .1 .9 Out of 20 samples recorded,16 samples are used to train the ANN and the unused 4 samples are used for test purpose.  Plots Plot for word ‘onnu’ 20
  • 21. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ Plot for word ‘randu’ Plot for word ‘naalu’ 21
  • 22. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ Plot for word ‘anju’ Plot for word ‘aaru’ 22
  • 23. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/  DWT Tree The 8 level decomposition tree for a signal using DWT is shown in the figure,which produces one approximation coefficient and eight detailed coefficients 23
  • 24. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ Decomposed waveforms for word ‘randu’ 24
  • 25. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ Decomposed waveforms for word ‘aaru’ 25
  • 26. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ 5. TESTING AND RESULT  Testing with pre-recorded samples Out of the 20 samples recorded for each word, 16 were used for training purpose. We tested our program‟s accuracy with these 4 unused samples. A total of 20 samples were tested ( 4 samples each for the 5 words) and the program yielded the right result for all 20 samples. Thus, we obtained 100% accuracy with pre- recorded samples.  Real-time testing: For real-time testing, we took a sample using microphone and directly executed the program using this sample. A total of 30 samples were tested, out of which 20 samples gave the right result. This gives an accuracy of about 66% with real-time samples. 26
  • 27. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/  Change in efficiency by changing the parameters of the ANN were observed and are plotted below Plot 1: Accuracy with 2 layer feed forward network,Number of neurons in the first layer=15 27
  • 28. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ Plot 2: Accuracy with 2 layer feed forward network ,Number of neurons in the first layer=20 28
  • 29. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ Plot 3: Accuracy with 3 layer feed forward network,Number of neurons in the first layer,N1=15& number of neurons in the second layer,N2=5 29
  • 30. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ 7. CONCLUSION Speech recognition is one of the advanced areas. Many research works has been taking place under this domain to implement new and enhanced approaches. During the experiment we experienced the effectiveness of Daubechies4 mother wavelet in feature extraction. In this experiment we have only used a limited number of samples. Increasing the number of samples may give better feature and a good recognition result for Malayalam word utterances. The performance of Neural Network with wavelet is appreciable. We have used software with some limitations, if we increase the number of samples as well as the number iterations (training), it can produce a good recognition result. We also observed that, Neural Network is an effective tool which can be embedded successfully with wavelet. The effectiveness of wavelet based feature extraction with other classification methods like neuro-fuzzy and genetic algorithm techniques can be used to do the same task. From this study we could understand and experience the effectiveness of discrete wavelet transform in feature extraction. Our recognition results under different kind of noise and noisy conditions, show that choosing dyadic bandwidths have better performance than choosing equal bandwidths in sub-band recombination. This result adapts to way which human ear recognizes speech and shows a useful benefit of dyadic nature of multi-level wavelet transform for sub-band speech recognition. The wavelet transform is a more dominant technique for speech processing than other previous techniques. ANN has proved to be the most successful classifier compared to HMM. 30
  • 31. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ 8. REFERENCES [1] Vimal Krishnan V.R, Athulya Jayakumar, Babu Anto.P, “Speech Recognition of Isolated Malayalam Words Using Wavelet Features and Artificial Neural Network”, 4th IEEE International Symposium on Electronic Design, Test & Applications [2] Lawrance Rabiner, Bing-Hwang Juang, “Fundamentals Speech Recognition”, Eaglewood Cliffs, NJ, Prentice hall, 1993. [3] Mallat Stephen, “A Wavelet Tour of Signal Processing”, San Dieago: Academic Press, 1999, ISBN 012466606. [4] Mallat SA, “Theory for MuItiresolution Signal Decomposition: The Wavelet Representation”, IEEE Transactions on Pattern Analysis Machine Intelligence. Vol. 31, pp 674-693, 1989. [5] K.P. Soman, K.I. Ramachandran, “Insight into Wavelets from Theory to Practice”, Second Edition, PHI, 2005. [6] Kadambe S., Srinivasan P. “Application of Adaptive Wavelets for Speech “, Optical Engineering 33(7), pp. 2204-2211, July 1994. [7] Stuart Russel, Peter Norvig, “Artificial Intelligence, A Modern Approach”, New Delhi: Prentice Hall of India, 2005. [8] S.N. Srinivasan, S. Sumathi, S.N. Deepa, “Introduction to Neural Networks using Matlab 6.0,” New Delhi, Tata McGraw Hill, 2006. [9] James A Freeman, David M Skapura, “Neural Networks Algorithm”. Application and Programming Techniques, Pearson Education, 2006. 31