SlideShare une entreprise Scribd logo
1  sur  29
Intelligent Character
Recognition
By
Suhas Pillai
Advisor
Ray Ptucha
Motivation
• In this digital age, still there are many documents that are handwritten and are
required to be scanned eg Legal documents, Forms, Receipts , Bond papers etc
• Difficult than OCR (Optical Character Recognition) because typed letters can be
easily recognized with fixed set of rules.
• A single word can be written in ‘N’ number of ways by the same person, so
basically there are many variations, so its difficult problem to crack
• High value for companies that design and develop scanners and printers eg
Kodak, Hewlett Packard, Canon etc.
• Different ways of writing ‘of’ by the same person
Why not use OCR tool?
●Tesseract is a famous Open Source OCR engine.
●Widely used for OCR and is being maintained by Google.
●Works for around 30-40 different languages.
●Good for OCR but not good for Intelligent Character recognition
●For Example
Original Image OCR output
Previous approaches
• Sophisticated preprocessing techniques
• Extracting Handcrafted features
• Combination of classifier and a sequential model
i.e Hybrid ANN/DNN Hidden Markov model
• Sequential models like HMM were good at providing transcription.
Deep Learning
Recurrent Neural Networks
• RNNs helps to model local as a well as global context .
• Does not require alphabet specific pre processing
and no need of handcrafted features
• Can be used for any other language and has shown
promising results (like machine to machine translation
, NLP, Speech & Handwriting Recognition etc.)
• Works on raw inputs (pixels)
• Globally trainable model.
• Good in handling long term dependencies.
Why LSTM cell?
General idea about LSTMs
• Solves vanishing gradient problem and thus works better
for long term dependencies.
• Activation is controlled by 3 multiplicative gates
o Input gate
o Forget gate
o Output gate
• Gates allow cell to store or retrieve information over time.
• Showed state of the art results for speech recognition
Bidirectional and Multidimensional
RNN(LSTM)
• Normal Recurrent neural networks can look back at previous time steps (left side)
and get contextual information
• This works good but its been seen that context from right also helps, so we have
BLSTM (Bidirectional LSTMs)
• RNNs are usually structured for 1D sequences, so the input is always converted to
a 1D vector and fed to RNN.
• So, any ‘d’ dimensional data needs to be brought down to 1D before it can be
processed by RNNs
• To overcome this shortcoming, [1] suggested Multidimensional RNNs
[1] Graves, Alex, and Jürgen Schmidhuber. "Offline handwriting recognition with multidimensional recurrent neural networks."
Multidimensional RNNs
• The standard LSTMs are explicitly one dimensional,with one recurrent
connection, and whether to use the information from that recurrent connection
is controlled by just one forget gate.
• With multi dimension, we extend this idea to ‘n’ dimensions, with ‘n’ recurrent
connections, and the information controlled by ‘n’ forget gates.
• The network starts scanning from top left.
1. The thick lines show connection to current
point (i,j).
2. The connections within the hidden plane are
recurrent.
3. The dashed lines are previous points
scanned by the network.
Multidimensional Recurrent Neural Networks
• Mathematically the network is modeled using the following equations.
Calculation for input gate
input gate at current time
sigmoid activation
weights from i/p to hidden layer
input to LSTM block
weights from prev hidden layer to current
hidden layer input gate
previous hidden layer output across ‘d’ dimensions
peep hole weights
previous cell state output across ‘d’ dimensions
input gate bias
Calculation for forget gate
forget gate value is calculated for every
dimension separately, because it helps to
store or forget previous information based on
which dimension is useful
sigmoid activation
weights from input to
forget gate
weights from prev hidden layer across ‘d’
dimensions to current hidden layer
hidden layer output from
previous time step
peep hole weights
cell state output of prev time step
forget gate bias, each across ‘d’ dimension
input to forget gate from i/p layer
Calculation for input
output after tanh activation. This is not an
output of any of the 3 gates, this is same as o/p from
fully connected network with tanh activation
tanh activation
weights from input layer to hidden layer
input from i/p layer to hidden layer
weights from prev hidden layer across ‘d’
dimensions to current hidden layer
hidden layer o/p from prev time steps across ‘d’ dimensions
bias for input
Calculation for cell state
cell state of that particular LSTM block,
a single lstm block can have multiple cell states,
usually one cell state works well in practice.
this expression calculates whether the input from
this particular time step is useful, if not then input
gate value will be close to zero, else close to 1
this expressions calculates, which dimensions
are useful, so suppose if information from ‘X’
dimension is not useful then forget gate value
calculated for that dimension will be 0 and it is
multiplied with the previous time step cell state of X
dimension, so that no information is carried forward from X dimension.
Calculation for output gate
output gate value at current time step
sigmoid activation
weights from input layer to output gate
peep hole weights
input from i/p layer to the hidden layer
weights from prev hidden layer to current hidden
layer across ‘d’ dimensions
hidden layer o/p of previous time steps across ‘d’
dimensions
remember this is the cell state output gate bias
at current time step
Calculation for output of hidden neuron
output of LSTM block (i.e hidden neuron) at current
time step.
output gate value, this decides whether this neuron’s o/p
should be given as an input to the hidden layer of future
time steps, if not then the value would be close to 0 else
close to 1
passing cell state of neuron through tanh activation
CTC (Connectionist Temporal Classification)
• Previous approaches to train end to end systems for handwriting recognition
involved segmenting input with ground truths.
• As a result, we had to do force alignment, which are prone to errors ,
thereby that error getting propagated to the training system.
• In order to overcome this issue, we use Connectionist Temporal
Classification(CTC)
• It provides two advantages
1. We can have variable input, no need for force alignment.
2. The CTC loss function is differentiable, hence end to end trainable.
Difference between Forced Alignment & CTC
Forced Alignment CTC Alignment
CTC Cost Function and Intuition
• The objective function is negative log likelihood of correctly labelling
the entire training data.
• x - Training sample
S - Training data
z - Generated sequence
CTC Cost Function and Intuition
derivative of objective function wrt o/p
o/p at time step ‘t’ for
‘k’ th label.
pr
probability of all the paths that can be formed for a particular input. I like speech recognition eg, suppose you are
saying a word ‘Robocop’, now there are many ways of saying ‘Robocop’, like Roooooooooobooocop or Robocoooooop
or Robo <pause> cop. P(z|x) is the total probability of all the sequences that can be formed for a given word. Now,
number of possible paths (words/sequences) can be exponential, we use forward backward algorithm to find total
probability of paths. More on this in next slides
Forward Backward Algorithm
blank
labels
t = 0 t = 1 t = 2 t = 3 t = 4 t = 0 t = 1 t =2 t = 3 t = 4
alphas betas
0
0
0
0
0
0
0 1
1
0
0
2
1
0
0
3
3
1
0
4
1
0
0
1
0
6
0
5
1
0
0
0
2
1
0
0
7
7
2
1
0
0
0
0
0
1
5
0
0
6
1
0
0
1
1
3
0
4
3
0
0
0
1
1
2
1
0
0
0
0
0
0
0
CTC Cost Function and Intuition
• Figure on the right represents total number of paths that pass through
every node at time step t=2. For eg
alpha for C = 3 and beta C = 1 , total number of paths going through C
at time step t = 2 is 3. This is how we calculate all the paths going through every
node at each time step.
• O/P at time step ‘t’
for label ‘k’
• Sum across same labels, like if
your ground truth word is KITKAT,
then K appears twice, so you sum across label ‘K’.
• alphas(s) * betas(s) , it is the probability of total number of paths
that contain label ‘s’. So, if you have 10 labels,
and your soft max outputs equal probabilities i.e 0.1 at each
time step, then the probability of total paths, through A = 16 * (0.1^5), for 5 time steps.
CTC Cost Function and Intuition
• Now to backpropagate the gradients, we need to find gradients with respect to output,
i.e before activation is applied
• Here k’ refers all the labels and k is the kth label
• Finally, we arrive to the following equation for gradients with respect to output before
activation is applied.
=
• For eg, if we take gradient with respect to activation at ‘A’ , considering that there are 10
labels, and all output 0.1 probability initially. Then gradient propagation for label ‘A’
• P (z|x) = 0 + (3* (0.1^5)) + (3* (0.1^5)) + (16 * (0.1^5)) + (3* (0.1^5)) + (3* (0.1^5)) + 0
alphas* betas = (16 * (0.1^5)) and output activation of label ‘A’ at time t is 0.1. Gradient value
is -0.4714
Stacking MDLSTMs
Architecture
Louradour, Jérôme, and Christopher Kermorvant. "Curriculum learning for handwritten text line recognition." Document Analysis Systems (DAS), 2014 11th IAPR International
Workshop on. IEEE, 2014.
Results
• Trained and Tested on IAM Handwriting database using
a)python code written from scratch
b) RNNlib library
1. Training data 80K
2. Validation data 20K
3. Testing data 15k
• NCER % (Normalized Character Error Recognition)
1. Training NCER 15.5 %
2. Testing NCER 15 %
3. Testing NCER with Lexicon 12.60 %
• Some examples from the database
Errors made by the network
Thank You
Questions?

Contenu connexe

Tendances

Deep Style: Using Variational Auto-encoders for Image Generation
Deep Style: Using Variational Auto-encoders for Image GenerationDeep Style: Using Variational Auto-encoders for Image Generation
Deep Style: Using Variational Auto-encoders for Image GenerationTJ Torres
 
Deep learning in Computer Vision
Deep learning in Computer VisionDeep learning in Computer Vision
Deep learning in Computer VisionDavid Dao
 
Devanagari Character Recognition
Devanagari Character RecognitionDevanagari Character Recognition
Devanagari Character RecognitionPulkit Goyal
 
2010 deep learning and unsupervised feature learning
2010 deep learning and unsupervised feature learning2010 deep learning and unsupervised feature learning
2010 deep learning and unsupervised feature learningVan Thanh
 
Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intell...
Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intell...Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intell...
Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...Simplilearn
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysisodsc
 
Character Recognition using Machine Learning
Character Recognition using Machine LearningCharacter Recognition using Machine Learning
Character Recognition using Machine LearningRitwikSaurabh1
 
Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsBuhwan Jeong
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsPython for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsRoelof Pieters
 
Autoencoders
AutoencodersAutoencoders
AutoencodersCloudxLab
 
P03 neural networks cvpr2012 deep learning methods for vision
P03 neural networks cvpr2012 deep learning methods for visionP03 neural networks cvpr2012 deep learning methods for vision
P03 neural networks cvpr2012 deep learning methods for visionzukun
 
NeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximateProgramsNeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximateProgramsMohid Nabil
 
Intro to Deep Learning
Intro to Deep LearningIntro to Deep Learning
Intro to Deep LearningKushal Arora
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
 

Tendances (20)

Deep Style: Using Variational Auto-encoders for Image Generation
Deep Style: Using Variational Auto-encoders for Image GenerationDeep Style: Using Variational Auto-encoders for Image Generation
Deep Style: Using Variational Auto-encoders for Image Generation
 
Deep learning in Computer Vision
Deep learning in Computer VisionDeep learning in Computer Vision
Deep learning in Computer Vision
 
Devanagari Character Recognition
Devanagari Character RecognitionDevanagari Character Recognition
Devanagari Character Recognition
 
2010 deep learning and unsupervised feature learning
2010 deep learning and unsupervised feature learning2010 deep learning and unsupervised feature learning
2010 deep learning and unsupervised feature learning
 
Som paper1.doc
Som paper1.docSom paper1.doc
Som paper1.doc
 
Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intell...
Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intell...Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intell...
Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intell...
 
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
Character Recognition using Machine Learning
Character Recognition using Machine LearningCharacter Recognition using Machine Learning
Character Recognition using Machine Learning
 
Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applications
 
Intoduction to Neural Network
Intoduction to Neural NetworkIntoduction to Neural Network
Intoduction to Neural Network
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsPython for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
 
Autoencoder
AutoencoderAutoencoder
Autoencoder
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
P03 neural networks cvpr2012 deep learning methods for vision
P03 neural networks cvpr2012 deep learning methods for visionP03 neural networks cvpr2012 deep learning methods for vision
P03 neural networks cvpr2012 deep learning methods for vision
 
NeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximateProgramsNeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximatePrograms
 
Intro to Deep Learning
Intro to Deep LearningIntro to Deep Learning
Intro to Deep Learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 

Similaire à Intelligent Handwriting Recognition_MIL_presentation_v3_final

Concepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, AttentionConcepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, AttentionSaumyaMundra3
 
Dynamic time warping and PIC 16F676 for control of devices
Dynamic time warping and PIC 16F676 for control of devicesDynamic time warping and PIC 16F676 for control of devices
Dynamic time warping and PIC 16F676 for control of devicesRoger Gomes
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnnKuppusamy P
 
A Robust UART Architecture Based on Recursive Running Sum Filter for Better N...
A Robust UART Architecture Based on Recursive Running Sum Filter for Better N...A Robust UART Architecture Based on Recursive Running Sum Filter for Better N...
A Robust UART Architecture Based on Recursive Running Sum Filter for Better N...Kevin Mathew
 
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Fordham University
 
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...Universitat Politècnica de Catalunya
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural NetworksSharath TS
 
Neural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptxNeural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptxssuser2624f71
 
RNN and LSTM model description and working advantages and disadvantages
RNN and LSTM model description and working advantages and disadvantagesRNN and LSTM model description and working advantages and disadvantages
RNN and LSTM model description and working advantages and disadvantagesAbhijitVenkatesh1
 
Concurrency in Distributed Systems : Leslie Lamport papers
Concurrency in Distributed Systems : Leslie Lamport papersConcurrency in Distributed Systems : Leslie Lamport papers
Concurrency in Distributed Systems : Leslie Lamport papersSubhajit Sahu
 
Performance measures
Performance measuresPerformance measures
Performance measuresDivya Tiwari
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model佳蓉 倪
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learningJunaid Bhat
 
recurrent_neural_networks_april_2020.pptx
recurrent_neural_networks_april_2020.pptxrecurrent_neural_networks_april_2020.pptx
recurrent_neural_networks_april_2020.pptxSagarTekwani4
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs)Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs)Abdullah al Mamun
 
Lecture 2 - Bit vs Qubits.pptx
Lecture 2 - Bit vs Qubits.pptxLecture 2 - Bit vs Qubits.pptx
Lecture 2 - Bit vs Qubits.pptxNatKell
 
Data structure and algorithm using java
Data structure and algorithm using javaData structure and algorithm using java
Data structure and algorithm using javaNarayan Sau
 
DSJ_Unit I & II.pdf
DSJ_Unit I & II.pdfDSJ_Unit I & II.pdf
DSJ_Unit I & II.pdfArumugam90
 

Similaire à Intelligent Handwriting Recognition_MIL_presentation_v3_final (20)

Concepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, AttentionConcepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, Attention
 
Dynamic time warping and PIC 16F676 for control of devices
Dynamic time warping and PIC 16F676 for control of devicesDynamic time warping and PIC 16F676 for control of devices
Dynamic time warping and PIC 16F676 for control of devices
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
A Robust UART Architecture Based on Recursive Running Sum Filter for Better N...
A Robust UART Architecture Based on Recursive Running Sum Filter for Better N...A Robust UART Architecture Based on Recursive Running Sum Filter for Better N...
A Robust UART Architecture Based on Recursive Running Sum Filter for Better N...
 
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
 
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Neural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptxNeural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptx
 
RNN and LSTM model description and working advantages and disadvantages
RNN and LSTM model description and working advantages and disadvantagesRNN and LSTM model description and working advantages and disadvantages
RNN and LSTM model description and working advantages and disadvantages
 
Concurrency in Distributed Systems : Leslie Lamport papers
Concurrency in Distributed Systems : Leslie Lamport papersConcurrency in Distributed Systems : Leslie Lamport papers
Concurrency in Distributed Systems : Leslie Lamport papers
 
Performance measures
Performance measuresPerformance measures
Performance measures
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
recurrent_neural_networks_april_2020.pptx
recurrent_neural_networks_april_2020.pptxrecurrent_neural_networks_april_2020.pptx
recurrent_neural_networks_april_2020.pptx
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs)Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs)
 
Lecture 2 - Bit vs Qubits.pptx
Lecture 2 - Bit vs Qubits.pptxLecture 2 - Bit vs Qubits.pptx
Lecture 2 - Bit vs Qubits.pptx
 
PS
PSPS
PS
 
Data structure and algorithm using java
Data structure and algorithm using javaData structure and algorithm using java
Data structure and algorithm using java
 
DSJ_Unit I & II.pdf
DSJ_Unit I & II.pdfDSJ_Unit I & II.pdf
DSJ_Unit I & II.pdf
 

Intelligent Handwriting Recognition_MIL_presentation_v3_final

  • 2. Motivation • In this digital age, still there are many documents that are handwritten and are required to be scanned eg Legal documents, Forms, Receipts , Bond papers etc • Difficult than OCR (Optical Character Recognition) because typed letters can be easily recognized with fixed set of rules. • A single word can be written in ‘N’ number of ways by the same person, so basically there are many variations, so its difficult problem to crack • High value for companies that design and develop scanners and printers eg Kodak, Hewlett Packard, Canon etc. • Different ways of writing ‘of’ by the same person
  • 3. Why not use OCR tool? ●Tesseract is a famous Open Source OCR engine. ●Widely used for OCR and is being maintained by Google. ●Works for around 30-40 different languages. ●Good for OCR but not good for Intelligent Character recognition ●For Example Original Image OCR output
  • 4. Previous approaches • Sophisticated preprocessing techniques • Extracting Handcrafted features • Combination of classifier and a sequential model i.e Hybrid ANN/DNN Hidden Markov model • Sequential models like HMM were good at providing transcription.
  • 6. Recurrent Neural Networks • RNNs helps to model local as a well as global context . • Does not require alphabet specific pre processing and no need of handcrafted features • Can be used for any other language and has shown promising results (like machine to machine translation , NLP, Speech & Handwriting Recognition etc.) • Works on raw inputs (pixels) • Globally trainable model. • Good in handling long term dependencies.
  • 7. Why LSTM cell? General idea about LSTMs • Solves vanishing gradient problem and thus works better for long term dependencies. • Activation is controlled by 3 multiplicative gates o Input gate o Forget gate o Output gate • Gates allow cell to store or retrieve information over time. • Showed state of the art results for speech recognition
  • 8. Bidirectional and Multidimensional RNN(LSTM) • Normal Recurrent neural networks can look back at previous time steps (left side) and get contextual information • This works good but its been seen that context from right also helps, so we have BLSTM (Bidirectional LSTMs) • RNNs are usually structured for 1D sequences, so the input is always converted to a 1D vector and fed to RNN. • So, any ‘d’ dimensional data needs to be brought down to 1D before it can be processed by RNNs • To overcome this shortcoming, [1] suggested Multidimensional RNNs [1] Graves, Alex, and Jürgen Schmidhuber. "Offline handwriting recognition with multidimensional recurrent neural networks."
  • 9. Multidimensional RNNs • The standard LSTMs are explicitly one dimensional,with one recurrent connection, and whether to use the information from that recurrent connection is controlled by just one forget gate. • With multi dimension, we extend this idea to ‘n’ dimensions, with ‘n’ recurrent connections, and the information controlled by ‘n’ forget gates. • The network starts scanning from top left. 1. The thick lines show connection to current point (i,j). 2. The connections within the hidden plane are recurrent. 3. The dashed lines are previous points scanned by the network.
  • 10. Multidimensional Recurrent Neural Networks • Mathematically the network is modeled using the following equations.
  • 11. Calculation for input gate input gate at current time sigmoid activation weights from i/p to hidden layer input to LSTM block weights from prev hidden layer to current hidden layer input gate previous hidden layer output across ‘d’ dimensions peep hole weights previous cell state output across ‘d’ dimensions input gate bias
  • 12. Calculation for forget gate forget gate value is calculated for every dimension separately, because it helps to store or forget previous information based on which dimension is useful sigmoid activation weights from input to forget gate weights from prev hidden layer across ‘d’ dimensions to current hidden layer hidden layer output from previous time step peep hole weights cell state output of prev time step forget gate bias, each across ‘d’ dimension input to forget gate from i/p layer
  • 13. Calculation for input output after tanh activation. This is not an output of any of the 3 gates, this is same as o/p from fully connected network with tanh activation tanh activation weights from input layer to hidden layer input from i/p layer to hidden layer weights from prev hidden layer across ‘d’ dimensions to current hidden layer hidden layer o/p from prev time steps across ‘d’ dimensions bias for input
  • 14. Calculation for cell state cell state of that particular LSTM block, a single lstm block can have multiple cell states, usually one cell state works well in practice. this expression calculates whether the input from this particular time step is useful, if not then input gate value will be close to zero, else close to 1 this expressions calculates, which dimensions are useful, so suppose if information from ‘X’ dimension is not useful then forget gate value calculated for that dimension will be 0 and it is multiplied with the previous time step cell state of X dimension, so that no information is carried forward from X dimension.
  • 15. Calculation for output gate output gate value at current time step sigmoid activation weights from input layer to output gate peep hole weights input from i/p layer to the hidden layer weights from prev hidden layer to current hidden layer across ‘d’ dimensions hidden layer o/p of previous time steps across ‘d’ dimensions remember this is the cell state output gate bias at current time step
  • 16. Calculation for output of hidden neuron output of LSTM block (i.e hidden neuron) at current time step. output gate value, this decides whether this neuron’s o/p should be given as an input to the hidden layer of future time steps, if not then the value would be close to 0 else close to 1 passing cell state of neuron through tanh activation
  • 17. CTC (Connectionist Temporal Classification) • Previous approaches to train end to end systems for handwriting recognition involved segmenting input with ground truths. • As a result, we had to do force alignment, which are prone to errors , thereby that error getting propagated to the training system. • In order to overcome this issue, we use Connectionist Temporal Classification(CTC) • It provides two advantages 1. We can have variable input, no need for force alignment. 2. The CTC loss function is differentiable, hence end to end trainable.
  • 18. Difference between Forced Alignment & CTC Forced Alignment CTC Alignment
  • 19. CTC Cost Function and Intuition • The objective function is negative log likelihood of correctly labelling the entire training data. • x - Training sample S - Training data z - Generated sequence
  • 20. CTC Cost Function and Intuition derivative of objective function wrt o/p o/p at time step ‘t’ for ‘k’ th label. pr probability of all the paths that can be formed for a particular input. I like speech recognition eg, suppose you are saying a word ‘Robocop’, now there are many ways of saying ‘Robocop’, like Roooooooooobooocop or Robocoooooop or Robo <pause> cop. P(z|x) is the total probability of all the sequences that can be formed for a given word. Now, number of possible paths (words/sequences) can be exponential, we use forward backward algorithm to find total probability of paths. More on this in next slides
  • 21. Forward Backward Algorithm blank labels t = 0 t = 1 t = 2 t = 3 t = 4 t = 0 t = 1 t =2 t = 3 t = 4 alphas betas 0 0 0 0 0 0 0 1 1 0 0 2 1 0 0 3 3 1 0 4 1 0 0 1 0 6 0 5 1 0 0 0 2 1 0 0 7 7 2 1 0 0 0 0 0 1 5 0 0 6 1 0 0 1 1 3 0 4 3 0 0 0 1 1 2 1 0 0 0 0 0 0 0
  • 22. CTC Cost Function and Intuition • Figure on the right represents total number of paths that pass through every node at time step t=2. For eg alpha for C = 3 and beta C = 1 , total number of paths going through C at time step t = 2 is 3. This is how we calculate all the paths going through every node at each time step. • O/P at time step ‘t’ for label ‘k’ • Sum across same labels, like if your ground truth word is KITKAT, then K appears twice, so you sum across label ‘K’. • alphas(s) * betas(s) , it is the probability of total number of paths that contain label ‘s’. So, if you have 10 labels, and your soft max outputs equal probabilities i.e 0.1 at each time step, then the probability of total paths, through A = 16 * (0.1^5), for 5 time steps.
  • 23. CTC Cost Function and Intuition • Now to backpropagate the gradients, we need to find gradients with respect to output, i.e before activation is applied • Here k’ refers all the labels and k is the kth label • Finally, we arrive to the following equation for gradients with respect to output before activation is applied. = • For eg, if we take gradient with respect to activation at ‘A’ , considering that there are 10 labels, and all output 0.1 probability initially. Then gradient propagation for label ‘A’ • P (z|x) = 0 + (3* (0.1^5)) + (3* (0.1^5)) + (16 * (0.1^5)) + (3* (0.1^5)) + (3* (0.1^5)) + 0 alphas* betas = (16 * (0.1^5)) and output activation of label ‘A’ at time t is 0.1. Gradient value is -0.4714
  • 25. Architecture Louradour, Jérôme, and Christopher Kermorvant. "Curriculum learning for handwritten text line recognition." Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on. IEEE, 2014.
  • 26. Results • Trained and Tested on IAM Handwriting database using a)python code written from scratch b) RNNlib library 1. Training data 80K 2. Validation data 20K 3. Testing data 15k • NCER % (Normalized Character Error Recognition) 1. Training NCER 15.5 % 2. Testing NCER 15 % 3. Testing NCER with Lexicon 12.60 % • Some examples from the database
  • 27. Errors made by the network