SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
3 Training Tesseract
An Introduction to the Training
Process
Ray Smith, Google Inc.
Tesseract Tutorial: DAS 2014 Tours France
The Big Picture
Web Crawl
Repository
Language ID
Map-Reduce
Eng
Dirty Language
Corpora
Cleaned Language
Corpora
Text Filtration
Eng
Language Model
Generation
Realistic Text
Rendering
OCR Engine
Training
Eng
Eng
OCR Shape Files
Language Model Files
Eng
Manually generated Files
Tesseract Tutorial: DAS 2014 Tours France
Language Model
Generation
Realistic Text
Rendering
OCR Engine
Training
Eng
Eng
OCR Shape Files
Language Model Files
Bigram list
Word list
Realistic Training text
Punctuation patterns
combine_tessdata
wordlist2dawg
mftraining
cntraining
shapetraining
unicharset_extractor
set_unicharset_properties
Eng
Manually generated Files
text2image
<lang>.traineddata
The Open Source Parts
Tesseract Tutorial: DAS 2014 Tours France
Training Fundamentals
● Character samples must be segregated by font
=> Trained on synthetic (rendered, distorted) data
● Few samples required (4-10 of each combination is good. 1 is OK)
● Not many fonts required. (32 used for Latin)
● Not many fonts allowed. (MAX_NUM_CONFIGS=64: Long story.)
● Number of different “characters” now limited only by memory.
Tesseract Tutorial: DAS 2014 Tours France
What Data Needs to be Created by Training? (1)
Name Type Status Creator Description
config Text Optional Manual Lang-specific engine settings if needed
unicharset Text Mandatory unicharset_extractor The set of recognizable units
unicharambigs Text Optional Manual* Intrinsic ambiguities for the language
inttemp Binary Mandatory mftraining Classifier shape data
pffmtable Text Mandatory mftraining Extra classifier data (num expected features)
normproto Text Mandatory cntraining Classifier baseline position info
cube-unicharset Text Optional unicharset_extractor Cube’s set of recognizable units
shapetable Binary Optional shapetraining Indirection between classifier and unicharset
params-model Text Optional Google tool Alternative method for combining LM & classifier
Tesseract Tutorial: DAS 2014 Tours France
Inside tesstrain.sh
Input Program Output
Realistic Training Text text2image *.tif, *.box
*.box unicharset_extractor unicharset
unicharset, <script>.unicharset set_unicharset_properties unicharset
Word List wordlist2dawg word-dawg
Frequent Word List wordlist2dawg freq-dawg
*.tif, *.box tesseract *.tr
*.tr cntraining normproto
*.tr mftraining inttemp, pffmtable
unicharset, dawgs, normproto, inttemp, pffmtable, config combine_tessdata traineddata
Tesseract Tutorial: DAS 2014 Tours France
Training Text:
Real [frequent]
Words covering
whole character
set
Rendering +
Image
Degradation
Training
Images +
Box Files
Fonts
Tesseract
box.train
Unicharset
Extractor
UniversalUnicharsetProperties Script
Unicharsets
Set Unicharset
Properties
Initial
UnicharsetExtracted Features (.tr files)
All from correctly segmented characters
Text Corpus
Training Data
Text Filtration
Filtered
Corpus
Text To
Wordlist
cn Training mf Training
inttemppffmtablenormproto
Words Sorted
by Frequency
Wordlist 2
DAWG
Combine Tessdata Traineddata
Overview of Tesseract Training Process
NOT open source
Ambigs
Training
Unichar
Ambigs
punc
dawg
word
dawg
number
dawg
freq
dawg
bigram
dawg
Tesseract Tutorial: DAS 2014 Tours France
Language-Specific Data
● Training text: defines the character set.
● Wordlists: define the language model. (Including bigrams when
present.)
● Pango layout: defines the grapheme clusters (recognition units).
Eg: 0xca6 + 0xccd + 0xca6 + 0xcc7 ->
● ICU: determines what is right-to-left.
● Script.unicharset: Stores typical font metrics for unicode chars.
● Config files: in training/langdata/<lang>/<lang>.config.
● Vertical rendering: Determined manually.
Tesseract Tutorial: DAS 2014 Tours France
MFtraining
Input: font.tr files (Stored Features with utf-8 labels)
Output: inttemp (knn classifier data)
pffmtable (helper information for class pruner)
Operation:
1. Independently cluster features in each font/char class combination.
2. Combine similar cluster means across fonts (single char class).
3. Define each font/char class as a combination of cluster means (a
font config).
4. Build class pruner and main knn classifier.
Tesseract Tutorial: DAS 2014 Tours France
Clustering Result
Protos of Arial ‘A’ Protos of Times Italic ‘A’
Tesseract Tutorial: DAS 2014 Tours France
CNtraining
Input: font.tr files (Stored Features with utf-8 labels)
Output: normproto (GMM means of the CN feature)
Operation:
1. Independently cluster the CN feature of all fonts for each char
class.
2. Write cluster means (a Gaussian Mixture Model) to normproto file.
Tesseract Tutorial: DAS 2014 Tours France
Shapetraining
Input: font.tr files (Stored Features with utf-8 labels)
Output: shapetable (Mapping from an index to a collection of unichar-
ids, fonts)
Operation:
1. Cluster across all fonts and all char classes.
2. Merge ambiguous classes into shapes.
3. Works OK for Indic, but not so good for others.
Tesseract Tutorial: DAS 2014 Tours France
Training Text:
Real [frequent]
Words covering
whole character
set
Rendering +
Image
Degradation
Training
Images +
Box Files
Fonts
Tesseract
box.train
Shape
Clustering
Unicharset
Extractor
UniversalUnicharsetProperties Script
Unicharsets
Set Unicharset
Properties
Initial
UnicharsetChopped
Fragments
Natural
Fragments
Correctly
Segmented
Naturally
touching
Extracted Features (.tr files)
Text Corpus
Training Data
Text filtration
Filtered
Corpus
Text To
Wordlist
Correctly
Segmented
Validation
Set
Junk
Master Trainer FileShape
Unicharset
Shape Table
cn Training mf Training
inttemppffmtablenormproto
Words Sorted
by Frequency
Wordlist 2
DAWG
punc
dawg
word
dawg
number
dawg
freq
dawg
Combine Tessdata Traineddata
Overview of Tesseract Training Process with Shapes
NOT open source
bigram
dawg
Tesseract Tutorial: DAS 2014 Tours France
DAWGs (Directed Acyclic Word Graph)
From “The world’s fastest Scrabble program” A. W. Appel, G.J. Jacobson,
CACM 31(5) May 1988, pp572-585:
“The lexicon represented as a raw word list takes about 780 Kbytes, while our dawg can be represented in 175 Kbytes. The
relatively small size of this data structure allows us to keep it entirely in core, even on a fairly modest computer.”
Trie: Dawg:
Tesseract Tutorial: DAS 2014 Tours France
What Data Needs to be Created by Training? (2)
Name Type Status Creator Description
punc-dawg Binary Optional wordlist2dawg Patterns of punctuation around words
word-dawg Binary Optional wordlist2dawg Main word-list/dictionary language model
number-dawg Binary Optional wordlist2dawg Acceptable number patterns (with units?)
freq-dawg Binary Optional wordlist2dawg Shorter dictionary of frequent words
fixed-length-dawgs Binary Deprecated wordlist2dawg Was used for CJK
cube-word-dawg Binary Optional wordlist2dawg Main word-list/dictionary language model for cube
bigram-dawg Binary Optional wordlist2dawg Word bigram language model
unambig-dawg Binary Optional wordlist2dawg List of unambiguous words (not used?)
Tesseract Tutorial: DAS 2014 Tours France
Thanks for Listening!
Questions?

Contenu connexe

Similaire à 3 training

8 modernization efforts
8 modernization efforts8 modernization efforts
8 modernization effortsSolin TEM
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Pythontswr
 
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...AI Frontiers
 
Text classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_juneText classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_juneDeep Learning Italia
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonRalf Gommers
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014Naoki Nakatani
 
1 intro history
1 intro history1 intro history
1 intro historySolin TEM
 
Reproducibility with R
Reproducibility with RReproducibility with R
Reproducibility with RMartin Jung
 
Language translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowLanguage translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowS N
 
Natural language processing open seminar For Tensorflow usage
Natural language processing open seminar For Tensorflow usageNatural language processing open seminar For Tensorflow usage
Natural language processing open seminar For Tensorflow usagehyunyoung Lee
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris
 
Separation of Concerns in Language Definition
Separation of Concerns in Language DefinitionSeparation of Concerns in Language Definition
Separation of Concerns in Language DefinitionEelco Visser
 
Text Mining Project: Identification of Age and Gender in Social Networks
Text Mining Project: Identification of Age and Gender in Social NetworksText Mining Project: Identification of Age and Gender in Social Networks
Text Mining Project: Identification of Age and Gender in Social NetworksLucas M
 
PhD Presentation
PhD PresentationPhD Presentation
PhD Presentationmskayed
 

Similaire à 3 training (20)

8 modernization efforts
8 modernization efforts8 modernization efforts
8 modernization efforts
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
 
Text classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_juneText classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_june
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
Tamil OCR using Tesseract OCR Engine
Tamil OCR using Tesseract OCR EngineTamil OCR using Tesseract OCR Engine
Tamil OCR using Tesseract OCR Engine
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Feature Engineering in NLP.pdf
Feature Engineering in NLP.pdfFeature Engineering in NLP.pdf
Feature Engineering in NLP.pdf
 
C3 w2
C3 w2C3 w2
C3 w2
 
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
 
1 intro history
1 intro history1 intro history
1 intro history
 
Reproducibility with R
Reproducibility with RReproducibility with R
Reproducibility with R
 
Language translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowLanguage translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlow
 
NLP
NLPNLP
NLP
 
Natural language processing open seminar For Tensorflow usage
Natural language processing open seminar For Tensorflow usageNatural language processing open seminar For Tensorflow usage
Natural language processing open seminar For Tensorflow usage
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
 
Separation of Concerns in Language Definition
Separation of Concerns in Language DefinitionSeparation of Concerns in Language Definition
Separation of Concerns in Language Definition
 
Text Mining Project: Identification of Age and Gender in Social Networks
Text Mining Project: Identification of Age and Gender in Social NetworksText Mining Project: Identification of Age and Gender in Social Networks
Text Mining Project: Identification of Age and Gender in Social Networks
 
PhD Presentation
PhD PresentationPhD Presentation
PhD Presentation
 

Plus de Solin TEM

CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1Solin TEM
 
cs15 slide_v2
cs15 slide_v2cs15 slide_v2
cs15 slide_v2Solin TEM
 
4 downloading
4 downloading4 downloading
4 downloadingSolin TEM
 
2 architecture anddatastructures
2 architecture anddatastructures2 architecture anddatastructures
2 architecture anddatastructuresSolin TEM
 
0 tutorial contents
0 tutorial contents0 tutorial contents
0 tutorial contentsSolin TEM
 
Khmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_dayKhmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_daySolin TEM
 
Khmer ocr using gfd
Khmer ocr using gfdKhmer ocr using gfd
Khmer ocr using gfdSolin TEM
 
Khmer ocr scientificday_itc
Khmer ocr scientificday_itcKhmer ocr scientificday_itc
Khmer ocr scientificday_itcSolin TEM
 
Khmer ocr itc
Khmer ocr itcKhmer ocr itc
Khmer ocr itcSolin TEM
 

Plus de Solin TEM (9)

CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1
 
cs15 slide_v2
cs15 slide_v2cs15 slide_v2
cs15 slide_v2
 
4 downloading
4 downloading4 downloading
4 downloading
 
2 architecture anddatastructures
2 architecture anddatastructures2 architecture anddatastructures
2 architecture anddatastructures
 
0 tutorial contents
0 tutorial contents0 tutorial contents
0 tutorial contents
 
Khmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_dayKhmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_day
 
Khmer ocr using gfd
Khmer ocr using gfdKhmer ocr using gfd
Khmer ocr using gfd
 
Khmer ocr scientificday_itc
Khmer ocr scientificday_itcKhmer ocr scientificday_itc
Khmer ocr scientificday_itc
 
Khmer ocr itc
Khmer ocr itcKhmer ocr itc
Khmer ocr itc
 

Dernier

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Dernier (20)

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

3 training

  • 1. 3 Training Tesseract An Introduction to the Training Process Ray Smith, Google Inc.
  • 2. Tesseract Tutorial: DAS 2014 Tours France The Big Picture Web Crawl Repository Language ID Map-Reduce Eng Dirty Language Corpora Cleaned Language Corpora Text Filtration Eng Language Model Generation Realistic Text Rendering OCR Engine Training Eng Eng OCR Shape Files Language Model Files Eng Manually generated Files
  • 3. Tesseract Tutorial: DAS 2014 Tours France Language Model Generation Realistic Text Rendering OCR Engine Training Eng Eng OCR Shape Files Language Model Files Bigram list Word list Realistic Training text Punctuation patterns combine_tessdata wordlist2dawg mftraining cntraining shapetraining unicharset_extractor set_unicharset_properties Eng Manually generated Files text2image <lang>.traineddata The Open Source Parts
  • 4. Tesseract Tutorial: DAS 2014 Tours France Training Fundamentals ● Character samples must be segregated by font => Trained on synthetic (rendered, distorted) data ● Few samples required (4-10 of each combination is good. 1 is OK) ● Not many fonts required. (32 used for Latin) ● Not many fonts allowed. (MAX_NUM_CONFIGS=64: Long story.) ● Number of different “characters” now limited only by memory.
  • 5. Tesseract Tutorial: DAS 2014 Tours France What Data Needs to be Created by Training? (1) Name Type Status Creator Description config Text Optional Manual Lang-specific engine settings if needed unicharset Text Mandatory unicharset_extractor The set of recognizable units unicharambigs Text Optional Manual* Intrinsic ambiguities for the language inttemp Binary Mandatory mftraining Classifier shape data pffmtable Text Mandatory mftraining Extra classifier data (num expected features) normproto Text Mandatory cntraining Classifier baseline position info cube-unicharset Text Optional unicharset_extractor Cube’s set of recognizable units shapetable Binary Optional shapetraining Indirection between classifier and unicharset params-model Text Optional Google tool Alternative method for combining LM & classifier
  • 6. Tesseract Tutorial: DAS 2014 Tours France Inside tesstrain.sh Input Program Output Realistic Training Text text2image *.tif, *.box *.box unicharset_extractor unicharset unicharset, <script>.unicharset set_unicharset_properties unicharset Word List wordlist2dawg word-dawg Frequent Word List wordlist2dawg freq-dawg *.tif, *.box tesseract *.tr *.tr cntraining normproto *.tr mftraining inttemp, pffmtable unicharset, dawgs, normproto, inttemp, pffmtable, config combine_tessdata traineddata
  • 7. Tesseract Tutorial: DAS 2014 Tours France Training Text: Real [frequent] Words covering whole character set Rendering + Image Degradation Training Images + Box Files Fonts Tesseract box.train Unicharset Extractor UniversalUnicharsetProperties Script Unicharsets Set Unicharset Properties Initial UnicharsetExtracted Features (.tr files) All from correctly segmented characters Text Corpus Training Data Text Filtration Filtered Corpus Text To Wordlist cn Training mf Training inttemppffmtablenormproto Words Sorted by Frequency Wordlist 2 DAWG Combine Tessdata Traineddata Overview of Tesseract Training Process NOT open source Ambigs Training Unichar Ambigs punc dawg word dawg number dawg freq dawg bigram dawg
  • 8. Tesseract Tutorial: DAS 2014 Tours France Language-Specific Data ● Training text: defines the character set. ● Wordlists: define the language model. (Including bigrams when present.) ● Pango layout: defines the grapheme clusters (recognition units). Eg: 0xca6 + 0xccd + 0xca6 + 0xcc7 -> ● ICU: determines what is right-to-left. ● Script.unicharset: Stores typical font metrics for unicode chars. ● Config files: in training/langdata/<lang>/<lang>.config. ● Vertical rendering: Determined manually.
  • 9. Tesseract Tutorial: DAS 2014 Tours France MFtraining Input: font.tr files (Stored Features with utf-8 labels) Output: inttemp (knn classifier data) pffmtable (helper information for class pruner) Operation: 1. Independently cluster features in each font/char class combination. 2. Combine similar cluster means across fonts (single char class). 3. Define each font/char class as a combination of cluster means (a font config). 4. Build class pruner and main knn classifier.
  • 10. Tesseract Tutorial: DAS 2014 Tours France Clustering Result Protos of Arial ‘A’ Protos of Times Italic ‘A’
  • 11. Tesseract Tutorial: DAS 2014 Tours France CNtraining Input: font.tr files (Stored Features with utf-8 labels) Output: normproto (GMM means of the CN feature) Operation: 1. Independently cluster the CN feature of all fonts for each char class. 2. Write cluster means (a Gaussian Mixture Model) to normproto file.
  • 12. Tesseract Tutorial: DAS 2014 Tours France Shapetraining Input: font.tr files (Stored Features with utf-8 labels) Output: shapetable (Mapping from an index to a collection of unichar- ids, fonts) Operation: 1. Cluster across all fonts and all char classes. 2. Merge ambiguous classes into shapes. 3. Works OK for Indic, but not so good for others.
  • 13. Tesseract Tutorial: DAS 2014 Tours France Training Text: Real [frequent] Words covering whole character set Rendering + Image Degradation Training Images + Box Files Fonts Tesseract box.train Shape Clustering Unicharset Extractor UniversalUnicharsetProperties Script Unicharsets Set Unicharset Properties Initial UnicharsetChopped Fragments Natural Fragments Correctly Segmented Naturally touching Extracted Features (.tr files) Text Corpus Training Data Text filtration Filtered Corpus Text To Wordlist Correctly Segmented Validation Set Junk Master Trainer FileShape Unicharset Shape Table cn Training mf Training inttemppffmtablenormproto Words Sorted by Frequency Wordlist 2 DAWG punc dawg word dawg number dawg freq dawg Combine Tessdata Traineddata Overview of Tesseract Training Process with Shapes NOT open source bigram dawg
  • 14. Tesseract Tutorial: DAS 2014 Tours France DAWGs (Directed Acyclic Word Graph) From “The world’s fastest Scrabble program” A. W. Appel, G.J. Jacobson, CACM 31(5) May 1988, pp572-585: “The lexicon represented as a raw word list takes about 780 Kbytes, while our dawg can be represented in 175 Kbytes. The relatively small size of this data structure allows us to keep it entirely in core, even on a fairly modest computer.” Trie: Dawg:
  • 15. Tesseract Tutorial: DAS 2014 Tours France What Data Needs to be Created by Training? (2) Name Type Status Creator Description punc-dawg Binary Optional wordlist2dawg Patterns of punctuation around words word-dawg Binary Optional wordlist2dawg Main word-list/dictionary language model number-dawg Binary Optional wordlist2dawg Acceptable number patterns (with units?) freq-dawg Binary Optional wordlist2dawg Shorter dictionary of frequent words fixed-length-dawgs Binary Deprecated wordlist2dawg Was used for CJK cube-word-dawg Binary Optional wordlist2dawg Main word-list/dictionary language model for cube bigram-dawg Binary Optional wordlist2dawg Word bigram language model unambig-dawg Binary Optional wordlist2dawg List of unambiguous words (not used?)
  • 16. Tesseract Tutorial: DAS 2014 Tours France Thanks for Listening! Questions?