SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
Tesseract Tutorial: DAS 2014 Tours France
1. Motivation and History of
the Tesseract OCR Engine
Lessons Learned from What Worked
and What Didn't
Ray Smith, Google Inc.
Tesseract Tutorial: DAS 2014 Tours France
What is Tesseract?
Tesseract Tutorial: DAS 2014 Tours France
Tesseract Timeline
4MB 16MB 64MB 256MB 3GB 24GB
10MHz 25MHz 100MHz 166MHz 3.5GHz 12 x 3.5GHz
Development machine Memory/Speed:
3.02
3.01
3.00
2.01-2
2.03
2.04
1.01-4
Tesseract Tutorial: DAS 2014 Tours France
The Stealth Era: 1984-1994
Tesseract Tutorial: DAS 2014 Tours France
Dest has a scanner with built-in OCR with >99.5%
accuracy.
Office equipment trade-show:
Earl's Court, London late 1984
Tesseract Tutorial: DAS 2014 Tours France
Dest has a scanner with built-in OCR with >99.5%
accuracy.
Caveat: It only recognizes 8 fonts from IBM Selectric
"golf-ball" typewriters.
Office equipment trade-show:
Earl's Court, London late 1984
Tesseract Tutorial: DAS 2014 Tours France
Office equipment trade-show:
Earl's Court, London late 1984
For documents
conventional OCR can't
handle, Kurzweil offers a
better approach: ICR.
For almost two decades, optical
character recognition systems have been
widely used to provide automated text
entry into computerised systems. Yet in
all this time, conventional OCR systems
have never overcome their in- ability to
read more than a handful of type fonts
and page formats. Propor- tionally
spaced type (which includes vir- tually all
typeset copy), laser printer fonts, and
even many non-proportional typewriter
fonts, have remained beyond the reach of
these systems. And as a result,
conventional OCR has never achieved
more than a marginal impact on the total
number of documents needing
conversion into digital form.
Tesseract Tutorial: DAS 2014 Tours France
Early Lessons:
Tesseract Tutorial: DAS 2014 Tours France
Early Lessons:
● Be able to OCR your own sales literature!
Tesseract Tutorial: DAS 2014 Tours France
Early Lessons:
● Be able to OCR your own sales literature!
● OCR should distinguish text from non-text.
Tesseract Tutorial: DAS 2014 Tours France
Early Lessons:
● Be able to OCR your own sales literature!
● OCR should distinguish text from non-text.
● OCR should read white-on gray and black-on-gray as easily as
black-on-white.
Tesseract Tutorial: DAS 2014 Tours France
Early Lessons:
● Be able to OCR your own sales literature!
● OCR should distinguish text from non-text.
● OCR should read white-on gray and black-on-gray as easily as
black-on-white.
Decision:
Extract outlines from grayscale images and features from the
outlines.
● Profound impact.
● Key differentiator.
Tesseract Tutorial: DAS 2014 Tours France
Connected component outlines vs bounding boxes:
Ninja often
● Bounding boxes of characters often overlap without the
characters actually touching.
Tesseract Tutorial: DAS 2014 Tours France
How to extract features from Outlines?
Outline Skeleton Topological Features
Tesseract Tutorial: DAS 2014 Tours France
Skeletonization is Unreliable
Outline: Serifed Skeleton: Decorated
Arrrrh!
Lesson: If there are a lot of papers on a topic, there is most likely no
good solution, at least not yet, so try to use something else.
Tesseract Tutorial: DAS 2014 Tours France
Topological features are Brittle
Damage to 'o' produces vastly different feature sets:
Standard 'o'
Broken 'o'
Filled 'o'
Lesson: Features must be as invariant as possible to as many as
possible of the expected degradations.
Tesseract Tutorial: DAS 2014 Tours France
Shrinking features and inappropriate statistics
Segments of the polygonal Even smaller features
Approximation
Statistical:
Geometric:
Lesson: Statistical Independence is difficult to dodge.
Tesseract Tutorial: DAS 2014 Tours France
Now it's Too Slow!
~2000 characters per page x
~100 character classes (English) x
32 fonts x
~20 prototype features x
~100 unknown features x
3 feature dimensions
= 38bn distance calculations per page...
Tesseract Tutorial: DAS 2014 Tours France
Now it's Too Slow!
~2000 characters per page x
~100 character classes (English) x
32 fonts x
~20 prototype features x
~100 unknown features x
3 feature dimensions
= 38bn distance calculations per page...
... on a 25MHz machine.
Tesseract Tutorial: DAS 2014 Tours France
What Were Other OCR Systems doing in 1988-1990?
Calera Wordscan: Required a $2000 "real computer" add-on board to
run a fixed dimension feature space on a PC (that used a 12.5MHz
80286 in those days):
Recognita: Decision tree using topological-like features. Very fast in
software. Accuracy degraded very fast on poor quality.
Neural Networks: Were just becoming popular. Would there be a
startup threat?
Tesseract Tutorial: DAS 2014 Tours France
Testing
Summary:
● Small test sets are meaningless -> Get lots of compute power and
data.
● Test on very different data to the training data.
● Test every change.
● What you measure improves -> Measure lots of dimensions.
● If it can break it will.
● Make your code faster while waiting for tests to run.
● Think/test out of the box...
Tesseract Tutorial: DAS 2014 Tours France
Creative testing
Christmas 1992: Company shutdown for 10 days. What
should we do with all that compute power that would
otherwise be wasted?
Tesseract Tutorial: DAS 2014 Tours France
Creative testing
Christmas 1992: Company shutdown for 10 days. What
should we do with all that compute power that would
otherwise be wasted?
● Tune the hand-tuned parameters with a genetic
algorithm.
● Process the 400 image test set thresholded at each
possible threshold in the range 32-224
Tesseract Tutorial: DAS 2014 Tours France
Thresholding Experiment Results
● Exposed several bugs
● Some fascinating accuracy results
0 Image Threshold 255
0Errorrate
-- Expected Results
-- Observed Results
Tesseract Tutorial: DAS 2014 Tours France
The Dark Ages: 1995-2005: Gathering Dust
Tesseract Tutorial: DAS 2014 Tours France
The Dark Ages
Tesseract:
● Ported to Windows.
● Moore's law:
○ 20x Memory Capacity.
○ 20x Speed.
Commercial Engines:
● 1994: Caere buys Calera.
● 1995: First use of character n-
grams in commercial system.
(Wordscan 3.1).
● 1996: Caere buys Recognita.
● 2000: Scansoft buys Caere.
● Voting improves accuracy, but
engine is much slower.
Tesseract Tutorial: DAS 2014 Tours France
2005: HP Open Sources Tesseract
Tesseract Tutorial: DAS 2014 Tours France
The Open Source Era: 2006 - Present
Tesseract source is available from:
http://code.google.com/p/tesseract-ocr
and for direct install on Linux via:
apt-get install tesseract-ocr
60+ Languages available.
Tesseract Tutorial: DAS 2014 Tours France
Open Source OCR is Used for Various Applications
From my inbox:
Tesseract Tutorial: DAS 2014 Tours France
Open Source OCR is Used for Various Applications
My reply:
Tesseract Tutorial: DAS 2014 Tours France
Open Source OCR is Used for Various Applications
He doesn't give up that easily:
Tesseract Tutorial: DAS 2014 Tours France
Recent Improvements
● Internationalize to 60+ Languages.
● Add Full Layout Analysis.
● Table Detection.
● Equation Detection.
● Better Language Models.
● Improved Segmentation Search.
● Word Bigrams.
Tesseract Tutorial: DAS 2014 Tours France
Results
Language Ground Truth
Chars (million)
Ground Truth
words (million)
Char Err Rate% Word Err
Rate%
English 271 44 0.47 6.4
Italian 59 10 0.54 5.41
Russian 23 3.5 0.67 5.57
Simplified
Chinese
0.25 0.17 2.52 6.29
Hebrew 0.16 0.03 3.2 10.58
Japanese 10 4.1 4.26 18.72
Vietnamese 0.41 0.09 5.06 19.39
Hindi 2.1 0.41 6.43 28.62
Thai 0.19 0.01 21.31 80.53
Tesseract Tutorial: DAS 2014 Tours France
Thanks for Listening!
Questions?

Contenu connexe

En vedette

Khmer ocr itc
Khmer ocr itcKhmer ocr itc
Khmer ocr itcSolin TEM
 
Khmer ocr scientificday_itc
Khmer ocr scientificday_itcKhmer ocr scientificday_itc
Khmer ocr scientificday_itcSolin TEM
 
0 tutorial contents
0 tutorial contents0 tutorial contents
0 tutorial contentsSolin TEM
 
cs15 slide_v2
cs15 slide_v2cs15 slide_v2
cs15 slide_v2Solin TEM
 
Khmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_dayKhmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_daySolin TEM
 
Khmer ocr using gfd
Khmer ocr using gfdKhmer ocr using gfd
Khmer ocr using gfdSolin TEM
 
7 layout analysis
7 layout analysis7 layout analysis
7 layout analysisSolin TEM
 
CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1Solin TEM
 
Project planning forms_0210revised
Project planning forms_0210revisedProject planning forms_0210revised
Project planning forms_0210revisedJennyhpn
 
一公分鉛筆的故事
一公分鉛筆的故事一公分鉛筆的故事
一公分鉛筆的故事t067
 
Կայքի հաջողության մի քանի քայլեր
Կայքի հաջողության մի քանի քայլերԿայքի հաջողության մի քանի քայլեր
Կայքի հաջողության մի քանի քայլերSona Asryan
 
How Online Education Used to reduce illiteracy in remote areas
How Online Education Used to reduce illiteracy in remote areasHow Online Education Used to reduce illiteracy in remote areas
How Online Education Used to reduce illiteracy in remote areasshuvo510
 
Presentación sobre curso de asignatura
Presentación sobre curso de asignaturaPresentación sobre curso de asignatura
Presentación sobre curso de asignaturapimela12
 
Efficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesEfficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesAlexandra Roatiș
 
Universidad nacional de chimborazo katty
Universidad nacional de chimborazo kattyUniversidad nacional de chimborazo katty
Universidad nacional de chimborazo kattyVivitarojas
 
Yazılımcı Gözüyle Scrum
Yazılımcı Gözüyle ScrumYazılımcı Gözüyle Scrum
Yazılımcı Gözüyle Scrumnedirtv
 
[Infographic] Cisco Visual Networking Index (VNI): Mobile-Connected Devices p...
[Infographic] Cisco Visual Networking Index (VNI): Mobile-Connected Devices p...[Infographic] Cisco Visual Networking Index (VNI): Mobile-Connected Devices p...
[Infographic] Cisco Visual Networking Index (VNI): Mobile-Connected Devices p...Cisco Service Provider
 

En vedette (20)

Khmer ocr itc
Khmer ocr itcKhmer ocr itc
Khmer ocr itc
 
Khmer ocr scientificday_itc
Khmer ocr scientificday_itcKhmer ocr scientificday_itc
Khmer ocr scientificday_itc
 
0 tutorial contents
0 tutorial contents0 tutorial contents
0 tutorial contents
 
cs15 slide_v2
cs15 slide_v2cs15 slide_v2
cs15 slide_v2
 
Khmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_dayKhmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_day
 
Khmer ocr using gfd
Khmer ocr using gfdKhmer ocr using gfd
Khmer ocr using gfd
 
7 layout analysis
7 layout analysis7 layout analysis
7 layout analysis
 
CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1
 
Project planning forms_0210revised
Project planning forms_0210revisedProject planning forms_0210revised
Project planning forms_0210revised
 
motivation & knowledge management
motivation & knowledge management motivation & knowledge management
motivation & knowledge management
 
一公分鉛筆的故事
一公分鉛筆的故事一公分鉛筆的故事
一公分鉛筆的故事
 
Project: Food map
Project: Food map Project: Food map
Project: Food map
 
Կայքի հաջողության մի քանի քայլեր
Կայքի հաջողության մի քանի քայլերԿայքի հաջողության մի քանի քայլեր
Կայքի հաջողության մի քանի քայլեր
 
How Online Education Used to reduce illiteracy in remote areas
How Online Education Used to reduce illiteracy in remote areasHow Online Education Used to reduce illiteracy in remote areas
How Online Education Used to reduce illiteracy in remote areas
 
Presentación sobre curso de asignatura
Presentación sobre curso de asignaturaPresentación sobre curso de asignatura
Presentación sobre curso de asignatura
 
Efficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesEfficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF Databases
 
Universidad nacional de chimborazo katty
Universidad nacional de chimborazo kattyUniversidad nacional de chimborazo katty
Universidad nacional de chimborazo katty
 
Yazılımcı Gözüyle Scrum
Yazılımcı Gözüyle ScrumYazılımcı Gözüyle Scrum
Yazılımcı Gözüyle Scrum
 
[Infographic] Cisco Visual Networking Index (VNI): Mobile-Connected Devices p...
[Infographic] Cisco Visual Networking Index (VNI): Mobile-Connected Devices p...[Infographic] Cisco Visual Networking Index (VNI): Mobile-Connected Devices p...
[Infographic] Cisco Visual Networking Index (VNI): Mobile-Connected Devices p...
 
Recorte Web - DuocUC - escom
Recorte Web - DuocUC - escomRecorte Web - DuocUC - escom
Recorte Web - DuocUC - escom
 

Similaire à Tesseract Tutorial: DAS 2014 Tours France

Lessons from Indic OCR Development
Lessons from Indic OCR DevelopmentLessons from Indic OCR Development
Lessons from Indic OCR DevelopmentNishad Thalhath
 
Deep Turnover Forecast - meetup Lille
Deep Turnover Forecast - meetup LilleDeep Turnover Forecast - meetup Lille
Deep Turnover Forecast - meetup LilleCarta Alfonso
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++Mike Acton
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Codemotion
 
DTrace - Miracle Scotland Database Forum
DTrace - Miracle Scotland Database ForumDTrace - Miracle Scotland Database Forum
DTrace - Miracle Scotland Database ForumDoug Burns
 
CPLEX Optimization Studio, Modeling, Theory, Best Practices and Case Studies
CPLEX Optimization Studio, Modeling, Theory, Best Practices and Case StudiesCPLEX Optimization Studio, Modeling, Theory, Best Practices and Case Studies
CPLEX Optimization Studio, Modeling, Theory, Best Practices and Case Studiesoptimizatiodirectdirect
 
Using F# to change the way we work
Using F# to change the way we workUsing F# to change the way we work
Using F# to change the way we workJon Harrop
 
infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...
infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...
infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...Infoshare
 
Machine learning the next revolution or just another hype
Machine learning   the next revolution or just another hypeMachine learning   the next revolution or just another hype
Machine learning the next revolution or just another hypeJorge Ferrer
 
Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Abhishek Thakur
 
Pharo VM Performance
Pharo VM PerformancePharo VM Performance
Pharo VM PerformancePharo
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
 
Day1 - TDD (Lecture SS 2015)
Day1 - TDD (Lecture SS 2015)Day1 - TDD (Lecture SS 2015)
Day1 - TDD (Lecture SS 2015)wolframkriesing
 
QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...Scality
 
PRADS presentation (English) @ University of Oslo by Ebf0 and kwy
PRADS presentation (English) @ University of Oslo by Ebf0 and kwyPRADS presentation (English) @ University of Oslo by Ebf0 and kwy
PRADS presentation (English) @ University of Oslo by Ebf0 and kwyRubén Romero
 
Solving Large Scale Optimization Problems using CPLEX Optimization Studio
Solving Large Scale Optimization Problems using CPLEX Optimization StudioSolving Large Scale Optimization Problems using CPLEX Optimization Studio
Solving Large Scale Optimization Problems using CPLEX Optimization Studiooptimizatiodirectdirect
 
Complicating Complexity: Performance in a New Machine Age
Complicating Complexity: Performance in a New Machine AgeComplicating Complexity: Performance in a New Machine Age
Complicating Complexity: Performance in a New Machine AgeMaurice Naftalin
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningSergey Karayev
 

Similaire à Tesseract Tutorial: DAS 2014 Tours France (20)

Lessons from Indic OCR Development
Lessons from Indic OCR DevelopmentLessons from Indic OCR Development
Lessons from Indic OCR Development
 
Deep Turnover Forecast - meetup Lille
Deep Turnover Forecast - meetup LilleDeep Turnover Forecast - meetup Lille
Deep Turnover Forecast - meetup Lille
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
 
DTrace - Miracle Scotland Database Forum
DTrace - Miracle Scotland Database ForumDTrace - Miracle Scotland Database Forum
DTrace - Miracle Scotland Database Forum
 
3 training
3 training3 training
3 training
 
CPLEX Optimization Studio, Modeling, Theory, Best Practices and Case Studies
CPLEX Optimization Studio, Modeling, Theory, Best Practices and Case StudiesCPLEX Optimization Studio, Modeling, Theory, Best Practices and Case Studies
CPLEX Optimization Studio, Modeling, Theory, Best Practices and Case Studies
 
Using F# to change the way we work
Using F# to change the way we workUsing F# to change the way we work
Using F# to change the way we work
 
Large scalecplex
Large scalecplexLarge scalecplex
Large scalecplex
 
infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...
infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...
infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...
 
Machine learning the next revolution or just another hype
Machine learning   the next revolution or just another hypeMachine learning   the next revolution or just another hype
Machine learning the next revolution or just another hype
 
Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)
 
Pharo VM Performance
Pharo VM PerformancePharo VM Performance
Pharo VM Performance
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Day1 - TDD (Lecture SS 2015)
Day1 - TDD (Lecture SS 2015)Day1 - TDD (Lecture SS 2015)
Day1 - TDD (Lecture SS 2015)
 
QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...
 
PRADS presentation (English) @ University of Oslo by Ebf0 and kwy
PRADS presentation (English) @ University of Oslo by Ebf0 and kwyPRADS presentation (English) @ University of Oslo by Ebf0 and kwy
PRADS presentation (English) @ University of Oslo by Ebf0 and kwy
 
Solving Large Scale Optimization Problems using CPLEX Optimization Studio
Solving Large Scale Optimization Problems using CPLEX Optimization StudioSolving Large Scale Optimization Problems using CPLEX Optimization Studio
Solving Large Scale Optimization Problems using CPLEX Optimization Studio
 
Complicating Complexity: Performance in a New Machine Age
Complicating Complexity: Performance in a New Machine AgeComplicating Complexity: Performance in a New Machine Age
Complicating Complexity: Performance in a New Machine Age
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
 

Dernier

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 

Dernier (20)

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 

Tesseract Tutorial: DAS 2014 Tours France

  • 1. Tesseract Tutorial: DAS 2014 Tours France 1. Motivation and History of the Tesseract OCR Engine Lessons Learned from What Worked and What Didn't Ray Smith, Google Inc.
  • 2. Tesseract Tutorial: DAS 2014 Tours France What is Tesseract?
  • 3. Tesseract Tutorial: DAS 2014 Tours France Tesseract Timeline 4MB 16MB 64MB 256MB 3GB 24GB 10MHz 25MHz 100MHz 166MHz 3.5GHz 12 x 3.5GHz Development machine Memory/Speed: 3.02 3.01 3.00 2.01-2 2.03 2.04 1.01-4
  • 4. Tesseract Tutorial: DAS 2014 Tours France The Stealth Era: 1984-1994
  • 5. Tesseract Tutorial: DAS 2014 Tours France Dest has a scanner with built-in OCR with >99.5% accuracy. Office equipment trade-show: Earl's Court, London late 1984
  • 6. Tesseract Tutorial: DAS 2014 Tours France Dest has a scanner with built-in OCR with >99.5% accuracy. Caveat: It only recognizes 8 fonts from IBM Selectric "golf-ball" typewriters. Office equipment trade-show: Earl's Court, London late 1984
  • 7. Tesseract Tutorial: DAS 2014 Tours France Office equipment trade-show: Earl's Court, London late 1984 For documents conventional OCR can't handle, Kurzweil offers a better approach: ICR. For almost two decades, optical character recognition systems have been widely used to provide automated text entry into computerised systems. Yet in all this time, conventional OCR systems have never overcome their in- ability to read more than a handful of type fonts and page formats. Propor- tionally spaced type (which includes vir- tually all typeset copy), laser printer fonts, and even many non-proportional typewriter fonts, have remained beyond the reach of these systems. And as a result, conventional OCR has never achieved more than a marginal impact on the total number of documents needing conversion into digital form.
  • 8. Tesseract Tutorial: DAS 2014 Tours France Early Lessons:
  • 9. Tesseract Tutorial: DAS 2014 Tours France Early Lessons: ● Be able to OCR your own sales literature!
  • 10. Tesseract Tutorial: DAS 2014 Tours France Early Lessons: ● Be able to OCR your own sales literature! ● OCR should distinguish text from non-text.
  • 11. Tesseract Tutorial: DAS 2014 Tours France Early Lessons: ● Be able to OCR your own sales literature! ● OCR should distinguish text from non-text. ● OCR should read white-on gray and black-on-gray as easily as black-on-white.
  • 12. Tesseract Tutorial: DAS 2014 Tours France Early Lessons: ● Be able to OCR your own sales literature! ● OCR should distinguish text from non-text. ● OCR should read white-on gray and black-on-gray as easily as black-on-white. Decision: Extract outlines from grayscale images and features from the outlines. ● Profound impact. ● Key differentiator.
  • 13. Tesseract Tutorial: DAS 2014 Tours France Connected component outlines vs bounding boxes: Ninja often ● Bounding boxes of characters often overlap without the characters actually touching.
  • 14. Tesseract Tutorial: DAS 2014 Tours France How to extract features from Outlines? Outline Skeleton Topological Features
  • 15. Tesseract Tutorial: DAS 2014 Tours France Skeletonization is Unreliable Outline: Serifed Skeleton: Decorated Arrrrh! Lesson: If there are a lot of papers on a topic, there is most likely no good solution, at least not yet, so try to use something else.
  • 16. Tesseract Tutorial: DAS 2014 Tours France Topological features are Brittle Damage to 'o' produces vastly different feature sets: Standard 'o' Broken 'o' Filled 'o' Lesson: Features must be as invariant as possible to as many as possible of the expected degradations.
  • 17. Tesseract Tutorial: DAS 2014 Tours France Shrinking features and inappropriate statistics Segments of the polygonal Even smaller features Approximation Statistical: Geometric: Lesson: Statistical Independence is difficult to dodge.
  • 18. Tesseract Tutorial: DAS 2014 Tours France Now it's Too Slow! ~2000 characters per page x ~100 character classes (English) x 32 fonts x ~20 prototype features x ~100 unknown features x 3 feature dimensions = 38bn distance calculations per page...
  • 19. Tesseract Tutorial: DAS 2014 Tours France Now it's Too Slow! ~2000 characters per page x ~100 character classes (English) x 32 fonts x ~20 prototype features x ~100 unknown features x 3 feature dimensions = 38bn distance calculations per page... ... on a 25MHz machine.
  • 20. Tesseract Tutorial: DAS 2014 Tours France What Were Other OCR Systems doing in 1988-1990? Calera Wordscan: Required a $2000 "real computer" add-on board to run a fixed dimension feature space on a PC (that used a 12.5MHz 80286 in those days): Recognita: Decision tree using topological-like features. Very fast in software. Accuracy degraded very fast on poor quality. Neural Networks: Were just becoming popular. Would there be a startup threat?
  • 21. Tesseract Tutorial: DAS 2014 Tours France Testing Summary: ● Small test sets are meaningless -> Get lots of compute power and data. ● Test on very different data to the training data. ● Test every change. ● What you measure improves -> Measure lots of dimensions. ● If it can break it will. ● Make your code faster while waiting for tests to run. ● Think/test out of the box...
  • 22. Tesseract Tutorial: DAS 2014 Tours France Creative testing Christmas 1992: Company shutdown for 10 days. What should we do with all that compute power that would otherwise be wasted?
  • 23. Tesseract Tutorial: DAS 2014 Tours France Creative testing Christmas 1992: Company shutdown for 10 days. What should we do with all that compute power that would otherwise be wasted? ● Tune the hand-tuned parameters with a genetic algorithm. ● Process the 400 image test set thresholded at each possible threshold in the range 32-224
  • 24. Tesseract Tutorial: DAS 2014 Tours France Thresholding Experiment Results ● Exposed several bugs ● Some fascinating accuracy results 0 Image Threshold 255 0Errorrate -- Expected Results -- Observed Results
  • 25. Tesseract Tutorial: DAS 2014 Tours France The Dark Ages: 1995-2005: Gathering Dust
  • 26. Tesseract Tutorial: DAS 2014 Tours France The Dark Ages Tesseract: ● Ported to Windows. ● Moore's law: ○ 20x Memory Capacity. ○ 20x Speed. Commercial Engines: ● 1994: Caere buys Calera. ● 1995: First use of character n- grams in commercial system. (Wordscan 3.1). ● 1996: Caere buys Recognita. ● 2000: Scansoft buys Caere. ● Voting improves accuracy, but engine is much slower.
  • 27. Tesseract Tutorial: DAS 2014 Tours France 2005: HP Open Sources Tesseract
  • 28. Tesseract Tutorial: DAS 2014 Tours France The Open Source Era: 2006 - Present Tesseract source is available from: http://code.google.com/p/tesseract-ocr and for direct install on Linux via: apt-get install tesseract-ocr 60+ Languages available.
  • 29. Tesseract Tutorial: DAS 2014 Tours France Open Source OCR is Used for Various Applications From my inbox:
  • 30. Tesseract Tutorial: DAS 2014 Tours France Open Source OCR is Used for Various Applications My reply:
  • 31. Tesseract Tutorial: DAS 2014 Tours France Open Source OCR is Used for Various Applications He doesn't give up that easily:
  • 32. Tesseract Tutorial: DAS 2014 Tours France Recent Improvements ● Internationalize to 60+ Languages. ● Add Full Layout Analysis. ● Table Detection. ● Equation Detection. ● Better Language Models. ● Improved Segmentation Search. ● Word Bigrams.
  • 33. Tesseract Tutorial: DAS 2014 Tours France Results Language Ground Truth Chars (million) Ground Truth words (million) Char Err Rate% Word Err Rate% English 271 44 0.47 6.4 Italian 59 10 0.54 5.41 Russian 23 3.5 0.67 5.57 Simplified Chinese 0.25 0.17 2.52 6.29 Hebrew 0.16 0.03 3.2 10.58 Japanese 10 4.1 4.26 18.72 Vietnamese 0.41 0.09 5.06 19.39 Hindi 2.1 0.41 6.43 28.62 Thai 0.19 0.01 21.31 80.53
  • 34. Tesseract Tutorial: DAS 2014 Tours France Thanks for Listening! Questions?