Call Girls in Dwarka Mor Delhi Contact Us 9654467111
IMPACT Final Conference - Michael Fuchs
1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.
ABBYY & OCR Improvements for IMPACT
Michael Fuchs
Senior Product Marketing Manager
ABBYY Europe
fuchs@abbyy.com
2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.
Agenda
Who is ABBYY?
Company Overview
(Short) Product Overview
ABBYY Technology in the IMPACT project
OCR & Processing – IMPACT improvements
Binarisation, Segmentation,
Recognition
Dictionary API, Export Formats
Lessons Learned, Pricing, Pre-Announcement, Q&A
2
4. ABBYY Group
Overview ABBYY Group
Founded in 1989 as BIT Software
> 1000 employees in 14 offices worldwide
Headquarters/R&D in Moscow, Russia
ABBYY & OCR for IMPACT 4
5. ABBYY OCR Products – Usage View
Desktop/Workgroup Server/Backend SDK/Integration
User driven processing, Automated processing, Automated processing,
Ready to use Ready to use Development needed
OCR & Document
Conversion
FineReader Recognition Server FineReader Engines
(Professional, Corporate, (Professional, Extended (Windows, Linux, Mac OS X,
Site Licence Edition) Edition) Free BSD, Embedded Systems)
Note: No Gothic/Fraktur
OCR! Gothic/Fraktur OCR Mobile OCR Engine
& XML Export (Android, Symbian, Linux,
PDF Transformer Support! Windows, Windows Mobile,
FotoReader iOS )
ScreenshotReader
End Users, Companies, Developers,
Users
are:
Companies, Scan Service Provider, Scan Service Provider
(Libraries) Libraries IMPACT Research
ABBYY & OCR for IMPACT 5
6. What (ABBYY) OCR can read...
Recognition Languages
Almost 200 OCR languages
34 languages with dictionary support and spell check
Alphabets: Cyrillic, Latin, Greek, Armenian, Hebrew, Thai
Chinese, Japanese, Korean (CJK) - 4 sets of hieroglyphs
(Chinese (traditional and simplified), Japanese, Korean)
Arabic (Technical Preview in the SDK)
Font Types
Recognition of mixed font types
(dot-matrix printer, typewriter, Gothic, etc.)
OCR-A
OCR-B
MICR (E13B)
CMC-7
ABBYY & OCR for IMPACT 6
7. IMPACT & ABBYY
ABBYY is the OCR technology provider for IMPACT members
ABBYY also improved the core technologies for the recognition
of old documents in IMPACT, focus areas are/were:
Image pre-processing
Segmentation
Character recognition
Export
IMPACT members work with the Software Development Kit (SDK)
FineReader Engine – not the desktop application
IMPACT focus is/was on research and not in setting up a
production system ;o)
Improved technologies are/will be added to current/future products
ABBYY & OCR for IMPACT 7
9. Why ABBYY? - OCR …
Original Image
[perfect quality :o) ]
Std. OCR *
ABBYY
Fraktur OCR*
*Recognition Server 3.0 R1 – Gothic/Fraktur disabled and enabled
ABBYY & OCR for IMPACT 9
10. ABBYY “History” and Old Fonts Recognition
FineReader XIX (V7 Technology) 2003
(METAe result 2000-2003)
FineReader Engine 9.0 (Release 1) 2008
(Pre-IMPACT – “State of the Art”)
FineReader Engine 10 2010
IMPACT Project Optimizations
ABBYY & OCR for IMPACT 10
11. ABBYY and Old European Fonts
Accuracy Comparison:
Up to 98,2 % on
good quality
images
2003
2008 2010
ABBYY Technology Version 10 recognition of old European fonts:
25% more accurate than FRE 9.0
38% more accurate than FR XIX
ABBYY & OCR for IMPACT 11
13. Processing Steps
Step 1. Scanning, Image Loading, Pre-Processing and
Modification
Compensating image defects and making the document suited for automatic OCR
Step 2. Document Layout Analysis
Layout analysis, detection of document sections like text, images and barcodes
Step 3. (Optical) Character Recognition
Automatic recognition of characters, apply selected recognition languages &
dictionaries
Step 4. (optional) Verification - by Operators or automated post
correction
Manual validation of suspicious characters and words
Step 5. Document Synthesis and Export
Generating an output document in the selected format
ABBYY & OCR for IMPACT 13
15. Step 1: Image pre-processing
Image Loading, Pre-Processing and Modification
Intelligent background filtering
Adaptive Binarisation
General binarisation on an image level can not
deliver good results for OCR
ABBYY & OCR for IMPACT 15
16. Step 1: Image pre-processing
New V10: Binarisation, Textured Background optimisations
Original scan
V9 binarisation
New V10 binarisation
ABBYY & OCR for IMPACT 16
17. Step 1: Image pre-processing
New V10: Binarisation, Textured Background optimisations
Original scan
V9 binarisation
V10 binarisation
ABBYY & OCR for IMPACT 17
18. Step 1: Image pre-processing
New V10: Binarisation for the IMPACT project
Original State of Art (V9) New (V10)
No text from the
other page!
ABBYY & OCR for IMPACT 18
20. Step 2: Document Layout Analysis
Analyze layout and find text, images, tables and barcodes
ABBYY & OCR for IMPACT 20
21. Step 2: Document Layout Analysis (old Newspapers)
Segmentation Improvements: Image/Text detection – Example 1/3
V9 Technology V10 Technology
Part of the column was detected as an image
ABBYY & OCR for IMPACT 21
22. Step 2: Document Layout Analysis (old Newspapers)
Segmentation Improvements: Word Order Detection– Example 2/3
V9 Technology V10 Technology
Less linear word order errors
ABBYY & OCR for IMPACT 22
23. Step 2: Document Layout Analysis (old Newspapers)
Segmentation Improvements: Lost text (no Detection) – Example
3/3
V9 Technology V10 Technology
Less lost text
ABBYY & OCR for IMPACT 23
24. Step 2: Document Layout Analysis
Segmentation Improvements: IMPACT Results over time
Before IMPACT:
Overall segmentation improvements
● Better picture detection
● Better separators
● Better page layout reconstruction
Only a random set of old newspapers available
After IMPACT:
IMPACT Segmentation Ground Truth available
New (internal) DA model for historic newspapers
New segmentation evaluation methodology
Evaluation results on newspapers
● 40% less split/merge errors
● 25% less garbage and lost text
ABBYY & OCR for IMPACT 24
26. Step 3: Text/Character Recognition
Samples for Classifiers used in ABBYY technologies
After line detection, character recognition is applied with different classifiers
Raster classifier Contour classifier
Structure classifier Feature differentiating classifier
ABBYY & OCR for IMPACT 26
27. Step 3: Text/Character Recognition
Optimization and new Developments
Improved Gothic Classifiers
A significant amount of time was invested in gothic classifier training
The library selection of ground truth material (historical relevance) was used
New gothic graphemes were added
Results
Good quality images: 2.8% (total) error rate on the used test set which is about
20% improvement to the “state of art” (V9) = almost comparable to modern
documents
Bad quality Images: 7% (total) error rate on the used test set which is about
30% improvement to the “state of art” (V9)
Most of the improvements available in ABBYY current products:
ABBYY FineReader Engine 10 (SDK) & Recognition Server 3.0
Quality optimization will be continued in future releases and technology cycles
optimized
ABBYY & OCR for IMPACT 27
28. Step 3: Text/Character Recognition
Optimization and new Developments
Old Slavonic as new OCR Language
New Development
Before
Now
ABBYY & OCR for IMPACT 28
32. Step 3 – 5: Other Optimizations
External Dictionary API Tuning
External Dictionary API was available in the FineReader Engine (SDK)
Support for any language, any time period
API was/is heavily used from IMPACT language partners to run quality tests
New ALTO XML Export Formats
FineReader Engine 10 R2, December 2010
Recognition Server 3.0, July 2011
ABBYY & OCR for IMPACT 32
34. Further Information & Trial Versions
The ABBYY Gothic/Fraktur OCR Portal:
www.frakturschrift.com
ABBYY & OCR for IMPACT 34
35. What IMPACT taught ABBYY about
Libraries & Mass Digitalization projects…
The Reality
Masses of books/document are available & already scanned
It is unclear if Antiqua and/or Gothic/Fraktur fonts are used in the documents
Pre-Sorting is impossible, it would be too time/cost expensive
ABBYY Europe's Answer
Reduced the pricing for mixed “Old” + “Modern” font OCR
projects
The pricing is now ready for “mass processing”
Examples Recognition Server 3.0 with “Gothic” enabled
10.000 pages – 299 Euro – available online
500.000 pages* – 5.000 Euro = 1 Euro cent per page = ca 2.000 books a 250
pages
Over 3 Mio pages* - ca 0,52 Euro cent per page = 12.000 books a 1,25 € (250
pages)
Over 10 Mio pages* - ca. 40.000 books = ca. 0,5 € per book
... No more excuses for not A4, bigger formats are counted as multiple pages 35
ABBYY & OCR for IMPACT * page size is
OCRing :o)
36. Pre-Announcement
ABBYY Online OCR Services with Gothic/Fraktur
The ABBYY Gothic/Fraktur OCR Portal:
finereader.abbyyonline.com
Historic OCR added just last week
Web GUI to upload documents and
get results
Simple to use
Low Volume, ad hoc Usage
Instant results, quality evaluation
Pay as you go
ABBYY Online OCR SDK
OCR Service with API and XML Output
Runs on Windows Azure
Currently Closed Beta Test
Public Beta Test Q1/2012
ABBYY & OCR for IMPACT 36