Offline Omni Font Arabic Optical Text Recognition System using Prolog Classification Technique

Offline Omni Font Arabic Optical Text Recognition System using Prolog Classification Technique Rami Al-Sahhar Ideas for today and tomorrow

Agenda OCR Overview The Arabic OCR Problem OCR Challenges Proposed Solution Detailed system stages Sample Run Future Work Demo

OCR Overview (OCR) is the process of converting an image of text, such as a scanned paper document, into computer-editable text The ultimate goal of OCR is to simulate the human ability to read both machine-printed and hand-written texts Most of the work on OCR has been on Latin and Chinese characters Arabic character recognition started recently and advanced relatively slowly due to the complexity of recognizing Arabic text, which has characters that are cursive in nature. Arabic character recognition is still an open and challenging field of research

The Arabic OCR Problem To propose a complete system that classifies and recognizes machine-printed Arabic text The input to the system is TIFF image file The Arabic font size varies from 8 up to 36 The font type is Arabic Simplified or Traditional Arabic The image scanned at 300 dpi ( Resolution ) The output is editable text in a word processor program ( MS Word)

OCR Challenges Understanding TIFF image format and pixel representation Programmatically , read TIFF image pixel by pixel from right to left Features extraction Segmentation free Spaces ,Words , Letters and Line isolation Noise reduction Dots and holes Overlapped characters

OCR Challenges Arabic Character Characteristics Right to left Always cursive Change of character shape according to its location in the word Four different shapes 28 basic characters: 15 with dots, 13 without No fixed character width and no fixed size

OCR Challenges Arabic Character Characteristics Group of Arabic character shapes A sample of written Arabic showing some of its characteristics

The Proposed Solution The proposed system starts from the document image acquisition stage and ends with recognized Arabic text in standard Simplified true type font format in MS Word 2007 We started designing our system by experimenting with prior researchers’ techniques, adopting or modifying some of them if they met our requirements, but otherwise developing our own techniques Consequently, the components of our system are either due to the work of others, the result of our improvement of others’ work, or our own completely new techniques.

Prolog-Based RECOGNIZED TEXT CLASSIFICATION AND RECOGNITION C -Based POST ENHANCEMENT FEATURE EXTRACTION PREPROCESSING TIFF IMAGE FILE The Proposed Solution ATR (Arabic Text Recognition) System model

The Proposed Solution Preprocessing Phase Digitalization, scaling, word-level segmentation, noise removal and elimination of redundant information as far as possible Image information retrieval Load/Read the input (TIFF) image file as binary; retrieve the image properties (size, width, height, pixel resolution, image channels and image alignment; and create memory storage for system intermediate processing Image digitalization Digitizes the TIFF image in order to apply fixed-level thresholding and to convert the gray-scale and bitmapped image to a binary (0’s and 1’s representation) scale image

The Proposed Solution It does the vertical and horizontal histograms to retrieve the number of lines per page and number of components (words) per each line We calculate the font baseline and size by finding the maximum horizontal histogram of each line per page This enables the dots or other special characters such as Shadda, Madda, and Tanween to be classified as upper or lower components related to this baseline Text line detection Word segmentation

The Proposed Solution B&W image is found in file name: [ test1.tif] Processing a [1615x2160] image with [1] channel(s) Image Origin : [Top-left Origin] , Align : [4-] Data Order :[Interleaved Color Channels] Number of Lines(s) found: [6] Line #0 , Y = 78 , Height = [67] Line #1 , Y = 185 , Height =[ 67] Line #2 , Y = 292 , Height = [67] Line #3 , Y = 399 , Height = [67] Line #4 , Y = 506 , Height = [67] Line #5 , Y = 613 , Height = [67] Font Baseline =[ 38 pixels] Number of Components found at Image Line #0 : [9] Number of Components found at Image Line #1 : [14] Number of Components found at Image Line #2 : [16] Number of Components found at Image Line #3 : [18] Number of Components found at Image Line #4 : [10] Number of Components found at Image Line #5 : [6] Preprocessing phase Text Line Detection

1 2 Number of retrieved contours : [2] ************* Bounding Rectangle (1,22)-(72,65) ************ Component [1] Origin Y = 22 , Height = 43 Area = 429.000000 Component [2] Origin Y = 47, Height = 7 Area = 19.000000  Max Component Area = 429.000000 , Y = 22 , H = 43 The Proposed Solution Preprocessing phase Word Segmentation

The Proposed Solution Feature Extraction Phase Is the most challenging part for character or text recognition The choice of good features significantly improves the recognition rate and minimizes the error in case of noise The main selected features are : Outer contours described in Freeman chain codes Contours’ corners Dot information Font estimated size All of these features are extracted for all detected components during the page scanning

The Proposed Solution Freeman chain code Chain code was introduced by Freeman as a mean of representing lines or boundaries of shapes by a connected sequence of straight-line segments of specified length and direction An example of the 8-connectivity chain code Chain code numbering schemes

The Proposed Solution Contour extraction process This is the core process to extract the main word-level features of the Arabic text in Freeman Chain code format After extracting the Freeman codes, we aggregate those codes into pairs as (X, Y) where X is the direction (i.e. from 1 to 7) and Y is the length in pixels

The Proposed Solution Contours Freeman Chain Codes [2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,2,3,2,3,6,5,6,6,5,7,7,7,6,7,6,6,6,6,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,4,3,1,1,1,1,2,2,2,3,4,4,3,4,4,4,4,4,4,4,4,5,4,6,5,6,6,6,1,0,0,7,7,7,5,4,5,4,4,5,4,4,4,4,4,3,2,2,2,2,2,2,2,2,2,3,2,3,2,3,6,5,6,5,7,6,7,7,6,7,6,6,6,6,5,4,4,4,4,4,4,4,4,4,4,4,4,4,3,2,2,2,2,2,2,2,3,2,2,2,3,2,3,2,3,3,3,3,4,3,6,6,6,6,7,0,7,7,7,7,6,7,6,6,6,7,6,6,6,6,6,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,6,6,6,6,7,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,1,7,7,7,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] Contours Freeman Chain Code Pairs - Aligned Total Pairs : [95] ==> [(2,14),(3,1),(2,1),(3,1),(2,1),(3,1),(6,1),(5,1),(6,2),(5,1),(7,3),(6,1),(7,1),(6,4),(5,1),(4,19),(3,1),(4,1),(3,1),(1,4),(2,3),(3,1),(4,2),(3,1),(4,8),(5,1),(4,1),(6,1),(5,1),(6,3),(1,1),(0,2),(7,3),(5,1),(4,1),(5,1),(4,2),(5,1),(4,5),(3,1),(2,9),(3,1),(2,1),(3,1),(2,1),(3,1),(6,1),(5,1),(6,1),(5,1),(7,1),(6,1),(7,2),(6,1),(7,1),(6,4),(5,1),(4,13),(3,1),(2,7),(3,1),(2,3),(3,1),(2,1),(3,1),(2,1),(3,4),(4,1),(3,1),(6,4),(7,1),(0,1),(7,4),(6,1),(7,1),(6,3),(7,1),(6,5),(5,1),(4,14),(3,2),(6,4),(7,2),(0,38),(1,1),(0,2),(1,1),(0,1),(1,1),(0,1),(1,1),(7,3),(0,1),(7,1),(0,22)] Contour Corners Positions: [5,18,23,33,37,60,67,81,87,90,92,103,113,118,132,146,160,162,168,172,178,180,189,202,206,210,258,263,] Feature extraction: Freeman chain codes, pairs and corner positions

The Proposed Solution Corner Detection This phase detects and extracts the component’s contour corners of the text under processing It is based on an implementation of contour detection and curve representation by circular local histogram of contour chain code presented by [Arrebola, Camacho, Bandera , & Sandoval (1999)] The corner detection phase is very important for the next classification and recognition phase It helps our Prolog engine to determine the unique shape of the character’s feature regardless of the character orientation The output of this phase is a stream of corner information to be input for the next phase

The Proposed Solution ,[object Object]

We introduced an algorithm to remove noisy pixels which come within any straight line, and to convert Arabic characters to approximately straight lines.

These enhancement rules, which are derived from testing Arabic characters multiple times, reduce the time required for character recognition,[object Object]

The Proposed Solution Font size against calculated component’s height (in pixels)

The Proposed Solution Definite Clause Grammar (DCG): Provides a mechanism for defining the grammar rules of a language These rules are automatically translated to a Prolog program which defines a parser for the language being defined Grammar rules are a feature only in some Prolog systems, and are designed to facilitate the parsing of natural language Using this notation, a grammar is represented as a set of logical rules When the DCG rules are consulted (or optimized), they are translated into Prolog clauses

The Proposed Solution Word-level Classification and Recognition Phase This is the most critical phase in our proposed ATR system It is written in Prolog language using Prolog matching, backtracking and DCG techniques The input for this phase is data on two features : The first input stream is the corner sequence of the word-level outer contours for each component that represents the elevation information of the input stream (the upper part that holds most of the features) The second input stream is the dot information found in the same component

The Proposed Solution The Prolog matching and backtracking techniques also use the corner sequence stream to classify the unknown inputs into character classes, while the Prolog DCG technique uses the dot information stream to recognize the actual Arabic letters of a particular character class

The Proposed Solution DCG implementation: The DCG grammar structure and some of the character classes are described below : % DCG part for Arabic text recognition based on two input streams % usage: phrase(s(R),[m,h_c,d1,m,dc]). s([H|T]) -->cc(H), subs(T). % every string is a character class followed by a sub-string s(R)-->cc(R). % or a string can be simply a character class subs(R)-->s(R). % a substring is nothing but a string (recursively) cc(R)-->ch(R). % a character class can be a simple character % or character classes can belong to any of the following classes cc(R)-->bc(R). % Ba class (ba, ta, tha, ya_md) cc(R)-->h_c(R). % H_ class (h_, jeem, kha) cc(R)-->dc(R). % Dal Class (dal, thal) cc(R)-->rc(R). % Ra' Class (ra, zay) cc(R)-->sc(R). % Seen Class (seen, sheen)

The Proposed Solution Microsoft Word Document Integration This is the final phase of our optical text recognition system It is written in Prolog language to interface with Microsoft Word program It uses Microsoft Word Document API to write the recognized characters into a new Word document It writes the output text in the same recognized font size in a predefined font type It also writes the white spaces and new lines to maintain the same original text alignment and format

Image Information Retrieval Height/ Width Document Image (TIFF File) Pixel Resolution Image Digitalization Font Baseline Font Size Word-level Segmentation Lines per Page Component Coordinates(X, Y) Words per Line Word-level Contour Extraction Freeman Chain Codes Component Area, Height and Width Contour Enhancement Character-Shape Prolog Matching Corner Detection Dots Detection Character Shape Stream Character Reference Database Dot Information Stream DCG Engine Word-level Recognition MS WORD Document Integration Recognized Text (Word Document) The Proposed Solution

Sample Run Original TIFF image with Arabic text The recognized Arabic text in MS Word 2007

Future Work Support more Arabic font types Support more image types ( GIF , BMP , JPEG…etc) Support different font sizes in same page Support Arabic & English fonts together , numeric and special characters Support Spellchecker and word suggestions Implement the system as Arabic Business Card reader Capture and Recognize feature for iPhone

OCR Applications Industries and Institutions in which control of large amounts of paper work is critical Banking, Credit cards, Insurance industries The medical community To capture, store and transmit radiology images Libraries and archives For conservation and preservation of vulnerable documents and for the provision of access to source documents

Offline Omni Font Arabic Optical Text Recognition System using Prolog Classification Technique

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Offline Omni Font Arabic Optical Text Recognition System using Prolog Classification Technique

Similaire à Offline Omni Font Arabic Optical Text Recognition System using Prolog Classification Technique (20)

Offline Omni Font Arabic Optical Text Recognition System using Prolog Classification Technique

Notes de l'éditeur