QITCOM 2011
May 24 | Day 1 | INNOVATE
Session 2: Digitizing Arabic Content - Lead the Way
Speaker: Dr. Kareem Darwish, Arabic Language Technology Senior Scientist - Qatar Computing Research Institute, Qatar Foundation
Topic: E-Learning: The Future of Arabic Digital Content
For more information visit www.qitcom.com.qa
7. Result of Scanning http://www.colophon.com Courtesy of the Library of Alexandria
8. Magic: Optical Character Recognition من ناحيتى المراقبة والنيران على السهل الساحلى. وطرق الاقتراب التى تسلكها أى قوات عربية من ناحية الشرتى تنح ر دى طرق خمسة أهمها الثلاثة التالية: ا- الطريق ا لأول وهو ا لأقصر من بغداد- هـ 2233 " المفرتى.، أو ا لانحراف إلى الرطبة قبل 3أ3 دمشق الأردن، محاور ا لارابى من العرا إلى سوريا وا لأردد 2- الطريق الاثانى من بغداد " أبو كمال- بالميرا- دمشق- ا لأردد. 3- الطريق الثالسث وهو الأطول من بغ داد- الموصل- دير الزور- حملرو- دمشق- ا لأردن " 68 Courtesy of the Library of Alexandria OCR output (Sakhr)
9. Arabic OCR is Hard Letters change shape depending on position in word, with dots distinguishing them from each other تـ ، ـتـ ، ـت قـ ، ـقـ ، ـق ، ق Diacritics are optional ق ، قَ ، قِ ، قُ ، قَّ ، قْ Some letter combinations have special shapes (ligatures): ل + ا = لا Letter elongations (Kashida) are often used قبل قبـــــــــــــــــــــــــــل Letters are connected
10. Arabic OCR is Hard Diacritics and dots easily confusable. If manuscript is old, they can be confused with speckle on page Word error rate is typically greater than 20% !
11. Arabic OCR is Hard Typical OCR output وتامسوق الجنة والنار وبها تقاكظالخليقة إلى المؤمنين رالكفاروالأبا إر رالفجار فهى منشأ الخلق والأمر والثواب والعقاب ،وهىاهدنالذى خطقت له الخليقة رغها رعن حقرقها السمؤال والحساب
12. Arabic Morphology Challenges Arabic uses complex derivational morphology: Root (ex. ktb) Stem – root in a template (ex. mkAtbp) Word – stem with optional determiner, preposition, coordinating conjunctions, plural suffix, etc. (ex. w+Al+mkAtbp+AtwAlmkAtbAt) Estimated number of possible words: 60 billion Morphology dictates diacritics, which change meaning Ex. Elm (Eelm, Ealam, Eolem: Knowledge, flag, acknowledge) No specific writing standard is prevalent: Ex. The trailing letters in Ely (Ali) and ElY (on) are often interchanged
13. Arabic Morphology For regular Arabic search, morphological analysis is typically used: Full morphological analysis: Sebawai, Buckwalter, IBM Lee, AMIRA Light stemming – remove common prefixes and suffixes Al-Stem or Light-10 For OCR they fail
14. OCR Error Handling Error correction: Word level techniques: Dictionary lookup (Jurafsky & Martin, 2000) Character level model uses confusion matrix Typically font dependent Character n-gram model: Some character sequences are more common than others Presence of a rare character sequence indicates position of error argmax P ( WordOrg| WordOCR) = P ( WordOCR| WordOrg) P ( WordOrg) Char level model Word level model
15. OCR Error Handling Error correction: Passage level/context sensitive techniques: Using language modeling (bi or trigram LM): Clustering words in passage: assumes salient terms appear more than once: Ex. Kennedy; Kemedy; Kennody; etc. P ( Wordoriginal| WordOCR) = P ( WordOCR| WordOrg) P ( WordOrg) P(WordOrg| WordOrg-1)
16. OCR Error Handling Multi-source fusion: Uses language modeling to fuse the output of multiple OCR systems Query garbling: Use a character level model to generate multiple degraded versions of a query Ex.: cement => cement, cornent, cernont, etc. Set degraded versions of a term as synonyms
17. Arabic OCR Text Retrieval Without error handling Use character n-grams (3 & 4-grams) وتام سوق الجنة والنار وبها تقاكظالخليقة إلى المؤمنين رالكفاروالأبا إر رالفجار فهى منشأ الخلق والأمر والثواب والعقاب ،وهىاهدن الذى خطقت له الخليقة رغها رعن حقرقها السمؤال والحساب رالفجار والفجار رال ، الف ، فجا ، جار وال ، الف ، فجا ، جار
18. Presenting Results Presenting OCR output to users is not an option How would a ranked list of images look like How would we generate image snippets? How do we highlight salient terms in these images?
20. Concluding Remarks Scanning is a fairly mature technology Arabic OCR has quite a ways to go Quality of search is tied to the quality of OCR Presentation Issues persist