SlideShare une entreprise Scribd logo
1  sur  20
Digitizing and Retrieving Printed Arabic Documents Kareem Darwish Senior Scientist Qatar Computing Research Institute
Overview Scanning Some Magic Search results
Scanning http://en.wikipedia.org/wiki/Book_scanning
Scanning http://en.wikipedia.org/wiki/Book_scanning
Scanning http://www.kirtas.com
Scanning http://www.kirtas.com
Result of Scanning http://www.colophon.com Courtesy of the Library of Alexandria
Magic:  Optical Character Recognition من ناحيتى المراقبة والنيران على السهل الساحلى.  وطرق الاقتراب التى تسلكها أى قوات عربية من ناحية الشرتى تنح ر دى طرق  خمسة أهمها الثلاثة التالية:  ا- الطريق ا لأول وهو ا لأقصر من بغداد- هـ 2233 " المفرتى.، أو ا لانحراف إلى  الرطبة قبل 3أ3 دمشق الأردن، محاور ا لارابى من العرا إلى سوريا وا لأردد 2- الطريق الاثانى من بغداد " أبو كمال- بالميرا- دمشق- ا لأردد.  3- الطريق الثالسث وهو الأطول من بغ داد- الموصل- دير الزور- حملرو- دمشق- ا لأردن " 68  Courtesy of the Library of Alexandria OCR output (Sakhr)
Arabic OCR is Hard Letters change shape depending on position in word, with dots distinguishing them from each other تـ ، ـتـ ، ـت قـ ، ـقـ ، ـق ، ق Diacritics are optional ق ، قَ ، قِ ، قُ ، قَّ ، قْ Some letter combinations have special shapes (ligatures): ل + ا = لا Letter elongations (Kashida) are often used قبل قبـــــــــــــــــــــــــــل Letters are connected
Arabic OCR is Hard Diacritics and dots easily confusable.  If manuscript is old, they can be confused with speckle on page Word error rate is typically greater than 20% !
Arabic OCR is Hard Typical OCR output وتامسوق الجنة والنار وبها تقاكظالخليقة إلى المؤمنين رالكفاروالأبا إر رالفجار فهى منشأ الخلق والأمر والثواب والعقاب ،وهىاهدنالذى خطقت له الخليقة رغها رعن حقرقها السمؤال والحساب
Arabic Morphology Challenges Arabic uses complex derivational morphology: Root (ex. ktb) Stem – root in a template (ex. mkAtbp) Word – stem with optional determiner, preposition, coordinating conjunctions, plural suffix, etc. (ex. w+Al+mkAtbp+AtwAlmkAtbAt) Estimated number of possible words: 60 billion Morphology dictates diacritics, which change meaning Ex. Elm  (Eelm, Ealam, Eolem:  Knowledge, flag, acknowledge) No specific writing standard is prevalent: Ex. The trailing letters in Ely (Ali) and ElY (on) are often interchanged
Arabic Morphology For regular Arabic search, morphological analysis is typically used: Full morphological analysis: Sebawai, Buckwalter, IBM Lee, AMIRA Light stemming – remove common prefixes and suffixes Al-Stem or Light-10 For OCR they fail
OCR Error Handling Error correction: Word level techniques: Dictionary lookup (Jurafsky & Martin, 2000)   Character level model uses confusion matrix Typically font dependent Character n-gram model: Some character sequences are more common than others Presence of a rare character sequence indicates position of error argmax P ( WordOrg| WordOCR) = P ( WordOCR| WordOrg) P ( WordOrg)  Char level model Word level model
OCR Error Handling Error correction: Passage level/context sensitive techniques: Using language modeling (bi or trigram LM): Clustering words in passage: assumes salient terms appear more than once: Ex. Kennedy; Kemedy; Kennody; etc. P ( Wordoriginal| WordOCR) =  	P ( WordOCR| WordOrg) P ( WordOrg)  P(WordOrg| WordOrg-1)
OCR Error Handling Multi-source fusion: Uses language modeling to fuse the output of multiple OCR systems Query garbling: Use a character level model to generate multiple degraded versions of a query Ex.: cement => cement, cornent, cernont, etc. Set degraded versions of a term as synonyms
Arabic OCR Text Retrieval Without error handling  Use character n-grams (3 & 4-grams) وتام سوق الجنة والنار وبها تقاكظالخليقة إلى المؤمنين رالكفاروالأبا إر رالفجار فهى منشأ الخلق والأمر والثواب والعقاب ،وهىاهدن الذى خطقت له الخليقة رغها رعن حقرقها السمؤال والحساب رالفجار والفجار رال ، الف ، فجا ، جار وال ، الف ، فجا ، جار
Presenting Results Presenting OCR output to users is not an option How would a ranked list of images look like How would we generate image snippets? How do we highlight salient terms in these images?
Presenting Results What is the unit of search? Is it book, chapter, page
Concluding Remarks Scanning is a fairly mature technology Arabic OCR has quite a ways to go Quality of search is tied to the quality of OCR Presentation Issues persist

Contenu connexe

Plus de QITCOM

Mr. Ottmar Krauss' presentation at QITCOM 2011
Mr. Ottmar Krauss' presentation at QITCOM 2011Mr. Ottmar Krauss' presentation at QITCOM 2011
Mr. Ottmar Krauss' presentation at QITCOM 2011
QITCOM
 

Plus de QITCOM (20)

2 QITCOM 2012 - Stagg Newman (Next Gen Broadband)
2 QITCOM 2012 - Stagg Newman (Next Gen Broadband)2 QITCOM 2012 - Stagg Newman (Next Gen Broadband)
2 QITCOM 2012 - Stagg Newman (Next Gen Broadband)
 
1 QITCOM 2012 - Nilo Mitra (Iptv)
1  QITCOM 2012 - Nilo Mitra  (Iptv)1  QITCOM 2012 - Nilo Mitra  (Iptv)
1 QITCOM 2012 - Nilo Mitra (Iptv)
 
Mr. Faisal Fayyaz's presentation at QITCOM 2011
Mr. Faisal Fayyaz's presentation at QITCOM 2011Mr. Faisal Fayyaz's presentation at QITCOM 2011
Mr. Faisal Fayyaz's presentation at QITCOM 2011
 
Mr. Paul Chang's presentation at QITCOM 2011
Mr. Paul Chang's presentation at QITCOM 2011Mr. Paul Chang's presentation at QITCOM 2011
Mr. Paul Chang's presentation at QITCOM 2011
 
Mr. Nick Brown's presentation at QITCOM 2011
Mr. Nick Brown's presentation at QITCOM 2011Mr. Nick Brown's presentation at QITCOM 2011
Mr. Nick Brown's presentation at QITCOM 2011
 
Mr. Khalid Attia's presentation at QITCOM 2011
Mr. Khalid Attia's presentation at QITCOM 2011Mr. Khalid Attia's presentation at QITCOM 2011
Mr. Khalid Attia's presentation at QITCOM 2011
 
Melvina Tarazi's presentation at QITCOM 2011
Melvina Tarazi's presentation at QITCOM 2011Melvina Tarazi's presentation at QITCOM 2011
Melvina Tarazi's presentation at QITCOM 2011
 
Auda Hazeem's presentation at QITCOM 2011
Auda Hazeem's presentation at QITCOM 2011Auda Hazeem's presentation at QITCOM 2011
Auda Hazeem's presentation at QITCOM 2011
 
Mr. Paul Budde's presentation at QITCOM 2011
Mr. Paul Budde's presentation at QITCOM 2011Mr. Paul Budde's presentation at QITCOM 2011
Mr. Paul Budde's presentation at QITCOM 2011
 
Mr. André Merigoux's presentation at QITCOM 2011
Mr.  André Merigoux's presentation at QITCOM 2011Mr.  André Merigoux's presentation at QITCOM 2011
Mr. André Merigoux's presentation at QITCOM 2011
 
Mr. Gamal Hegazi's presentation on QITCOM 2011
Mr. Gamal Hegazi's presentation on QITCOM 2011Mr. Gamal Hegazi's presentation on QITCOM 2011
Mr. Gamal Hegazi's presentation on QITCOM 2011
 
Julia Glidden's presentation at QITCOM 2011
Julia Glidden's presentation at QITCOM 2011Julia Glidden's presentation at QITCOM 2011
Julia Glidden's presentation at QITCOM 2011
 
Mr. Joseph Teo's presentation at QITCOM 2011
Mr. Joseph Teo's presentation at QITCOM 2011Mr. Joseph Teo's presentation at QITCOM 2011
Mr. Joseph Teo's presentation at QITCOM 2011
 
Mr. Nihal Mehta's presentation at QITCOM 2011
Mr. Nihal Mehta's presentation at QITCOM 2011Mr. Nihal Mehta's presentation at QITCOM 2011
Mr. Nihal Mehta's presentation at QITCOM 2011
 
Mr. Ottmar Krauss' presentation at QITCOM 2011
Mr. Ottmar Krauss' presentation at QITCOM 2011Mr. Ottmar Krauss' presentation at QITCOM 2011
Mr. Ottmar Krauss' presentation at QITCOM 2011
 
Mr. William Fagan's presentation at QITCOM 2011
Mr. William Fagan's presentation at QITCOM 2011Mr. William Fagan's presentation at QITCOM 2011
Mr. William Fagan's presentation at QITCOM 2011
 
Mr. Julian Kersey's presentation at QITCOM 2011
Mr. Julian Kersey's presentation at QITCOM 2011Mr. Julian Kersey's presentation at QITCOM 2011
Mr. Julian Kersey's presentation at QITCOM 2011
 
Mr. Christopher Gow's presentation at QITCOM 2011
Mr. Christopher Gow's presentation at QITCOM 2011Mr. Christopher Gow's presentation at QITCOM 2011
Mr. Christopher Gow's presentation at QITCOM 2011
 
Mr. Hassan al Sayed's presentation at QITCOM 2011
Mr. Hassan al Sayed's presentation at QITCOM 2011Mr. Hassan al Sayed's presentation at QITCOM 2011
Mr. Hassan al Sayed's presentation at QITCOM 2011
 
Mr. Ali Bin Saleh Al-Soma's presentation at QITCOM 2011
Mr. Ali Bin Saleh Al-Soma's presentation at QITCOM 2011Mr. Ali Bin Saleh Al-Soma's presentation at QITCOM 2011
Mr. Ali Bin Saleh Al-Soma's presentation at QITCOM 2011
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Dr. Kareem Darwish's presentation at QITCOM 2011

  • 1. Digitizing and Retrieving Printed Arabic Documents Kareem Darwish Senior Scientist Qatar Computing Research Institute
  • 2. Overview Scanning Some Magic Search results
  • 7. Result of Scanning http://www.colophon.com Courtesy of the Library of Alexandria
  • 8. Magic: Optical Character Recognition من ناحيتى المراقبة والنيران على السهل الساحلى. وطرق الاقتراب التى تسلكها أى قوات عربية من ناحية الشرتى تنح ر دى طرق خمسة أهمها الثلاثة التالية: ا- الطريق ا لأول وهو ا لأقصر من بغداد- هـ 2233 " المفرتى.، أو ا لانحراف إلى الرطبة قبل 3أ3 دمشق الأردن، محاور ا لارابى من العرا إلى سوريا وا لأردد 2- الطريق الاثانى من بغداد " أبو كمال- بالميرا- دمشق- ا لأردد. 3- الطريق الثالسث وهو الأطول من بغ داد- الموصل- دير الزور- حملرو- دمشق- ا لأردن " 68 Courtesy of the Library of Alexandria OCR output (Sakhr)
  • 9. Arabic OCR is Hard Letters change shape depending on position in word, with dots distinguishing them from each other تـ ، ـتـ ، ـت قـ ، ـقـ ، ـق ، ق Diacritics are optional ق ، قَ ، قِ ، قُ ، قَّ ، قْ Some letter combinations have special shapes (ligatures): ل + ا = لا Letter elongations (Kashida) are often used قبل قبـــــــــــــــــــــــــــل Letters are connected
  • 10. Arabic OCR is Hard Diacritics and dots easily confusable. If manuscript is old, they can be confused with speckle on page Word error rate is typically greater than 20% !
  • 11. Arabic OCR is Hard Typical OCR output وتامسوق الجنة والنار وبها تقاكظالخليقة إلى المؤمنين رالكفاروالأبا إر رالفجار فهى منشأ الخلق والأمر والثواب والعقاب ،وهىاهدنالذى خطقت له الخليقة رغها رعن حقرقها السمؤال والحساب
  • 12. Arabic Morphology Challenges Arabic uses complex derivational morphology: Root (ex. ktb) Stem – root in a template (ex. mkAtbp) Word – stem with optional determiner, preposition, coordinating conjunctions, plural suffix, etc. (ex. w+Al+mkAtbp+AtwAlmkAtbAt) Estimated number of possible words: 60 billion Morphology dictates diacritics, which change meaning Ex. Elm  (Eelm, Ealam, Eolem: Knowledge, flag, acknowledge) No specific writing standard is prevalent: Ex. The trailing letters in Ely (Ali) and ElY (on) are often interchanged
  • 13. Arabic Morphology For regular Arabic search, morphological analysis is typically used: Full morphological analysis: Sebawai, Buckwalter, IBM Lee, AMIRA Light stemming – remove common prefixes and suffixes Al-Stem or Light-10 For OCR they fail
  • 14. OCR Error Handling Error correction: Word level techniques: Dictionary lookup (Jurafsky & Martin, 2000) Character level model uses confusion matrix Typically font dependent Character n-gram model: Some character sequences are more common than others Presence of a rare character sequence indicates position of error argmax P ( WordOrg| WordOCR) = P ( WordOCR| WordOrg) P ( WordOrg) Char level model Word level model
  • 15. OCR Error Handling Error correction: Passage level/context sensitive techniques: Using language modeling (bi or trigram LM): Clustering words in passage: assumes salient terms appear more than once: Ex. Kennedy; Kemedy; Kennody; etc. P ( Wordoriginal| WordOCR) = P ( WordOCR| WordOrg) P ( WordOrg) P(WordOrg| WordOrg-1)
  • 16. OCR Error Handling Multi-source fusion: Uses language modeling to fuse the output of multiple OCR systems Query garbling: Use a character level model to generate multiple degraded versions of a query Ex.: cement => cement, cornent, cernont, etc. Set degraded versions of a term as synonyms
  • 17. Arabic OCR Text Retrieval Without error handling  Use character n-grams (3 & 4-grams) وتام سوق الجنة والنار وبها تقاكظالخليقة إلى المؤمنين رالكفاروالأبا إر رالفجار فهى منشأ الخلق والأمر والثواب والعقاب ،وهىاهدن الذى خطقت له الخليقة رغها رعن حقرقها السمؤال والحساب رالفجار والفجار رال ، الف ، فجا ، جار وال ، الف ، فجا ، جار
  • 18. Presenting Results Presenting OCR output to users is not an option How would a ranked list of images look like How would we generate image snippets? How do we highlight salient terms in these images?
  • 19. Presenting Results What is the unit of search? Is it book, chapter, page
  • 20. Concluding Remarks Scanning is a fairly mature technology Arabic OCR has quite a ways to go Quality of search is tied to the quality of OCR Presentation Issues persist