SlideShare une entreprise Scribd logo
1  sur  47
Télécharger pour lire hors ligne
Arabic-SOS
Segmenter, Stemmer, and Orthography Standardizer for the Arabic
Cultural Heritage
Emad Mohamed & Zeeshan Sayyed
May 2019
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 1 / 47
Background
Emad Mohamed
Senior Lecturer, Research Group in Computational Lingustics,
University of Wolverhampton
Morphological Analysis, Syntactic Analysis, Computational Corpus
Linguistics, Language Resources
Zeeshan Ali Sayyed
PhD Candidate in Computer Science, Indiana University
Machine Learning, NLP, Parsing Morphologically-rich Languages
Research associate in the Arabic Cultural Analytics project at Doha
Institute for Graduate Studies, Qatar.
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 2 / 47
Roadmap
Segmentation
What is it, and why do we need it?
Data & Methods
Experiments & Results
Substandard Orthography
The Problem
Data & Methods
Experiments & Results
Effect of Substandard Orthography on Segmentation
Stemming as a by-product of segmentation
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 3 / 47
whatissegmentation?
andwhydoweneedit?
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 4 / 47
AlktAb
Al
DEF
ktAb
NOUN
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 5 / 47
wAlktAb
w
CONJ
Al
DEF
ktAb
NOUN
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 6 / 47
wllktAb
w
CONJ
l
PREP
Al
DEF
ktAb
NOUN
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 7 / 47
wlktAbhm
w
CONJ
l
PREP
ktAb
NOUN
hm
POSS
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 8 / 47
Why is segmentation important
Word2vec is an algorithm for finding related words.
word2vec(book): book, books, novels, novel, manuscript, author,
fiction, essay, poem, poems
word2vec(ktAb): ktAb, AlktAb, wAlktAb, ktb, Alktb, llktAb, ktAby,
bAlktAb, fAlktAb, ktAbnA
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 9 / 47
Why is segmentation important
Segmentation is a required step for, or significantly improves:
POS tagging
Parsing
Named Entity Recognition
Machine Translation
Lexical Analysis
...
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 10 / 47
Previous Work
Segmentation is a hot topic in Arabic NLP.
Many systems exist: MADA, AMIRA, MADAMIRA, FARASA
These systems handle Modern Standard Arabic or Colloquial Arabic
These systems fail on the Arabic cultural heritage. This is really
worrying given that Arabic is a continuum.
We focus on pre-MSA Arabic in this talk, but we are working
on a universal model for the Arabic language.
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 11 / 47
Previous Work
System MSA Accuracy CA Accuracy
MADAMIRA 98.3 94.3
FARASA 98.5 86
Table: Performance on Modern Standard Arabic and on Classical Arabic
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 12 / 47
Our approach to segmentation
Data & Annotation
Experiments
Results
Problems and Solutions
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 13 / 47
Annotation
Randomly selected a corpus from the Qur’an, Hadith, Islamic Law,
Islamic Philosophy, and the Al-Manar Magazine (1898-1935)
Segment it using a model built on the ATB
Select the most frequent ngrams, pass these to an annotator.
Iteratively do this to build a test set, training set, and dev set.
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 14 / 47
Set Source #Words
train 1 Al Manar 85 312
train 2 Al Manar + Classical 141 766
dev Al Manar 23 786
test 1 Al Manar 24 005
test 2 Classical 5 299
Table: Statistics of the datasets used for the experiments
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 15 / 47
Gradient Boosting Machines
Sequential Ensemble Method.
Uses Regression Decision Trees.
Multiple Iterations
Each subsequent iteration focuses on those parts of the problem that
the previous iterations got wrong.
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 16 / 47
Gradient Boosting Machines
Sequential Ensemble Method.
Uses Regression Decision Trees.
Multiple Iterations
Each subsequent iteration focuses on those parts of the problem that
the previous iterations got wrong.
”There are only two Machine Learning algorithms: Gradient Boosting
& Neural Netweorks”
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 17 / 47
Features for Segmentation
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 18 / 47
Experiments: Algorithms
Compare SVM’s, CRF’s and GBM’s
Gradient Boosting Machines produce the best results
XGBOOST
CATBOOST (Yandex)
LightGBM (Microsoft)
Settled on CATBOOST: best results
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 19 / 47
Al Manar Classical
CRF-Baseline 92.7% 94.96
SOS (Manar) 97.18% 97.17
SOS (Manar + Classical) 97.45 98.47
Table: Baselines segmentation accuracy using CRF
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 20 / 47
System Accuracy
SOS-Manar 97.17
SOS-Manar + Classical 98.47
Mohamed (2018) 96.8
MADAMIRA 94.7
SAPA 86.47
Table: Comparison with other segmenters on classical test set
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 21 / 47
Feature Ranking
Feature Value Feature Value
focus 15.6501 prev word suffix 4.2443
next2letters 11.857 chr position 3.2133
prev2letters 8.8664 minus2 3.1651
focus word prefix 7.8821 minus3 2.7478
plus1 7.3599 plus4 2.5857
focus word suffix 6.9752 plus5 2.566
plus3 6.7646 following word prefix 2.5203
plus2 5.5329 minus4 2.1905
minus1 4.7142 minus5 1.1644
Table: Feature importances ranked by the model
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 22 / 47
Takeaways
We have a good segmenter that achieves almost the same as the
MSA segmenters with less than one quarter of the data that has some
noise.
The context does not seem to help much as the most important
features are local.
Error analysis shows that ambiguity is the main culprit: most of the
ill-segmented words are ambiguous.
When we tried our segmenter on data available online, the
results were obviously worse. The reason: Substandard
Orthography
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 23 / 47
Substandard Orthography
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 24 / 47
The different forms of hamza
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 25 / 47
The different forms of hamza
Part of the stem
Question word: Do, Does, Is, Has, etc ..
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 26 / 47
The different forms of hamza
Part of the stem. Cannot be segmented.
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 27 / 47
The different forms of hamza
Part of the stem. Cannot be segmented.
Assimilated question word. Must be segmented.
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 28 / 47
The different forms of hamza
Part of the stem. Cannot be segmented.
Accusative marker. May be segmented.
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 29 / 47
Figure: Three forms of hamza forming minimal pairs
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 30 / 47
t vs. h
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 31 / 47
t vs. h
Part of the stem: Cannot be segmented
3rd Person singular pronoun. Must be segmented.
3rd Person singular possessive pronoun. Must be segmented.
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 32 / 47
t vs. h
Almost always a singular feminine marker. May be segmented
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 33 / 47
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 34 / 47
y vs. a
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 35 / 47
y vs. a
Part of the stem. Cannot be segmented
First person possessive pronoun. Must be segmented.
First person pronoun. Must be segmented.
Imperfective prefix. May be segmented
17 different functions in the Arabic Treebank.
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 36 / 47
y vs. a
Part of the stem. Cannot be segmented.
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 37 / 47
Standardizing the Orthography
Data
Methods
Evaluation
Effect of Orthography Standardization on Segmentation
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 38 / 47
Standardizing the Orthography: Data
Most available data are sub-standard
Formal publications: serious newspapers, magazines and books are
usually standard
The IslamWeb Library published over 1000 books all of which
rigorously checked and proofread.
We substandrdize this data
We select a sub-corpus of 35,666,914 words
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 39 / 47
Representing substandard orthography
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 40 / 47
Handling Substandard Orthography
We use the same set of features as we do in segmentation
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 41 / 47
Orthography Standardization Results
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 42 / 47
Effect of Standardization on Segmentation
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 43 / 47
Stemming
Stemming can be derived from segmentation through a rule-based
system
Remove all the affixes. Whatever remains is the stem
Theoretically you need POS tagging to disambiguate some rare cases
of ambiguity
Practically, those cases are so rare the number never get affected
Stemming is at least as accurate as segmentation
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 44 / 47
Long-standing Problems & Possible Solutions
Problem Solution
Most of the errors in seg-
mentation are ambiguous
words
Widen the context to include n previ-
ous/following words, Add more data
(hard), synthetic data
Data imbalance in
orthography standardiza-
tion
Try several methods of under-
sampling for the dominant class(es)
Substandard sometimes
beats standardization
Still examining the problem
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 45 / 47
Ongoing & Future Research
Treat both segmentation and standardization as sequence to sequence
problems
Joint standardization and segmentation
Create artificial data for segmentation (successful first attempts)
Trasnfer learning with contextual embeddings
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 46 / 47
Thank You
Github Repo: https://github.com/zeeshansayyed/ArabicSOS Email:
Zeeshan: zasayyed@indiana.edu
Emad: e.mohamed2@wlv.ac.uk
thankyouallforcomingandhopeyoufoundthisuseful
pleasefeelfreetoasksuggestorcriticise
”there is no such thing as a stupid question, only
stupid answers”
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 47 / 47

Contenu connexe

Plus de IMPACT Centre of Competence

Advanced Imaging Services at KU Leuven Libraries Webinar slides
Advanced Imaging Services at KU Leuven Libraries Webinar slidesAdvanced Imaging Services at KU Leuven Libraries Webinar slides
Advanced Imaging Services at KU Leuven Libraries Webinar slidesIMPACT Centre of Competence
 
DInGO: Digitise and Go! (digitisation workflows). Toolset for digitisation wo...
DInGO: Digitise and Go! (digitisation workflows). Toolset for digitisation wo...DInGO: Digitise and Go! (digitisation workflows). Toolset for digitisation wo...
DInGO: Digitise and Go! (digitisation workflows). Toolset for digitisation wo...IMPACT Centre of Competence
 
Digitisation at KU Leuven University Libraries: Towards consolidation
Digitisation at KU Leuven University Libraries: Towards consolidationDigitisation at KU Leuven University Libraries: Towards consolidation
Digitisation at KU Leuven University Libraries: Towards consolidationIMPACT Centre of Competence
 
Evaluation and post-correction of OCR of digitised historical newspapers
Evaluation and post-correction of OCR of digitised historical newspapersEvaluation and post-correction of OCR of digitised historical newspapers
Evaluation and post-correction of OCR of digitised historical newspapersIMPACT Centre of Competence
 

Plus de IMPACT Centre of Competence (20)

Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 
Session1 02.anna-maria sichani
Session1 02.anna-maria sichaniSession1 02.anna-maria sichani
Session1 02.anna-maria sichani
 
Session1 01.konstantin baierer
Session1 01.konstantin baiererSession1 01.konstantin baierer
Session1 01.konstantin baierer
 
Advanced Imaging Services at KU Leuven Libraries Webinar slides
Advanced Imaging Services at KU Leuven Libraries Webinar slidesAdvanced Imaging Services at KU Leuven Libraries Webinar slides
Advanced Imaging Services at KU Leuven Libraries Webinar slides
 
Xii simposi internacional noves tendencies
Xii simposi internacional noves tendenciesXii simposi internacional noves tendencies
Xii simposi internacional noves tendencies
 
Impact management report 2016
Impact management report 2016Impact management report 2016
Impact management report 2016
 
DInGO: Digitise and Go! (digitisation workflows). Toolset for digitisation wo...
DInGO: Digitise and Go! (digitisation workflows). Toolset for digitisation wo...DInGO: Digitise and Go! (digitisation workflows). Toolset for digitisation wo...
DInGO: Digitise and Go! (digitisation workflows). Toolset for digitisation wo...
 
Digitisation at KU Leuven University Libraries: Towards consolidation
Digitisation at KU Leuven University Libraries: Towards consolidationDigitisation at KU Leuven University Libraries: Towards consolidation
Digitisation at KU Leuven University Libraries: Towards consolidation
 
Evaluation and post-correction of OCR of digitised historical newspapers
Evaluation and post-correction of OCR of digitised historical newspapersEvaluation and post-correction of OCR of digitised historical newspapers
Evaluation and post-correction of OCR of digitised historical newspapers
 

Dernier

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 

Dernier (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Session2 01.emad mohamed

  • 1. Arabic-SOS Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural Heritage Emad Mohamed & Zeeshan Sayyed May 2019 Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 1 / 47
  • 2. Background Emad Mohamed Senior Lecturer, Research Group in Computational Lingustics, University of Wolverhampton Morphological Analysis, Syntactic Analysis, Computational Corpus Linguistics, Language Resources Zeeshan Ali Sayyed PhD Candidate in Computer Science, Indiana University Machine Learning, NLP, Parsing Morphologically-rich Languages Research associate in the Arabic Cultural Analytics project at Doha Institute for Graduate Studies, Qatar. Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 2 / 47
  • 3. Roadmap Segmentation What is it, and why do we need it? Data & Methods Experiments & Results Substandard Orthography The Problem Data & Methods Experiments & Results Effect of Substandard Orthography on Segmentation Stemming as a by-product of segmentation Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 3 / 47
  • 4. whatissegmentation? andwhydoweneedit? Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 4 / 47
  • 5. AlktAb Al DEF ktAb NOUN Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 5 / 47
  • 6. wAlktAb w CONJ Al DEF ktAb NOUN Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 6 / 47
  • 7. wllktAb w CONJ l PREP Al DEF ktAb NOUN Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 7 / 47
  • 8. wlktAbhm w CONJ l PREP ktAb NOUN hm POSS Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 8 / 47
  • 9. Why is segmentation important Word2vec is an algorithm for finding related words. word2vec(book): book, books, novels, novel, manuscript, author, fiction, essay, poem, poems word2vec(ktAb): ktAb, AlktAb, wAlktAb, ktb, Alktb, llktAb, ktAby, bAlktAb, fAlktAb, ktAbnA Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 9 / 47
  • 10. Why is segmentation important Segmentation is a required step for, or significantly improves: POS tagging Parsing Named Entity Recognition Machine Translation Lexical Analysis ... Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 10 / 47
  • 11. Previous Work Segmentation is a hot topic in Arabic NLP. Many systems exist: MADA, AMIRA, MADAMIRA, FARASA These systems handle Modern Standard Arabic or Colloquial Arabic These systems fail on the Arabic cultural heritage. This is really worrying given that Arabic is a continuum. We focus on pre-MSA Arabic in this talk, but we are working on a universal model for the Arabic language. Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 11 / 47
  • 12. Previous Work System MSA Accuracy CA Accuracy MADAMIRA 98.3 94.3 FARASA 98.5 86 Table: Performance on Modern Standard Arabic and on Classical Arabic Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 12 / 47
  • 13. Our approach to segmentation Data & Annotation Experiments Results Problems and Solutions Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 13 / 47
  • 14. Annotation Randomly selected a corpus from the Qur’an, Hadith, Islamic Law, Islamic Philosophy, and the Al-Manar Magazine (1898-1935) Segment it using a model built on the ATB Select the most frequent ngrams, pass these to an annotator. Iteratively do this to build a test set, training set, and dev set. Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 14 / 47
  • 15. Set Source #Words train 1 Al Manar 85 312 train 2 Al Manar + Classical 141 766 dev Al Manar 23 786 test 1 Al Manar 24 005 test 2 Classical 5 299 Table: Statistics of the datasets used for the experiments Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 15 / 47
  • 16. Gradient Boosting Machines Sequential Ensemble Method. Uses Regression Decision Trees. Multiple Iterations Each subsequent iteration focuses on those parts of the problem that the previous iterations got wrong. Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 16 / 47
  • 17. Gradient Boosting Machines Sequential Ensemble Method. Uses Regression Decision Trees. Multiple Iterations Each subsequent iteration focuses on those parts of the problem that the previous iterations got wrong. ”There are only two Machine Learning algorithms: Gradient Boosting & Neural Netweorks” Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 17 / 47
  • 18. Features for Segmentation Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 18 / 47
  • 19. Experiments: Algorithms Compare SVM’s, CRF’s and GBM’s Gradient Boosting Machines produce the best results XGBOOST CATBOOST (Yandex) LightGBM (Microsoft) Settled on CATBOOST: best results Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 19 / 47
  • 20. Al Manar Classical CRF-Baseline 92.7% 94.96 SOS (Manar) 97.18% 97.17 SOS (Manar + Classical) 97.45 98.47 Table: Baselines segmentation accuracy using CRF Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 20 / 47
  • 21. System Accuracy SOS-Manar 97.17 SOS-Manar + Classical 98.47 Mohamed (2018) 96.8 MADAMIRA 94.7 SAPA 86.47 Table: Comparison with other segmenters on classical test set Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 21 / 47
  • 22. Feature Ranking Feature Value Feature Value focus 15.6501 prev word suffix 4.2443 next2letters 11.857 chr position 3.2133 prev2letters 8.8664 minus2 3.1651 focus word prefix 7.8821 minus3 2.7478 plus1 7.3599 plus4 2.5857 focus word suffix 6.9752 plus5 2.566 plus3 6.7646 following word prefix 2.5203 plus2 5.5329 minus4 2.1905 minus1 4.7142 minus5 1.1644 Table: Feature importances ranked by the model Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 22 / 47
  • 23. Takeaways We have a good segmenter that achieves almost the same as the MSA segmenters with less than one quarter of the data that has some noise. The context does not seem to help much as the most important features are local. Error analysis shows that ambiguity is the main culprit: most of the ill-segmented words are ambiguous. When we tried our segmenter on data available online, the results were obviously worse. The reason: Substandard Orthography Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 23 / 47
  • 24. Substandard Orthography Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 24 / 47
  • 25. The different forms of hamza Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 25 / 47
  • 26. The different forms of hamza Part of the stem Question word: Do, Does, Is, Has, etc .. Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 26 / 47
  • 27. The different forms of hamza Part of the stem. Cannot be segmented. Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 27 / 47
  • 28. The different forms of hamza Part of the stem. Cannot be segmented. Assimilated question word. Must be segmented. Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 28 / 47
  • 29. The different forms of hamza Part of the stem. Cannot be segmented. Accusative marker. May be segmented. Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 29 / 47
  • 30. Figure: Three forms of hamza forming minimal pairs Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 30 / 47
  • 31. t vs. h Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 31 / 47
  • 32. t vs. h Part of the stem: Cannot be segmented 3rd Person singular pronoun. Must be segmented. 3rd Person singular possessive pronoun. Must be segmented. Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 32 / 47
  • 33. t vs. h Almost always a singular feminine marker. May be segmented Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 33 / 47
  • 34. Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 34 / 47
  • 35. y vs. a Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 35 / 47
  • 36. y vs. a Part of the stem. Cannot be segmented First person possessive pronoun. Must be segmented. First person pronoun. Must be segmented. Imperfective prefix. May be segmented 17 different functions in the Arabic Treebank. Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 36 / 47
  • 37. y vs. a Part of the stem. Cannot be segmented. Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 37 / 47
  • 38. Standardizing the Orthography Data Methods Evaluation Effect of Orthography Standardization on Segmentation Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 38 / 47
  • 39. Standardizing the Orthography: Data Most available data are sub-standard Formal publications: serious newspapers, magazines and books are usually standard The IslamWeb Library published over 1000 books all of which rigorously checked and proofread. We substandrdize this data We select a sub-corpus of 35,666,914 words Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 39 / 47
  • 40. Representing substandard orthography Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 40 / 47
  • 41. Handling Substandard Orthography We use the same set of features as we do in segmentation Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 41 / 47
  • 42. Orthography Standardization Results Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 42 / 47
  • 43. Effect of Standardization on Segmentation Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 43 / 47
  • 44. Stemming Stemming can be derived from segmentation through a rule-based system Remove all the affixes. Whatever remains is the stem Theoretically you need POS tagging to disambiguate some rare cases of ambiguity Practically, those cases are so rare the number never get affected Stemming is at least as accurate as segmentation Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 44 / 47
  • 45. Long-standing Problems & Possible Solutions Problem Solution Most of the errors in seg- mentation are ambiguous words Widen the context to include n previ- ous/following words, Add more data (hard), synthetic data Data imbalance in orthography standardiza- tion Try several methods of under- sampling for the dominant class(es) Substandard sometimes beats standardization Still examining the problem Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 45 / 47
  • 46. Ongoing & Future Research Treat both segmentation and standardization as sequence to sequence problems Joint standardization and segmentation Create artificial data for segmentation (successful first attempts) Trasnfer learning with contextual embeddings Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 46 / 47
  • 47. Thank You Github Repo: https://github.com/zeeshansayyed/ArabicSOS Email: Zeeshan: zasayyed@indiana.edu Emad: e.mohamed2@wlv.ac.uk thankyouallforcomingandhopeyoufoundthisuseful pleasefeelfreetoasksuggestorcriticise ”there is no such thing as a stupid question, only stupid answers” Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 47 / 47