Arabic language poses several challenges faced by Natural Language Processing (NLP), largely due to the fact that Arabic language has a very rich and sophisticated morphological system. This talk will cover some of the challenges and how to solve them with Solr and will also present the challenges that were handled in Opensooq’s use case.
3. Ramzi Alqrainy
• MSc. In computer science, University of
Jordan, Amman - Jordan
• Senior Enterprise Search / Data Engineer @
OpenSooq.com
• Technical Reviewer for “Scaling Apache Solr”
and “Apache Solr Search Patterns” (Books)
• Co-founder of Solr.ar group
• Built 8 search engines for different models in
the last 2 years
• Active blogger and Presenter about
Information Retrieval
4. Agenda
• Why is Arabic Language Important ?
• Arabic Language is Complex
• How we use Apache Solr @ OpenSooq ?
• Localization Concept with SolrCloud
• Ranking and Relevancy
• Apache Solr Implementations @ OpenSooq
6. Why is Arabic Language Important ?
Sample Arabic document without dots
7. Why is Arabic Language Important ?
Sample Arabic document with dots
8. Why is Arabic Language Important ?
• The Arabic Language is ranked as the fourth
top language on the web
• The number of Arab Internet users grew from
65 million in 2011 to 135 million in 2013
9. Arabic Language is Complex
• Arabic Orthography and Print
§ Arabic
has
a
right-‐to-‐le0
connected
script
that
uses
28
basic
le7ers,
which
change
shape
depending
on
their
posi:ons
in
words.
• Arabic Diacritics
§ Diacri:cs
help
disambiguate
the
meaning
of
words.
§ For
example,
the
two
words
مَلَع(Alam
-‐
meaning
“flag”)
and
مِلع(Eilm
-‐
meaning
“knowledge”)
share
the
same
le7ers
معل
(Elm)
but
differ
in
diacri:cs.
10. Arabic Language is Complex
• Arabic Morphology
§ Arabic
words
are
divided
into
three
main
types:
nouns,
verbs,
and
par:cles.
§ Arabic
nouns,
which
include
adjec:ves
and
adverbs,
and
verbs
are
derived
from
a
closed
set
of
around
10,000
roots
11. Arabic Language is Complex
• Arabic Dialects
§ There
are
6
dominant
with
many
more
varia:ons
of
them
and
dozens
more
less
spoken
dialects.
§ EG.
The
concept
corresponding
to
“I
want”
is
expressed
as
زعاو
(Eawz)
in
Egyp:an,
ىأبغ
(Abgy)
in
Gulf,
يأب
(Aby)
in
Iraqi,
and
يبد
(bdy)
in
Levan:ne.
• Arabizi (Transliteration)
§ Arabic
is
some:mes
wri7en
using
La:n
characters
in
transliterated
form.
§ Arabizi
uses
numerals
to
represent
Arabic
le7ers.
§ EG.
"2"
and
“3”
represent
the
le7ers
أ
(that
sounds
like
“a”
as
in
apple)
and
ع
(E)
(that
is
a
gu7ural
“aa”)
respec:vely.
12. How we use Apache Solr @ OpenSooq ?
• A leading classifieds ads website in the Middle East and North Africa.
• Right now : Average > 7K Concurrent Users.
• Activity-Per-Second : 240 APS.
• Adding/Edi:ng/Dele:ng
Post
• Adding
Comments
• Sending
Message
to
Buyer/Seller,
etc.
• More than 40k hits on Apache Solr Per Minute.
13. How we use Apache Solr @ OpenSooq ?
• Arabic Search Engine
14. Arabic Normalization
• There are common spelling mistakes that are widely accepted.
For
example,
the
verb ادرس
(Adrs)
in
impera:ve
mood
(meaning
“study”
–
in
a
command
form)
would
turn
to
أدرس.
• Arabic content would be normalized according to the following steps:
§ Remove
punctua:on
§ Remove
diacri:cs
(primarily
weak
vowels).
§ Remove
non
le7ers
§ Replace
ا
,
إ
,
and
أ
with
ا
from
first
le7er
in
each
word
(A
-‐
alef)
§ Replace
final
ى
with
ي
(Ya)
§ Replace
final
ة
with
ه
(Ha)
15. Arabic Light Stemmer
• A light stemmer is not dictionary driven.
• This algorithm follows a rule-based prefix-removal mechanism.
16. Arabic Light Stemmer
• The light stemmer, light10, outperformed the other approaches. It is becoming
widely used in Arabic information retrieval.
17. Arabic Light Stemmer
• Sometimes a stemmer might not do what you want out of the box.
• Protects words from being modified by stemmers.
Stop words and Synonyms
• Removing stop words is important to ensure high performance and improve recall
h7ps://github.com/Ramzi-‐Alqrainy/Arabic-‐IR/blob/master/stopwords-‐ar.txt
• Matching strings of tokens and replacing them with other strings of tokens will
improve precision and recall .
20. Ranking and Relevancy: Boost documents by age
• Just do a descending sort by age = done?
• Boost more recent documents and penalize older documents just for being old
• Recency Boosting
Bf=recip(ms(sub(NOW,post_inserted_date)),3.16e-‐11,0.08,0.05)
^5