3. Ramzi Alqrainy
• MSc. In computer science, University of
Jordan, Amman - Jordan
• Senior Enterprise Search / Data Engineer @
OpenSooq.com
• Technical Reviewer for “Scaling Apache Solr”
and “Apache Solr Search Patterns” (Books)
• Co-founder of Solr.ar group
• Built 8 search engines for different models in
the last 2 years
• Active blogger and Presenter about
Information Retrieval
4. Agenda
• Why is Arabic Language Important ?
• Arabic Language is Complex
• How we use Apache Solr @ OpenSooq ?
• Localization Concept with SolrCloud
• Ranking and Relevancy
• Apache Solr Implementations @ OpenSooq
6. Why is Arabic Language Important ?
Sample Arabic document without dots
7. Why is Arabic Language Important ?
Sample Arabic document with dots
8. Why is Arabic Language Important ?
• The Arabic Language is ranked as the fourth
top language on the web
• The number of Arab Internet users grew from
65 million in 2011 to 135 million in 2013
9. Arabic Language is Complex
• Arabic Orthography and Print
§ Arabic
has
a
right-‐to-‐le0
connected
script
that
uses
28
basic
le7ers,
which
change
shape
depending
on
their
posi:ons
in
words.
• Arabic Diacritics
§ Diacri:cs
help
disambiguate
the
meaning
of
words.
§ For
example,
the
two
words
عَلَم (Alam
-‐
meaning
“flag”)
and
عِلم (Eilm
-‐
meaning
“knowledge”)
share
the
same
le7ers
علم
(Elm)
but
differ
in
diacri:cs.
10. Arabic Language is Complex
• Arabic Morphology
§ Arabic
words
are
divided
into
three
main
types:
nouns,
verbs,
and
par:cles.
§ Arabic
nouns,
which
include
adjec:ves
and
adverbs,
and
verbs
are
derived
from
a
closed
set
of
around
10,000
roots
11. Arabic Language is Complex
• Arabic Dialects
§ There
are
6
dominant
with
many
more
varia:ons
of
them
and
dozens
more
less
spoken
dialects.
§ EG.
The
concept
corresponding
to
“I
want”
is
expressed
as
عاوز
(Eawz)
in
Egyp:an,
أبغى
(Abgy)
in
Gulf,
أبي
(Aby)
in
Iraqi,
and
بدي
(bdy)
in
Levan:ne.
• Arabizi (Transliteration)
§ Arabic
is
some:mes
wri7en
using
La:n
characters
in
transliterated
form.
§ Arabizi
uses
numerals
to
represent
Arabic
le7ers.
§ EG.
"2"
and
“3”
represent
the
le7ers
أ
(that
sounds
like
“a”
as
in
apple)
and
ع
(E)
(that
is
a
gu7ural
“aa”)
respec:vely.
12. How we use Apache Solr @ OpenSooq ?
• A leading classifieds ads website in the Middle East and North Africa.
• Right now : Average > 7K Concurrent Users.
• Activity-Per-Second : 240 APS.
• Adding/Edi:ng/Dele:ng
Post
• Adding
Comments
• Sending
Message
to
Buyer/Seller,
etc.
• More than 40k hits on Apache Solr Per Minute.
13. How we use Apache Solr @ OpenSooq ?
• Arabic Search Engine
14. Arabic Normalization
• There are common spelling mistakes that are widely accepted.
For
example,
the
verb ادرس
(Adrs)
in
impera:ve
mood
(meaning
“study”
–
in
a
command
form)
would
turn
to
.
أدرس
• Arabic content would be normalized according to the following steps:
§ Remove
punctua:on
§ Remove
diacri:cs
(primarily
weak
vowels).
§ Remove
non
le7ers
§ Replace
ا
,
إ
,
and
أ
with
ا
from
first
le7er
in
each
word
(A
-‐
alef)
§ Replace
final
ى
with
ي
(Ya)
§ Replace
final
ة
with
ه
(Ha)
15. Arabic Light Stemmer
• A light stemmer is not dictionary driven.
• This algorithm follows a rule-based prefix-removal mechanism.
16. Arabic Light Stemmer
• The light stemmer, light10, outperformed the other approaches. It is becoming
widely used in Arabic information retrieval.
17. Arabic Light Stemmer
• Sometimes a stemmer might not do what you want out of the box.
• Protects words from being modified by stemmers.
Stop words and Synonyms
• Removing stop words is important to ensure high performance and improve recall
h7ps://github.com/Ramzi-‐Alqrainy/Arabic-‐IR/blob/master/stopwords-‐ar.txt
• Matching strings of tokens and replacing them with other strings of tokens will
improve precision and recall .
20. Ranking and Relevancy: Boost documents by age
• Just do a descending sort by age = done?
• Boost more recent documents and penalize older documents just for being old
• Recency Boosting
Bf=recip(ms(sub(NOW,post_inserted_date)),3.16e-‐11,0.08,0.05)
^5