SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Arabic Content with Apache Solr 
Ramzi Alqrainy
Ramzi Alqrainy 
• MSc. In computer science, University of 
Jordan, Amman - Jordan 
• Senior Enterprise Search / Data Engineer @ 
OpenSooq.com 
• Technical Reviewer for “Scaling Apache Solr” 
and “Apache Solr Search Patterns” (Books) 
• Co-founder of Solr.ar group 
• Built 8 search engines for different models in 
the last 2 years 
• Active blogger and Presenter about 
Information Retrieval
Agenda 
• Why is Arabic Language Important ? 
• Arabic Language is Complex 
• How we use Apache Solr @ OpenSooq ? 
• Localization Concept with SolrCloud 
• Ranking and Relevancy 
• Apache Solr Implementations @ OpenSooq
Why is Arabic Language Important ?
Why is Arabic Language Important ? 
Sample Arabic document without dots
Why is Arabic Language Important ? 
Sample Arabic document with dots
Why is Arabic Language Important ? 
• The Arabic Language is ranked as the fourth 
top language on the web 
• The number of Arab Internet users grew from 
65 million in 2011 to 135 million in 2013
Arabic Language is Complex 
• Arabic Orthography and Print 
§ Arabic 
has 
a 
right-­‐to-­‐le0 
connected 
script 
that 
uses 
28 
basic 
le7ers, 
which 
change 
shape 
depending 
on 
their 
posi:ons 
in 
words. 
• Arabic Diacritics 
§ Diacri:cs 
help 
disambiguate 
the 
meaning 
of 
words. 
§ For 
example, 
the 
two 
words 
عَلَم (Alam 
-­‐ 
meaning 
“flag”) 
and 
عِلم (Eilm 
-­‐ 
meaning 
“knowledge”) 
share 
the 
same 
le7ers 
علم 
(Elm) 
but 
differ 
in 
diacri:cs.
Arabic Language is Complex 
• Arabic Morphology 
§ Arabic 
words 
are 
divided 
into 
three 
main 
types: 
nouns, 
verbs, 
and 
par:cles. 
§ Arabic 
nouns, 
which 
include 
adjec:ves 
and 
adverbs, 
and 
verbs 
are 
derived 
from 
a 
closed 
set 
of 
around 
10,000 
roots
Arabic Language is Complex 
• Arabic Dialects 
§ There 
are 
6 
dominant 
with 
many 
more 
varia:ons 
of 
them 
and 
dozens 
more 
less 
spoken 
dialects. 
§ EG. 
The 
concept 
corresponding 
to 
“I 
want” 
is 
expressed 
as 
عاوز 
(Eawz) 
in 
Egyp:an, 
أبغى 
(Abgy) 
in 
Gulf, 
أبي 
(Aby) 
in 
Iraqi, 
and 
بدي 
(bdy) 
in 
Levan:ne. 
• Arabizi (Transliteration) 
§ Arabic 
is 
some:mes 
wri7en 
using 
La:n 
characters 
in 
transliterated 
form. 
§ Arabizi 
uses 
numerals 
to 
represent 
Arabic 
le7ers. 
§ EG. 
"2" 
and 
“3” 
represent 
the 
le7ers 
أ 
(that 
sounds 
like 
“a” 
as 
in 
apple) 
and 
ع 
(E) 
(that 
is 
a 
gu7ural 
“aa”) 
respec:vely.
How we use Apache Solr @ OpenSooq ? 
• A leading classifieds ads website in the Middle East and North Africa. 
• Right now : Average > 7K Concurrent Users. 
• Activity-Per-Second : 240 APS. 
• Adding/Edi:ng/Dele:ng 
Post 
• Adding 
Comments 
• Sending 
Message 
to 
Buyer/Seller, 
etc. 
• More than 40k hits on Apache Solr Per Minute.
How we use Apache Solr @ OpenSooq ? 
• Arabic Search Engine
Arabic Normalization 
• There are common spelling mistakes that are widely accepted. 
For 
example, 
the 
verb ادرس 
(Adrs) 
in 
impera:ve 
mood 
(meaning 
“study” 
– 
in 
a 
command 
form) 
would 
turn 
to 
. 
أدرس 
• Arabic content would be normalized according to the following steps: 
§ Remove 
punctua:on 
§ Remove 
diacri:cs 
(primarily 
weak 
vowels). 
§ Remove 
non 
le7ers 
§ Replace 
ا 
, 
إ 
, 
and 
أ 
with 
ا 
from 
first 
le7er 
in 
each 
word 
(A 
-­‐ 
alef) 
§ Replace 
final 
ى 
with 
ي 
(Ya) 
§ Replace 
final 
ة 
with 
ه 
(Ha)
Arabic Light Stemmer 
• A light stemmer is not dictionary driven. 
• This algorithm follows a rule-based prefix-removal mechanism.
Arabic Light Stemmer 
• The light stemmer, light10, outperformed the other approaches. It is becoming 
widely used in Arabic information retrieval.
Arabic Light Stemmer 
• Sometimes a stemmer might not do what you want out of the box. 
• Protects words from being modified by stemmers. 
Stop words and Synonyms 
• Removing stop words is important to ensure high performance and improve recall 
h7ps://github.com/Ramzi-­‐Alqrainy/Arabic-­‐IR/blob/master/stopwords-­‐ar.txt 
• Matching strings of tokens and replacing them with other strings of tokens will 
improve precision and recall .
Apache Solr Schema.xml 
• A text field that is appropriate for Arabic
Localization Concept with SolrCloud
Ranking and Relevancy: Boost documents by age 
• Just do a descending sort by age = done? 
• Boost more recent documents and penalize older documents just for being old 
• Recency Boosting 
Bf=recip(ms(sub(NOW,post_inserted_date)),3.16e-­‐11,0.08,0.05) 
^5
Tune Solr Recip Function
Solr Implementations @ OpenSooq ? 
§ Anti Spam 
§ Checking Relevancy 
§ Tags Generations 
§ Recommendation System
Thank You 
@RamziAlqrainy 
https://github.com/Ramzi-Alqrainy 
http://solr-enterprise-search-server.blogspot.com/

Contenu connexe

Similaire à Arabic Content with Apache Solr: Presented by Ramzi Alqrainy, OpenSooq

MoM2010: Arabic natural language processing
MoM2010: Arabic natural language processingMoM2010: Arabic natural language processing
MoM2010: Arabic natural language processing
Hend Al-Khalifa
 
Habash: Arabic Natural Language Processing
Habash: Arabic Natural Language ProcessingHabash: Arabic Natural Language Processing
Habash: Arabic Natural Language Processing
Mustafa Jarrar
 
XMODEL: An XML-based Morphological Analyzer for Arabic Language
XMODEL: An XML-based Morphological Analyzer for Arabic LanguageXMODEL: An XML-based Morphological Analyzer for Arabic Language
XMODEL: An XML-based Morphological Analyzer for Arabic Language
Waqas Tariq
 
The named entity recognition (ner)2
The named entity recognition (ner)2The named entity recognition (ner)2
The named entity recognition (ner)2
Arabic_NLP_ImamU2013
 
Arabic Domain Names: What’s been done so far?
Arabic Domain Names: What’s been done so far?Arabic Domain Names: What’s been done so far?
Arabic Domain Names: What’s been done so far?
ahzoman
 

Similaire à Arabic Content with Apache Solr: Presented by Ramzi Alqrainy, OpenSooq (20)

almisbarIEEE-1
almisbarIEEE-1almisbarIEEE-1
almisbarIEEE-1
 
Adopting Quadrilateral Arabic Roots in Search Engine of E-library System
Adopting Quadrilateral Arabic Roots in Search Engine of E-library SystemAdopting Quadrilateral Arabic Roots in Search Engine of E-library System
Adopting Quadrilateral Arabic Roots in Search Engine of E-library System
 
Exploring the effects of stemming on
Exploring the effects of stemming onExploring the effects of stemming on
Exploring the effects of stemming on
 
MoM2010: Arabic natural language processing
MoM2010: Arabic natural language processingMoM2010: Arabic natural language processing
MoM2010: Arabic natural language processing
 
Habash: Arabic Natural Language Processing
Habash: Arabic Natural Language ProcessingHabash: Arabic Natural Language Processing
Habash: Arabic Natural Language Processing
 
Language support in searching Drupal with SOLR - Drupalcamp London 2013
Language support in searching Drupal with SOLR - Drupalcamp London 2013Language support in searching Drupal with SOLR - Drupalcamp London 2013
Language support in searching Drupal with SOLR - Drupalcamp London 2013
 
Coreference recognition in arabic
Coreference recognition in arabicCoreference recognition in arabic
Coreference recognition in arabic
 
XMODEL: An XML-based Morphological Analyzer for Arabic Language
XMODEL: An XML-based Morphological Analyzer for Arabic LanguageXMODEL: An XML-based Morphological Analyzer for Arabic Language
XMODEL: An XML-based Morphological Analyzer for Arabic Language
 
The named entity recognition (ner)2
The named entity recognition (ner)2The named entity recognition (ner)2
The named entity recognition (ner)2
 
Final quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using rFinal quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using r
 
HOW TO LEARN ARABIC LANGUAGE 2024.pptx
HOW TO LEARN ARABIC LANGUAGE   2024.pptxHOW TO LEARN ARABIC LANGUAGE   2024.pptx
HOW TO LEARN ARABIC LANGUAGE 2024.pptx
 
Easy Arabic
Easy ArabicEasy Arabic
Easy Arabic
 
Arabic Domain Names: What’s been done so far?
Arabic Domain Names: What’s been done so far?Arabic Domain Names: What’s been done so far?
Arabic Domain Names: What’s been done so far?
 
Tawasol symbols project overview
Tawasol symbols project overviewTawasol symbols project overview
Tawasol symbols project overview
 
Arabic spell checking approaches
Arabic spell checking approachesArabic spell checking approaches
Arabic spell checking approaches
 
ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)
 
Intro to Perl
Intro to PerlIntro to Perl
Intro to Perl
 
PHP - Introduction to PHP
PHP -  Introduction to PHPPHP -  Introduction to PHP
PHP - Introduction to PHP
 
Xtext Best Practices
Xtext Best PracticesXtext Best Practices
Xtext Best Practices
 
An Arabizi-English Social Media Statistical Machine Translation System
An Arabizi-English Social Media Statistical Machine Translation SystemAn Arabizi-English Social Media Statistical Machine Translation System
An Arabizi-English Social Media Statistical Machine Translation System
 

Plus de Lucidworks

Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 

Plus de Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Dernier

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Dernier (20)

Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 

Arabic Content with Apache Solr: Presented by Ramzi Alqrainy, OpenSooq

  • 1.
  • 2. Arabic Content with Apache Solr Ramzi Alqrainy
  • 3. Ramzi Alqrainy • MSc. In computer science, University of Jordan, Amman - Jordan • Senior Enterprise Search / Data Engineer @ OpenSooq.com • Technical Reviewer for “Scaling Apache Solr” and “Apache Solr Search Patterns” (Books) • Co-founder of Solr.ar group • Built 8 search engines for different models in the last 2 years • Active blogger and Presenter about Information Retrieval
  • 4. Agenda • Why is Arabic Language Important ? • Arabic Language is Complex • How we use Apache Solr @ OpenSooq ? • Localization Concept with SolrCloud • Ranking and Relevancy • Apache Solr Implementations @ OpenSooq
  • 5. Why is Arabic Language Important ?
  • 6. Why is Arabic Language Important ? Sample Arabic document without dots
  • 7. Why is Arabic Language Important ? Sample Arabic document with dots
  • 8. Why is Arabic Language Important ? • The Arabic Language is ranked as the fourth top language on the web • The number of Arab Internet users grew from 65 million in 2011 to 135 million in 2013
  • 9. Arabic Language is Complex • Arabic Orthography and Print § Arabic has a right-­‐to-­‐le0 connected script that uses 28 basic le7ers, which change shape depending on their posi:ons in words. • Arabic Diacritics § Diacri:cs help disambiguate the meaning of words. § For example, the two words عَلَم (Alam -­‐ meaning “flag”) and عِلم (Eilm -­‐ meaning “knowledge”) share the same le7ers علم (Elm) but differ in diacri:cs.
  • 10. Arabic Language is Complex • Arabic Morphology § Arabic words are divided into three main types: nouns, verbs, and par:cles. § Arabic nouns, which include adjec:ves and adverbs, and verbs are derived from a closed set of around 10,000 roots
  • 11. Arabic Language is Complex • Arabic Dialects § There are 6 dominant with many more varia:ons of them and dozens more less spoken dialects. § EG. The concept corresponding to “I want” is expressed as عاوز (Eawz) in Egyp:an, أبغى (Abgy) in Gulf, أبي (Aby) in Iraqi, and بدي (bdy) in Levan:ne. • Arabizi (Transliteration) § Arabic is some:mes wri7en using La:n characters in transliterated form. § Arabizi uses numerals to represent Arabic le7ers. § EG. "2" and “3” represent the le7ers أ (that sounds like “a” as in apple) and ع (E) (that is a gu7ural “aa”) respec:vely.
  • 12. How we use Apache Solr @ OpenSooq ? • A leading classifieds ads website in the Middle East and North Africa. • Right now : Average > 7K Concurrent Users. • Activity-Per-Second : 240 APS. • Adding/Edi:ng/Dele:ng Post • Adding Comments • Sending Message to Buyer/Seller, etc. • More than 40k hits on Apache Solr Per Minute.
  • 13. How we use Apache Solr @ OpenSooq ? • Arabic Search Engine
  • 14. Arabic Normalization • There are common spelling mistakes that are widely accepted. For example, the verb ادرس (Adrs) in impera:ve mood (meaning “study” – in a command form) would turn to . أدرس • Arabic content would be normalized according to the following steps: § Remove punctua:on § Remove diacri:cs (primarily weak vowels). § Remove non le7ers § Replace ا , إ , and أ with ا from first le7er in each word (A -­‐ alef) § Replace final ى with ي (Ya) § Replace final ة with ه (Ha)
  • 15. Arabic Light Stemmer • A light stemmer is not dictionary driven. • This algorithm follows a rule-based prefix-removal mechanism.
  • 16. Arabic Light Stemmer • The light stemmer, light10, outperformed the other approaches. It is becoming widely used in Arabic information retrieval.
  • 17. Arabic Light Stemmer • Sometimes a stemmer might not do what you want out of the box. • Protects words from being modified by stemmers. Stop words and Synonyms • Removing stop words is important to ensure high performance and improve recall h7ps://github.com/Ramzi-­‐Alqrainy/Arabic-­‐IR/blob/master/stopwords-­‐ar.txt • Matching strings of tokens and replacing them with other strings of tokens will improve precision and recall .
  • 18. Apache Solr Schema.xml • A text field that is appropriate for Arabic
  • 20. Ranking and Relevancy: Boost documents by age • Just do a descending sort by age = done? • Boost more recent documents and penalize older documents just for being old • Recency Boosting Bf=recip(ms(sub(NOW,post_inserted_date)),3.16e-­‐11,0.08,0.05) ^5
  • 21. Tune Solr Recip Function
  • 22. Solr Implementations @ OpenSooq ? § Anti Spam § Checking Relevancy § Tags Generations § Recommendation System
  • 23. Thank You @RamziAlqrainy https://github.com/Ramzi-Alqrainy http://solr-enterprise-search-server.blogspot.com/