SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Arabic Content with Apache Solr
Ramzi Alqrainy
Ramzi Alqrainy
•  MSc. In computer science, University of
Jordan, Amman - Jordan
•  Senior Enterprise Search / Data Engineer @
OpenSooq.com
•  Technical Reviewer for “Scaling Apache Solr”
and “Apache Solr Search Patterns” (Books)
•  Co-founder of Solr.ar group
•  Built 8 search engines for different models in
the last 2 years
•  Active blogger and Presenter about
Information Retrieval
Agenda
•  Why is Arabic Language Important ?
•  Arabic Language is Complex
•  How we use Apache Solr @ OpenSooq ?
•  Localization Concept with SolrCloud
•  Ranking and Relevancy
•  Apache Solr Implementations @ OpenSooq
Why is Arabic Language Important ?
Why is Arabic Language Important ?
Sample Arabic document without dots
Why is Arabic Language Important ?
Sample Arabic document with dots
Why is Arabic Language Important ?
•  The Arabic Language is ranked as the fourth
top language on the web
•  The number of Arab Internet users grew from
65 million in 2011 to 135 million in 2013
Arabic Language is Complex
•  Arabic Orthography and Print
§  Arabic	
  has	
  a	
  right-­‐to-­‐le0	
  connected	
  script	
  that	
  uses	
  28	
  basic	
  le7ers,	
  which	
  change	
  
shape	
  depending	
  on	
  their	
  posi:ons	
  in	
  words.	
  
•  Arabic Diacritics
§  Diacri:cs	
  help	
  disambiguate	
  the	
  meaning	
  of	
  words.	
  
§  For	
  example,	
  the	
  two	
  words	
  ‫م‬َ‫ل‬َ‫ع‬(Alam	
  -­‐	
  meaning	
  “flag”)	
  and	
  ‫م‬‫ِل‬‫ع‬(Eilm	
  -­‐	
  meaning	
  
“knowledge”)	
  share	
  the	
  same	
  le7ers	
  ‫م‬‫عل‬	
  (Elm)	
  but	
  differ	
  in	
  diacri:cs.	
  
Arabic Language is Complex
•  Arabic Morphology
§  Arabic	
  words	
  are	
  divided	
  into	
  three	
  main	
  types:	
  nouns,	
  verbs,	
  and	
  par:cles.	
  
§  Arabic	
  nouns,	
  which	
  include	
  adjec:ves	
  and	
  adverbs,	
  and	
  verbs	
  are	
  derived	
  from	
  a	
  
closed	
  set	
  of	
  around	
  10,000	
  roots	
  
Arabic Language is Complex
•  Arabic Dialects
§  There	
  are	
  6	
  dominant	
  with	
  many	
  more	
  varia:ons	
  of	
  them	
  and	
  dozens	
  more	
  less	
  spoken	
  
dialects.	
  
§  EG.	
  The	
  concept	
  corresponding	
  to	
  “I	
  want”	
  is	
  expressed	
  as	
  ‫ز‬‫عاو‬	
  (Eawz)	
  in	
  Egyp:an,	
  ‫ى‬‫أبغ‬	
  
(Abgy)	
  in	
  Gulf,	
  ‫ي‬‫أب‬	
  (Aby)	
  in	
  Iraqi,	
  and	
  ‫ي‬‫بد‬	
  (bdy)	
  in	
  Levan:ne.	
  
•  Arabizi (Transliteration)
§  Arabic	
  is	
  some:mes	
  wri7en	
  using	
  La:n	
  characters	
  in	
  transliterated	
  form.	
  
§  Arabizi	
  uses	
  numerals	
  to	
  represent	
  Arabic	
  le7ers.	
  
§  EG.	
  "2"	
  and	
  “3”	
  represent	
  the	
  le7ers	
  ‫أ‬	
  	
  (that	
  sounds	
  like	
  “a”	
  as	
  in	
  apple)	
  and	
  ‫ع‬	
  (E)	
  (that	
  is	
  
a	
  gu7ural	
  “aa”)	
  respec:vely.	
  	
  
How we use Apache Solr @ OpenSooq ?
•  A leading classifieds ads website in the Middle East and North Africa.
•  Right now : Average > 7K Concurrent Users.
•  Activity-Per-Second : 240 APS.
•  Adding/Edi:ng/Dele:ng	
  Post	
  
•  Adding	
  Comments	
  
•  Sending	
  Message	
  to	
  Buyer/Seller,	
  etc.	
  
•  More than 40k hits on Apache Solr Per Minute.
How we use Apache Solr @ OpenSooq ?
•  Arabic Search Engine
Arabic Normalization
•  There are common spelling mistakes that are widely accepted.
	
  
For	
  example,	
  the	
  verb ‫ادرس‬	
  (Adrs)	
  in	
  impera:ve	
  mood	
  (meaning	
  “study”	
  –	
  in	
  a	
  command	
  
form)	
  would	
  turn	
  to ‫  
	أدرس‬.	
  
	
  
•  Arabic content would be normalized according to the following steps:
§  Remove	
  punctua:on	
  	
  
§  Remove	
  diacri:cs	
  (primarily	
  weak	
  vowels).	
  	
  
§  Remove	
  non	
  le7ers	
  	
  
§  Replace	
   ‫ا‬	
  ,	
  ‫إ‬ 	
  ,	
  and	
  ‫أ‬	
  with	
  ‫ا‬	
  	
  from	
  first	
  le7er	
  in	
  each	
  word	
  (A	
  -­‐	
  alef)	
  
§  Replace	
  final	
  ‫ى‬	
  with	
  ‫ي‬	
  	
  (Ya)	
  
§  Replace	
  final	
  ‫ة‬	
  with	
  ‫ه‬	
  (Ha)	
  
	
  
Arabic Light Stemmer
•  A light stemmer is not dictionary driven.
•  This algorithm follows a rule-based prefix-removal mechanism.
Arabic Light Stemmer
•  The light stemmer, light10, outperformed the other approaches. It is becoming
widely used in Arabic information retrieval.
Arabic Light Stemmer
•  Sometimes a stemmer might not do what you want out of the box.
•  Protects words from being modified by stemmers.
Stop words and Synonyms
•  Removing stop words is important to ensure high performance and improve recall
h7ps://github.com/Ramzi-­‐Alqrainy/Arabic-­‐IR/blob/master/stopwords-­‐ar.txt	
  
•  Matching strings of tokens and replacing them with other strings of tokens will
improve precision and recall .
Apache Solr Schema.xml
•  A text field that is appropriate for Arabic
Localization Concept with SolrCloud
Ranking and Relevancy: Boost documents by age
•  Just do a descending sort by age = done?
•  Boost more recent documents and penalize older documents just for being old
•  Recency Boosting
Bf=recip(ms(sub(NOW,post_inserted_date)),3.16e-­‐11,0.08,0.05)	
  ^5	
  
Tune Solr Recip Function
Solr Implementations @ OpenSooq ?
§  Anti Spam
§  Checking Relevancy
§  Tags Generations
§  Recommendation System
Thank You
@RamziAlqrainy
https://github.com/Ramzi-Alqrainy
http://solr-enterprise-search-server.blogspot.com/

Contenu connexe

Similaire à Arabic Content with Apache Solr

Exploring the effects of stemming on
Exploring the effects of stemming onExploring the effects of stemming on
Exploring the effects of stemming onijaia
 
Arabic words stemming approach using arabic wordnet
Arabic words stemming approach using arabic wordnetArabic words stemming approach using arabic wordnet
Arabic words stemming approach using arabic wordnetIJDKP
 
Adopting Quadrilateral Arabic Roots in Search Engine of E-library System
Adopting Quadrilateral Arabic Roots in Search Engine of E-library SystemAdopting Quadrilateral Arabic Roots in Search Engine of E-library System
Adopting Quadrilateral Arabic Roots in Search Engine of E-library Systempaperpublications3
 
MoM2010: Arabic natural language processing
MoM2010: Arabic natural language processingMoM2010: Arabic natural language processing
MoM2010: Arabic natural language processingHend Al-Khalifa
 
Language support in searching Drupal with SOLR - Drupalcamp London 2013
Language support in searching Drupal with SOLR - Drupalcamp London 2013Language support in searching Drupal with SOLR - Drupalcamp London 2013
Language support in searching Drupal with SOLR - Drupalcamp London 2013Exove
 
XMODEL: An XML-based Morphological Analyzer for Arabic Language
XMODEL: An XML-based Morphological Analyzer for Arabic LanguageXMODEL: An XML-based Morphological Analyzer for Arabic Language
XMODEL: An XML-based Morphological Analyzer for Arabic LanguageWaqas Tariq
 
Arabic Domain Names: What’s been done so far?
Arabic Domain Names: What’s been done so far?Arabic Domain Names: What’s been done so far?
Arabic Domain Names: What’s been done so far?ahzoman
 
HOW TO LEARN ARABIC LANGUAGE 2024.pptx
HOW TO LEARN ARABIC LANGUAGE   2024.pptxHOW TO LEARN ARABIC LANGUAGE   2024.pptx
HOW TO LEARN ARABIC LANGUAGE 2024.pptxyaseenbaig10
 
Tawasol symbols project overview
Tawasol symbols project overviewTawasol symbols project overview
Tawasol symbols project overviewE.A. Draffan
 
Final quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using rFinal quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using rAlexandria University
 
ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN
 
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending  Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending Assem CHELLI
 
BEST ARABIC COACHING CENTER IN HYDERABAD
BEST ARABIC COACHING CENTER IN HYDERABADBEST ARABIC COACHING CENTER IN HYDERABAD
BEST ARABIC COACHING CENTER IN HYDERABADiiafl
 
The named entity recognition (ner)2
The named entity recognition (ner)2The named entity recognition (ner)2
The named entity recognition (ner)2Arabic_NLP_ImamU2013
 

Similaire à Arabic Content with Apache Solr (20)

Exploring the effects of stemming on
Exploring the effects of stemming onExploring the effects of stemming on
Exploring the effects of stemming on
 
Arabic words stemming approach using arabic wordnet
Arabic words stemming approach using arabic wordnetArabic words stemming approach using arabic wordnet
Arabic words stemming approach using arabic wordnet
 
Adopting Quadrilateral Arabic Roots in Search Engine of E-library System
Adopting Quadrilateral Arabic Roots in Search Engine of E-library SystemAdopting Quadrilateral Arabic Roots in Search Engine of E-library System
Adopting Quadrilateral Arabic Roots in Search Engine of E-library System
 
almisbarIEEE-1
almisbarIEEE-1almisbarIEEE-1
almisbarIEEE-1
 
MoM2010: Arabic natural language processing
MoM2010: Arabic natural language processingMoM2010: Arabic natural language processing
MoM2010: Arabic natural language processing
 
Language support in searching Drupal with SOLR - Drupalcamp London 2013
Language support in searching Drupal with SOLR - Drupalcamp London 2013Language support in searching Drupal with SOLR - Drupalcamp London 2013
Language support in searching Drupal with SOLR - Drupalcamp London 2013
 
XMODEL: An XML-based Morphological Analyzer for Arabic Language
XMODEL: An XML-based Morphological Analyzer for Arabic LanguageXMODEL: An XML-based Morphological Analyzer for Arabic Language
XMODEL: An XML-based Morphological Analyzer for Arabic Language
 
Arabic Domain Names: What’s been done so far?
Arabic Domain Names: What’s been done so far?Arabic Domain Names: What’s been done so far?
Arabic Domain Names: What’s been done so far?
 
HOW TO LEARN ARABIC LANGUAGE 2024.pptx
HOW TO LEARN ARABIC LANGUAGE   2024.pptxHOW TO LEARN ARABIC LANGUAGE   2024.pptx
HOW TO LEARN ARABIC LANGUAGE 2024.pptx
 
Easy Arabic
Easy ArabicEasy Arabic
Easy Arabic
 
Tawasol symbols project overview
Tawasol symbols project overviewTawasol symbols project overview
Tawasol symbols project overview
 
Coreference recognition in arabic
Coreference recognition in arabicCoreference recognition in arabic
Coreference recognition in arabic
 
Xtext Best Practices
Xtext Best PracticesXtext Best Practices
Xtext Best Practices
 
Arabic alphabets
Arabic alphabetsArabic alphabets
Arabic alphabets
 
Final quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using rFinal quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using r
 
ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)
 
Arabic spell checking approaches
Arabic spell checking approachesArabic spell checking approaches
Arabic spell checking approaches
 
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending  Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
 
BEST ARABIC COACHING CENTER IN HYDERABAD
BEST ARABIC COACHING CENTER IN HYDERABADBEST ARABIC COACHING CENTER IN HYDERABAD
BEST ARABIC COACHING CENTER IN HYDERABAD
 
The named entity recognition (ner)2
The named entity recognition (ner)2The named entity recognition (ner)2
The named entity recognition (ner)2
 

Plus de Ramzi Alqrainy

OpenSooq Image Recognition on AWS - AWS ML Lab
OpenSooq Image Recognition on AWS - AWS ML LabOpenSooq Image Recognition on AWS - AWS ML Lab
OpenSooq Image Recognition on AWS - AWS ML LabRamzi Alqrainy
 
A Few Milliseconds in the Life of an HTTP Request - AWS Summit 2019
A Few Milliseconds in the Life of an HTTP Request - AWS Summit 2019A Few Milliseconds in the Life of an HTTP Request - AWS Summit 2019
A Few Milliseconds in the Life of an HTTP Request - AWS Summit 2019Ramzi Alqrainy
 
Mastering Chaos - OpenSooq’s journey from Monolithic to Microservices
Mastering Chaos - OpenSooq’s journey from Monolithic to Microservices Mastering Chaos - OpenSooq’s journey from Monolithic to Microservices
Mastering Chaos - OpenSooq’s journey from Monolithic to Microservices Ramzi Alqrainy
 
Infrastructure OpenSooq Mobile @ Scale
Infrastructure OpenSooq Mobile @ ScaleInfrastructure OpenSooq Mobile @ Scale
Infrastructure OpenSooq Mobile @ ScaleRamzi Alqrainy
 
Choosing the Right Technologies for OpenSooq
Choosing the Right Technologies for OpenSooqChoosing the Right Technologies for OpenSooq
Choosing the Right Technologies for OpenSooqRamzi Alqrainy
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From SolrRamzi Alqrainy
 
Recommender Systems, Part 1 - Introduction to approaches and algorithms
Recommender Systems, Part 1 - Introduction to approaches and algorithmsRecommender Systems, Part 1 - Introduction to approaches and algorithms
Recommender Systems, Part 1 - Introduction to approaches and algorithmsRamzi Alqrainy
 
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...Ramzi Alqrainy
 
Evaluating Search Engines
Evaluating Search EnginesEvaluating Search Engines
Evaluating Search EnginesRamzi Alqrainy
 
Starting From Zero - Winning Strategies for Zero Results Page
Starting From Zero - Winning Strategies for Zero Results PageStarting From Zero - Winning Strategies for Zero Results Page
Starting From Zero - Winning Strategies for Zero Results PageRamzi Alqrainy
 
Search Behavior Patterns
Search Behavior PatternsSearch Behavior Patterns
Search Behavior PatternsRamzi Alqrainy
 
Intel microprocessor history
Intel microprocessor historyIntel microprocessor history
Intel microprocessor historyRamzi Alqrainy
 
How to prevent the cache problem in AJAX
How to prevent the cache problem in AJAXHow to prevent the cache problem in AJAX
How to prevent the cache problem in AJAXRamzi Alqrainy
 
Linked stacks and queues
Linked stacks and queuesLinked stacks and queues
Linked stacks and queuesRamzi Alqrainy
 
Advance Data Structure
Advance Data StructureAdvance Data Structure
Advance Data StructureRamzi Alqrainy
 

Plus de Ramzi Alqrainy (20)

OpenSooq Image Recognition on AWS - AWS ML Lab
OpenSooq Image Recognition on AWS - AWS ML LabOpenSooq Image Recognition on AWS - AWS ML Lab
OpenSooq Image Recognition on AWS - AWS ML Lab
 
A Few Milliseconds in the Life of an HTTP Request - AWS Summit 2019
A Few Milliseconds in the Life of an HTTP Request - AWS Summit 2019A Few Milliseconds in the Life of an HTTP Request - AWS Summit 2019
A Few Milliseconds in the Life of an HTTP Request - AWS Summit 2019
 
Mastering Chaos - OpenSooq’s journey from Monolithic to Microservices
Mastering Chaos - OpenSooq’s journey from Monolithic to Microservices Mastering Chaos - OpenSooq’s journey from Monolithic to Microservices
Mastering Chaos - OpenSooq’s journey from Monolithic to Microservices
 
Infrastructure OpenSooq Mobile @ Scale
Infrastructure OpenSooq Mobile @ ScaleInfrastructure OpenSooq Mobile @ Scale
Infrastructure OpenSooq Mobile @ Scale
 
Choosing the Right Technologies for OpenSooq
Choosing the Right Technologies for OpenSooqChoosing the Right Technologies for OpenSooq
Choosing the Right Technologies for OpenSooq
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
 
MemSQL
MemSQLMemSQL
MemSQL
 
Recommender Systems, Part 1 - Introduction to approaches and algorithms
Recommender Systems, Part 1 - Introduction to approaches and algorithmsRecommender Systems, Part 1 - Introduction to approaches and algorithms
Recommender Systems, Part 1 - Introduction to approaches and algorithms
 
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
 
Evaluating Search Engines
Evaluating Search EnginesEvaluating Search Engines
Evaluating Search Engines
 
Starting From Zero - Winning Strategies for Zero Results Page
Starting From Zero - Winning Strategies for Zero Results PageStarting From Zero - Winning Strategies for Zero Results Page
Starting From Zero - Winning Strategies for Zero Results Page
 
Search Behavior Patterns
Search Behavior PatternsSearch Behavior Patterns
Search Behavior Patterns
 
Intel microprocessor history
Intel microprocessor historyIntel microprocessor history
Intel microprocessor history
 
How to prevent the cache problem in AJAX
How to prevent the cache problem in AJAXHow to prevent the cache problem in AJAX
How to prevent the cache problem in AJAX
 
Linked stacks and queues
Linked stacks and queuesLinked stacks and queues
Linked stacks and queues
 
Advance Data Structure
Advance Data StructureAdvance Data Structure
Advance Data Structure
 
Hashing
HashingHashing
Hashing
 
Markov Matrix
Markov MatrixMarkov Matrix
Markov Matrix
 
STACK
STACKSTACK
STACK
 
Digital Logic Rcs
Digital Logic RcsDigital Logic Rcs
Digital Logic Rcs
 

Dernier

8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 

Dernier (20)

8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 

Arabic Content with Apache Solr

  • 1.
  • 2. Arabic Content with Apache Solr Ramzi Alqrainy
  • 3. Ramzi Alqrainy •  MSc. In computer science, University of Jordan, Amman - Jordan •  Senior Enterprise Search / Data Engineer @ OpenSooq.com •  Technical Reviewer for “Scaling Apache Solr” and “Apache Solr Search Patterns” (Books) •  Co-founder of Solr.ar group •  Built 8 search engines for different models in the last 2 years •  Active blogger and Presenter about Information Retrieval
  • 4. Agenda •  Why is Arabic Language Important ? •  Arabic Language is Complex •  How we use Apache Solr @ OpenSooq ? •  Localization Concept with SolrCloud •  Ranking and Relevancy •  Apache Solr Implementations @ OpenSooq
  • 5. Why is Arabic Language Important ?
  • 6. Why is Arabic Language Important ? Sample Arabic document without dots
  • 7. Why is Arabic Language Important ? Sample Arabic document with dots
  • 8. Why is Arabic Language Important ? •  The Arabic Language is ranked as the fourth top language on the web •  The number of Arab Internet users grew from 65 million in 2011 to 135 million in 2013
  • 9. Arabic Language is Complex •  Arabic Orthography and Print §  Arabic  has  a  right-­‐to-­‐le0  connected  script  that  uses  28  basic  le7ers,  which  change   shape  depending  on  their  posi:ons  in  words.   •  Arabic Diacritics §  Diacri:cs  help  disambiguate  the  meaning  of  words.   §  For  example,  the  two  words  ‫م‬َ‫ل‬َ‫ع‬(Alam  -­‐  meaning  “flag”)  and  ‫م‬‫ِل‬‫ع‬(Eilm  -­‐  meaning   “knowledge”)  share  the  same  le7ers  ‫م‬‫عل‬  (Elm)  but  differ  in  diacri:cs.  
  • 10. Arabic Language is Complex •  Arabic Morphology §  Arabic  words  are  divided  into  three  main  types:  nouns,  verbs,  and  par:cles.   §  Arabic  nouns,  which  include  adjec:ves  and  adverbs,  and  verbs  are  derived  from  a   closed  set  of  around  10,000  roots  
  • 11. Arabic Language is Complex •  Arabic Dialects §  There  are  6  dominant  with  many  more  varia:ons  of  them  and  dozens  more  less  spoken   dialects.   §  EG.  The  concept  corresponding  to  “I  want”  is  expressed  as  ‫ز‬‫عاو‬  (Eawz)  in  Egyp:an,  ‫ى‬‫أبغ‬   (Abgy)  in  Gulf,  ‫ي‬‫أب‬  (Aby)  in  Iraqi,  and  ‫ي‬‫بد‬  (bdy)  in  Levan:ne.   •  Arabizi (Transliteration) §  Arabic  is  some:mes  wri7en  using  La:n  characters  in  transliterated  form.   §  Arabizi  uses  numerals  to  represent  Arabic  le7ers.   §  EG.  "2"  and  “3”  represent  the  le7ers  ‫أ‬    (that  sounds  like  “a”  as  in  apple)  and  ‫ع‬  (E)  (that  is   a  gu7ural  “aa”)  respec:vely.    
  • 12. How we use Apache Solr @ OpenSooq ? •  A leading classifieds ads website in the Middle East and North Africa. •  Right now : Average > 7K Concurrent Users. •  Activity-Per-Second : 240 APS. •  Adding/Edi:ng/Dele:ng  Post   •  Adding  Comments   •  Sending  Message  to  Buyer/Seller,  etc.   •  More than 40k hits on Apache Solr Per Minute.
  • 13. How we use Apache Solr @ OpenSooq ? •  Arabic Search Engine
  • 14. Arabic Normalization •  There are common spelling mistakes that are widely accepted.   For  example,  the  verb ‫ادرس‬  (Adrs)  in  impera:ve  mood  (meaning  “study”  –  in  a  command   form)  would  turn  to ‫  أدرس‬.     •  Arabic content would be normalized according to the following steps: §  Remove  punctua:on     §  Remove  diacri:cs  (primarily  weak  vowels).     §  Remove  non  le7ers     §  Replace   ‫ا‬  ,  ‫إ‬  ,  and  ‫أ‬  with  ‫ا‬    from  first  le7er  in  each  word  (A  -­‐  alef)   §  Replace  final  ‫ى‬  with  ‫ي‬    (Ya)   §  Replace  final  ‫ة‬  with  ‫ه‬  (Ha)    
  • 15. Arabic Light Stemmer •  A light stemmer is not dictionary driven. •  This algorithm follows a rule-based prefix-removal mechanism.
  • 16. Arabic Light Stemmer •  The light stemmer, light10, outperformed the other approaches. It is becoming widely used in Arabic information retrieval.
  • 17. Arabic Light Stemmer •  Sometimes a stemmer might not do what you want out of the box. •  Protects words from being modified by stemmers. Stop words and Synonyms •  Removing stop words is important to ensure high performance and improve recall h7ps://github.com/Ramzi-­‐Alqrainy/Arabic-­‐IR/blob/master/stopwords-­‐ar.txt   •  Matching strings of tokens and replacing them with other strings of tokens will improve precision and recall .
  • 18. Apache Solr Schema.xml •  A text field that is appropriate for Arabic
  • 20. Ranking and Relevancy: Boost documents by age •  Just do a descending sort by age = done? •  Boost more recent documents and penalize older documents just for being old •  Recency Boosting Bf=recip(ms(sub(NOW,post_inserted_date)),3.16e-­‐11,0.08,0.05)  ^5  
  • 21. Tune Solr Recip Function
  • 22. Solr Implementations @ OpenSooq ? §  Anti Spam §  Checking Relevancy §  Tags Generations §  Recommendation System