SlideShare une entreprise Scribd logo
1  sur  41
Télécharger pour lire hors ligne
xavier@trovit.com

SPELLCHECKING IN TROVIT: IMPLEMENTING A
CONTEXTUAL MULTI-LANGUAGE
SPELLCHECKER FOR CLASSIFIED ADS
Xavier Sanchez Loro
R&D Engineer
Outline
• 
• 
• 
• 
• 
• 
• 
• 

Introduction
Our approach: Contextual Spellchecking
Nature and characteristics of our document corpus
Spellcheckers in Solr
White-listing and purging: controlling dictionary data
Spellchecker configuration
Customizing Solr’s SpellcheckComponent
Conclusions and Future Work
Supporting text for this speech

Trovit Engineering Blog post on spellchecking
http://tech.trovit.com/index.php/spellchecking-in-trovit/
INTRODUCTION
Introduction

Trovit: a search engine for classified ads
Introduction
Introduction: spellchecking in Trovit
•  Multi-language spellchecking system using SOLR and Lucene
•  Objectives:
–  help our users to better find the desired ads
–  avoid the dreaded 0 results as much as possible
–  Our goal is not pure orthographic correction but also to
suggest correct searches for a certain site.
OUR APPROACH: CONTEXTUAL
SPELLCHECKING
Contextual Spellchecking: approach
• 

The Key element in the spellchecking process is choosing the right
dictionary
–  one with a relevant vocabulary
•  according to the type of information included in each site.

• 

Approach
–  Specializing the dictionaries based on user’s search context.

• 

Search contexts are composed of:
–  country (with a default language)
–  vertical (determining the type of ads and vocabulary).
Contextual Spellchecking: vocabularies
• 

Each site’s document corpus has a limited vocabulary
–  reduced to the type of information, language and terms included in each site’s
ads.

• 

Using a more generalized approach is not suitable for our needs
–  One vocabulary for each language less precise than specialized vocabularies
for each site.
–  Drastic differences
•  type of terms
•  semantics of each vertical.
–  Terms that are relevant in one context are meaningless in another one

• 

Different vocabularies for each site, even when supporting the same language.
–  Vocabulary is tailored according to context of searches
NATURE AND CHARACTERISTICS
OF OUR DOCUMENT CORPUS
Challenges: Inconsistencies in our corpus
• 

Document corpus is fed by different third-party sources
–  providing the ads for the different sites.

• 

We can detect incorrect documents and reconcile certain inconsistences
–  But we cannot control or modify the content of the ads themselves.
Inconsistencies
–  hinder any language detection process
–  pose challenges to the development of the spellchecking system

• 
Inconsistencies example
• 

Spanish homes vertical
–  not fully written in Spanish
–  Ads in several languages.
•  native languages: Spanish, Catalan, Basque and Galician.
•  foreign languages: English, German, French, Italian, Russian… even
oriental languages like Chinese!
•  Multi-language ads
–  badly written and misspelled words
•  Spanish words badly translated from regional languages
•  overtly misspelled words
–  e.g. “picina” yields a 1197 docs Vs 1048434 of “piscina”, 0.01%
–  “noisy” content
•  numbers, postal codes, references, etc.
Characteristics of our ads
• 

Summarizing
–  Segmented corpus in different indexes, one per country plus vertical (site)
–  3rd party generated
–  Ads in national language + other languages (regional and foreign)
–  Multi-language content in ads
–  Noisy content (numbers, references, postal codes, etc.)
–  Small texts (around 3000 characters long)
–  Misspellings and incorrect words

Corpus unreliable for use as the knowledge base to build any
spellchecking dictionary.
What/Where search segmentation

geolocation data is not
mixed with vertical data

geolocation data
interleaved with vertical
data

Only vertical data (no
geodata)
• 
Narrower
dictionary, less
collisons, more
controlable

Cover all geodata
• 
Wider dictionary,
more collisons,
less controlable
SPELLCHECKERS IN SOLR
IndexBasedSpellchecker
• 

• 

It creates a parallel index for the spelling dictionary that is based on an
existing Lucene index.
–  Depends on index data correctness (misspells)
–  Creates additional index from current index (small, MB)
–  Supports term frequency parameters
–  Must (re)build
Even though this component behaves as expected
–  it was of no use for Trovit’s use case.
IndexBasedSpellchecker
• 
• 

• 

It depends on index data
–  not an accurate and reliable for the spellchecking dictionary.
Continuous builds
–  synchronicity between index data and spelling index data.
–  If not
•  frequency information and hit counting are neither reliable nor
accurate.
•  false positives/negatives
•  suggestions of words with different number of hits, even 0.
We cannot risk suffering this situation
FileBasedSpellChecker
• 

It uses a flat file to generate a spelling dictionary in the form of a Lucene
spellchecking index.
–  Requires a dictionary file
–  Creates additional index from dictionary file (small, MB)
–  Does not depend on index data (controlled data)
–  Build once
•  rebuild only if dictionary is updated
–  No frequency information used when calculating spelling suggestions
FileBasedSpellChecker
• 
• 

• 

Requires rebuilds also
–  albeit less frequently
No frequency related data
–  Pure orthographic correction is not our main goal
–  We cannot risk suggesting corrections without results.
But
–  insight on how to approach the final solution we are implementing.
–  allows the highest degree of control in dictionary contents
•  essential feature for spelling dictionaries.
DirectSpellChecker
• 

Experimental spellchecker that just uses the main Solr index directly
–  Build/rebuild is not required.
–  Depends on index data correctness (misspells)
–  Uses existing index
•  field: source of the spelling dictionary.
–  Supports term frequency parameters.
–  No (re)build.

• 

Several promising features
–  No build + continuously in sync with index data.
–  Provides accurate frequency information data.
DirectSpellChecker
• 
• 

The real drawback
–  lack of control over index data sourcing the spelling dictionary.
If we can overcome it, this type would make an ideal candidate for our
use case.
WordBreakSpellChecker
• 

Generates suggestions by combining adjacent words and/or breaking
words into multiples.
–  This spellchecker can be configured with a traditional checker
(ie:DirectSolrSpellChecker).
–  The results are combined and collations can contain a mix of
corrections from both spellcheckers.
–  Uses existing index. No build.
WordBreakSpellChecker
• 
• 
• 
• 

Good complement to the other spellcheckers
It works really well with well-written concatenated words
–  it is able to break them up with great accuracy.
Combining split words is not as accurate
Drawback: it’s based on index data.
WHITE-LISTING AND PURGING:
CONTROLLING DICTIONARY
DATA
White-listing
• 
• 
• 

Any spelling system can only be as good as its knowledge base or
dictionary is accurate.
We need to control the data indexed as dictionary.
White-listing approach
–  we only index spelling data contained in a controlled dictionary list.
–  processes to build a base dictionary specialized for a given site.
White-list building process
SPELLCHECKER CONFIGURATION
Initial spellchecker configuration
• 

• 

DirectSpellChecker using purged spell field
–  Spell field filled with purged content
•  Purging according to whitelist
•  Whitelist generated from matching dictionary with index words, after
purge process
Benefits:
–  Build is no longer required.
–  Spell field is automatically updated via pipeline.
–  We can work with term freq.
–  No additional index, just an additional field.
–  Better relevance and suggestions.
Initial spellchecker configuration
• 
• 

Cons:
–  Whitelist maintenance and creation for new sites.
Features:
–  Accurate detection of misspelled words.
–  Good detection of concatenated words.
•  piscinagarajejardin to piscina garaje jardin
•  picina garajejardin to piscina (garaje jardin)
–  Able to detect several misspelled words.
–  Evolution based on whitelisting fine-tuning.
Initial spellchecker configuration
• 

Issues:
–  False negatives: suggestion of corrections when words are correctly spelled.
–  Suggestions for all the words in the query, not just those misspelled words.
–  Misguiding “correctlySpelled” parameter.
•  Parameter dependant on frequency information, making it unreliable for
our purposes.
•  It returns true/false according to thresholds,
–  not really depending on word distance but
–  results found, “alternativeTermCount” and “maxResultsForSuggest”
thresholds.
–  Minor discrepancies if we only index boosted terms (i.e. qf)
•  # hits spell< #docs index
CUSTOMIZING SOLR
SPELLCHECKCOMPONENT
Hacking SpellcheckComponent
• 

Lack of reliability of the “correctlySpelled” parameter
–  Difficult to know when give a suggestion or not.
–  First policy based on document hits
•  sliding window
–  based on the number of queried terms
•  the longer the tail, the smaller the threshold
•  inaccurate and prone to collisions.
–  Difficult to set up thresholds to a good level of accuracy.

We needed a more reliable way.
Hacking SpellcheckComponent: correctlySpelled
parameter behaviour
• 

Binary approach to deciding if a word is correctly spelled or not.

• 

Simpler approach
–  any term that appears in our spelling field is a correctly spelled word
•  regardless the value of its frequency info or the configured thresholds.
–  this way the parameter can be used to control when to start querying the
spellchecking index.
Hacking SpellcheckComponent
• 

Other changes to the SpellcheckComponent:
–  No suggestions when words are correctly spelled.
–  Only makes suggestions for misspelled words, not for all words
•  i.e. piscina garage -> piscina garaje

• 

Spanish-friendly ASCIIFoldingFilter
–  modified in order to not fold “ñ” (for Spanish) and “ç” (for Catalan names)
characters.
•  Avoids collisions with similar words with “n” and “c”
–  e.g. “pena” and “peña”
–  Still folding accented vowels
•  usually omitted by users.
CONCLUSIONS AND FUTURE
WORK
Conclusion & Future Work
• 

• 
• 

Base code
–  expand the spellchecking process to other sites
–  design final policy to decide when giving suggestions or not.
Geodata in homes verticals
–  find ways to avoid collisions in large dictionary sets.
Scoring system for spelling dictionary
–  Control suggestions based on user input
•  Feedback on relevance or quality of our spellchecking suggestions.
•  System more accurate and reliable
•  Expand whitelists to cover large amounts of geodata
–  with acceptable levels of precision.
Conclusion & Future Work
• 

Plural suggester
–  suggest alternative searches and corrections using plural or singular variants
of the terms in the query.
–  Use frequency and scoring information to choose most suitable suggestions.
THANKS FOR YOUR ATTENTION!
ANY QUESTIONS?
References
[1] Lucene/Solr Revolution EU 2013. Dublin, 6-7 November 2013.
http://www.lucenerevolution.org/
[2] Trovit – A search engine for classified ads of real estate, jobs, cars and vacation
rentals. http://www.trovit.com
[3] Apache Software Foundation. “Apache Solr” https://lucene.apache.org/solr/
[4] Apache Software Foundation. “Apache Lucene” https://lucene.apache.org
[5] Apache Software Foundation. “Spell Checking – Apache Solr Reference Guide –
Apache Software Foundation”
https://cwiki.apache.org/confluence/display/solr/Spell+Checking

Contenu connexe

Similaire à Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...lucenerevolution
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and searchNathan McMinn
 
Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)Brian Brazil
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisFabio Benedetti
 
Agile Mumbai 2020 Conference | How to get the best ROI on Your Test Automati...
Agile Mumbai 2020 Conference |  How to get the best ROI on Your Test Automati...Agile Mumbai 2020 Conference |  How to get the best ROI on Your Test Automati...
Agile Mumbai 2020 Conference | How to get the best ROI on Your Test Automati...AgileNetwork
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...Dr. Haxel Consult
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
 
Avoid Growing Pains: Scale Your App for the Enterprise (October 14, 2014)
Avoid Growing Pains: Scale Your App for the Enterprise (October 14, 2014)Avoid Growing Pains: Scale Your App for the Enterprise (October 14, 2014)
Avoid Growing Pains: Scale Your App for the Enterprise (October 14, 2014)Salesforce Partners
 
Moving to Modern DevOps with Fuzzing and ML - DevOps Next
Moving to Modern DevOps with Fuzzing and ML - DevOps NextMoving to Modern DevOps with Fuzzing and ML - DevOps Next
Moving to Modern DevOps with Fuzzing and ML - DevOps NextPerfecto by Perforce
 
How Open Source Embiggens Salesforce.com
How Open Source Embiggens Salesforce.comHow Open Source Embiggens Salesforce.com
How Open Source Embiggens Salesforce.comSalesforce Engineering
 
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Databricks
 
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Hady Elsahar
 
Type checking in compiler design
Type checking in compiler designType checking in compiler design
Type checking in compiler designSudip Singh
 
Finding the Bad Actor: Custom scoring & forensic name matching with Elastics...
Finding the Bad Actor: Custom scoring & forensic name matching  with Elastics...Finding the Bad Actor: Custom scoring & forensic name matching  with Elastics...
Finding the Bad Actor: Custom scoring & forensic name matching with Elastics...Charlie Hull
 
Testing with an Accent: Internationalization Testing
Testing with an Accent: Internationalization TestingTesting with an Accent: Internationalization Testing
Testing with an Accent: Internationalization TestingTechWell
 
Getting Started with MySQL Full Text Search
Getting Started with MySQL Full Text SearchGetting Started with MySQL Full Text Search
Getting Started with MySQL Full Text SearchMatt Lord
 

Similaire à Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads (20)

Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and search
 
Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
 
Agile Mumbai 2020 Conference | How to get the best ROI on Your Test Automati...
Agile Mumbai 2020 Conference |  How to get the best ROI on Your Test Automati...Agile Mumbai 2020 Conference |  How to get the best ROI on Your Test Automati...
Agile Mumbai 2020 Conference | How to get the best ROI on Your Test Automati...
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
Avoid Growing Pains: Scale Your App for the Enterprise (October 14, 2014)
Avoid Growing Pains: Scale Your App for the Enterprise (October 14, 2014)Avoid Growing Pains: Scale Your App for the Enterprise (October 14, 2014)
Avoid Growing Pains: Scale Your App for the Enterprise (October 14, 2014)
 
Moving to Modern DevOps with Fuzzing and ML - DevOps Next
Moving to Modern DevOps with Fuzzing and ML - DevOps NextMoving to Modern DevOps with Fuzzing and ML - DevOps Next
Moving to Modern DevOps with Fuzzing and ML - DevOps Next
 
How Open Source Embiggens Salesforce.com
How Open Source Embiggens Salesforce.comHow Open Source Embiggens Salesforce.com
How Open Source Embiggens Salesforce.com
 
1 compiler outline
1 compiler outline1 compiler outline
1 compiler outline
 
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
 
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
 
Type checking in compiler design
Type checking in compiler designType checking in compiler design
Type checking in compiler design
 
Finding the Bad Actor: Custom scoring & forensic name matching with Elastics...
Finding the Bad Actor: Custom scoring & forensic name matching  with Elastics...Finding the Bad Actor: Custom scoring & forensic name matching  with Elastics...
Finding the Bad Actor: Custom scoring & forensic name matching with Elastics...
 
Testing with an Accent: Internationalization Testing
Testing with an Accent: Internationalization TestingTesting with an Accent: Internationalization Testing
Testing with an Accent: Internationalization Testing
 
Getting Started with MySQL Full Text Search
Getting Started with MySQL Full Text SearchGetting Started with MySQL Full Text Search
Getting Started with MySQL Full Text Search
 

Plus de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 

Plus de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 

Dernier

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Dernier (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

  • 1.
  • 2. xavier@trovit.com SPELLCHECKING IN TROVIT: IMPLEMENTING A CONTEXTUAL MULTI-LANGUAGE SPELLCHECKER FOR CLASSIFIED ADS Xavier Sanchez Loro R&D Engineer
  • 3. Outline •  •  •  •  •  •  •  •  Introduction Our approach: Contextual Spellchecking Nature and characteristics of our document corpus Spellcheckers in Solr White-listing and purging: controlling dictionary data Spellchecker configuration Customizing Solr’s SpellcheckComponent Conclusions and Future Work
  • 4. Supporting text for this speech Trovit Engineering Blog post on spellchecking http://tech.trovit.com/index.php/spellchecking-in-trovit/
  • 6. Introduction Trovit: a search engine for classified ads
  • 8. Introduction: spellchecking in Trovit •  Multi-language spellchecking system using SOLR and Lucene •  Objectives: –  help our users to better find the desired ads –  avoid the dreaded 0 results as much as possible –  Our goal is not pure orthographic correction but also to suggest correct searches for a certain site.
  • 10. Contextual Spellchecking: approach •  The Key element in the spellchecking process is choosing the right dictionary –  one with a relevant vocabulary •  according to the type of information included in each site. •  Approach –  Specializing the dictionaries based on user’s search context. •  Search contexts are composed of: –  country (with a default language) –  vertical (determining the type of ads and vocabulary).
  • 11. Contextual Spellchecking: vocabularies •  Each site’s document corpus has a limited vocabulary –  reduced to the type of information, language and terms included in each site’s ads. •  Using a more generalized approach is not suitable for our needs –  One vocabulary for each language less precise than specialized vocabularies for each site. –  Drastic differences •  type of terms •  semantics of each vertical. –  Terms that are relevant in one context are meaningless in another one •  Different vocabularies for each site, even when supporting the same language. –  Vocabulary is tailored according to context of searches
  • 12. NATURE AND CHARACTERISTICS OF OUR DOCUMENT CORPUS
  • 13. Challenges: Inconsistencies in our corpus •  Document corpus is fed by different third-party sources –  providing the ads for the different sites. •  We can detect incorrect documents and reconcile certain inconsistences –  But we cannot control or modify the content of the ads themselves. Inconsistencies –  hinder any language detection process –  pose challenges to the development of the spellchecking system • 
  • 14. Inconsistencies example •  Spanish homes vertical –  not fully written in Spanish –  Ads in several languages. •  native languages: Spanish, Catalan, Basque and Galician. •  foreign languages: English, German, French, Italian, Russian… even oriental languages like Chinese! •  Multi-language ads –  badly written and misspelled words •  Spanish words badly translated from regional languages •  overtly misspelled words –  e.g. “picina” yields a 1197 docs Vs 1048434 of “piscina”, 0.01% –  “noisy” content •  numbers, postal codes, references, etc.
  • 15. Characteristics of our ads •  Summarizing –  Segmented corpus in different indexes, one per country plus vertical (site) –  3rd party generated –  Ads in national language + other languages (regional and foreign) –  Multi-language content in ads –  Noisy content (numbers, references, postal codes, etc.) –  Small texts (around 3000 characters long) –  Misspellings and incorrect words Corpus unreliable for use as the knowledge base to build any spellchecking dictionary.
  • 16. What/Where search segmentation geolocation data is not mixed with vertical data geolocation data interleaved with vertical data Only vertical data (no geodata) •  Narrower dictionary, less collisons, more controlable Cover all geodata •  Wider dictionary, more collisons, less controlable
  • 18. IndexBasedSpellchecker •  •  It creates a parallel index for the spelling dictionary that is based on an existing Lucene index. –  Depends on index data correctness (misspells) –  Creates additional index from current index (small, MB) –  Supports term frequency parameters –  Must (re)build Even though this component behaves as expected –  it was of no use for Trovit’s use case.
  • 19. IndexBasedSpellchecker •  •  •  It depends on index data –  not an accurate and reliable for the spellchecking dictionary. Continuous builds –  synchronicity between index data and spelling index data. –  If not •  frequency information and hit counting are neither reliable nor accurate. •  false positives/negatives •  suggestions of words with different number of hits, even 0. We cannot risk suffering this situation
  • 20. FileBasedSpellChecker •  It uses a flat file to generate a spelling dictionary in the form of a Lucene spellchecking index. –  Requires a dictionary file –  Creates additional index from dictionary file (small, MB) –  Does not depend on index data (controlled data) –  Build once •  rebuild only if dictionary is updated –  No frequency information used when calculating spelling suggestions
  • 21. FileBasedSpellChecker •  •  •  Requires rebuilds also –  albeit less frequently No frequency related data –  Pure orthographic correction is not our main goal –  We cannot risk suggesting corrections without results. But –  insight on how to approach the final solution we are implementing. –  allows the highest degree of control in dictionary contents •  essential feature for spelling dictionaries.
  • 22. DirectSpellChecker •  Experimental spellchecker that just uses the main Solr index directly –  Build/rebuild is not required. –  Depends on index data correctness (misspells) –  Uses existing index •  field: source of the spelling dictionary. –  Supports term frequency parameters. –  No (re)build. •  Several promising features –  No build + continuously in sync with index data. –  Provides accurate frequency information data.
  • 23. DirectSpellChecker •  •  The real drawback –  lack of control over index data sourcing the spelling dictionary. If we can overcome it, this type would make an ideal candidate for our use case.
  • 24. WordBreakSpellChecker •  Generates suggestions by combining adjacent words and/or breaking words into multiples. –  This spellchecker can be configured with a traditional checker (ie:DirectSolrSpellChecker). –  The results are combined and collations can contain a mix of corrections from both spellcheckers. –  Uses existing index. No build.
  • 25. WordBreakSpellChecker •  •  •  •  Good complement to the other spellcheckers It works really well with well-written concatenated words –  it is able to break them up with great accuracy. Combining split words is not as accurate Drawback: it’s based on index data.
  • 27. White-listing •  •  •  Any spelling system can only be as good as its knowledge base or dictionary is accurate. We need to control the data indexed as dictionary. White-listing approach –  we only index spelling data contained in a controlled dictionary list. –  processes to build a base dictionary specialized for a given site.
  • 30. Initial spellchecker configuration •  •  DirectSpellChecker using purged spell field –  Spell field filled with purged content •  Purging according to whitelist •  Whitelist generated from matching dictionary with index words, after purge process Benefits: –  Build is no longer required. –  Spell field is automatically updated via pipeline. –  We can work with term freq. –  No additional index, just an additional field. –  Better relevance and suggestions.
  • 31. Initial spellchecker configuration •  •  Cons: –  Whitelist maintenance and creation for new sites. Features: –  Accurate detection of misspelled words. –  Good detection of concatenated words. •  piscinagarajejardin to piscina garaje jardin •  picina garajejardin to piscina (garaje jardin) –  Able to detect several misspelled words. –  Evolution based on whitelisting fine-tuning.
  • 32. Initial spellchecker configuration •  Issues: –  False negatives: suggestion of corrections when words are correctly spelled. –  Suggestions for all the words in the query, not just those misspelled words. –  Misguiding “correctlySpelled” parameter. •  Parameter dependant on frequency information, making it unreliable for our purposes. •  It returns true/false according to thresholds, –  not really depending on word distance but –  results found, “alternativeTermCount” and “maxResultsForSuggest” thresholds. –  Minor discrepancies if we only index boosted terms (i.e. qf) •  # hits spell< #docs index
  • 34. Hacking SpellcheckComponent •  Lack of reliability of the “correctlySpelled” parameter –  Difficult to know when give a suggestion or not. –  First policy based on document hits •  sliding window –  based on the number of queried terms •  the longer the tail, the smaller the threshold •  inaccurate and prone to collisions. –  Difficult to set up thresholds to a good level of accuracy. We needed a more reliable way.
  • 35. Hacking SpellcheckComponent: correctlySpelled parameter behaviour •  Binary approach to deciding if a word is correctly spelled or not. •  Simpler approach –  any term that appears in our spelling field is a correctly spelled word •  regardless the value of its frequency info or the configured thresholds. –  this way the parameter can be used to control when to start querying the spellchecking index.
  • 36. Hacking SpellcheckComponent •  Other changes to the SpellcheckComponent: –  No suggestions when words are correctly spelled. –  Only makes suggestions for misspelled words, not for all words •  i.e. piscina garage -> piscina garaje •  Spanish-friendly ASCIIFoldingFilter –  modified in order to not fold “ñ” (for Spanish) and “ç” (for Catalan names) characters. •  Avoids collisions with similar words with “n” and “c” –  e.g. “pena” and “peña” –  Still folding accented vowels •  usually omitted by users.
  • 38. Conclusion & Future Work •  •  •  Base code –  expand the spellchecking process to other sites –  design final policy to decide when giving suggestions or not. Geodata in homes verticals –  find ways to avoid collisions in large dictionary sets. Scoring system for spelling dictionary –  Control suggestions based on user input •  Feedback on relevance or quality of our spellchecking suggestions. •  System more accurate and reliable •  Expand whitelists to cover large amounts of geodata –  with acceptable levels of precision.
  • 39. Conclusion & Future Work •  Plural suggester –  suggest alternative searches and corrections using plural or singular variants of the terms in the query. –  Use frequency and scoring information to choose most suitable suggestions.
  • 40. THANKS FOR YOUR ATTENTION! ANY QUESTIONS?
  • 41. References [1] Lucene/Solr Revolution EU 2013. Dublin, 6-7 November 2013. http://www.lucenerevolution.org/ [2] Trovit – A search engine for classified ads of real estate, jobs, cars and vacation rentals. http://www.trovit.com [3] Apache Software Foundation. “Apache Solr” https://lucene.apache.org/solr/ [4] Apache Software Foundation. “Apache Lucene” https://lucene.apache.org [5] Apache Software Foundation. “Spell Checking – Apache Solr Reference Guide – Apache Software Foundation” https://cwiki.apache.org/confluence/display/solr/Spell+Checking