Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

xavier@trovit.com

SPELLCHECKING IN TROVIT: IMPLEMENTING A
CONTEXTUAL MULTI-LANGUAGE
SPELLCHECKER FOR CLASSIFIED ADS
Xavier Sanchez Loro
R&D Engineer

Outline
• 
• 
• 
• 
• 
• 
• 
• 

Introduction
Our approach: Contextual Spellchecking
Nature and characteristics of our document corpus
Spellcheckers in Solr
White-listing and purging: controlling dictionary data
Spellchecker configuration
Customizing Solr’s SpellcheckComponent
Conclusions and Future Work

Supporting text for this speech

Trovit Engineering Blog post on spellchecking
http://tech.trovit.com/index.php/spellchecking-in-trovit/

Introduction

Trovit: a search engine for classified ads

Introduction: spellchecking in Trovit
•  Multi-language spellchecking system using SOLR and Lucene
•  Objectives:
–  help our users to better find the desired ads
–  avoid the dreaded 0 results as much as possible
–  Our goal is not pure orthographic correction but also to
suggest correct searches for a certain site.

OUR APPROACH: CONTEXTUAL
SPELLCHECKING

Contextual Spellchecking: approach
• 

The Key element in the spellchecking process is choosing the right
dictionary
–  one with a relevant vocabulary
•  according to the type of information included in each site.

• 

Approach
–  Specializing the dictionaries based on user’s search context.

• 

Search contexts are composed of:
–  country (with a default language)
–  vertical (determining the type of ads and vocabulary).

Contextual Spellchecking: vocabularies
• 

Each site’s document corpus has a limited vocabulary
–  reduced to the type of information, language and terms included in each site’s
ads.

• 

Using a more generalized approach is not suitable for our needs
–  One vocabulary for each language less precise than specialized vocabularies
for each site.
–  Drastic differences
•  type of terms
•  semantics of each vertical.
–  Terms that are relevant in one context are meaningless in another one

• 

Different vocabularies for each site, even when supporting the same language.
–  Vocabulary is tailored according to context of searches

NATURE AND CHARACTERISTICS
OF OUR DOCUMENT CORPUS

Challenges: Inconsistencies in our corpus
• 

Document corpus is fed by different third-party sources
–  providing the ads for the different sites.

• 

We can detect incorrect documents and reconcile certain inconsistences
–  But we cannot control or modify the content of the ads themselves.
Inconsistencies
–  hinder any language detection process
–  pose challenges to the development of the spellchecking system

•

Inconsistencies example
• 

Spanish homes vertical
–  not fully written in Spanish
–  Ads in several languages.
•  native languages: Spanish, Catalan, Basque and Galician.
•  foreign languages: English, German, French, Italian, Russian… even
oriental languages like Chinese!
•  Multi-language ads
–  badly written and misspelled words
•  Spanish words badly translated from regional languages
•  overtly misspelled words
–  e.g. “picina” yields a 1197 docs Vs 1048434 of “piscina”, 0.01%
–  “noisy” content
•  numbers, postal codes, references, etc.

Characteristics of our ads
• 

Summarizing
–  Segmented corpus in different indexes, one per country plus vertical (site)
–  3rd party generated
–  Ads in national language + other languages (regional and foreign)
–  Multi-language content in ads
–  Noisy content (numbers, references, postal codes, etc.)
–  Small texts (around 3000 characters long)
–  Misspellings and incorrect words

Corpus unreliable for use as the knowledge base to build any
spellchecking dictionary.

What/Where search segmentation

geolocation data is not
mixed with vertical data

geolocation data
interleaved with vertical
data

Only vertical data (no
geodata)
• 
Narrower
dictionary, less
collisons, more
controlable

Cover all geodata
• 
Wider dictionary,
more collisons,
less controlable

IndexBasedSpellchecker
• 

• 

It creates a parallel index for the spelling dictionary that is based on an
existing Lucene index.
–  Depends on index data correctness (misspells)
–  Creates additional index from current index (small, MB)
–  Supports term frequency parameters
–  Must (re)build
Even though this component behaves as expected
–  it was of no use for Trovit’s use case.

IndexBasedSpellchecker
• 
• 

• 

It depends on index data
–  not an accurate and reliable for the spellchecking dictionary.
Continuous builds
–  synchronicity between index data and spelling index data.
–  If not
•  frequency information and hit counting are neither reliable nor
accurate.
•  false positives/negatives
•  suggestions of words with different number of hits, even 0.
We cannot risk suffering this situation

FileBasedSpellChecker
• 

It uses a flat file to generate a spelling dictionary in the form of a Lucene
spellchecking index.
–  Requires a dictionary file
–  Creates additional index from dictionary file (small, MB)
–  Does not depend on index data (controlled data)
–  Build once
•  rebuild only if dictionary is updated
–  No frequency information used when calculating spelling suggestions

FileBasedSpellChecker
• 
• 

• 

Requires rebuilds also
–  albeit less frequently
No frequency related data
–  Pure orthographic correction is not our main goal
–  We cannot risk suggesting corrections without results.
But
–  insight on how to approach the final solution we are implementing.
–  allows the highest degree of control in dictionary contents
•  essential feature for spelling dictionaries.

DirectSpellChecker
• 

Experimental spellchecker that just uses the main Solr index directly
–  Build/rebuild is not required.
–  Depends on index data correctness (misspells)
–  Uses existing index
•  field: source of the spelling dictionary.
–  Supports term frequency parameters.
–  No (re)build.

• 

Several promising features
–  No build + continuously in sync with index data.
–  Provides accurate frequency information data.

DirectSpellChecker
• 
• 

The real drawback
–  lack of control over index data sourcing the spelling dictionary.
If we can overcome it, this type would make an ideal candidate for our
use case.

WordBreakSpellChecker
• 

Generates suggestions by combining adjacent words and/or breaking
words into multiples.
–  This spellchecker can be configured with a traditional checker
(ie:DirectSolrSpellChecker).
–  The results are combined and collations can contain a mix of
corrections from both spellcheckers.
–  Uses existing index. No build.

WordBreakSpellChecker
• 
• 
• 
• 

Good complement to the other spellcheckers
It works really well with well-written concatenated words
–  it is able to break them up with great accuracy.
Combining split words is not as accurate
Drawback: it’s based on index data.

WHITE-LISTING AND PURGING:
CONTROLLING DICTIONARY
DATA

White-listing
• 
• 
• 

Any spelling system can only be as good as its knowledge base or
dictionary is accurate.
We need to control the data indexed as dictionary.
White-listing approach
–  we only index spelling data contained in a controlled dictionary list.
–  processes to build a base dictionary specialized for a given site.

Initial spellchecker configuration
• 

• 

DirectSpellChecker using purged spell field
–  Spell field filled with purged content
•  Purging according to whitelist
•  Whitelist generated from matching dictionary with index words, after
purge process
Benefits:
–  Build is no longer required.
–  Spell field is automatically updated via pipeline.
–  We can work with term freq.
–  No additional index, just an additional field.
–  Better relevance and suggestions.

• 
• 

Cons:
–  Whitelist maintenance and creation for new sites.
Features:
–  Accurate detection of misspelled words.
–  Good detection of concatenated words.
•  piscinagarajejardin to piscina garaje jardin
•  picina garajejardin to piscina (garaje jardin)
–  Able to detect several misspelled words.
–  Evolution based on whitelisting fine-tuning.

• 

Issues:
–  False negatives: suggestion of corrections when words are correctly spelled.
–  Suggestions for all the words in the query, not just those misspelled words.
–  Misguiding “correctlySpelled” parameter.
•  Parameter dependant on frequency information, making it unreliable for
our purposes.
•  It returns true/false according to thresholds,
–  not really depending on word distance but
–  results found, “alternativeTermCount” and “maxResultsForSuggest”
thresholds.
–  Minor discrepancies if we only index boosted terms (i.e. qf)
•  # hits spell< #docs index

CUSTOMIZING SOLR
SPELLCHECKCOMPONENT

Hacking SpellcheckComponent
• 

Lack of reliability of the “correctlySpelled” parameter
–  Difficult to know when give a suggestion or not.
–  First policy based on document hits
•  sliding window
–  based on the number of queried terms
•  the longer the tail, the smaller the threshold
•  inaccurate and prone to collisions.
–  Difficult to set up thresholds to a good level of accuracy.

We needed a more reliable way.

Hacking SpellcheckComponent: correctlySpelled
parameter behaviour
• 

Binary approach to deciding if a word is correctly spelled or not.

• 

Simpler approach
–  any term that appears in our spelling field is a correctly spelled word
•  regardless the value of its frequency info or the configured thresholds.
–  this way the parameter can be used to control when to start querying the
spellchecking index.

Hacking SpellcheckComponent
• 

Other changes to the SpellcheckComponent:
–  No suggestions when words are correctly spelled.
–  Only makes suggestions for misspelled words, not for all words
•  i.e. piscina garage -> piscina garaje

• 

Spanish-friendly ASCIIFoldingFilter
–  modified in order to not fold “ñ” (for Spanish) and “ç” (for Catalan names)
characters.
•  Avoids collisions with similar words with “n” and “c”
–  e.g. “pena” and “peña”
–  Still folding accented vowels
•  usually omitted by users.

Conclusion & Future Work
• 

• 
• 

Base code
–  expand the spellchecking process to other sites
–  design final policy to decide when giving suggestions or not.
Geodata in homes verticals
–  find ways to avoid collisions in large dictionary sets.
Scoring system for spelling dictionary
–  Control suggestions based on user input
•  Feedback on relevance or quality of our spellchecking suggestions.
•  System more accurate and reliable
•  Expand whitelists to cover large amounts of geodata
–  with acceptable levels of precision.

Conclusion & Future Work
• 

Plural suggester
–  suggest alternative searches and corrections using plural or singular variants
of the terms in the query.
–  Use frequency and scoring information to choose most suitable suggestions.

THANKS FOR YOUR ATTENTION!
ANY QUESTIONS?

References
[1] Lucene/Solr Revolution EU 2013. Dublin, 6-7 November 2013.
http://www.lucenerevolution.org/
[2] Trovit – A search engine for classified ads of real estate, jobs, cars and vacation
rentals. http://www.trovit.com
[3] Apache Software Foundation. “Apache Solr” https://lucene.apache.org/solr/
[4] Apache Software Foundation. “Apache Lucene” https://lucene.apache.org
[5] Apache Software Foundation. “Spell Checking – Apache Solr Reference Guide –
Apache Software Foundation”
https://cwiki.apache.org/confluence/display/solr/Spell+Checking

Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

Recommandé

Recommandé

Contenu connexe

Similaire à Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

Similaire à Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads (20)

Plus de lucenerevolution

Plus de lucenerevolution (20)

Dernier

Dernier (20)

Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads