Use of-solr-at-trovit-classified-ads marc-sturlese

1

U s a g e of S olr a t T r ov it
A Search Engine For Classified Ads

Marc Sturlese
Trovit

marc@trovit.com
Apache Lucene Eurocon 2010, Prague, 20 May 2010

Apache Lucene EuroCon 4 May 2010

Agenda

● Trovit, a Solr use case
● Types of index
● Architecture overview
● Relevance tuning
● Out of the box features
● Custom features
● Sharding
● Future directions
● Questions

Apache Lucene EuroCon 05/16/10

W h a t is T r o v it? A S e a r c h E n g in e F o r C la s s ifie d A d s


T y pe s o f in de x

There are 3 different types of index
● Organic ads index
● Sponsored ads index
● Recommended searches index

There is an index per country and per business category for
every type... what means a total of 180 index
Some of them are sharded. All of them have replicas.


T y pe s o f in de x

Captura donde se vean los 3 tipos de índice


A r qu ite ctu r e o v e r v ie w crawling / parsing

wharehouse

indexing

Solr indexer
back end
replication

Solr
slaves

load balancer

frontal
load balancing

load balancer front end
request

Apache Lucene EuroCon 05/16/10 6

A r ch ite ctu r e o v e r v ie w

M a s te r s - I n de x in g
● 4 servers. Continuously updating index sequentially
● 1 server to index organic ads for all countries/categories
● 1 server to index powered ads for all countries/categories
● 1 server to index recommended searches for all countries/categories

S la v e s – S e r v in g s e a r c h r e q u e s ts
● Index with high traffic have 4 replicas
● Indexs with less traffic have 3 replicas


A r qu ite ctu r e o v e r v ir e w

● Index are replicated using modified c o l l e c t i o n
d i s t r i b u t i o n scripts to allow multi core
● Snapshooter and snappuller are sequentially executed
● Snapinstaller is executed at the same time on each slave
to preserve exactly the same content all the time
● Started load balancing with P e r l b a l . It was producing
high CPU loads


L ife o f a u s e r s e a r ch r e qu e s t

For every user search:
● A request is done to the organic and sponsored index
● Per each result of the organic search, a request to the
recommended searches ads is done

● 13 Solr request per user search! And once this is done...
The user search request is going to be batch processed to decide
if it must be indexed in the similar user searches index


L ife o f a u s e r s e a r ch r e qu e s t


R e le v a n c e tu n in g

● Basic searches use dismax qt. Build on top of Lucenes
DisjunctionMaxQuery
● Boosting queries to make latest ads more relevant
● Boost some ads at document level at indexing time to
make them more important than others
● Boost ads at field level at query time to make the match
more important in some fields than in others


R e le v a n c e tu n in g

Us er s ea r ch: hom e tennes s ee
● Higher quality ad

● Lower quality ad


O u t o f th e bo x S o lr fe a tu r e s

● Synonyms for USA states
● Per country and per business category stopwords
● MoreLikeThis request handler
● TrieFields to index housing latitude and longitude
● Facet fields, queries and dates.
● Warming queries from a specific file using an EventListener.
Issue SOLR-784


O u t o f th e bo x S o lr fe a tu r e s : M o r e L ik e T h is


O u t o f th e bo x S o lr fe a tu r e s : U s a g e o f T r ie F ie ld s


Cus tom fe a tu r e s

● Duplicates detection
● Coming from the same source: Indexing time
● Coming from different sources: Indexing and search
time
● Pseudo field collapsing
● Custom ranking for sponsored ads
● Custom Data Import Handler for full indexing and updates


C u s to m fe a tu r e s – N e a r d u plic a te s d e te c tio n

● A ds c om in g fr om th e s a m e s ou r c e
● Last who comes is the one that will be kept on the index
● Deduplication method using SignatureUpdateProcessor
● Small hack to custom the TextProfileSignature

● A ds c om in g fr om diffe r e n t s ou r c e s
● Give the user the chance to decide the source to visit
● Based on field collapsing issue (SOLR-236) and
SignatureUpdateProcessor used in Deduplication
● Done in 2 steps, one at index time and one at search time.

N e a r d u plic a te s d e te c tio n
A ds c o m in g fr o m diffe r e n t s o u r c e s


C u s to m fe a tu r e s – N e a r d u plic a te s d e te c tio n
A ds c o m in g fr o m diffe r e n t s o u r c e s

● Why to calculate them at index time?
● Avoid loading FieldCache of a “big field” at search time.
Very memory consuming!


C u s to m fe a tu r e s – P s e u d o fie ld c o lla ps in g

● Don't want to show first results pages with all ads from the
same sources
● “Bad” results will be send to the later pages
● SOLR-236 makes a double trip, not so good in performance
terms
● Core hack to avoid the double trip... SOLR–1311
● Does not support proper distributed search at the moment


C u s to m fe a tu r e s – S pe cia l r a n k in g fo r S po n s o r e d
Ads
● Not just relevance is important. External factors are
important too.
● Implemented using a Solr SearchComponent
● External factors are loaded from a resource and used
in a Lucene FieldComparatorSource to alter the
score of the documents


C u s to m fe a tu r e s – H a c k e d D a ta I m po r tH a n d le r
● DIH is a tool to index data to Solr from different sources
(xml, txt, data bases...)
● Extended transformers to alter data before it is indexed
● Delta imports are meant to be used not updating huge
amounts of rows. Doing that can end up with memory
problems
● If something crashes we have to reindex. It can sometimes
take a long time. We want to keep going from the last indexed
doc
● Hacks to allow us to use it as distributed indexer.


S h a r din g

F ir s t s tr a te g y
● No distributed IDF's at the moment Better to choose
randomly the shard where to index a doc:
SolrDocUniqueField.hashCode / NumberOfShards = ShardNumber

● Once we started keeping track of near duplicates among
ads from different sources this was not good anymore.
W h y ? Dups system is based on SOLR-236: Duplicated
documents must be indexed on the same shard to
be detected!!!


S h a r din g

S e cond s tr a te gy
● HashCode of the signature field will decide the shard number
● This forces the signature field to be calculated in the
warehouse so when indexing process starts we
already have it

T h ir d a n d fu tu r e s tr a te g y
● Calculate duplicates in the warehouse
● There will be no need for the dups to be in the same shard
anymore

F u tu r e dir e ctio n s
P r o pe r dis tr ibu te d I D F ' s
● Allows to have absolute relevance among shards.
More accurate results
● Issue SOLR-1632
● Still some bugs specially when using boosting functions
● Allows to improve sharding strategies. No need to choose the
shard number randomly anymore.


F u tu r e dir e ctio n s
L o a d ba la n c e w ith Z o o k e e pe r ( S o lr C lo u d )
● Use Solr Cloud to manage sharding
● Currently being commited to trunk
● Replace load balancer for Zookeeper
● Let Zookeeper handle distributed configuration stuff


?

T ha nk y ou
for y ou r a tte n tion

Marc Sturlese
Trovit

marc@trovit.com
Apache Lucene Eurocon 2010, Prague, 20 May 2010


Use of-solr-at-trovit-classified-ads marc-sturlese

Recommandé

Recommandé

Contenu connexe

Similaire à Use of-solr-at-trovit-classified-ads marc-sturlese

Similaire à Use of-solr-at-trovit-classified-ads marc-sturlese (20)

Dernier

Dernier (20)

Use of-solr-at-trovit-classified-ads marc-sturlese