Use of-solr-at-trovit-classified-ads marc-sturlese
1. 1
U s a g e of S olr a t T r ov it
A Search Engine For Classified Ads
Marc Sturlese
Trovit
marc@trovit.com
Apache Lucene Eurocon 2010, Prague, 20 May 2010
Apache Lucene EuroCon 4 May 2010
2. Agenda
● Trovit, a Solr use case
● Types of index
● Architecture overview
● Relevance tuning
● Out of the box features
● Custom features
● Sharding
● Future directions
● Questions
Apache Lucene EuroCon 05/16/10
3. W h a t is T r o v it? A S e a r c h E n g in e F o r C la s s ifie d A d s
Apache Lucene EuroCon 05/16/10
4. T y pe s o f in de x
There are 3 different types of index
● Organic ads index
● Sponsored ads index
● Recommended searches index
There is an index per country and per business category for
every type... what means a total of 180 index
Some of them are sharded. All of them have replicas.
Apache Lucene EuroCon 05/16/10
5. T y pe s o f in de x
Captura donde se vean los 3 tipos de índice
Apache Lucene EuroCon 05/16/10
6. A r qu ite ctu r e o v e r v ie w crawling / parsing
wharehouse
indexing
Solr indexer
back end
replication
Solr
slaves
load balancer
frontal
load balancing
load balancer front end
request
Apache Lucene EuroCon 05/16/10 6
7. A r ch ite ctu r e o v e r v ie w
M a s te r s - I n de x in g
● 4 servers. Continuously updating index sequentially
● 1 server to index organic ads for all countries/categories
● 1 server to index powered ads for all countries/categories
● 1 server to index recommended searches for all countries/categories
S la v e s – S e r v in g s e a r c h r e q u e s ts
● Index with high traffic have 4 replicas
● Indexs with less traffic have 3 replicas
Apache Lucene EuroCon 05/16/10
8. A r qu ite ctu r e o v e r v ir e w
● Index are replicated using modified c o l l e c t i o n
d i s t r i b u t i o n scripts to allow multi core
● Snapshooter and snappuller are sequentially executed
● Snapinstaller is executed at the same time on each slave
to preserve exactly the same content all the time
● Started load balancing with P e r l b a l . It was producing
high CPU loads
Apache Lucene EuroCon 05/16/10
9. L ife o f a u s e r s e a r ch r e qu e s t
For every user search:
● A request is done to the organic and sponsored index
● Per each result of the organic search, a request to the
recommended searches ads is done
● 13 Solr request per user search! And once this is done...
The user search request is going to be batch processed to decide
if it must be indexed in the similar user searches index
Apache Lucene EuroCon 05/16/10
10. L ife o f a u s e r s e a r ch r e qu e s t
Apache Lucene EuroCon 05/16/10
11. R e le v a n c e tu n in g
● Basic searches use dismax qt. Build on top of Lucenes
DisjunctionMaxQuery
● Boosting queries to make latest ads more relevant
● Boost some ads at document level at indexing time to
make them more important than others
● Boost ads at field level at query time to make the match
more important in some fields than in others
Apache Lucene EuroCon 05/16/10
12. R e le v a n c e tu n in g
Us er s ea r ch: hom e tennes s ee
● Higher quality ad
● Lower quality ad
Apache Lucene EuroCon 05/16/10
13. O u t o f th e bo x S o lr fe a tu r e s
● Synonyms for USA states
● Per country and per business category stopwords
● MoreLikeThis request handler
● TrieFields to index housing latitude and longitude
● Facet fields, queries and dates.
● Warming queries from a specific file using an EventListener.
Issue SOLR-784
Apache Lucene EuroCon 05/16/10
14. O u t o f th e bo x S o lr fe a tu r e s : M o r e L ik e T h is
Apache Lucene EuroCon 05/16/10
15. O u t o f th e bo x S o lr fe a tu r e s : U s a g e o f T r ie F ie ld s
Apache Lucene EuroCon 05/16/10
16. Cus tom fe a tu r e s
● Duplicates detection
● Coming from the same source: Indexing time
● Coming from different sources: Indexing and search
time
● Pseudo field collapsing
● Custom ranking for sponsored ads
● Custom Data Import Handler for full indexing and updates
Apache Lucene EuroCon 05/16/10
17. C u s to m fe a tu r e s – N e a r d u plic a te s d e te c tio n
● A ds c om in g fr om th e s a m e s ou r c e
● Last who comes is the one that will be kept on the index
● Deduplication method using SignatureUpdateProcessor
● Small hack to custom the TextProfileSignature
● A ds c om in g fr om diffe r e n t s ou r c e s
● Give the user the chance to decide the source to visit
● Based on field collapsing issue (SOLR-236) and
SignatureUpdateProcessor used in Deduplication
● Done in 2 steps, one at index time and one at search time.
Apache Lucene EuroCon 05/16/10
18. N e a r d u plic a te s d e te c tio n
A ds c o m in g fr o m diffe r e n t s o u r c e s
Apache Lucene EuroCon 05/16/10
19. C u s to m fe a tu r e s – N e a r d u plic a te s d e te c tio n
A ds c o m in g fr o m diffe r e n t s o u r c e s
● Why to calculate them at index time?
● Avoid loading FieldCache of a “big field” at search time.
Very memory consuming!
Apache Lucene EuroCon 05/16/10
20. C u s to m fe a tu r e s – P s e u d o fie ld c o lla ps in g
● Don't want to show first results pages with all ads from the
same sources
● “Bad” results will be send to the later pages
● SOLR-236 makes a double trip, not so good in performance
terms
● Core hack to avoid the double trip... SOLR–1311
● Does not support proper distributed search at the moment
Apache Lucene EuroCon 05/16/10
21. C u s to m fe a tu r e s – S pe cia l r a n k in g fo r S po n s o r e d
Ads
● Not just relevance is important. External factors are
important too.
● Implemented using a Solr SearchComponent
● External factors are loaded from a resource and used
in a Lucene FieldComparatorSource to alter the
score of the documents
Apache Lucene EuroCon 05/16/10
22. C u s to m fe a tu r e s – H a c k e d D a ta I m po r tH a n d le r
● DIH is a tool to index data to Solr from different sources
(xml, txt, data bases...)
● Extended transformers to alter data before it is indexed
● Delta imports are meant to be used not updating huge
amounts of rows. Doing that can end up with memory
problems
● If something crashes we have to reindex. It can sometimes
take a long time. We want to keep going from the last indexed
doc
● Hacks to allow us to use it as distributed indexer.
Apache Lucene EuroCon 05/16/10
23. S h a r din g
F ir s t s tr a te g y
● No distributed IDF's at the moment Better to choose
randomly the shard where to index a doc:
SolrDocUniqueField.hashCode / NumberOfShards = ShardNumber
● Once we started keeping track of near duplicates among
ads from different sources this was not good anymore.
W h y ? Dups system is based on SOLR-236: Duplicated
documents must be indexed on the same shard to
be detected!!!
Apache Lucene EuroCon 05/16/10
24. S h a r din g
S e cond s tr a te gy
● HashCode of the signature field will decide the shard number
● This forces the signature field to be calculated in the
warehouse so when indexing process starts we
already have it
T h ir d a n d fu tu r e s tr a te g y
● Calculate duplicates in the warehouse
● There will be no need for the dups to be in the same shard
anymore
Apache Lucene EuroCon 05/16/10
25. F u tu r e dir e ctio n s
P r o pe r dis tr ibu te d I D F ' s
● Allows to have absolute relevance among shards.
More accurate results
● Issue SOLR-1632
● Still some bugs specially when using boosting functions
● Allows to improve sharding strategies. No need to choose the
shard number randomly anymore.
Apache Lucene EuroCon 05/16/10
26. F u tu r e dir e ctio n s
L o a d ba la n c e w ith Z o o k e e pe r ( S o lr C lo u d )
● Use Solr Cloud to manage sharding
● Currently being commited to trunk
● Replace load balancer for Zookeeper
● Let Zookeeper handle distributed configuration stuff
Apache Lucene EuroCon 05/16/10
28. T ha nk y ou
for y ou r a tte n tion
Marc Sturlese
Trovit
marc@trovit.com
Apache Lucene Eurocon 2010, Prague, 20 May 2010
Apache Lucene EuroCon 05/16/10