5. What is solr
HTTP Request Servlet Update Servlet
Admin
XML
Different Request Handler
Update
schema
caching
config Solr Core
concurrency
Lucene
Replication
6. What is solr
● Unstructured rows
● Denormalization of data
● Dynamic fields
● Schema → Tokenizer, Filters, etc.
● Tons of XML
7. What is solr
Indexing Query
Filter Tokenizer Query
Tokenizer Token Filter Strings
Index
Results
8. What is solr
● Get Requests
hl.fragsize=0
&spellcheck=true
&spellcheck.extendedResults=true
&qf=everything_phonetic_wa^1+display_name_phonetic_wa^2+comment_en_wa^4+revi
ew_en_wa^8+everything_en_wa^16+everything_wa^32+display_name_en_wa^64+displ
ay_name_wa^128
&spellcheck.collate=true
&wt=ruby
&hl=true
&rows=100
&f =pk_i,score
l
&start=0
&q=chipotle+bbq
&spellcheck.dictionary=spell_en
&bf=linear(en_rating_points_i,100,0)
&spellcheck.count=1
&qt=dismax&
fq=closed_b:false+AND+domain_id_s:uki*+AND+(type_s:Place)
9. What is solr
● Response type
● XML
● Ruby
● JSON
● XML + XSLT
● etc.
10. Solr integration into Rails
● Sunspot
● acts_as_solr
● Qype → acts_as_solr
● Optimized Queries for solr
● Monkey patching
● Defined queries without dynamic fields
● Names of search fields differ from AR names
11. Solr integration into Rails
● Data consistency
● Synchronous
– AR stores in mysql and solr
– Longer response times
– Not really synchron in case of replication
● Asynchronous
– AR stores in mysql
– Data import via mysql requests by solr master
– Out of sync for some minutes
– Deletion by flag, later physically
– Javascript preprocessing of data possible
12. Challenges - Spellchecking
● Pool of words for spellchecking
Words from real data
?
●
● Beeeeeeer
● 9 Languages CC BY-ND 2.0 - JM3
● New → Spellchecker for different kind of data
● Suggestion → Locator → Facet → best match ?
● Similar word → fuzzy search vs. spellchecking
13. Challenges - Spellchecking
Chipotle BBQ
CC BY-ND 2.0
raybdbomb CC BY-ND 2.0 - Meindert Arnold Jacob
Chinese Baby
CC BY-ND 2.0 - joshDubya
! CC BY-ND 2.0 - michael clarke stuff
shingles
15. Challenges – Synonyms
● 9 Languages
● OpenOffice rules !
● Not all languages available → NL is missing
16. Challenges – NGrams
● Hugh Index
● Tee matches Steeb
● EdgeNGrams
● Bar → Sofabar → Barmbek
● Not matched string shall be a word → performance
17. Challenges – Phrases
● Boost matching of phrases → whole entry
● 'Europa Passage'
● Boost matching of phrases → left sided
● 'Galeria Kaufhof in Hamburg'
● 'Boutique in Galeria Kaufhof'
● Javascript pre processing
● Boost matching of phrase somewhere in entry
● How to handle matches of some words in given
phrase?
18. Challenges – Whitespace in index
● Index: 'Ping Pong'
● Search word: 'Pingpong'
● Javascript pre processing
CC BY-ND 2.0 - zimpenfish
CC BY-ND 2.0 - Ewan-M
19. Experiences – sever setup
Live Staging Dev
Loadbalancer Slave iMac
Solr queries
Master
Slave Slave Slave
Replication Solr & MySql
DB Slave
Master
Import
DB Slave
20. Experiences – size of indices
● Staging System → Sunday evening
● Places in simple format: 712 MB
● Previews simple format: 5,519 GByte
● Places Previews Comments extended: 3,5 GB
● Big Spellchecker: 16 GByte
● New combined index: 15 GByte
● Index: 14 Gbyte
● Spellchecker: 1 GByte
21. Experiences – server setup
● Live Servers
● 2 x 8 Cores, 2 x 16 Cores
● 32 Gbyte RAM
● Max. CPU usage: up to 500%
● Solr loves RAM → 32 Gbyte full with cache
22. Experiences – Solr loves RAM
● Dev → 1 Gig
● Staging → 4.5 Gig (no load)
● Import → 11 Gig and more
● Production → 14 Gig