Seeley yonik solr performance key innovations

Solr Performance &
Key Innovations

Yonik Seeley, Lucid Imagination
yonik@lucidimagination.com, May 26 2011

Solr 3.1 Highlights
§  Numeric range facets (similar to date faceting).
§  New spatial search, including spatial filtering,
boosting and sorting capabilities.
§  Example Velocity driven search UI at
http://localhost:8983/solr/browse
§  A new faster termvector-based highlighter.
§  Extended dismax (edismax) query parser with
support for fielded queries, enhanced relevancy, and
full lucene syntax support.
§  Distributed search support for the Spell check and
Terms components.
3

Solr 3.1 Highlights (continued)
§  Suggester, a fast trie-based autocomplete
component.
§  Sort results by any function query.
§  JSON document indexing.
§  CSV response format
§  Apache UIMA integration for metadata
extraction.
§  Tons of optimizations, bugfixes, and new
analysis capabilities via Apache Lucene 3.1.

4

What’s not in 3.1?
§  Result Grouping (AKA Field Collapsing)
§  Pivot Faceting
§  SolrCloud
§  Pseudo-fields
§  Pseudo-join
§  Relevancy function queries
§  Per-segment faceting
§  *Tons* of new Lucene performance/efficiency
goodness
5

Recent Lucene Performance
§  TieredMergePolicy – the new default
•  Much better for incremental indexing / NRT
•  Ignores segment order when selecting best merge
•  Takes deletes into account
•  Does not over-merge (no cascading merges)
§  Finite State Transducer (FST) based terms index

6

DocumentWriterPerThread (DWPT)
Indexing
thread
§  Flushing new
segment is now Index Writer
concurrent w/
indexing
§  Use multiple DWPT DWPT DWPT
in-memory
indexing threads/
connections
§  When max mem is Flush segment
to disk
hit, biggest DWPT is _1_0.tiv _2_0.tiv _3_0.tiv
concurrently flushed _1_0.prx _2_0.prx _3_0.prx
_1_0.frq _2_0.frq _3_0.frq
… … …
7

Solr Cloud
http://.../solr/collection1?distrib=true

Load-balanced
shard1 sub-request shard2
(replica1) (replica1)
replica2 replica2
replica3 replica3

ZK
node /livenodes
server1:8983/solr
ZK /collections server2:8983/solr
node /collection1 configName=myconf ZK
server2:8983/solr node
/shards
/shard1 /configs
server1:8983/solr /myconf
server2:8983/solr solrconfig.xml
/shard2 schema.xml
server3:8983/solr
ZK server4:8983/solr ZK
node node
ZooKeeper quorum
8

Solr Cloud: Getting Started
http://wiki.apache.org/solr/SolrCloud
java
-‐Dbootstrap_confdir=./solr/conf

-‐Dcollection.configName=myconf

-‐DzkRun

Upload /solr/conf

-‐jar
start.jar
to ZK and call it
“myconf”

Run an internal
ZK server

http://localhost:8983/solr/collection1/admin/zookeeper.jsp

Distributed Requests
l  Explicitly specify node addresses to load-balance across
shards=localhost:8983/solr|localhost:8900/solr,

localhost:7574/solr|localhost:7500/solr

l  A list of equivalent nodes are separated by “|”
l  Different phases of the same distributed request use the same node
l  Specify logical shard ids to search across
shards=NY_shard,NJ_shard

l  Query across all shards in the collection
http://localhost:8983/solr/collection1/select?distrib=true

l  public
CloudSolrServer(String
zkHost)

l  SolrJ Java client that load-balances across all nodes in cluster

Extended Dismax Parser
l  Supersetof dismax
l  Designed to directly handle user queries w/o exceptions
&defType=edismax&q=foo&qf=body

l  Fixes edge cases where dismax could still throw exceptions
OR

AND

NOT

-‐

l  Full lucene syntax support
l  Tries lucene syntax first
l  Smart escaping is done if syntax errors
l  Optionally supports treating and / or as AND/OR in lucene
syntax
l  Fielded queries (e.g. myfield:foo) even in degraded mode
l  uf parameter controls what field names may be directly specified in q

Extended Dismax Parser (continued)
l  boost parameter for multiplicative boost-by-function
l  Pure negative query clauses
Example: solr
OR
(-‐solr)

l  Enhanced term proximity boosting
l  pf2=myfield – results in term bigrams in sloppy phrase queries

myfield: aa
bb
cc -‐>

myfield: aa
bb

myfield: bb
cc

l  Enhanced stopword handling
l  stopwords omitted in main query, but added in optional proximity
boosting part
Example: q=solr
is
awesome
&
qf=myfield
&
pf2=myfield

-‐>

+myfield:(solr
awesome)

(myfield: solr
is
myfield: is

awesome )

l  Currently controlled by the absence of StopWordFilter in index analyzer,
and presence in query analyzer

Faceting Performance Improvements

l  For facet.method=enum, speed up initial population of the
filterCache (i.e. first time facet): from 30% to 32x
improvement
l  Optimized facet.method=fc for multi-valued fields and large
facet.limit – up to 3x faster
l  Optimized deep facet paging – up to 10x faster with really
large facet.offsets
l  Less memory consumed by field cache entries

l  Per-segment faceting with facet.method=fcs
l  Only faster when re-opening index frequently (many times a second)
l  Only works for single-valued fields

Pivot Faceting
l  Other names that could have made sense:
l  Grid Faceting, Cross-Product Faceting, Matrix Faceting
l  Syntax: facet.pivot=field1,field2,field3,…

facet.pivot=cat,inStock
#docs #docs w/ #docs w/
inStock:true instock:false
cat:electronics 14 10 4
cat:memory 3 3 0
cat:connector 2 0 2
cat:graphics card 2 0 2
cat:hard drive 2 2 0

Pivot Faceting
http://...&facet=true&facet.pivot=cat,popularity
"facet_counts":{ (continued)
"facet_pivot":{
"cat,popularity":[{ {
"field":"cat", "field":"popularity",
14 docs w/ "value":"electronics", "value":"1",
cat==electronics "count":14, "count":2}]},
"pivot":[{ {
5 docs w/ "field":"popularity", "field":"cat",
cat==electronics "value":"6", "value":"memory",
&& popularity==6 "count":5}, "count":3,
{ "pivot":[]},
"field":"popularity",
"value":"7", […]
"count":4},

Range Faceting
"facet_counts":{
§  Like Date faceting, but "facet_ranges":{
more generic "price":{
"counts":{
"0.0":5,
http://...&facet=true "50.0":2,
&facet.range=price "100.0":0,
"150.0":2,
&facet.range.start=0 "200.0":0,
&facet.range.end=500 "250.0":1,
"300.0":2,
&facet.range.gap=50 "350.0":2,
"400.0":0,
"450.0":1},
"gap":50.0,
"start":0.0,
"end":500.0}}}}

Spatial Search
Step1: Index some locations!
<field name= name >The Alpine Shop</field>
<field name= store >44.013617,-73.168264</field>

Step2: Decide where you are
&pt=44.0153371,-73.16734
&d=1
&sfield=store

Step3: Profit!

Spatial Filter: &fq={!geofilt}

Bounding Box: &fq={!bbox}

Distance Function: &sort=geodist() asc

Returning the distance: &fl=geodist()

Pseudo-fields! Note: You can now sort
by any arbitrary
function query!

Pseudo-Fields
Returns other info along with document stored fields
§  Function queries
fl=name,location,geodist(),add(myfield,10)

§  Fieldname globs
fl=id,attr_*

§  Multiple “fl” (field list) values
&fl=id,attr_*&fl=geodist()&fl=termfreq(text,’solr’)

§  Aliasing
fl=id,location:loc,_dist_:geodist()

§  Future: inlined highlighting, “explain”, sort-values,
group-value

18

Result Grouping / Field
Collapsing
l  Goal
l Limit the number of results per category
l  category normally defined by unique values in a field

l  Uses
l  Web Search – collapse by web site
l  Email threads – collapse by thread id

l  Ecommerce/retail

l  Show the top 5 items for each store category (music, movies,
etc)

Result Grouping by Category
Field Collapse on Product Type

Group by Field
http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact
"grouped":{
"manu_exact":{
"matches":3,
"groups":[{
"groupValue":"Belkin",
"doclist":{"numFound":2,"start":0,"docs":[
{
"id":"IW-02",
"name":"iPod & iPod Mini USB 2.0 Cable"}]
}},
{
"groupValue":"Apple Computer Inc.",
{

Group by Query
http://...&group=true&group.query=price:[0 TO 99.99]
&group.query=price:[100 TO *]&group.limit=5
"grouped":{
"price:[0 TO 99.99]":{
"matches":3,
{
"id":"IW-02",
"name":"iPod & iPod Mini USB 2.0 Cable"},
{
"id":"F8V7067-APL-KIT",
"name":"Belkin Mobile Power Cord for iPod"}]
}},
"price:[100 TO *]":{
"matches":3,

Grouping Params
parameter meaning default

group.field=<field> Like facet.field – group by unique field
values
group.query=<query> Like facet.query – top docs that also
match
group.function=<function Group by unique values produced by the
query> function query
group.limit=<n> How many docs per group 1
group.sort=<sort spec> How to sort documents within a group Same as sort

rows=<n> How many groups to return 10
sort=<sort spec> How to sort the groups relative to each
other (based on top doc)
group.format=<format> grouped/simple – if simple, a single flat grouped
list is used and rows units are “docs”
group.main=true/false If true, the first field grouping command is false
used as main result set

Pseudo-Join
id: blog1 id: post1
blog_id: blog1
name: Solr ‘n Stuff
author: Yonik Seeley
owner: Yonik Seeley title: Solr relevancy function queries
Started: 2007-10-26 body: Lucene’s default ranking […]

id: blog2 id: post2
name: lifehacker blog_id: blog1
author: Yonik Seeley
owner: Gawker Media
title: Solr result grouping
started: 2005-1-31 body: Result Grouping, also called […]

id: post3
blog_id: blog2
Restrict to blogs mentioning netflix author: Whitson Gordon
title: How to Install Netflix on Almost
Any Android Device
fq={!join from=blog_id to=id}body:netflix

-  Finds all documents matching “netflix”
-  Maps to different docs by following blog_id to id

25

Pseudo-Join Examples
§  Only show posts from blogs started after 2010
q=foo&fq={!join from=id to=blog_id}started:[2010 TO *]

§  If any post in a blog mentions “obama”, then search
all posts in that blog for “bomb” (self-join)
q=bomb&fq={!join from=blog_id to=blog_id}obama

§  If any blog post mentions “obama”, then search all
websites with the same blog owner for “bomb”
q=bomb&fq={!join from=owner to=website_owner}{!join
from=blog_id to=id}obama

26

Cross-Core Join
id: doc1
security: managers
id: mary
title: doc for managers only security_groups: managers, employees
body: …

id: doc1 id: john
security_groups: employees
security: managers, employees
title: doc for everyone
body: …

collection1 sec1

Single Solr Server

http://localhost:8983/solr/collection1/select?q=foo&fq={!join
fromIndex=sec1 from=security_groups to=security}user:john

27

Pseudo-Join vs Grouping
Pseudo-Join Result Grouping / Field Collapsing

O(n_terms_in_join_fields) O(n_docs_in_result)

Single or multi-valued fields Single-valued fields only

Filters only (no info currently passed from Can order docs within a group and groups
the “from” docs to the “to” docs). by top doc within that group using normal
sort criteria.
Chainable (one join can be the input to Not currently chainable – can only group
another) one field deep
Affects which documents match a request, Grouping does not currently affect the set
so naturally affects facet numbers (e.g. of documents matching the query, so
you can search posts and get numbers of faceting is unaffected.
blogs)

28

Auto-Suggest
l  Many people previously used terms component
l  Can be slow for a large corpus
l  New auto-suggest builds off SpellCheck component
l  TST implementation: compact memory based trie
l  FST implementation: slower to build, but smaller & faster lookup
l  Based on a field in the main index, or on a dictionary file
http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult

"spellcheck":{
"suggestions":[
"ult",{
"numFound":1,
"startOffset":0,
"endOffset":3,
"suggestion":["ultrasharp"]},
"collation","ultrasharp"]}}
29

Index with JSON
$
URL=http://localhost:8983/solr/update/json

$
curl
$URL
-‐H
'Content-‐type:application/json'
-‐d
’

[

{

"id"
:
"978-‐0641723445",

"cat"
:
["book","hardcover"],

"title"
:
"The
Lightning
Thief",

"author"
:
"Rick
Riordan",

"series_t"
:
"Percy
Jackson
and
the
Olympians",

"sequence_i"
:
1,

"genre_s"
:
"fantasy",

"inStock"
:
true,

"price"
:
12.50,

"pages_i"
:
384

}

]'

Query Results in CSV
http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv

name,price,cat,popularity
iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1
Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1
Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10

l  Can handle multi-valued fields (see cat field in example)
l  Completely compatible with the CSV update handler (can round-trip)

l  Results are streamed – good for dumping entire parts of the index

http://localhost:8983/solr/browse

Seeley yonik solr performance key innovations

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Seeley yonik solr performance key innovations

Similaire à Seeley yonik solr performance key innovations (20)

Plus de Lucidworks (Archived)

Plus de Lucidworks (Archived) (20)

Dernier

Dernier (20)

Seeley yonik solr performance key innovations