2. Solr 3.1 Highlights
§ Numeric range facets (similar to date faceting).
§ New spatial search, including spatial filtering,
boosting and sorting capabilities.
§ Example Velocity driven search UI at
http://localhost:8983/solr/browse
§ A new faster termvector-based highlighter.
§ Extended dismax (edismax) query parser with
support for fielded queries, enhanced relevancy, and
full lucene syntax support.
§ Distributed search support for the Spell check and
Terms components.
3
3. Solr 3.1 Highlights (continued)
§ Suggester, a fast trie-based autocomplete
component.
§ Sort results by any function query.
§ JSON document indexing.
§ CSV response format
§ Apache UIMA integration for metadata
extraction.
§ Tons of optimizations, bugfixes, and new
analysis capabilities via Apache Lucene 3.1.
4
4. What’s not in 3.1?
§ Result Grouping (AKA Field Collapsing)
§ Pivot Faceting
§ SolrCloud
§ Pseudo-fields
§ Pseudo-join
§ Relevancy function queries
§ Per-segment faceting
§ *Tons* of new Lucene performance/efficiency
goodness
5
5. Recent Lucene Performance
§ TieredMergePolicy – the new default
• Much better for incremental indexing / NRT
• Ignores segment order when selecting best merge
• Takes deletes into account
• Does not over-merge (no cascading merges)
§ Finite State Transducer (FST) based terms index
6
6. DocumentWriterPerThread (DWPT)
Indexing
thread
§ Flushing new
segment is now Index Writer
concurrent w/
indexing
§ Use multiple DWPT DWPT DWPT
in-memory
indexing threads/
connections
§ When max mem is Flush segment
to disk
hit, biggest DWPT is _1_0.tiv _2_0.tiv _3_0.tiv
concurrently flushed _1_0.prx _2_0.prx _3_0.prx
_1_0.frq _2_0.frq _3_0.frq
… … …
7
8. Solr Cloud: Getting Started
http://wiki.apache.org/solr/SolrCloud
java
-‐Dbootstrap_confdir=./solr/conf
-‐Dcollection.configName=myconf
-‐DzkRun
Upload /solr/conf
-‐jar
start.jar
to ZK and call it
“myconf”
Run an internal
ZK server
http://localhost:8983/solr/collection1/admin/zookeeper.jsp
9. Distributed Requests
l Explicitly specify node addresses to load-balance across
shards=localhost:8983/solr|localhost:8900/solr,
localhost:7574/solr|localhost:7500/solr
l A list of equivalent nodes are separated by “|”
l Different phases of the same distributed request use the same node
l Specify logical shard ids to search across
shards=NY_shard,NJ_shard
l Query across all shards in the collection
http://localhost:8983/solr/collection1/select?distrib=true
l public
CloudSolrServer(String
zkHost)
l SolrJ Java client that load-balances across all nodes in cluster
10. Extended Dismax Parser
l Supersetof dismax
l Designed to directly handle user queries w/o exceptions
&defType=edismax&q=foo&qf=body
l Fixes edge cases where dismax could still throw exceptions
OR
AND
NOT
-‐
l Full lucene syntax support
l Tries lucene syntax first
l Smart escaping is done if syntax errors
l Optionally supports treating and / or as AND/OR in lucene
syntax
l Fielded queries (e.g. myfield:foo) even in degraded mode
l uf parameter controls what field names may be directly specified in q
11. Extended Dismax Parser (continued)
l boost parameter for multiplicative boost-by-function
l Pure negative query clauses
Example: solr
OR
(-‐solr)
l Enhanced term proximity boosting
l pf2=myfield – results in term bigrams in sloppy phrase queries
myfield: aa
bb
cc -‐>
myfield: aa
bb
myfield: bb
cc
l Enhanced stopword handling
l stopwords omitted in main query, but added in optional proximity
boosting part
Example: q=solr
is
awesome
&
qf=myfield
&
pf2=myfield
-‐>
+myfield:(solr
awesome)
(myfield: solr
is
myfield: is
awesome )
l Currently controlled by the absence of StopWordFilter in index analyzer,
and presence in query analyzer
12. Faceting Performance Improvements
l For facet.method=enum, speed up initial population of the
filterCache (i.e. first time facet): from 30% to 32x
improvement
l Optimized facet.method=fc for multi-valued fields and large
facet.limit – up to 3x faster
l Optimized deep facet paging – up to 10x faster with really
large facet.offsets
l Less memory consumed by field cache entries
l Per-segment faceting with facet.method=fcs
l Only faster when re-opening index frequently (many times a second)
l Only works for single-valued fields
15. Range Faceting
"facet_counts":{
§ Like Date faceting, but "facet_ranges":{
more generic "price":{
"counts":{
"0.0":5,
http://...&facet=true "50.0":2,
&facet.range=price "100.0":0,
"150.0":2,
&facet.range.start=0 "200.0":0,
&facet.range.end=500 "250.0":1,
"300.0":2,
&facet.range.gap=50 "350.0":2,
"400.0":0,
"450.0":1},
"gap":50.0,
"start":0.0,
"end":500.0}}}}
16. Spatial Search
Step1: Index some locations!
<field name= name >The Alpine Shop</field>
<field name= store >44.013617,-73.168264</field>
Step2: Decide where you are
&pt=44.0153371,-73.16734
&d=1
&sfield=store
Step3: Profit!
Spatial Filter: &fq={!geofilt}
Bounding Box: &fq={!bbox}
Distance Function: &sort=geodist() asc
Returning the distance: &fl=geodist()
Pseudo-fields! Note: You can now sort
by any arbitrary
function query!
17. Pseudo-Fields
Returns other info along with document stored fields
§ Function queries
fl=name,location,geodist(),add(myfield,10)
§ Fieldname globs
fl=id,attr_*
§ Multiple “fl” (field list) values
&fl=id,attr_*&fl=geodist()&fl=termfreq(text,’solr’)
§ Aliasing
fl=id,location:loc,_dist_:geodist()
§ Future: inlined highlighting, “explain”, sort-values,
group-value
18
18. Result Grouping / Field
Collapsing
l Goal
l Limit the number of results per category
l category normally defined by unique values in a field
l Uses
l Web Search – collapse by web site
l Email threads – collapse by thread id
l Ecommerce/retail
l Show the top 5 items for each store category (music, movies,
etc)
21. Group by Field
http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact
"grouped":{
"manu_exact":{
"matches":3,
"groups":[{
"groupValue":"Belkin",
"doclist":{"numFound":2,"start":0,"docs":[
{
"id":"IW-02",
"name":"iPod & iPod Mini USB 2.0 Cable"}]
}},
{
"groupValue":"Apple Computer Inc.",
"doclist":{"numFound":1,"start":0,"docs":[
{
22. Group by Query
http://...&group=true&group.query=price:[0 TO 99.99]
&group.query=price:[100 TO *]&group.limit=5
"grouped":{
"price:[0 TO 99.99]":{
"matches":3,
"doclist":{"numFound":2,"start":0,"docs":[
{
"id":"IW-02",
"name":"iPod & iPod Mini USB 2.0 Cable"},
{
"id":"F8V7067-APL-KIT",
"name":"Belkin Mobile Power Cord for iPod"}]
}},
"price:[100 TO *]":{
"matches":3,
"doclist":{"numFound":1,"start":0,"docs":[
23. Grouping Params
parameter meaning default
group.field=<field> Like facet.field – group by unique field
values
group.query=<query> Like facet.query – top docs that also
match
group.function=<function Group by unique values produced by the
query> function query
group.limit=<n> How many docs per group 1
group.sort=<sort spec> How to sort documents within a group Same as sort
rows=<n> How many groups to return 10
sort=<sort spec> How to sort the groups relative to each
other (based on top doc)
group.format=<format> grouped/simple – if simple, a single flat grouped
list is used and rows units are “docs”
group.main=true/false If true, the first field grouping command is false
used as main result set
24. Pseudo-Join
id: blog1 id: post1
blog_id: blog1
name: Solr ‘n Stuff
author: Yonik Seeley
owner: Yonik Seeley title: Solr relevancy function queries
Started: 2007-10-26 body: Lucene’s default ranking […]
id: blog2 id: post2
name: lifehacker blog_id: blog1
author: Yonik Seeley
owner: Gawker Media
title: Solr result grouping
started: 2005-1-31 body: Result Grouping, also called […]
id: post3
blog_id: blog2
Restrict to blogs mentioning netflix author: Whitson Gordon
title: How to Install Netflix on Almost
Any Android Device
fq={!join from=blog_id to=id}body:netflix
- Finds all documents matching “netflix”
- Maps to different docs by following blog_id to id
25
25. Pseudo-Join Examples
§ Only show posts from blogs started after 2010
q=foo&fq={!join from=id to=blog_id}started:[2010 TO *]
§ If any post in a blog mentions “obama”, then search
all posts in that blog for “bomb” (self-join)
q=bomb&fq={!join from=blog_id to=blog_id}obama
§ If any blog post mentions “obama”, then search all
websites with the same blog owner for “bomb”
q=bomb&fq={!join from=owner to=website_owner}{!join
from=blog_id to=id}obama
26
26. Cross-Core Join
id: doc1
security: managers
id: mary
title: doc for managers only security_groups: managers, employees
body: …
id: doc1 id: john
security_groups: employees
security: managers, employees
title: doc for everyone
body: …
collection1 sec1
Single Solr Server
http://localhost:8983/solr/collection1/select?q=foo&fq={!join
fromIndex=sec1 from=security_groups to=security}user:john
27
27. Pseudo-Join vs Grouping
Pseudo-Join Result Grouping / Field Collapsing
O(n_terms_in_join_fields) O(n_docs_in_result)
Single or multi-valued fields Single-valued fields only
Filters only (no info currently passed from Can order docs within a group and groups
the “from” docs to the “to” docs). by top doc within that group using normal
sort criteria.
Chainable (one join can be the input to Not currently chainable – can only group
another) one field deep
Affects which documents match a request, Grouping does not currently affect the set
so naturally affects facet numbers (e.g. of documents matching the query, so
you can search posts and get numbers of faceting is unaffected.
blogs)
28
28. Auto-Suggest
l Many people previously used terms component
l Can be slow for a large corpus
l New auto-suggest builds off SpellCheck component
l TST implementation: compact memory based trie
l FST implementation: slower to build, but smaller & faster lookup
l Based on a field in the main index, or on a dictionary file
http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult
"spellcheck":{
"suggestions":[
"ult",{
"numFound":1,
"startOffset":0,
"endOffset":3,
"suggestion":["ultrasharp"]},
"collation","ultrasharp"]}}
29
29. Index with JSON
$
URL=http://localhost:8983/solr/update/json
$
curl
$URL
-‐H
'Content-‐type:application/json'
-‐d
’
[
{
"id"
:
"978-‐0641723445",
"cat"
:
["book","hardcover"],
"title"
:
"The
Lightning
Thief",
"author"
:
"Rick
Riordan",
"series_t"
:
"Percy
Jackson
and
the
Olympians",
"sequence_i"
:
1,
"genre_s"
:
"fantasy",
"inStock"
:
true,
"price"
:
12.50,
"pages_i"
:
384
}
]'
30. Query Results in CSV
http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv
name,price,cat,popularity
iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1
Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1
Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10
l Can handle multi-valued fields (see cat field in example)
l Completely compatible with the CSV update handler (can round-trip)
l Results are streamed – good for dumping entire parts of the index