Solr Black Belt Pre-conference

Solr Black Belt
code4lib conference 2010 - Asheville, NC
Erik Hatcher, Lucid Imagination
Naomi Dushay, Stanford

1

What’s new in Solr 1.4?
• Java-based replication • TermsComponent

• VelocityResponseWriter (Solritas) • Rich document indexing, via Tika
(Solr Cell)
• Logging switched to SLF4J
• Greatly improved faceting
performance
• Rollback, since last commit

• StatsComponent • Exact/near duplicate document
handling

• TermVectorComponent
• Support added for Lucene's omitTf

• Conﬁgurable Directory provider
• "trie" range query support

• CharFilter

2

Performance
Improvements
• Caching
• Concurrent ﬁle access
• Per-segment index updates
• Faceting
• DocSet generation, avoids scoring
• Streaming updates for SolrJ
3

Lucene 2.9
• IndexReader#reopen()

• Faster ﬁlter performance, by 300% in some cases

• Per-segment FieldCache

• Reusable token streams

• Faster numeric/date range queries, thanks to trie

• and tons more, see Lucene 2.9's CHANGES.txt

4

Deployment Architectures

5

JVM
• -server
• -XmxNNNNm
• Java 1.6 (latest point release)
• garbage collector
• 64-bit?
• Tools: JVM GC logging, jconsole
6

Useful JVM switches
• -Xloggc:gc.out: Will output GC information to a ﬁle named “gc.out”.

• –XX:+PrintGC: Outputs basic information at every garbage
collection.

• –XX:+PrintGCDetails: Outputs more detailed information at every
garbage collection.

• –XX:+PrintGCTimeStamps: Outputs a time stamp at the start of
each garbage collection event. Used with –XX:+PrintGC or –XX:
+PrintGCDetails to show when each garbage collection begins.

• -XX:-HeapDumpOnOutOfMemoryError: Dump heap to ﬁle when
java.lang.OutOfMemoryError is thrown.

7

Indexing Performance
• Tricks of the trade:
• multithread/multiprocess
• batch documents
• separate Solr server and indexers
• Indexing master + replicants
• StreamingUpdateSolrServer + javabin
8

MARC indexing strategies

• SolrMarc
• Future? DataImportHandler hooks

9

Index Settings
• useCompoundFile: set to false

• mergeFactor: 10 or lower, generally

• ramBufferSizeMB: buffer used for added documents before ﬂushing to
directory; more predictable instead of using maxBufferedDocs.
Benchmarking shows <= 128 is best.

• maxMergeDocs: maximum number of documents for a single segment

• maxFieldLength: generally max. int is desired = 2147483647

• maxWarmingSearchers: 1 is best

10

Searching Performance
• javabin - binary protocol for Java clients
• caches: ﬁlterCache most relevant here
• autowarm
• FastLRUCache
• warming queries: ﬁrstSearcher, newSearcher
• sorting, faceting

11

debugQuery=true

• parsed queries
• scoring explanations
• search component timings

12

Query Parsing

• defType
• applies to main query only
• fq parsed as "lucene" unless individually
overridden
• {!parser local=params}query string

13

Solr Query Parser
(lucene)
• http://lucene.apache.org/java/2_9_1/
queryparsersyntax.html + Solr extensions

• Kitchen sink parser, includes advanced user-
unfriendly syntax

• Syntax errors throw parse exceptions back to
client

• Example: title:ipod* AND price:[0 TO 100]

• http://wiki.apache.org/solr/SolrQuerySyntax

14

SolrQueryParser
• Default query parser

• schema.xml

• <defaultSearchField>text</defaultSearchField>

• <solrQueryParser defaultOperator="OR"/>

• Adds _query_:"..." and _val_:"..." hooks

• Supports leading wildcards with
ReversedWildcardFilterFactory

15

Dismax Query Parser
(dismax)
• Simplified syntax:
loose text “quote phrases” -prohibited
+required
• Spreads query terms across query fields
(qf) with dynamic boosting per field, phrase
construction (pf), and boosting query and
function capabilities (bq and bf)

16

dismax: q and q.alt

• odd number of quotes is parsed as if there
were no quotes
• wildcards, fuzzy, etc not supported
• q.alt: alternate query; "lucene" parsed, used
when q is omitted; useful as *:* to get
collection-wide facet counts

17

dismax: qf and pf
• query fields / phrase fields
• syntax: field[^boost]...
• example: title^2 body
• pf for boosting where terms in q are in
close proximity; entire q string is used as
phrase implicitly

18

dismax: qs and ps

• qs: query slop; used for explicit "phrase
queries"
• ps: phrase slop; used for implicit phrase
query added for pf ﬁelds

19

dismax: mm
• minimum match, for optional clauses

• default = 100% (pure AND)

• Examples:

• pure OR: mm=0 or mm=0%

• at least two should match: mm=2

• at least 75% should match: mm=75%

• 1-3 clauses, must match, 4 or more 90% must match: mm=3<90%

• 1-2 clauses all required, 3-9 clauses all but 25% must match, 9 or more all
but 3 are requried: mm=2<-25% 9<-3

• 1-3 clauses all must match, 3-5 clauses, one less than the number of clauses
must match, 6 or more clauses, 80% must match, rounded down:
mm=2<-1 5<80%
http://lucene.apache.org/solr/api/org/apache/solr/util/doc-ﬁles/min-should-match.html

20

dismax: tie
• tiebreaker

• more than one field may match and scored based on term
frequency

• how much the final score of the query will be influenced by the
scores of the lower scoring fields compared to the highest scoring
field.

• A value of "0.0" makes the query a pure "disjunction max query" --
only the maximum scoring sub query contributes to the final score.
A value of "1.0" makes the query a pure "disjunction sum query"
where it doesn't matter what the maximum scoring sub query is,
the final score is the sum of the sub scores. Typically a low value (ie:
0.1) is useful.

21

dismax: tie
• The “tie” (tie breaker) parameter is very important, but not easy to understand. It may
be useful to visualize it as a “slider” control between 0 and 1, with a value of 0 being a
“pure disjunction max” query, and a value of “1” being a “pure disjunction sum” query.
So the “max” score is added to the sum of all other scores multiplied by the tie
breaker:

• If the max score is 2.12 and the other scores are 1.7 and 0.5, and the tie breaker is
0:

• score = 2.12 + ((1.7 + 0.5) * 0 ) = 2.12

• If the max score is 2.12 and the other scores are 1.7 and 0.5, and the tie breaker is
1:

• score = 2.12 + ((1.7 + 0.5) * 1) = 4.32

22

dismax: bq
• boosting query

• "lucene" query parsed, by default

• combined (optionally) with users query to boost
matching documents

• warning: a boolean query with boost of 1.0 has clauses
added as-is, can be problematic by adding required/
prohibited clauses; could be caused by multiple bq
parameters

• Example: bq=library:music^2

23

dismax: bf
• boost function

• same as using _val_:"function(...)" in bq
parameter

• example: bf=recip(ms(NOW,mydateﬁeld),
3.16e-11,1,1)

• but careful with adding versus multiplying
scores, bf will be additive - see "boost" query
parser

24

local params
• {!parser p=param}expression

• OR {!parser p=param v=expression}

• Indirect parameter values with $syntax:

• {!parser p=$p}expression&p=param

• Real example:

• _query_:”{!dismax qf=$qf_author pf=$pf_author}[advanced
author search box ﬁeld value], where qf_author and
pf_author deﬁned in request handler mapping, combined
with other clauses or similar _query_'s for other groups

25

Raw query parser
• {!raw f=ﬁeld}Foo Bar
• exact TermQuery, no analysis or
transformations
• ideal for typical fq usage
• fq={!raw f=format}Musical Score
• avoids query parsing escaping madness
26

request handler ninjitsu
http://localhost:8983/solr/document?id=...

<requestHandler class="solr.SearchHandler" name="/document">

<lst name="invariants">

<str name="q">{!raw f=id v=$id}</str>

<str name="rows">1</str>

<str name="fl">*</str>

</lst>

</requestHandler>

27

Field query parser

• {!field f=field}Foo Bar
• generally equivalent to field:"Foo Bar"
• parses to term or phrase query, depending
on analysis for field

28

Prefix query parser

• {!prefix f=field}foo
• no analysis or transformation performed
• generally equivalent to field:foo*

29

Function query parser

• {!func}log(foo)
• Used for _val_ expressions in "lucene"
parser

30

Boost query parser
• {!boost b=log(popularity)}foo
• Multiplies score, rather than additive
• Example:
• ?q={!boost b=$dateboost v=$qq
defType=dismax}&dateboost=recip(
ms(NOW,manufacturedate_dt),
3.16e-11,1,1)&qf=text&pf=text&qq
=ipod

31

extended dismax
(edismax)
• Solr 1.5 (currently trunk)

• Supports full lucene query syntax in the absence of syntax
errors: AND/OR/NOT, wildcards, fuzzy...; and/or also

• When syntax errors, smart partial escaping of special
characters, ﬁelded queries, +/-, and phrases still supported

• shingles phrases speciﬁed in pf2 and pf3 parameters

• advanced stopword handling: stopwords are not required in the
mandatory part of the query but are still used (if indexed) in
the proximity boosting part. If a query consists of all stopwords
(e.g. to be or not to be) then all will be required.

32

edismax: pf2 and pf3

• shingles into two and three term phrases
• prevents problem of needing 100% of the
words in the document, as well as having all
of the words in a single ﬁeld, to get any
boost

33

edismax: boost

• wraps generated query with boost query
• like the dismax bf param, but multiplies the
function query instead of adding it in

34

Nested queries
• Naomi's "A Better Advanced Search",
Wednesday, 13:00
• http://www.lucidimagination.com/blog/
2009/03/31/nested-queries-in-solr/
• Example:
• _query_:"{!dismax qf=$qf1}query1"
AND _query_:"{!dismax qf=$qf2}query2"

35

Useful request handlers

• dump, ping, luke, system, plugins, threads,
properties, ﬁle

36

Dump
• http://localhost:8983/solr/debug/dump
• Echoes parameters, content streams, and
Solr web context
• Careful with content stream enabled, client
could retrieve contents of any ﬁle on
server or accessible network! [Solution:
disable dump request handler]

37

Ping

• http://localhost:8983/solr/admin/ping
• If healthcheck configured and file not
available, error is reported
• Executes single configured request and
reports failure or OK

38

Luke
• http://localhost:8983/solr/admin/luke

• Introspects Lucene index structure and schema
relationships

• See an individual document:

• ?doc=<key> or ?docId=<lucene doc #>

• Schema details: ?show=schema

• Admin schema browser uses Luke request handler

• See also: original Luke tool - http://www.getopt.org/luke/

39

System

• http://localhost:8983/solr/admin/system
• core info, Lucene version, JVM details,
uptime, operating system info

40

Plugins

• http://localhost:8983/solr/admin/plugins
• Conﬁguration details of Solr core, available
query and update handlers, cache settings

41

Threads

• http://localhost:8983/solr/admin/threads
• JVM thread details

42

Properties

• http://localhost:8983/solr/admin/properties
• All JVM system properties, or single
property value (?name=os.arch)

43

File

• http://localhost:8983/solr/admin/file?file=/
• See fetchable directory tree
• http://localhost:8983/solr/admin/file?
file=schema.xml&contentType=text/plain

44

Search components

• Standard: query, facet, mlt, highlight,
stats, debug
• Others: elevation, clustering, term,
term vector

45

Clustering

• Dynamic grouping of documents into labeled sets

• http://localhost:8983/solr/clustering?q=*:*&rows=10

• http://wiki.apache.org/solr/ClusteringComponent

• Requires additional steps to install (see
documentation) with Apache Solr distro; baked fully
into Lucid certiﬁed distro

46

Terms

• Enumerates terms from specified fields
• http://localhost:8983/solr/terms?
terms.fl=name&terms.sort=index&terms.pr
efix=vi

47

Term Vectors

• Details term vector information: term
frequency, document frequency, position
and offset information
• http://localhost:8983/solr/select/?q=*
%3A*&qt=tvrh&tv=true&tv.all=true

48

stats.jsp
• Not technically a “request handler”, outputs only
XML

• http://localhost:8983/solr/admin/stats.jsp

• Index stats such as number of documents,
searcher open time

• Request handler details, number of requests and
errors, average request time, average requests
per second, number of pending docs, etc, etc

49

Analysis Tricks
• CharFilters: MappingCharFilterFactory, PatternReplaceCharFilterFactory,
HTMLStripCharFilterFactory

• ReversedWildcardFilterFactory, see example schema.xml "text_rev" ﬁeld type

• *thing queries for gniht*

• PositionFilterFactory

• "can be used with a query Analyzer to prevent expensive Phrase and
MultiPhraseQueries" or "all words and shingles to be placed at the same position,
so that all shingles to be treated as synonyms of each other."

• CommonGramsFilterFactory - Makes shingles by combining common tokens and
regular tokens

• CollationKeyFilterFactory (Solr 1.5) - locale based sorting

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
50

Faceting

• multi-select
• hierarchical

51

Multi-select

• &facet.field=facet_field&fq=facet_field:
(value1 OR value2)
• to exclude filters from facet counts:
• &facet.field={!ex=group}facet_field&fq={!
tag=group}facet_field:value2

52

Hierarchical

• http://wiki.apache.org/solr/HierarchicalFaceting

53

Facet paging

• Blacklight trick, requesting one more than
page size

54

i18n
• CJK
• SmartChineseAnalyzer
• German
• DictionaryCompoundWordTokenFilterFactory
• To watch:
• http://code.google.com/p/lucene-hunspell/
55

Testing

• Automate
• Relevancy
• Performance
• Solr log analysis: zero results queries, slow
queries

56

Questions
• One subject that's of some interest to me is paging through facets. It drives me a little crazy that Solr
lets you page through facets, yet it won't give you a total count of how many facets you are paging
through, which makes presenting a fully functional paging mechanism rather problematic. I've heard
that Bobo-browse may be helpful here but haven't dug into it too deeply. Maybe this is too narrow a
topic to be worth spending much time on, but if anybody has any thoughts or solutions, I'd love to
discuss them

• What if we wanted to implement a traditional browse with Solr? Like a call number browse to
simulate shelf browsing? Is there a way to leverage Solr for something like that? I'd think the trie
structure would make this possible, but how it could be exposed in that manner is a mystery.

• that inner query/nested query stuff that Naomi is using for advanced search would be one thing I'd
add to the list. Continues to confuse me every time I look at it.

• Another idea, approaches for ﬁguring out how much RAM solr needs, and how big to make the
various Solr query caches. I know it depends on a lot and is different for every index, but I don't even
know how to get started ﬁguring out what it should be for my index. Not sure if this makes sense as
an issue or not, just an idea.

57

Questions
• We're currently using 1.3, so the biggest changes/improvements in 1.4 would
be good.

• I'm also interested in fulltext indexing. We have some documents
(newspapers and dissertations) that are quite large (hundreds of MB of
plaintext). Is there a good rule-of-thumb for how much text we should
index? How large is too large? Is uncorrected OCR'd text worth indexing?

• The other topic I'm particularly interested in is update performance. Most
of our data is currently batch-loaded and batch-indexed, but we are moving
to interactive editing for some of our data, with the expectation that the
solr index be kept updated in realtime (or near-realtime). Should we use a
separate server (or core) to keep the updates from impacting read-only
performance? Do we need to optimize the index (this can take 20+
seconds for our main index) frequently?

58

Questions

• One other thing: we're using the web
service interface which seems fast and
reliable. Is the SolrJ interface signiﬁcantly
faster or better?
• DidYouMean/Spellcheck

59

Questions
does it make sense to use ﬁxtures or
ﬁxture scenarios like Rails? does it make
sense to set up a separate 'testing' core
that can be dynamically dumped and rebuilt
through the apis by your test suite

60

Questions
1. What methods and tools can be used to determine
whether configuration or physical resource changes
might improve performance. E.g. increasing filter cache,
adding more memory, going to 64 bit architecture,
adding another disk drive to the array, etc.

2. Best procedures to make these configuration changes.
E.g. These two parameters work in conjunction with
each other, change this one then that one, this one
should be set to X percent of your physical memory,
don't touch this one unless you really know what you
are doing, etc.

61

Questions
- Scaling issues: millions of records, trying to keep data
reasonably current
- Distributed search
- Considerations for non-Roman data mixed in with Roman
data? We have CJK data, Cyrillic, Hebrew, Arabic. Is
there a sensible way to set up the analyzers?
- Any considerations for merging heterogenous data (MARC,
OAI-DC, EAD, web spidering) that may be particular to Solr?
(I don't expect so, it's all going into one schema, but
maybe you're run into something.)

62

Questions
Indexing strategies:

* Performance tuning or configuring Solr for indexing (as opposed to a copy of Solr a search app
runs on). Which config options make a difference? What JVM options matter?
* Merging a 'build' copy of an index into a search app's copy. Is this the replication piece?
* Using multiple threads when writing to Solr. Using StreamingSolrUpdateServer effectively/safely.

Advanced features on retrieval side:

* Info about facets: can Solr retrieve the global count number for a facet in addition to the count
number within a filtered search result set? Only with 2 queries?
* Doing Google-like autosuggest against facet values for subject terms (not like facet.prefix method
in the Solr 1.4 book). Best to use a multicore setup and have an index or two dedicated to
autosuggestions?

Multiple index design:

As my colleague Eric put it: big generalized index + N extreme indexes = Righteous Discovery
Platform || High Folly?

This is a question we are dealing with. As librarians and researchers learn what we are doing on our
campus a lot of people are offering up data. Some of which is *highly* specialized. For example,
metadata based on a microscopy data standard. We expect that these researchers would like us to
create an expert search tool with advanced features tailored to their data model

63

Questions
Getting a better understanding of Solr memory use would be very helpful for us. (Or perhaps tools
and tips for understanding Solr memory use)
Right now we can watch the tomcat/Solr jvm with Jconsole and see heap use suddenly increasing and
decreasing, but we don't understand why, so our main technique is to wait until we get an
OutOfMemoryError and then increase the memory we give to the Solr/Tomcat JVM. (That and continuing
to buy more memory:)

The dismax/edismax and how folks are using them to tweak relevance ranking (based on MARC fields) is
also of great interest.

A couple of topics that may or may not be of interest to other folks and may or may not be
appropriate for the workshop. The context of these is that we are trying to understand scalability
and performance issues with very large indexes (300GB x 10) and multiple shards (5 million full-text
docs and growing.)

1) I'd like to get a bit of a better understanding of how filter queries are implemented. (and how
that relates to faceting)

2) I'd like to get a better understanding of how distributed search is implemented. In particular,
I'd like to understand the traffic that goes between the head shard and the shards it distributes
the query to. For example in the tomcat logs we can see traffic with the isShard=true and
ids="abc","def" parameters.

64

Questions

• Call number -> shelf key
• Reverse sorting ﬁelds
• termsComponent queries
• Terms -> documents
• Can we apply facets?

65

e-book now available!
print coming soon
http://www.manning.com/lucene

67

LucidWorks for Solr
• Certiﬁed Distribution

• Value-added integration

• KStemmer

• Carrot2 clustering

• LucidGaze for Solr

• installer

• Reference Manual

• Solr 1.4++ certiﬁed

68

LucidGaze for Solr

• Monitoring tool, captures, stores, and
interactively views Solr performance
metrics
• requests/second
• time/request

69

LucidFind

http://search.lucidimagination.com/?q=code4lib

71

http://www.ﬂickr.com/photos/mikeoliveri/2036797884/

73

Solr Black Belt Pre-conference

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Solr Black Belt Pre-conference

Similaire à Solr Black Belt Pre-conference (20)

Plus de Erik Hatcher

Plus de Erik Hatcher (12)

Dernier

Dernier (20)

Solr Black Belt Pre-conference