These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=cc1e6803-b0ec-4832-b8df-e15ea7bd7694
2. Overview
● why Lucene/Solr?
● what are Lucene and Solr?
● how to use Lucene and Solr
○ setup
○ indexing
○ searching
● resources
● demo
● questions/answers
3. How to Make Your Data Searchable
● pay someone to do it
● use some solution someone else has written
● write some solution yourself
4. How to Search - One Approach
for each document d {
if (query is a substring of d's content) {
add d to the list of results
}
}
sort the result (or not)
5. How to Search - Problems
● slow
○ reads the whole dataset for each search
● not scalable
○ if you dataset grows by 10x,
your search slows down by 10x
● how to show the most relevant documents
first?
○ list of results can be quite long
○ users have limited time and patience
6. Inverted Index - Introduction
● like the "index" at the end of books
● a map of one of the following types
○ term → document list
○ term → <document, position> list
7. documents:
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"
inverted index (without positions):
"a":
{2}
"banana": {2}
"is":
{0, 1, 2}
"it":
{0, 1, 2}
"what":
{0, 1}
inverted index (with positions):
"a":
{(2, 2)}
"banana": {(2, 3)}
"is":
{(0, 1), (0, 4), (1, 1), (2, 1)}
"it":
{(0, 0), (0, 3), (1, 2), (2, 0)}
"what":
{(0, 2), (1, 0)}
Credit: Wikipedia (http://en.wikipedia.org/wiki/Inverted_index)
8. Inverted Index - Speed
● term list
○ typically very small
○ grows slowly
● term lookup
○ O(1) to O(log(number of terms))
● for a particular term
○ document lists: very small
○ document + position lists: still small
● few terms per query
9. Inverted Index - Relevance
● information in the index enables:
○ determination (scoring) of relevance of each
document to the query
○ comparison of relevance among documents
○ sorting by (decreasing) relevance
■ i.e. the most relevant document first
10. Lucene v.s. Solr - Lucene
●
●
●
●
full-text search library
creates, updates and read from the index
takes queries and produces search results
your application creates objects and calls
methods in the Lucene API
● provides building blocks for custom features
11. Lucene v.s. Solr - Solr
●
●
●
●
●
full-text search platform
uses Lucene for indexing and search
REST-like API over HTTP
different output formats (e.g. XML, JSON)
provides some features not built into Lucene
12. Lucene:
machine running Java VM
your application
Lucene
Lucene code
libraries
index
Solr:
machine running Java VM
servlet container (e.g. Tomcat, Jetty)
Solr
HTTP
Solr code
Lucene code
index
libraries
client
14. Workflow - Setup
● servlet configuration
○ e.g. port number, max POST size
○ you can usually use the default settings
● Solr configuration
○ e.g. data directory, deduplication, language
identification, highlighting
○ you can usually use the default settings
● schema definition
○ defines fields in your documents
○ you can use the default settings if you name your
fields in a certain way
15. How Data Are Organized
collection
document
document
document
field
field
field
field
field
field
field
field
field
19. Solr Field Definition
● field
○ name (e.g. "subject")
○ type (e.g. "text_general")
○ options (e.g. indexed="true" stored="true")
● field type
○ text: "string", "text_general"
○ numeric: "int", "long", "float", "double"
● options
○ indexed: content can be searched
○ stored: content can be returned at search-time
○ multivalued: multiple values per field & document
20. Solr Dynamic Field
● define field by naming convention
● "amount_i": int, index, stored
● "tag_ss": string, indexed, stored, multivalued
name
type
indexed
stored
multiValued
*_i
int
true
true
false
*_l
long
true
true
false
*_f
float
true
true
false
*_d
double
true
true
false
*_s
string
true
true
false
*_ss
string
true
true
true
*_t
text_general
true
true
false
*_txt
text_general
true
true
true
21. Solr Copy Field
● copy one or more fields into another field
● can be used to define a catch-all field
○ source: "title", "author", "description"
○ destination: "text"
○ searching the "text" field has the effect of searching
all the other three fields
24. Indexing - DataImportHandler
● has its own config file (data-config.xml)
● import data from various sources
○ RDBMS (JDBC)
○ e-mail (IMAP)
○ XML data locally (file) or remotely (HTTP)
● transformers
○ extract data (RegEx, XPath)
○ manipulate data (strip HTML tags)
25. Indexing - ExtractingRequestHandler
● allows indexing of different formats
○ e.g. PDF, MS Word, XML
● uses Apache Tika to extract text and
metadata
○ Tika: a framework for different file format parsers (e.
g. PDFBox for PDF, Apache POI for MS Word)
● maps extracted text to the “content” field
● maps metadata (e.g. MIME type) to different
fields
26. Searching - Basics
● send request to http://host:port/solr/search
● parameters
○
○
○
○
○
○
○
q - main query
fq - filter query
defType - query parser (e.g. lucene, edismax)
fl - fields to return
sort - sort criteria
wt - response writer (e.g. xml, json)
indent - set to true for pretty-printing
27. search handler's URL
main query
http://localhost:8983/solr/select?q=title:tablet&
fl=title,price,inStock&sort=price&wt=json
fields to return
sort criteria
response writer
28. Searching - Query Syntax - Field
● search a specific field
○ field_name:value
● if field omitted, Solr uses default field:
○ df parameter in URL
○ defaultSearchField setting in schema.xml
○ "text"
29. Searching - Query Syntax - Term
● a term by itself: matches documents that
contain that term
○ e.g. tablet
30. Searching - Query Syntax - Boolean
● “conventional” boolean operators supported
●
●
●
○ AND &&
○ OR ||
○ NOT !
e.g. a AND b
○ all of a, b must occur
e.g. a OR b
○ at least one of a, b must occur
e.g. a AND NOT b
○ a must occur and b must not occur
31. Searching - Query Syntax - Boolean
● Lucene/Solr's boolean operators are not true
boolean operators
● e.g. a OR b OR c does not behave like
(a OR b) OR c
○ instead, a OR b OR c means at least one of a, b, c
must occur
● parentheses are supported
32. Searching - Query Syntax - Boolean
● "+" prefix means "must"
● "-" prefix means "must not"
● no prefix means "at least one must"
(by default)
○ e.g. a b c
■ at least one of a, b, c must occur
● operators can mix
○ e.g. +a b c d -e
■ a must occur
■ at least one of b, c, d must occur
■ e must not occur
33. Searching - Query Syntax - Phrase
● phrases are enclosed by double-quotes
● e.g. +"the phrase"
○ the phrase must occur
● e.g. -"the phrase"
○ the phrase must not occur
34. Searching - Query Syntax - Boost
● manually assign different weights to clauses
● gives more weight to a field
○ e.g. title:a^10 body:a
● gives more weight to a word
○ e.g. title:a title:b^10
● gives phrases more weight than words
○ e.g. title:(+a +b) title:"a b"^10
35. Searching - Query Syntax - Range
● matches field values within a range
○ inclusive range - denoted by square brackets
○ exclusive range - denoted by curly brackets
● e.g. age:[10 TO 20]
○ matches the field "age" with the value in 10..20
● string or numeric comparison, depending on
the field's type
● open-ended range supported
● e.g. age: [10 TO *]
○ matches the field "age" with the value 10 or larger
36. Searching - Query Syntax - EDisMax
● suitable for user-generated queries
○ does not complain about the syntax
○ searches for individual words across several fields
("disjunction")
○ uses max score of a word in all fields for scoring
("max")
● configurable (in solrconfig.xml)
○ what fields to search the words in
○ boosting of these fields
37. Sorting
● default: sorting by decreasing score
● custom sorting rules: use the sort parameter
○ syntax: fieldName (asc|desc)
○ e.g. sort by ascending price (i.e. lowest price first):
price asc
○ e.g. sort by descending date (i.e. newest date first):
date asc
38. Sorting
● special field names
○ use score for score and _docid_ for document D
○ e.g. sort by ascending score:
score asc
○ e.g. sort by descending document ID
_docid_ desc
39. Sorting
● multiple fields and orders: separate by
commas
○ e.g. sort by descending starRating and ascending
price:
○ starRating desc, price asc
40. Sorting
● cannot use multivalued fields
● overrides the default sorting behavior
41. Faceted Search
● facet values: (distinct) values (generally nonoverlapping) ranges of a field
● displaying facets
○ show possible values
○ let users narrow down their searches easily
43. Faceted Search
● set facet parameter to true - enables
faceting
● other parameters
○ facet.field - use the field's values as facets
■ return <value, count> pairs
○ facet.query - use the given queries as facets
■ return <query, count> pairs
○ facet.sort - set the ordering of the facets;
■ can be "count" or "index"
○ facet.offset and face.limit - used for
pagination of facets
44. Resources - Books
● Lucene in Action
○ written by 3 committer and PMC members
○ somewhat outdated (2010; covers Lucene 3.0)
○ http://www.manning.com/hatcher3/
● Solr in Action
○ early access; coming out later this year
○ http://www.manning.com/grainger/
● Apache Solr 4 Cookbook
○ common problems and useful tips
○ http://www.packtpub.com/apache-solr-4cookbook/book
45. Resources - Books
● Introduction to Information Retrieval
○ not specific to Lucene/Solr, but about IR concepts
○ free e-book
○ http://nlp.stanford.edu/IR-book/
● Managing Gigabytes
○ indexing, compression and other topics
○ accompanied by MG4J - a full-text search software
○ http://mg4j.di.unimi.it/
46. Resources - Web
● official websites
○ Lucene Core - http://lucene.apache.org/core/
○ Solr - http://lucene.apache.org/solr/
● mailing lists
● Wiki sites
○ Lucene Core - http://wiki.apache.org/lucene-java/
○ Solr - http://wiki.apache.org/solr/
● reference guides
○ API Documentation for Lucene and Solr
○ Apache Solr Reference Guide
47. Getting Started
● download Solr
○ requires Java 6 or newer to run
● Solr comes bundled/configured with Jetty
○ <Solr directory>/example/start.jar
● "exampledocs" directory contains sample
documents
○ <Solr directory>/example/exampledocs/post.jar
○ java -Durl=http://localhost:
8983/solr/update -jar post.jar *.xml
● use the Solr admin interface
○ http://localhost:8983/solr/