These are the slides for the session I presented at SoCal Code Camp San Diego on June 24, 2012.
http://www.socalcodecamp.com/session.aspx?sid=f9e83f56-3c56-4aa1-9cff-154c6537ccbe
2. How to Search
• One (common) approach to searching all your
documents:
for each document d {
if (query is a substring of d’s content) {
add d to the list of results
}
}
sort the results
1
3. How to Search
• Problems
– Slow: Reads the whole database for each search
– Not scalable: If your database grows by 10x, your
search slows down by 10x
– How to show the most relevant documents first?
2
4. Inverted Index
• (term -> document list) map
Documents: T0 = "it is what it is"
T1 = "what is it"
T2 = "it is a banana"
Inverted "a": {2}
index: "banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
E 3
5. Inverted Index
• (term -> <document, position> list) map
T0 = "it is what it is”
0 1 2 3 4
T1 = "what is it”
0 1 2
T2 = "it is a banana”
0 1 2 3
E 4
6. Inverted Index
• (term -> <document, position> list) map
T0 = "it is what it is"
T1 = "what is it"
T2 = "it is a banana"
"a": {(2, 2)}
"banana": {(2, 3)}
"is": {(0, 1), (0, 4), (1, 1), (2, 1)}
"it": {(0, 0), (0, 3), (1, 2), (2, 0)}
"what": {(0, 2), (1, 0)}
E 5
7. Inverted Index
• Speed
– Term list
• Very small compared to documents’ content
• Tends to grow at a slower speed than documents
(after a certain level)
– Term lookup: O(1) to O(log of number of terms)
– Document lists are very small
– Document + position lists still small
6
8. Inverted Index
• Relevance
– Extra information in the index
• Stored in a easily accessible way
• Determine relevance of each document to the query
– Enables sorting by (decreasing) relevance
7
9. Determining Relevancy
• Two models used in the searching process
– Boolean model
• AND, OR, NOT, etc.
• Either a document matches a query, or not
– Vector space model
• How often a query term appears in a document vs.
how often the term appears in all documents
• Scoring and sorting by relevancy possible
8
10. Determining Relevancy
Lucene uses both models
all documents
filtering (Boolean Model)
some documents
(unsorted)
scoring (Vector Space Model)
some documents
(sorted by score)
9
12. Scoring
• Term frequency (TF)
– How many times does this term (t) appear in this
document (d)?
– Score proportional to TF
• Document frequency (DF)
– How many documents have this term (t)?
– Score proportional to the inverse of DF (IDF)
11
13. Scoring
• Coordination factor (coord)
– Documents that contains all or most query terms
get higher scores
• Normalizing factor (norm)
– Adjust for field length and query complexity
12
14. Scoring
• Boost
– “Manual override”: ask Lucene to give a higher
score to some particular thing
– Index-time
• Document
• Field (of a particular document)
– Search-time
• Query
13
15. Scoring
coordination factor query normalizing factor
score(q, d) = coord(q, d) . queryNorm(q) .
Σ t in q (tf (t in d) . idf(t)2 . boost(t) . norm(t, d))
term inverse
frequency document
frequency
term boost document boost,
field boost,
length normalizing factor
http://lucene.apache.org/core/3_6_0/scoring.html 14
16. Work Flow
• Indexing
– Index: storage of inverted index + documents
– Add fields to a document
– Add the document to the index
– Repeat for every document
• Searching
– Generate a query
– Search with this query
– Get back a sorted document list (top N docs)
15
17. Adding Field to Document
• Store?
• Index?
– Analyzed (split text into multiple terms)
– Not analyzed (treat the whole text as ONE term)
– Not indexed (this field will not be searchable)
– Store norms?
16
18. Analyzed vs. Not Analyzed
Text: “the quick brown fox”
Analyzed: 4 terms Not analyzed: 1 term
1. the 1. the quick brown fox
2. quick
3. brown
4. fox
17
19. Index-time Analysis
• Analyzer
– Determine which TokenStream classes to use
• TokenStream
– Does the actual hard work
– Tokenizer: text to tokens
– Token filter: tokens to tokens
18
23. Attributes
• Past versions of Lucene: Token object
• Recent version of Lucene: attributes
– Efficiency, flexibility
– Ask for attributes you want
– Receive attribute objects
– Use these object for information about tokens
22
24. create token stream
TokenStream tokenStream =
analyzer.reusableTokenStream(fieldName, reader);
tokenStream.reset();
CharTermAttribute term = obtain each
stream.addAttribute(CharTermAttribute.class); attribute you
want to know
OffsetAttribute offset =
stream.addAttribute(OffsetAttribute.class);
PositionIncrementAttribute posInc =
stream.addAttribute(PositionIncrementAttribute.class);
while (tokenStream.incrementToken()) { go to the next token
doSomething(term.toString(),
offset.startOffset(), use information about
offset.endOffset(), the current token
posInc.getPositionIncrement());
}
tokenStream.end(); close token stream
tokenStream.close(); 23
25. Query-time Analysis
• Text in a query is analyzed like fields
• Use the same analyzer that analyzed the
particular field
+field1:“quick brown fox” +(field2:“lazy dog” field2:“cozy cat”)
quick brown fox lazy dog cozy cat
24
26. Query Formation
• Query parsing
– A query parser in core code
– Additional query parsers in contributed code
• Or build query from the Lucene query classes
25
28. Term Range Query
• Matches documents with any of the terms in a
particular range
– Field
– Lowest term text
– Highest term text
– Include lowest term text?
– Include highest term text?
27
29. Prefix Query
• Matches documents with any of the terms
with a particular prefix
– Field
– Prefix
28
30. Wildcard/Regex Query
• Matches documents with any of the terms
that match a particular pattern
– Field
– Pattern
• Wildcard: * for 0+ characters, ? for 0-1 character
• Regular expression
• Pattern matching on individual terms only
29
31. Fuzzy Query
• Matches documents with any of the terms
that are “similar” to a particular term
– Levenshtein distance (“edit distance”):
Number of character insertions, deletions or
substitutions needed to transform one string into
another
• e.g. kitten -> sitten -> sittin -> sitting (3 edits)
– Field
– Text
– Minimum similarity score
30
32. Phrase Query
• Matches documents with all the given words
present and being “near” each other
– Field
– Terms
– Slop
• Number of “moves of words” permitted
• Slop = 0 means exact phrase match required
31
33. Boolean Query
• Conceptually similar to boolean operators
(“AND”, “OR”, “NOT”), but not identical
• Why Not AND, OR, And NOT?
– http://www.lucidimagination.com/blog/2011/12/
28/why-not-and-or-and-not/
– In short, boolean operators do not handle > 2
clauses well
32
34. Boolean Query
• Three types of clauses
– Must
– Should
– Must not
• For a boolean query to match a document
– All “must” clauses must match
– All “must not” clauses must not match
– At least one “must” or “should” clause must
match
33
35. Span Query
• Similar to other queries, but matches spans
• Span
– particular place/part of a particular document
– <document ID, start position, end position> tuple
34
36. T0 = "it is what it is”
0 1 2 3 4
T1 = "what is it”
0 1 2
T2 = "it is a banana”
0 1 2 3
<doc ID, start pos., end pos.>
“it is”: <0, 0, 2>
<0, 3, 5>
<2, 0, 2>
35
37. Span Query
• SpanTermQuery
– Same as TermQuery, except your can build other
span queries with it
• SpanOrQuery
– Matches spans that are matched by any of some
span queries
• SpanNotQuery
– Matches spans that are matched by one span
query but not the other span query
36
38. spanTerm(apple) spanOr([apple, orange])
apple orange apple orange
spanTerm(orange) spanNot(apple, orange)
37
39. Span Query
• SpanNearQuery
– Matches spans that are within a certain “slop” of
each other
– Slop: max number of positions between spans
– Can specify whether order matters
38
41. Filtering
• A Filter narrows down the search result
– Creates a set of document IDs
– Decides what documents get processed further
– Does not affect scoring, i.e. does not score/rank
documents that pass the filter
– Can be cached easily
– Useful for access control, presets, etc.
40
42. Notable Filter classes
• TermsFilter
– Allows documents with any of the given terms
• TermRangeFilter
– Filter version of TermRangeQuery
• PrefixFilter
– Filter version of PrefixQuery
• QueryWrapperFilter
– “Adapts” a query into a filter
• CachingWrapperFilter
– Cache the result of the wrapped filter
41
43. Sorting
• Score (default)
• Index order
• Field
– Requires the field be indexed & not analyzed
– Specify type (string, int, etc.)
– Normal or reverse order
– Single or multiple fields
42
44. Interfacing Lucene with “Outside”
• Embedding directly
• Language bridge
– E.g. PHP/Java Bridge
• Web service
– E.g. Jetty + your own request handler
• Solr
– Lucene + Jetty + lots of useful functionality
43
45. Books
• Lucene in Action, 2nd Edition
– Written by 3 committers and PMC members
– http://www.manning.com/hatcher3/
• Introduction to Information Retrieval
– Not specific to Lucene, but about IR concepts
– Free e-book
– http://nlp.stanford.edu/IR-book/
44
47. Getting Started
• Getting started
– Download lucene-3.6.0.zip (or .tgz)
– Add lucene-core-3.6.0.jar to your classpath
– Consider using an IDE (e.g. Eclipse)
– Luke (Lucene Index Toolbox)
http://code.google.com/p/luke/
46
I bet this is exactly how many systems are handling search right now.Perhaps many systems do not think about how to sort the result and just throws back the result list to the user, without considering what should go first.
Image the slowdown if your website goes from "nobody besides our employees and friends use it" to being "the next FaceBook”.People loose interest in your application easily,if the first few things your search result present do not look exactly like what they are trying to find.
Expand the inverted index we just saw.Positions start with zero.
There are only so many words that people commonly use.You can hash the terms, organize them as a prefix tree, sort them and use binary search, and so on.For the purpose of deciding which documents match, you only need to store document IDs (integers).
Extra info: determine how good of a match a document is to a query.Put the best matches near the topof the search result list.
The highest-scored (most relevant) document is the first in the result list.
In VSM, documents and queries are presented as vectors in an n-dimensional space, where n is the total number of unique terms in the document collection, and each dimension corresponds to a separate term. A vector's value in a particular dimension is not zero if the document or the query contains that term.Document vector closer to query vector = document more relevant to the query
The term might be a common word that appears everywhere.
Storing the field means that the original text is stored in the index; can retrieve it at search time.Indexing the fieldmeans that the field is made searchable.
Token = term, at index time, with start/end position information, and not tied to a document already in the index.
WhitespaceAnalyzer:whitespaces as separators;punctuations are a part of tokens. StopAnalyzer: non-letters as separators; makes everything lowercase; removes common stop-words like "the”.StandardAnalyzer:sophisticated rules to handle punctuations, hyphens, etc.; recognizes (and avoids breaking up) e-mail addresses and internet hostnames.
Character folding: turns the "a" with an accent mark above into an "a" without the accent markStemming: the words "consistent" and "consistency" have the same stem, which is "consist”Synonyms: like "country" and "nation”Shingles: “the quick”, “the brown”, “brown fox”; useful for searching text in Asian languages like Chinese and Japanese; reduces the number of unique terms in an index and reduces overhead.
Offsets: character offsets of this token from the beginning of the field's textPosition increment: position of this token relative to the previous token; usually 1
This query have clauses about 3 fields. So you analyze 3 pieces of text and get back 3 sets of tokens.A good practice is to use the same analyzer that analyzed the particular field that you are searching.
Examples of range:January 1st to December 31st of 2012 (inclusive)1 to 10 (excluding 10)
Your pattern describe a term, not a document, so you cannot put a phrase or a sentence in a pattern and expect the query to match that phrase or sentence.
Minimum similarity score isbased on the editing distance.
It takes two moves to swap two words in a phrase.
Lucene does not have the standard boolean operators.
Lucene has these instead (of the “standard” boolean operators).
End position is actually one plus the position of the last term in the span
This "slop" is different from the "slop" in Phrase Query.
total number of positions between spans = 2 + 1 + 0 = 3The first two queries match this document because the slops are at least 3. The third query does not match, because the slope is less than 3. The fourth query does not match because even though the required slop is large enough, the query require all the spans to be in the given order, and the spans in this document are not. The fifth query matches because the given order matches the order of the spans in the document.
CachingWrapperFilter good for filters that don’t change a lot, e.g. access restriction.
Index order = order in which docs are added to the indexIndex and not analyzed = whole field as one token/term
Embedding directly: good when the rest of your application is also in Java.In most uses cases, you would be dealing with Solr rather than Lucene directly. But you would still be indirectly using Lucene, and you can still benefit from understanding many of the things discussed in this session.
Eclipse has many useful features such as setting up the classpath and compiling your code for you.
It shows you what your index looks like and what fields and terms it has. You can look at individual documents, run queries, try out different analyzers.