Numeric Range Queries in Lucene and Solr

Numeric Range Queries
in Lucene and Solr
kirilchukvadim@gmail.com

Agenda:
● What is RangeQuery
● Which field type to use for Numerics
● Range stuff under the hood (run!)
● NumericRangeQuery
● Useful links

Range Queries:
A range query is a type of query that matches
all documents where some value is between an
upper and lower boundary:
Give me:
● Jeans with price from 200 to 300$
● Car with length from 5 to 10m
● ...

Range Queries:
In solr range query is as simple as:
q = field:[100 TO 200]
We will talk about Numeric Range Queries
but you can use range queries for text too:
q = field:[A TO Z]

Agenda:
● What is RangeQuery
● Which field type to use for Numerics
● Range stuff under the hood (run!)
● NumericRangeQuery
● Useful links (relax)

Which field type?
Which field type to use for “range” fields (let’s
stick with int) in schema?
● solr.IntField
● or maybe solr.SortableIntField
● or maybe solr.TrieIntField

Which field type?
Let’s assume we have:
● 11 documents, id: 1,2,3,..11
● each doc has single value “int” price field
● document id is the same as it’s price
● q = *:*
"numFound": 11,
"docs": [
{
"id": 1, “price_field": 1
},
{
"id": 2, “price_field": 2
},
...
{
"id": 11, “price_field": 11 }]

Which field type - solr.IntField
q = price_field:[1 TO 10]

q = price_field:[1 TO 10]
"numFound": 2,
"start": 0,
"docs": [
{
"price_field": 1
},
{
"price_field": 10
}
]
}

Store and index the text value verbatim and
hence don't correctly support range queries,
since the lexicographic ordering isn't equal to
the numeric ordering
[1,10],11,2,3,4,5,6,7,8,9
Interesting, but “sort by” works fine..
Clever comparator knows that values
are ints!

Which field type - solr.SortableIntField
● q = price_field:[1 TO 10]
○ "numFound": 10

● “Sortable”, in fact, refer to the notion of
making the numbers have correctly sorted
order. It’s not about “sort by” actually!
● Processed and compared as strings!!!
tricky string encoding:
NumberUtils.int2sortableStr(...)
● Deprecated and will be removed in 5.X
● What should i use then?

Which field type - solr.TrieIntField
● q = price_field:[1 TO 10]
○ "numFound": 10

● Recommended as replacement for IntField
and SortableIntField in javadoc
● Default for primitive fields in reference
schema
● Said to be fast for range queries (actually
depends on precision step)
● Tricky and, btw wtf is precision step?

Under the hood - Index
NumericTokenStream is where half of magic
happens!
● precision step = 1
● value = 11
00000000

00000000

00000000

00001011

● Let’s see how it will be indexed!


Field with precisionStep=1

shift=0

00001011

11

shift=1

00001010

10 = 5 << 1

shift=2

00001000

8 = 2 << 2

shift=3

00001000

8 = 1 << 3

shift=4

00000000

0 = 0 << 4

shift=5

00000000

0 = 0 << 5

continue…

How much for an integer?
11111111

11111111

11111111

11111111

Algorithm requires to index all 32/precisionStep
terms
So, for “11” we have 11, 10, 8, 8, 0, 0, 0, 0, 0….0

Okay! We indexed 32 tokens for the field.
(TermDictionary! Postings!) Where is the trick?

Stay tuned!

Under the hood - Query
Sub-classes of FieldType could override
#getRangeQuery(...) to provide their own range
query implementation.
If not, then likely you will have:
MultiTermQuery rangeQuery = TermRangeQuery.
newStringRange(...)
TrieField overrides it. And here comes...

Numeric Range Query (Decimal)
● Decimal example, precisionStep = ten
● q = price:[423 TO 642]

Numeric Range Query (Binary)
● precisionStep = 1
● q = price:[3 TO 12]

0

1

2

3

4

5

6

7

8

9

10

11

12

13

● q = price:[3 TO 12]

SHIFT = 1
0

0

1

1

2

3

2

3

4

5

6

4

7

8

5

9

10

6

11

12

13

...

● q = price:[3 TO 12]

0

0

0

1

0

0

0

0

1

1

2

1

2

3

2

3

4

5

6

3

4

7

8

5

9

10

6

11

12

13

● q = price:[3 TO 12]
0

1

0

0

0

0

1

1

2

1

2

3

2

3

4

5

6

3

4

7

8

5

9

10

6

11

12

13

Numeric Range Query (How?)

So, the questions is:
How to create query for the algorithm?

Let’s come back to TrieField#getRangeQuery(...)
There are several options:
● field is multiValued, hasDocValues, not indexed
○ super#getRangeQuery
● field is hasDocValues, not indexed
○ new ConstantScoreQuery (
FieldCacheRangeFilter.newIntRange(...) )
● otherwise ta-da
○ NumericRangeQuery.newIntRange(...)

NumericRangeQuery extends MultiTermQuery
which is:
An abstract Query that matches documents
containing a subset of terms provided by a
FilteredTermsEnum enumeration.
This query cannot be used directly(abstract); you
must subclass it and define getTermsEnum(Terms,
AttributeSource) to provide a FilteredTermsEnum
that iterates through the terms to be matched.

Let’s understand how #getTermsEnum works.
Returns new NumericRangeTermsEnum(...)
The main part is: NumericUtils.splitIntRange(...)

Algorithm uses binary masks very much:
for (int shift=0; noRanges(); shift += precisionStep):
diff = 1L << (shift + precisionStep);
mask = ((1L << precisionStep) - 1L) << shift;
diff=2
0

0

1

1

2

3

Diff is distance between upper level neighbors
Mask is to check if currentLevel node has nodes
lower or upper. (1,3 hasLower, 0,2 hasUpper)

hasLower = (minBound & mask) != 0L;
hasUpper = (maxBound & mask) != mask;
if (hasLower)
addRange(builder, valSize, minBound, minBound |
mask, shift);
if (hasUpper)
addRange(builder, valSize, maxBound & ~mask,
maxBound, shift);

hasLower = (minBound & mask) != 0L;
hasUpper = (maxBound & mask) != mask;
nextMinBound = (hasLower ? (minBound + diff) :
minBound) & ~mask;
nextMaxBound = (hasUpper ? (maxBound - diff) :
maxBound) & ~mask;

// If we are in the lowest precision or the next
precision is not available.
addRange(builder, valSize, minBound, maxBound,
shift);
// exit the split recursion loop (FOR)

●
●
●
●
●

shift = 0
diff = 0b00000010 = 2
mask = 0b00000001 = 1
hasLower = (3 & 1 != 0)? = true
hasUpper = (12 & 1 != 1)? = true
○ addRange 3..(3 | 1) = 3..3
○ addRange 12..(12 & ~1) = 12..12

● nextMin = (3 + 2) & ~1 = 4
● nextMax = (12 - 2) & ~1 = 10

0

1

2

3

4

5

6

7

8

9

10

11

12

13

●
●
●
●
●
●
●
●

min:4; max:10
shift = 1
diff = 0b00000100 = 4
mask = 0b00000010 = 2
hasLower = (4 & 2 != 0) ? = false
hasUpper = (10 & 2 != 2) ? = false
nextMin = min
nextMax = max
0

0

1

1

2

3

2

3

4

5

6

4

7

8

5

9

10

6

11

12

13

●
●
●
●
●
●
●
●

min:4; max:10
shift = 2
diff = 0b00001000 = 8
mask = 0b00000100 = 4
hasLower = (4 & 4 != 0) ? = true
hasUpper = (10 & 4 != 4) ? = true
nextMin = (4 + 8) & ~4 = 8 => min > max END
nextMax = (10 - 8) & ~4 = 0 => range 1..2 shift =
2
2
3
0

1

0

0

1

1

2

3

2

3

4

5

6

4

7

8

5

9

10

6

11

12

13

TestNumericUtils#testSplitIntRange
assertIntRangeSplit(lower, upper, precisionStep, expectBounds,
shifts)
assertIntRangeSplit(3, 12, 1, true,
Arrays.asList(
-2147483645,-2147483645, // 3,3
-2147483636,-2147483636, // 12,12
536870913, 536870914),
// 1, 2 for shift == 2
Arrays.asList(0, 0, 2)
); // Crappy unsigned int conversions are done in the asserts

So, NumericTermsEnum generates and remembers
all ranges to match.

Basically TermsEnum is an Iterator to seek or step
through terms in some order.
In our case order is:
0

1

2

3

4

5

6

7

8

9

10

11

12

Then (shift = 1):
0

1

2

3

4

5

6

Then (shift = 2)
0

2

1

...

3

13

Actually we have FilteredTermsEnum:

1. Only red terms are accepted by our enumerator
2. If term is not accepted we advance:
FilteredTermsEnum#nextSeekTerm(currentTerm)
TermsEnum#seekCeil(termToSeek)
Seek term depends on currentTerm and
generated ranges.

Ok, now we have TermsEnum for MiltiTermQuery
and enum is able to seek through only those terms
which match appropriate sub ranges.
The question is how to convert TermsEnum to
Query!?

The last trick is query#rewrite() method of
MultiTermQuery (rewrite is always called on query
before performing search):
public final Query rewrite(IndexReader reader) {
return rewriteMethod.rewrite(reader, this);
}

Oh, “rewriteMethod” how interesting… It defines how
the query is rewritten.

There are plenty of different rewrite methods, but
most interesting for us are:
●
CONSTANT_SCORE_*
○ BOOLEAN_QUERY_REWRITE
○ FILTER_REWRITE
○ AUTO_REWRITE_DEFAULT

BOOLEAN_QUERY_REWRITE

1. Collect terms (TermCollector) by using
#getTermsEnum(...)
2. For each term create TermQuery
3. return BooleanQuery with all TermQuery as leafs

FILTER_REWRITE

1.
2.
3.
4.
5.

Get termsEnum by using #getTermsEnum(...)
Create FixedBitSet
Get DocsEnum for each term
Iterate over docs and bitSet.set(docid);
return ConstantScoreQuery over filter (bitSet)

AUTO_REWRITE_DEFAULT
If the number of documents to be visited in the
postings exceeds some percentage of the maxDoc()
for the index then FILTER_REWRITE is used,
otherwise BOOLEAN_REWRITE is used.

Agenda:
● ..
● I promised. Precision Step!
● ...

Precision step
So, what is precision step and how it affects
performance?
● Defines how much terms to index for each value
○ Lower step values mean more precisions and
consequently more terms in index
○ indexedTermsPerValue = bitsPerVal / pStep
○ Lower precision terms are non unique, so term
dictionary doesn’t grow much, however
postings file does

Precision step
So, what is precision step and how it affects
performance?
● ...
○ Smaller precision step means less number of
terms to match, which optimizes query speed
○ But more terms to seek in index
○ You can index with a lower precision step value
and test search speed using a multiple of the
original step value.
○ Ideal step is found by testing only

Precision step (Results)
According to NumericRangeQuery javadoc:
● Opteron64 machine, Java 1.5, 8 bit precision step
● 500k docs index
● TermRangeQuery in BooleanRewriteMode took
about 30-40 seconds
● TermRangeQuery in FilterRewriteMode took
about 5 seconds
● NumericRangeQuery took < 100ms

Useful links
● http://searchhub.org/2009/05/13/exploringlucene-and-solrs-trierange-capabilities/
● http://www.panfmp.org/
● http://epic.awi.de/17813/1/Sch2007br.pdf
● http://lucene.apache.
org/core/4_3_1/core/org/apache/lucene/search/
NumericRangeQuery.html
● http://en.wikipedia.org/wiki/Range_tree
● me
http://plus.google.com/+VadimKirilchuk

Numeric Range Queries in Lucene and Solr

Numeric Range Queries in Lucene and Solr

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Numeric Range Queries in Lucene and Solr

Similaire à Numeric Range Queries in Lucene and Solr (20)

Dernier

Dernier (20)

Numeric Range Queries in Lucene and Solr