More Related Content
Similar to Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera (20)
More from Lucidworks (20)
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
- 1. 1
©
Cloudera,
Inc.
All
rights
reserved.
Real-‐Time
Analy=cs
with
Solr
Yonik
Seeley
10/15/2015
- 2. 2
©
Cloudera,
Inc.
All
rights
reserved.
My
Background
• Creator
of
Solr
• Cloudera
Engineer
• LucidWorks
Co-‐Founder
• Lucene/Solr
commiMer,
PMC
member
• Apache
SoQware
Founda=on
member
• M.S.
in
Computer
Science,
Stanford
- 4. 4
©
Cloudera,
Inc.
All
rights
reserved.
Search
and
Hadoop
• Search
is
a
key
component
of
many
big
data
problems
• Many
analy=cs
use
cases
start
with
search
• Adding
analy=cs
to
full-‐text
search
has
proven
to
be
more
effec=ve
than
vice-‐versa
• External
integra=ons
are
challenging
for
"real-‐=me"
(i.e.
interac=ve)
results
- 5. 5
©
Cloudera,
Inc.
All
rights
reserved.
Solr
in
Hadoop
• Top
Hadoop
vendors
who
have
integrated
search
have
all
chosen
Apache
Solr
• For
example:
Cloudera,
Hortonworks,
MapR,
IBM,
...
• Historical
focus
on
interac=ve
response
=mes
• Historical
focus
on
faceted
search
/
guided
naviga=on
• High
performance
indexes
• originally
for
"full-‐text"
search,
but
just
as
great
for
meta-‐data!
- 6. 6
©
Cloudera,
Inc.
All
rights
reserved.
Inverted
Index
aardvark
hood
red
liMle
riding
robin
women
zoo
LiMle
Red
Riding
Hood
Robin
Hood
LiMle
Women
0
1
0
2
0
0
2
1
0
1
2
Documents
- 7. 7
©
Cloudera,
Inc.
All
rights
reserved.
Columnar
Storage
(DocValues)
a1
a2
a3
a4
b1
b2
b3
b4
c1
c2
c3
c4
a1
b1
c1
a3
b3
c3
Stored
Fields
(row
oriented)
DocValues
(column
oriented)
a1
b1
c1
a1
b1
c1
...
a1
b1
c1
a2
b2
c3
...
• Fast
linear
scan
• Read
only
the
data
you
need
• Fast
random
access
• docid
-‐>
value(s)
• High
degree
of
locality
• Compressed
• prefix,
delta,
table,
gcd,
etc
• Mostly
"Off-‐Heap"
• Memory
mapped
from
index
• Row
vs
Column
configurable
per
field!
- 8. 8
©
Cloudera,
Inc.
All
rights
reserved.
Mul=-‐Segment
Index
_0.fnm
_0.fdt
_0.fdx
[...]
_0_1.del
_1.fnm
_1.fdt
_1.fdx
[…]
segments_3
• Each
segment
is
a
self-‐contained
"index"
• Segments
are
never
changed
once
wriMen
• Per-‐segment
caching
very
effec=ve
• Point-‐in-‐=me
searcher
• gejng
new
view
means
wri=ng
&
including
addi=onal
segment
• turns
a
weakness
into
a
strength
- 9. 9
©
Cloudera,
Inc.
All
rights
reserved.
Faceted
Search
• Breaks
search
results
into
buckets
• Generally
provides
bucket
counts
• Allows
user
to
filter
/
"drill
into"
results
- 11. 11
©
Cloudera,
Inc.
All
rights
reserved.
Face=ng
Search
Sta=s=cs
Facet
Module
Goals
Search
Joins
Grouping
Field
Collapsing
New
Facet
Module
JSON
Facet
API
• Integra=on
• Performance
• Ease
of
use
Highligh=ng
Nested
Documents
Geosearch
- 12. 12
©
Cloudera,
Inc.
All
rights
reserved.
Slice
and
Dice
with
Facet
commands
Domain
Facet
Command
A
• Domain:
A
set
of
documents
• Facet
command:
create
sub-‐domains
/
"facet
buckets"
Facet
Command
B
Domain
Domain
Domain
Domain
Facet
Command
C
Domain
Domain
Domain
Domain
Domain
Domain
- 13. 13
©
Cloudera,
Inc.
All
rights
reserved.
Facet
Func=ons
/
Sta=s=cs
Domain
Facet
Command
A
Facet
Command
B
Domain
Domain
Domain
Domain
Facet
Command
C
Domain
Domain
Domain
Domain
Domain
Domain
sum(x)
unique(y)
sum(x)
unique(y)
sum(x)
unique(y)
min(units)
avg(price)
• Facet
func=on
calculates
something
over
a
domain
• Can
sort
domains
by
facet
func=ons!
- 14. 14
©
Cloudera,
Inc.
All
rights
reserved.
Facet
func=ons
• Calculate
(and
Sort)
by
things
other
than
document
count
Func%on
Example
Descrip%on
sum
sum(sales)
Summa=on
of
numeric
values
avg
avg(popularity)
Average
of
numeric
values
sumsq
sumsq(rent)
Sum
of
squares
min
min(salary)
Minimum
value
max
max(mul(popularity,boost))
Maximum
value
unique
unique(state)
Number
of
unique
values
(calc
dis=nct)
hll
hll(state)
Number
of
unique
values
using
HyperLogLog
algorithm
percen=le
percen=le(salary,
25,
50,
75)
Calculates
percen=les
via
t-‐digest
algorithm
topdocs
topdocs("another
query",5)
(in
progress)
Returns
the
top
documents
for
another
query
- 15. 15
©
Cloudera,
Inc.
All
rights
reserved.
Simple
request
and
response
curl
http://localhost:8983/solr/query
-‐d
'
q=widgets&
json.facet=
{
x
:
"avg(price)"
,
y
:
"unique(brand)"
}
'
[…]
"facets"
:
{
"count"
:
314,
"x"
:
102.5,
"y"
:
28
}
root
domain
defined
by
docs
matching
the
query
count
of
docs
in
the
bucket
- 16. 16
©
Cloudera,
Inc.
All
rights
reserved.
All-‐JSON
request
example
$
curl
http://localhost:8983/solr/query
-‐d
'
{
query
:
"widgets",
//
our
JSON
parser
accepts
comments
(C-‐style
too)
filter
:
"inStock:true",
//
bare
strings
can
appear
unquoted
offset:
0,
limit:
5,
sort:
"price
desc",
fields:
["id","name","price"],
/*
could
have
also
used
"id,name,price"
*/
facet
:
{
x
:
"avg(price)",
y
:
"unique(brand)"
}
}
'
- 17. 17
©
Cloudera,
Inc.
All
rights
reserved.
Bucke=ng
Facet
Types
• Terms
Facet
• Creates
new
domains
(facet
buckets)
based
on
values
in
a
field
• Range
Facet
• Creates
mul=ple
buckets
based
on
date
ranges
or
numeric
ranges
• Query
Facet
• Creates
a
single
bucket
of
documents
that
match
any
given
query
• Unlimited
nes=ng:
Any
facet
types
may
have
any
number
of
sub-‐facets
- 18. 18
©
Cloudera,
Inc.
All
rights
reserved.
Terms
facet
example
json.facet={
shoes
:
{
type
:
terms,
field
:
shoe_style,
sort
:
{x
:
desc},
facet
:
{
x
:
"avg(price)",
y
:
"unique(brand)"
}
}
}
"facets":
{
"count"
:
472,
"shoes":
{
"buckets"
:
[
{
"val"
:
"Hiking",
"count"
:
34,
"x"
:
135.25,
"y"
:
17,
},
{
"val"
:
"Running",
"count"
:
45,
"x"
:
110.75,
"y"
:
24,
},
Calculated
per-‐bucket
- 19. 19
©
Cloudera,
Inc.
All
rights
reserved.
Sub-‐facet
example
json.facet={
shoes:{
type
:
terms,
field
:
shoe_style,
sort
:
{x
:
desc},
facet
:
{
x
:
"avg(price)",
y
:
"unique(brand)",
colors
:
{
type
:
terms,
field
:
color
}
}
}
}
"facets":
{
"count"
:
472,
"shoes":
{
"buckets"
:
[
{
"val"
:
"Hiking",
"count"
:
34,
"x"
:
135.25,
"y"
:
17,
"colors"
:
{
"buckets"
:
[
{
"val"
:
"brown",
"count"
:
12
},
{
"val"
:
"black",
"count"
:
10
},
[…]
]
}
//
end
of
colors
sub-‐facet
},
//
end
of
Hiking
bucket
{
"val"
:
"Running",
"count"
:
45,
"x"
:
110.75,
"y"
:
24,
"colors"
:
{
"buckets"
:
[…]
- 21. 21
©
Cloudera,
Inc.
All
rights
reserved.
Fantasy
($1045)
Top
Authors
$423
George
R.R.
Mar=n
$347
Brandon
Sanderson
$155
JK
Rowling
Top
Books
$252
A
Game
of
Thrones
$113
Emperor
of
Thorns
$101
Nine
Princes
in
Amber
$82
Steel
Heart
Sci-‐Fi
($898)
Top
Authors
$321
Iain
M
Banks
$218
Neal
Asher
$155
Neal
Stephenson
Top
Books
$113
Gridlinked
$101
Use
of
Weapons
$93
Snow
Crash
$82
The
Skinner
Mystery
($645)
Top
Authors
$191
James
PaMerson
$145
Patricia
Cornwell
$126
John
Grisham
Top
Books
$85
One
for
the
Money
$77
Angels
&
Daemons
$64
ShuMer
Island
$35
The
Firm
Filter
By
State
$852
NJ
(14
stores)
$658
NY
(11
stores)
$421
CT
(8
stores)
Chain
$984
Amazoon
(14
stores)
$734
Houses&Royalty
(9
stores)
$387
Books-‐r-‐us
(7
stores)
Store
$108
Amazoon
Branchburg
$93
Books-‐r-‐us
Bridgewater
$87
H&R
NYC
Number
of
Books
Chain
201K
Houses&Royalty
183K
Amazoon
98K
Books-‐r-‐us
Store
193K
H&R
NYC
77K
Books-‐r-‐us
Bridgewater
68K
Amazoon
Branchburg
- 22. 22
©
Cloudera,
Inc.
All
rights
reserved.
date_breakout
:
{
type
:
range,
field
:
sale_date,
start
:
...,
end
:
...,
gap
:
"+1MONTH”,
facet
:
{
top_genres
:
{
type
:
terms
field
:
genre,
sort
:
"revenue
desc",
limit
:
4,
facet
:
{
revenue
:
"sum(sales)"
}
},
by_chain:
{
type
:
terms,
field
:
chain,
facet
:
{
revenue
:
"sum(sales)"
}
}
Implementa=on
Range
Facet
(sale_date)
Terms
Facet
(genre)
Terms
Facet
(chain)
sum(sales)
sum(sales)
- 23. 23
©
Cloudera,
Inc.
All
rights
reserved.
Fantasy
($1045)
Top
Authors
$423
George
R.R.
Mar=n
$347
Brandon
Sanderson
$155
JK
Rowling
Top
Books
$252
A
Game
of
Thrones
$113
Emperor
of
Thorns
$101
Nine
Princes
in
Amber
$82
Steel
Heart
Sci-‐Fi
($898)
Top
Authors
$321
Iain
M
Banks
$218
Neal
Asher
$155
Neal
Stephenson
Top
Books
$113
Gridlinked
$101
Use
of
Weapons
$93
Snow
Crash
$82
The
Skinner
Mystery
($645)
Top
Authors
$191
James
PaMerson
$145
Patricia
Cornwell
$126
John
Grisham
Top
Books
$85
One
for
the
Money
$77
Angels
&
Daemons
$64
ShuMer
Island
$35
The
Firm
top_genres:{
type
:
terms,
field
:
genre,
facet
:
{
rev
:
"sum(sales)",
top_authors:{
type
:
terms,
field
:
author,
sort
:"rev
desc",
limit
:
3,
facet
:
{
rev
:
"sum(sales)"
}
},
top_books:{
type
:
terms,
field
:
=tle,
sort
:
"rev
desc",
limit
:
4,
facet
:
{
rev
:
"sum(sales)"
}
}
}
Implementa=on
(con=nued)
Terms
Facet
(genre)
Terms
Facet
(author)
Terms
Facet
(=tle)
sum(sales)
sum(sales)
sum(sales)
- 25. 25
©
Cloudera,
Inc.
All
rights
reserved.
facet=true&stats=true
&stats.field={!tag=stat1+mean=true}field2
&facet.pivot={!stats=stat1}field1
&f.field1.limit=10
json.facet={
f
:
{
type
:
terms,
field
:
field1,
facet:{
mean:"avg(field2)"
}
}
}
Tested
Facet
Request
Legacy
(stats component & pivot facets)
JSON Facet API
(New Facet Module)
- 29. 29
©
Cloudera,
Inc.
All
rights
reserved.
Indexing
Nested
Documents
id
:
book1
=tle
:
The
Way
of
Kings
author
:
Brandon
Sanderson
id
:
book1_review1
review_author
:
Yonik
stars
:
5
comment
:
A
great
start
to
what
...
id
:
book1_review2
review_author
:
Dan
stars
:
3
comment
:
This
book
was
too
long
id
:
book2
=tle
:
Snow
Crash
author
:
Neal
Stephenson
id
:
book2_review1
review_author
:
Yonik
stars
:
5
comment
:
Ahead
of
it's
=me
...
book1_review1
book1_review2
book1
book2_review1
book2
Lucene
index
view
(flat)
• Group
indexed
as
a
"block"
• atomic
• internal
document
ids
con=guous
• enables
quick
and
inexpensive
joins
- 30. 30
©
Cloudera,
Inc.
All
rights
reserved.
Indexing
Nested
Documents
(JSON
format)
{
id
:
book1,
type
:
book,
=tle
:
"The
Way
of
Kings",
author
:
"Brandon
Sanderson",
genre
:
fantasy,
pubyear
:
2010,
publisher
:
Tor,
_childDocuments_
:
[
{
id
:
book1_review1,
type
:
review,
review_dt:"2015-‐01-‐03T14:30:00Z",
stars
:
5,
review_author
:
Yonik,
comment
:
"A
great
start
to
what
looks
like
an
epic
series!"
}
,
{
id
:
book1_review2,
type
:
review,
review_dt:"2015-‐03-‐15T12:00:00Z",
stars
:
3,
review_author
:
Dan,
comment
:
"This
book
was
too
long."
}
]
}
- 31. 31
©
Cloudera,
Inc.
All
rights
reserved.
Block
Join
Queries
Find
reviews
men=oning
"epic",
limi=ng
to
reviews
for
books
published
by
Tor
Find
books
published
by
Tor
with
a
review
men=oning
"epic"
q=comment:epic
fq={!child
of="type:book"}publisher:Tor
sort=review_dt
desc
q=publisher:Tor
fq={!parent
which="type:book"}comment:epic
sort=pubyear
desc
- 32. 32
©
Cloudera,
Inc.
All
rights
reserved.
Block
Join
Face=ng
(child
to
parent)
• Find
the
number
of
books
I
(Yonik)
reviewed,
broken
out
by
Genre
q=review_author:Yonik&
json.facet={
genres
:
{
type
:
terms,
field
:
genre,
domain
:
{
blockParent
:
"type:book"
}
}
}
- 33. 33
©
Cloudera,
Inc.
All
rights
reserved.
Block
Join
Face=ng
(parent
to
child)
• Find
the
top
reviewers
for
sci-‐fi
and
fantasy
books
q=genre:(sci-‐fi
OR
fantasy)&
json.facet={
top_reviewers
:
{
type:
terms,
field:
review_author,
domain:
{
blockChildren
:
"type:book"
}
}
}
- 34. 34
©
Cloudera,
Inc.
All
rights
reserved.
Thank
you
yonik@cloudera.com