SolrCloud on Hadoop

1
SolrCloud
on
Hadoop

Cleveland
Big
Data
and
Hadoop
User
Group,
January
2014

Alex
Moundalexis

@technmsg

Disclaimer

•  Technologies,
not
products

•  Cloudera
builds
things
soGware

•  most
donated
to
Apache

•  some
closed-‐source

•  I
will
likely
menLon
“Cloudera
Something”

•  Cloudera
“products”
I
reference
are
open
source

•  Apache
Licensed

•  Source
code
is
on
GitHub

•  hQps://github.com/cloudera

2

What
This
Talk
Isn’t
About

•  Deploying

•  Puppet,
Chef,
Ansible,
homegrown
scripts,
intern
labor

•  Sizing
&
Tuning

•  Depends
heavily
on
data
and
workload

•  Coding

•  Unless
you
count
XML
or
CSV

•  Algorithms

3

4
Quick
and
dirty,
more
Lme
for
use
cases.

The
Apache
Hadoop
Ecosystem

Why
“Ecosystem?”

•  In
the
beginning,
just
Hadoop

•  HDFS

•  MapReduce

•  Today,
dozens
of
interrelated
components

•  I/O

•  Processing

•  Specialty
ApplicaLons

•  ConﬁguraLon

•  Workﬂow

5

ParLal
Ecosystem

6
Hadoop

external
system

RDBMS
/
DWH

web
server

device
logs

API
access

log
collecLon

DB
table
import

batch
processing

machine
learning

external
system

API
access

user

RDBMS
/
DWH

DB
table

export

BI
tool

+
JDBC/ODBC

Search

SQL

HDFS

•  Distributed,
highly
fault-‐tolerant
ﬁlesystem

•  OpLmized
for
large
streaming
access
to
data

•  Based
on
Google
File
System

•  hQp://research.google.com/archive/gfs.html

7

Lots
of
Commodity
Machines

8
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

MapReduce
(MR)

•  Programming
paradigm

•  Batch
oriented,
not
realLme

•  Works
well
with
distributed
compuLng

•  Lots
of
Java,
but
other
languages
supported

•  Based
on
Google’s
paper

•  hQp://research.google.com/archive/mapreduce.html

9

You specify map() and
reduce() functions.

The framework does the
rest.

60

Apache
HBase

•  Random,
realLme
read/write
access

•  Key/value
columnar
store

•  (b|tr)illions
of
rows/columns

•  Based
on
Google
BigTable

•  hQp://research.google.com/archive/bigtable.html

12

Cloudera
Hue

•  Hadoop
User
Experience

•  Hadoop
is
largely
command
line

•  Hue
provides
a
UI
for
end-‐users

•  SDK
to
build
your
own
apps
on
top

13

Apache
Tika

•  Content
analysis
toolkit

•  Simply
put,
a
lot
of
parsers

•  Detect/extract
metadata/text
from
documents

•  HTML

•  XML

•  Oﬃce

•  PDF

•  mbox

•  More…

14

Apache
ZooKeeper

•  Distributed
systems
are
HARD

•  Everyone
was
trying
to
implement
the
same
subsystems

•  Bugs
leads
to
race
condiLons,
other
bad
things

•  ZK:
Highly
reliable
distributed
coordinaLon
services

•  ConﬁguraLon

•  Naming

•  SynchronizaLon

•  Group
Services

15

Cloudera
Morphlines

•  In-‐memory
transformaLons

•  Load,
parse,
transform,
process

•  Records
as
name-‐value
pairs
w/
opLonal
blob/pojo
objects

•  Java
library,
embedded
in
your
codebase

•  Used
to
ETL
data
from
Flume
and
MR
into
Solr

•  Was
part
of
CDK,
now
part
of
Kite

•  hQp://kitesdk.org

16

Apache
Lucene

•  Java-‐based
index
and
search

•  ranked
or
sorted
results

•  hits
streamed
through
QP

•  mem(results)

mem(collecLon)

•  rich/extensible
query
operators

•  bool,
phrase,
range,
span,
spaLal

•  Features

•  spellchecking

•  hit
highlighLng

•  tokenizaLon

17

Apache
Solr

•  Enterprise
search
plaporm

•  Based
on
Apache
Lucene

•  Full-‐text
search

•  FaceLng

•  NRT
indexing

•  UI

18

Apache
Solr
–
Simple
Indexing
via
CLI

$
java
-‐jar
post.jar
solr.xml
money.xml

SimplePostTool:
version
1.4

SimplePostTool:
POSTing
files
to
http://
localhost:8983/solr/update..

SimplePostTool:
POSTing
file
solr.xml

SimplePostTool:
POSTing
file
money.xml

SimplePostTool:
COMMITting
Solr
index
changes..

$
post.sh
*.xml

19

Apache
Solr
–
Document
money.xml

add

doc

field
name=idUSD/field

field
name=nameOne
Dollar/field

field
name=manuBank
of
America/field

field
name=manu_id_sboa/field

field
name=catcurrency/field

field
name=featuresCoins
and
notes/field

field
name=price_c1,USD/field

field
name=inStocktrue/field

/doc

doc

field
name=idEUR/field

field
name=nameOne
Euro/field

20

Apache
Solr
–
More
Advanced
Indexing

•  From
DB,
using
Data
Import
Handler
(DIH)

•  Load
a
CSV
ﬁle

•  POST
JSON
documents

•  Index
binary
documents
(uses
Tika)

•  SolrJ
for
programmaLc
document
creaLon

21

Apache
Solr
–
Querying

•  HTTP
GET

•  hQp://solr:8983/solr/collecLon1/select/

•  Examples

•  ?q=Lmestamp:[*
TO
NOW]

•  ?q=-‐instock:false

•  ?q={!lucene
q.op=AND
df=text}myﬁeld:foo
+bar
-‐bat

22

Apache
Solr
–
Querying

•  HTTP
GET

•  hQp://solr:8983/solr/collecLon1/select/?q=video

•  Examples

•  fl=name,id

(return
only
name
and
id
fields)

•  fl=name,id,score

(return
relevancy
score
as
well)

•  fl=*,score

(return
all
fields
+
relevancy
score)

•  sort=price
descfl=name,id,price

(sort
by
price
desc)

•  wt=json

(return
response
in
JSON
format)

23

What
the
Heck
is
FaceLng?

•  Generate
counts
for
properLes
or
categories

•  Links
allow
drill-‐down
or
reﬁne
search
results

What?

24

Facets
on
Amazon.com

25

Apache
Solr
–
Facets
at
Query
Time

•  HTTP
GET

•  hQp://solr:8983/solr/collecLon1/select/?q=video

•  All
docs,
count
by
category

q=*:*facet=truefacet.field=cat

•  All
docs,
count
by
category
and
in-‐stock
status

q=*:*facet=truefacet.field=catfacet.field=inStock

•  Docs
matching
“ipod”,
count
by
price
(above/below
$100)

q=ipodfacet=truefacet.query=price:[0
TO

100]facet.query=price:[100
TO
*]

26

Apache
Solr
–
Querying
via
UI

27

Apache
SolrCloud

•  IntegraLon
of
Solr
+
ZooKeeper

•  Provides
for
shard
failover

28

Cloudera
Search

•  Based
on
Apache
Solr
(incl
Lucene
and
SolrCloud)

•  Fault-‐tolerance:
collecLons
backed
by
HDFS
or
HBase

•  IntegraLon
galore:

•  HBase/Flume/MapReduce
w/
Lucene

•  Hue
w/
Solr

•  Avro
w/
Tika

•  HDFS
w/
Solr/Lucene

•  Sentry
w/
Solr

29

Cloudera
Search
+
Hue

30

Cloudera
Search
+
Hue

31

32
Apologies,
I
swiped
some
preQy
slides
from
markeLng…

Why
Search?

Search
Design
Strategy

33
One
pool
of
data

One
security
framework

One
set
of
system
resources

One
management
interface

An
Integrated
Part
of

the
Hadoop
System

Storage

Integra5on

Resource
Management

Metadata

Batch

Processing

MAPREDUCE,

HIVE

PIG

…
HDFS
HBase

TEXT,
RCFILE,
PARQUET,
AVRO,
ETC.
RECORDS

Engines

InteracLve

SQL

CLOUDERA

IMPALA

InteracLve

Search

CLOUDERA

SEARCH

Machine

Learning

MAHOUT

Math

Sta5s5cs

SAS,
R

Beneﬁts
of
Search
IntegraLon

34
Improved
Big
Data
ROI

§  An
interacLve
experience
without
technical
knowledge

§  Single
data
set
for
mulLple
compuLng
frameworks

Faster
Time
to
Insight

§  Exploratory
analysis,
esp.
unstructured
data

§  Broad
range
of
indexing
opLons
to
accommodate
needs

Cost
Eﬃciency

§  Single
scalable
plaporm;
no
incremental
investment

§  No
need
for
separate
systems,
storage

Solid
Founda5ons

Reliability

§  Solr
in
producLon
environments
for
years

§  Hadoop-‐powered
reliability
and
scalability

35
Some
quick
examples.

Search
Use
Cases

Search
Use
Cases

36
Oﬀer
easy
access
to
non-‐technical

resources

Explore
data
prior
to
processing
and

modeling

Gain
immediate
access
and
ﬁnd

correlaLons
in
mission-‐criLcal
data

Powerful,
proven
search
capabili5es
that

let
organiza5ons:

Monsanto

37
Scalable,
eﬃcient
image
search
for

analysis
and
research

Track
plant
characterisLcs
throughout
their

lifecycle

Before:
Manual
aQribute
extracLon
and
search

queries
within
database

Now:
Parse
and
index
images
at
acquisiLon
and

on
demand,
index
archived
images
in
batch

38
Cloudera:
Internal
Field
Portal

Custom
Aggregated
Search

Cloudera
–
Internal
Field
Portal

•  Single
stop
for
ﬁeld
engineers

•  Mailing
lists:
public,
private

•  Tickets:
support,
development,
public
ASF

•  Customer
data:
accounts,
clusters,
KB
arLcles

•  Customer
Clusters:
conﬁgs,
audits,
logs,
events

•  Books
and
papers

•  Discussion
forums

•  Dogfooding,
yes

•  Makes
my
life
easier

39

Cloudera
–
Internal
Field
Portal

40

Cloudera
–
Internal
Field
Portal

•  Varied
fetchers/observers
for
web/API
content

•  Content
is
retrieved
via
Flume,
Sqoop

•  Search
indexes
and
replicates
into
HBase

•  Each
collecLon
has
collecLon-‐specific
filters/fields

•  Provides
Ltle,
content
snippet,
link
to
original

•  Morphlines
extracts
books
and
papers
using
Tika

•  Impala
for
analyLcs

•  Future:
Use
MapReduce
to
ingest
logs

41

42
ParLng
thoughts…
in
no
parLcular
order.

Summary

Search
Simpliﬁes
InteracLon

43
Explore

Navigate

Correlate

Experts
know
MapReduce.
Savvy
people
know
SQL.

Everyone
knows
Search.

Summary

•  With
Hadoop,
it
depends.

•  The
tools
are
out
there.

•  Open
source
soGware,
hooray!

•  Many
interconnected
pieces

•  Many
unexplored
opportuniLes

•  A
thriving
community
awaits
you…

•  Data
can
make
a
diﬀerence.

•  Search
allows
everyone
to
interact
with
data.

•  This
is
a
Big
Deal.

44

What’s
Next?

•  Search
examples

•  hQp://blog.cloudera.com/blog/category/search/

•  Cloudera
provides
pre-‐loaded
VMs

•  hQp://Lny.cloudera.com/quickstartvm

•  Clone
our
repos!

•  hQps://github.com/cloudera

45

46
Preferably
related
to
the
talk…

QuesLons?

47
Thank
You!

Alex
Moundalexis

@technmsg

We’re
hiring,
kids!
Well,
not
kids.

SolrCloud on Hadoop

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à SolrCloud on Hadoop

Similaire à SolrCloud on Hadoop (20)

Plus de Alex Moundalexis

Plus de Alex Moundalexis (8)

Dernier

Dernier (20)

SolrCloud on Hadoop