This document provides an overview of SolrCloud on Hadoop. It discusses how SolrCloud allows for distributed, highly scalable search capabilities on Hadoop clusters. Key components that work with SolrCloud are also summarized, including HDFS for storage, MapReduce for processing, and ZooKeeper for coordination services. The document demonstrates how SolrCloud can index and query large datasets stored in Hadoop.
1. 1
SolrCloud
on
Hadoop
Cleveland
Big
Data
and
Hadoop
User
Group,
January
2014
Alex
Moundalexis
@technmsg
2. Disclaimer
• Technologies,
not
products
• Cloudera
builds
things
soGware
• most
donated
to
Apache
• some
closed-‐source
• I
will
likely
menLon
“Cloudera
Something”
• Cloudera
“products”
I
reference
are
open
source
• Apache
Licensed
• Source
code
is
on
GitHub
• hQps://github.com/cloudera
2
3. What
This
Talk
Isn’t
About
• Deploying
• Puppet,
Chef,
Ansible,
homegrown
scripts,
intern
labor
• Sizing
&
Tuning
• Depends
heavily
on
data
and
workload
• Coding
• Unless
you
count
XML
or
CSV
• Algorithms
3
5. Why
“Ecosystem?”
• In
the
beginning,
just
Hadoop
• HDFS
• MapReduce
• Today,
dozens
of
interrelated
components
• I/O
• Processing
• Specialty
ApplicaLons
• ConfiguraLon
• Workflow
5
6. ParLal
Ecosystem
6
Hadoop
external
system
RDBMS
/
DWH
web
server
device
logs
API
access
log
collecLon
DB
table
import
batch
processing
machine
learning
external
system
API
access
user
RDBMS
/
DWH
DB
table
export
BI
tool
+
JDBC/ODBC
Search
SQL
7. HDFS
• Distributed,
highly
fault-‐tolerant
filesystem
• OpLmized
for
large
streaming
access
to
data
• Based
on
Google
File
System
• hQp://research.google.com/archive/gfs.html
7
9. MapReduce
(MR)
• Programming
paradigm
• Batch
oriented,
not
realLme
• Works
well
with
distributed
compuLng
• Lots
of
Java,
but
other
languages
supported
• Based
on
Google’s
paper
• hQp://research.google.com/archive/mapreduce.html
9
11. You specify map() and
reduce() functions.
The framework does the
rest.
60
12. Apache
HBase
• Random,
realLme
read/write
access
• Key/value
columnar
store
• (b|tr)illions
of
rows/columns
• Based
on
Google
BigTable
• hQp://research.google.com/archive/bigtable.html
12
13. Cloudera
Hue
• Hadoop
User
Experience
• Hadoop
is
largely
command
line
• Hue
provides
a
UI
for
end-‐users
• SDK
to
build
your
own
apps
on
top
13
14. Apache
Tika
• Content
analysis
toolkit
• Simply
put,
a
lot
of
parsers
• Detect/extract
metadata/text
from
documents
• HTML
• XML
• Office
• PDF
• mbox
• More…
14
15. Apache
ZooKeeper
• Distributed
systems
are
HARD
• Everyone
was
trying
to
implement
the
same
subsystems
• Bugs
leads
to
race
condiLons,
other
bad
things
• ZK:
Highly
reliable
distributed
coordinaLon
services
• ConfiguraLon
• Naming
• SynchronizaLon
• Group
Services
15
16. Cloudera
Morphlines
• In-‐memory
transformaLons
• Load,
parse,
transform,
process
• Records
as
name-‐value
pairs
w/
opLonal
blob/pojo
objects
• Java
library,
embedded
in
your
codebase
• Used
to
ETL
data
from
Flume
and
MR
into
Solr
• Was
part
of
CDK,
now
part
of
Kite
• hQp://kitesdk.org
16
17. Apache
Lucene
• Java-‐based
index
and
search
• ranked
or
sorted
results
• hits
streamed
through
QP
• mem(results)
mem(collecLon)
• rich/extensible
query
operators
• bool,
phrase,
range,
span,
spaLal
• Features
• spellchecking
• hit
highlighLng
• tokenizaLon
17
19. Apache
Solr
–
Simple
Indexing
via
CLI
$
java
-‐jar
post.jar
solr.xml
money.xml
SimplePostTool:
version
1.4
SimplePostTool:
POSTing
files
to
http://
localhost:8983/solr/update..
SimplePostTool:
POSTing
file
solr.xml
SimplePostTool:
POSTing
file
money.xml
SimplePostTool:
COMMITting
Solr
index
changes..
$
post.sh
*.xml
19
20. Apache
Solr
–
Document
money.xml
add
doc
field
name=idUSD/field
field
name=nameOne
Dollar/field
field
name=manuBank
of
America/field
field
name=manu_id_sboa/field
field
name=catcurrency/field
field
name=featuresCoins
and
notes/field
field
name=price_c1,USD/field
field
name=inStocktrue/field
/doc
doc
field
name=idEUR/field
field
name=nameOne
Euro/field
20
21. Apache
Solr
–
More
Advanced
Indexing
• From
DB,
using
Data
Import
Handler
(DIH)
• Load
a
CSV
file
• POST
JSON
documents
• Index
binary
documents
(uses
Tika)
• SolrJ
for
programmaLc
document
creaLon
21
26. Apache
Solr
–
Facets
at
Query
Time
• HTTP
GET
• hQp://solr:8983/solr/collecLon1/select/?q=video
• All
docs,
count
by
category
q=*:*facet=truefacet.field=cat
• All
docs,
count
by
category
and
in-‐stock
status
q=*:*facet=truefacet.field=catfacet.field=inStock
• Docs
matching
“ipod”,
count
by
price
(above/below
$100)
q=ipodfacet=truefacet.query=price:[0
TO
100]facet.query=price:[100
TO
*]
26
33. Search
Design
Strategy
33
One
pool
of
data
One
security
framework
One
set
of
system
resources
One
management
interface
An
Integrated
Part
of
the
Hadoop
System
Storage
Integra5on
Resource
Management
Metadata
Batch
Processing
MAPREDUCE,
HIVE
PIG
…
HDFS
HBase
TEXT,
RCFILE,
PARQUET,
AVRO,
ETC.
RECORDS
Engines
InteracLve
SQL
CLOUDERA
IMPALA
InteracLve
Search
CLOUDERA
SEARCH
Machine
Learning
MAHOUT
Math
Sta5s5cs
SAS,
R
34. Benefits
of
Search
IntegraLon
34
Improved
Big
Data
ROI
§ An
interacLve
experience
without
technical
knowledge
§ Single
data
set
for
mulLple
compuLng
frameworks
Faster
Time
to
Insight
§ Exploratory
analysis,
esp.
unstructured
data
§ Broad
range
of
indexing
opLons
to
accommodate
needs
Cost
Efficiency
§ Single
scalable
plaporm;
no
incremental
investment
§ No
need
for
separate
systems,
storage
Solid
Founda5ons
Reliability
§ Solr
in
producLon
environments
for
years
§ Hadoop-‐powered
reliability
and
scalability
36. Search
Use
Cases
36
Offer
easy
access
to
non-‐technical
resources
Explore
data
prior
to
processing
and
modeling
Gain
immediate
access
and
find
correlaLons
in
mission-‐criLcal
data
Powerful,
proven
search
capabili5es
that
let
organiza5ons:
37. Monsanto
37
Scalable,
efficient
image
search
for
analysis
and
research
Track
plant
characterisLcs
throughout
their
lifecycle
Before:
Manual
aQribute
extracLon
and
search
queries
within
database
Now:
Parse
and
index
images
at
acquisiLon
and
on
demand,
index
archived
images
in
batch
41. Cloudera
–
Internal
Field
Portal
• Varied
fetchers/observers
for
web/API
content
• Content
is
retrieved
via
Flume,
Sqoop
• Search
indexes
and
replicates
into
HBase
• Each
collecLon
has
collecLon-‐specific
filters/fields
• Provides
Ltle,
content
snippet,
link
to
original
• Morphlines
extracts
books
and
papers
using
Tika
• Impala
for
analyLcs
• Future:
Use
MapReduce
to
ingest
logs
41
44. Summary
• With
Hadoop,
it
depends.
• The
tools
are
out
there.
• Open
source
soGware,
hooray!
• Many
interconnected
pieces
• Many
unexplored
opportuniLes
• A
thriving
community
awaits
you…
• Data
can
make
a
difference.
• Search
allows
everyone
to
interact
with
data.
• This
is
a
Big
Deal.
44