KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

Search

Discover

Analyze

Enabling
Scalable
Search,
Discovery

and
Analy6cs
with
Solr,
Mahout
and

Hadoop

Grant
Ingersoll

Chief
Scien:st

Lucid
Imagina:on

|

1

We
All
Know
the
Pain

l  ________
data
growth
in
the
next
___
days/months/years

–  Many
es:mate
80-‐90%
of
data
is
“unstructured”
(mul:-‐structured?)

l  The
Age
of
“Data
Paranoia”

–  What
if
I
don’t
collect
it
all?

–  What
if
I
miss
something
or
lose
something?

–  What
if
I
can’t
store
it
long
enough?

–  How
do
I
secure
it?

–  Can
I
aﬀord
to
do
any
of
this?

Can
I
aﬀord
not
to?

–  What
if
I
can’t
make
sense
of
it?

|

2

Big
Data
Premise
and
Promise

Premise Promise

Large Scale Data Collection/Storage ✔

Prevents Data Loss ✔

Long Term Storage ✔

Affordable ✔

New Science Delivering New Insights ?

|

3

Why
Search,
Discovery
and
Analy;cs
(SDA)?

l  User
Needs:

–  Real-‐:me,
ad
hoc
access
to
content

Search
–  Aggressive
Priori:za:on
based
on
Importance

–  Serendipity

l  Batch
processing
isn’t
enough

l  Search
is
built
for
mul:-‐structured

Analytics Discovery
l  Deeper
analysis
yields:

–  Business
insight
into
users

–  Beaer
Search
and
Discovery
for
users

|

4

What
do
you
need
for
SDA?

l  Fast, efficient, scalable search
–  Bulk and Near Real Time Indexing
l  Large scale, cost effective storage
l  Large scale processing power
–  Large scale and distributed for whole data consumption and analysis
–  Sampling tools
–  Distributed In Memory where appropriate

l  NLP and machine learning tools that scale to enhance discovery and
analysis

|

5

Example
Use
Cases

l  Dark
Data
–
Petabytes
(and
beyond)
of
content
in
storage
with
liale
insight

into
what’s
in
it

–  Forensics,
Intelligence
Gathering,
Risk
analysis,
etc.

l  Financial
–
Enable
total
customer
view
to
beaer
understand
risks
and

opportuni:es

l  Medical
–
Extend
research
capabili:es
through
deeper
analysis
of
both

scien:ﬁc
data,
publica:ons
and
ﬁeld
usage

l  Social
Media
Monitoring
–
Understand
and
analyze
social
networks
and

their
trends
all
the
:me,
no
maaer
the
scale

l  Commerce
–
Drive
more
sales
through
metric
driven
search
and
discovery

without
the
guesswork

|

6

Announcing
LucidWorks
Big
Data
Beta

An
applica:on
development
plaiorm
aimed
at
enabling
Search,
Discovery
and

Analysis
of
your
content
and
user
interac:ons,
no
maaer
the
volume,
variety

and
velocity
of
that
content,
nor
the
number
of
users

|

7

Key
Features
of
Beta

l  Combines
the
real
:me,
ad
hoc
data
accessibility
of
LucidWorks
with

compute
and
storage
capabili:es
of
Hadoop

l  Delivers
analy:c
capabili:es
along
with
scalable
machine
learning

algorithms
for
deeper
insight
into
both
content
and
users

l  RESTful
API
suppor:ng
JSON
input/output
formats
for
easy
integra:on

l  Full
Stack
-‐
Minimizes
the
impact
of
provisioning
Hadoop,
LucidWorks
and

other
components

l  Hosted
in
cloud
and
supported
by
Lucid
Imagina:on

|

9

APIs

l  Search
and
Indexing
l  Analy:cs

–  Full
power
of
LucidWorks
(Solr)
–  Common
search
analy:cs
for

–  Bulk
and
Near
Real
Time
Indexing
beaer
understanding
of
relevancy

based
on
log
analysis

–  Sharded
via
SolrCloud

–  Historical
views

l  Workflows

l  Machine
Learning

–  Predefined
workflows
ease

common
data
tasks
such
as
bulk
–  Clustering

indexing
–  Sta:s:cally
Interes:ng
Phrases

l  Administra:on
–  Future
enhancements
planned

–  Access
to
key
system
informa:on
l  Proxy
APIs

–  User
management
–  LucidWorks

–  WebHDFS

|

10

Under
the
Hood

LucidWorks 2.1 SDA Engine

l  Lucene/Solr
4.0-‐dev
l  RESTful
services
built
on
Restlet
2.1

l  Sharded
with
SolrCloud
l  Service
Discovery,
load
balancing,

–  1
second
(default)
som
commits
for
failover
enabled
via
ZooKeeper
+

NRT
updates
Neilix
Curator

–  1
minute
(default)
hard
commits
l  Authen:ca:on
and
authoriza:on

(no
searcher
reopen)

over
SSL
(op:onal)

–  Transac:on
logs
for
recovery

l  Proxies
for
LucidWorks
and

–  Solr
takes
care
of
leader
elec:on,

etc.
so
no
more
master/worker
WebHDFS
API

l  See
Mark
Miller’s
talk
on
SolrCloud
l  Workﬂow
engine
coordinates
data

ﬂow

|

11

Under
the
Hood

l  Apache
Hadoop
l  Apache
HBase

–  Map-‐Reduce
(MR)
jobs
for
ETL
and
–  Key-‐value
and
:me
series
of
all

bulk
indexing
into
SolrCloud
calculated
metrics

sharded
system

l  Apache
Pig

–  Leverage
Pig
and
custom
MR
jobs

for
log
processing
and
metric
–  ETL

calcula:on
–  Log
analysis
-‐>
HBase

–  WebHDFS
l  Apache
ZooKeeper

l  Apache
Mahout
–  Neilix
Curator
for
service

–  K-‐Means
Clustering
discovery
and
higher
level
ZK
client

–  Sta:s:cally
Interes:ng
Phrases
l  Apache
Kasa

–  More
to
come
–  Pub-‐sub
for
collec:ng
logs
from

LucidWorks
into
HDFS

|

12

The
Road
Ahead

l  Our
approach
is
from
search
and
discovery
outwards
to
analy:cs

–  Analy:cs
in
beta
are
focused
around
analysis
of
search
logs

l  Analy:cs
Themes

–  Relevance

–  Data
quality

–  Discovery

–  Integra:on
with
other
packages
(R?)

l  Machine
Learning

–  Classiﬁca:on

–  NLP

l  More
analy:cs
on
the
index
itself?

|

13

Contacts

l  hap://bit.ly/lucidworks-‐big-‐data

l  hap://www.lucidimagina:on.com

l  grant@lucidimagina:on.com

l  @gsingers

|

14

KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

Recommandé

Recommandé

Contenu connexe

Plus de lucenerevolution

Plus de lucenerevolution (20)

Dernier

Dernier (20)

KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop