Presented by Grant Ingersoll, Chief Scientist, Lucid Imagination - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
Understanding and accessing large volumes of content often requires a multi-faceted approach that goes well beyond the needs of batch processing approaches. In many cases, one needs both ad hoc, real-time access to the content as well as the ability to discover interesting
information based on a variety of features such as recommendations, summaries and other interesting insights. Furthermore, analyzing how users interact with the content can both further enhance the quality of the system as well as deliver much needed insight into both the users and the content for the business. In this talk, we'll discuss a platform that enables large scale search, discovery and analytics over a wide variety of content utilizing tools like Solr, Hadoop, Mahout and others. The talk will discuss the architecture and capabilities of the system along with
how the capabilities of Solr 4 help drive real time access for content discovery and analytics.
KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop
1. Search
Discover
Analyze
Enabling
Scalable
Search,
Discovery
and
Analy6cs
with
Solr,
Mahout
and
Hadoop
Grant
Ingersoll
Chief
Scien:st
Lucid
Imagina:on
|
1
2. We
All
Know
the
Pain
l ________
data
growth
in
the
next
___
days/months/years
– Many
es:mate
80-‐90%
of
data
is
“unstructured”
(mul:-‐structured?)
l The
Age
of
“Data
Paranoia”
– What
if
I
don’t
collect
it
all?
– What
if
I
miss
something
or
lose
something?
– What
if
I
can’t
store
it
long
enough?
– How
do
I
secure
it?
– Can
I
afford
to
do
any
of
this?
Can
I
afford
not
to?
– What
if
I
can’t
make
sense
of
it?
|
2
3. Big
Data
Premise
and
Promise
Premise Promise
Large Scale Data Collection/Storage ✔
Prevents Data Loss ✔
Long Term Storage ✔
Affordable ✔
New Science Delivering New Insights ?
|
3
4. Why
Search,
Discovery
and
Analy;cs
(SDA)?
l User
Needs:
– Real-‐:me,
ad
hoc
access
to
content
Search
– Aggressive
Priori:za:on
based
on
Importance
– Serendipity
l Batch
processing
isn’t
enough
l Search
is
built
for
mul:-‐structured
Analytics Discovery
l Deeper
analysis
yields:
– Business
insight
into
users
– Beaer
Search
and
Discovery
for
users
|
4
5. What
do
you
need
for
SDA?
l Fast, efficient, scalable search
– Bulk and Near Real Time Indexing
l Large scale, cost effective storage
l Large scale processing power
– Large scale and distributed for whole data consumption and analysis
– Sampling tools
– Distributed In Memory where appropriate
l NLP and machine learning tools that scale to enhance discovery and
analysis
|
5
6. Example
Use
Cases
l Dark
Data
–
Petabytes
(and
beyond)
of
content
in
storage
with
liale
insight
into
what’s
in
it
– Forensics,
Intelligence
Gathering,
Risk
analysis,
etc.
l Financial
–
Enable
total
customer
view
to
beaer
understand
risks
and
opportuni:es
l Medical
–
Extend
research
capabili:es
through
deeper
analysis
of
both
scien:fic
data,
publica:ons
and
field
usage
l Social
Media
Monitoring
–
Understand
and
analyze
social
networks
and
their
trends
all
the
:me,
no
maaer
the
scale
l Commerce
–
Drive
more
sales
through
metric
driven
search
and
discovery
without
the
guesswork
|
6
7. Announcing
LucidWorks
Big
Data
Beta
An
applica:on
development
plaiorm
aimed
at
enabling
Search,
Discovery
and
Analysis
of
your
content
and
user
interac:ons,
no
maaer
the
volume,
variety
and
velocity
of
that
content,
nor
the
number
of
users
|
7
9. Key
Features
of
Beta
l Combines
the
real
:me,
ad
hoc
data
accessibility
of
LucidWorks
with
compute
and
storage
capabili:es
of
Hadoop
l Delivers
analy:c
capabili:es
along
with
scalable
machine
learning
algorithms
for
deeper
insight
into
both
content
and
users
l RESTful
API
suppor:ng
JSON
input/output
formats
for
easy
integra:on
l Full
Stack
-‐
Minimizes
the
impact
of
provisioning
Hadoop,
LucidWorks
and
other
components
l Hosted
in
cloud
and
supported
by
Lucid
Imagina:on
|
9
10. APIs
l Search
and
Indexing
l Analy:cs
– Full
power
of
LucidWorks
(Solr)
– Common
search
analy:cs
for
– Bulk
and
Near
Real
Time
Indexing
beaer
understanding
of
relevancy
based
on
log
analysis
– Sharded
via
SolrCloud
– Historical
views
l Workflows
l Machine
Learning
– Predefined
workflows
ease
common
data
tasks
such
as
bulk
– Clustering
indexing
– Sta:s:cally
Interes:ng
Phrases
l Administra:on
– Future
enhancements
planned
– Access
to
key
system
informa:on
l Proxy
APIs
– User
management
– LucidWorks
– WebHDFS
|
10
11. Under
the
Hood
LucidWorks 2.1 SDA Engine
l Lucene/Solr
4.0-‐dev
l RESTful
services
built
on
Restlet
2.1
l Sharded
with
SolrCloud
l Service
Discovery,
load
balancing,
– 1
second
(default)
som
commits
for
failover
enabled
via
ZooKeeper
+
NRT
updates
Neilix
Curator
– 1
minute
(default)
hard
commits
l Authen:ca:on
and
authoriza:on
(no
searcher
reopen)
over
SSL
(op:onal)
– Transac:on
logs
for
recovery
l Proxies
for
LucidWorks
and
– Solr
takes
care
of
leader
elec:on,
etc.
so
no
more
master/worker
WebHDFS
API
l See
Mark
Miller’s
talk
on
SolrCloud
l Workflow
engine
coordinates
data
flow
|
11
12. Under
the
Hood
l Apache
Hadoop
l Apache
HBase
– Map-‐Reduce
(MR)
jobs
for
ETL
and
– Key-‐value
and
:me
series
of
all
bulk
indexing
into
SolrCloud
calculated
metrics
sharded
system
l Apache
Pig
– Leverage
Pig
and
custom
MR
jobs
for
log
processing
and
metric
– ETL
calcula:on
– Log
analysis
-‐>
HBase
– WebHDFS
l Apache
ZooKeeper
l Apache
Mahout
– Neilix
Curator
for
service
– K-‐Means
Clustering
discovery
and
higher
level
ZK
client
– Sta:s:cally
Interes:ng
Phrases
l Apache
Kasa
– More
to
come
– Pub-‐sub
for
collec:ng
logs
from
LucidWorks
into
HDFS
|
12
13. The
Road
Ahead
l Our
approach
is
from
search
and
discovery
outwards
to
analy:cs
– Analy:cs
in
beta
are
focused
around
analysis
of
search
logs
l Analy:cs
Themes
– Relevance
– Data
quality
– Discovery
– Integra:on
with
other
packages
(R?)
l Machine
Learning
– Classifica:on
– NLP
l More
analy:cs
on
the
index
itself?
|
13