Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Oct.
20th
2012@Rakuten
Technology
Conference
2012

Realtime
deep
analytics

for
BigData
Daisuke
Okanohara

Preferred
Infrastructure,
Inc.　

co-‐founder,
vice
president

hillbig@preferred.jp

Agenda

l  Introduction
of
PFI

l  Current
condition
of
BigData
Analysis

l  Jubatus:
concept
and
characteristics

l  Inside
Jubatus:
Update,
Analyze,
and
Mix

2

Preferred
Infrastructure
(PFI)

l  Founded:
March
2006

l  Location:
Hongo,
Tokyo

l  Employees:
26

l  Our
mission:

Bring
cutting-‐edge
research
advances
to
the
real

world

l  Our
products
:

l  Sedue

“Modern
search
engine”

l  Bazil

“Machine
learning
for
everyone”

l  Jubatus

“Realtime
deep
analytics
for
BigData”

3

Preferred
Infrastructure
(contd.)

l  We
are
passionate
towards
developing
various
computer

science
technologies

l  machine
learning

l  natural
language
processing

l  distributed
systems

l  programming
languages

l  data
structures

l  algorithms,
etc…

l  Out
team
includes
winners
of
various
programming
contests

and
red
coders

l  Very
rapid
prototyping
and
developing
good
software

4

Agenda

of
PFI

l  Current
condition
of
BigData
Analysis

l  Jubatus:
concept
and
characteristics

l  Inside
Jubatus:
Update,
Analyze,
and
Mix

5

BigData
!

l  We
see
BigData
everywhere

l  3V

“Volume”,
“Velocity”,
“Variety”

l  Need
tools
for
analyzing
BigData

<Data
Types>
Text Log Image Voice Vision Signal Finance Bio

People PC Mobile Sensors Cars Factories Web Hospitals

<Data
Sources>
6

Case
1.
SNS（Twitter・Facebook,
etc.）
•  Jubatus
classiﬁes
each
tweet
from
stream
(6000
tps)

into
categories
according
to
tweet
contents
using

machine
learning
technologies

7

Case
2.
Automobiles

l  Services

l  Remote
maintenance
/

security

l  Insurance:
Pay
As
You
Drive
,
Pay
How
You
Drive

l  Auto-‐driving
cars

l  equipped
sensors:
radar,
lidar
(laser
radar)
,
GPS,
cameras

l  E.
g.
Google
driverless
cars

l  In
Aug.
2012,
they
completed
480,000
km
test
drive

8

Case
2.
automobile
(contd.)

navigation
system
based
on
real-‐time
traﬃc
updates

waze.com

9

Case
3.

Infrastructures,
factories

l  Preventive
maintenance
for
NY
City
power
grid

l  Learning
prioritization
(supervised
ranking
or
MTBF)
of

candidates
using
approx.
300
summary
features

l  The
results
are
enough
accurate
to
support
decision
making

OA rate 
=outage rate

“Machine
Learning
for
the
New
York
City
Power
Grid”,

J.
IEEE
Trans.
PAMI,
2-‐12,

10

Case
3.
Infrastructures,
factories
(contd.)
Beneﬁt vs Cost for various replacement strategies analyzed by 
machine learning 

“Machine
Learning
for
the
New
York
City
Power
Grid”,

J.
IEEE
Trans.
PAMI,
2-‐12,

11

Case.
4

Genome
Analysis

l  Next
generation
sequencer
makes
big
changes

l  Human
genome
sequencing,
$3
billion/10
year
in
2001

becomes
$7,700/1
day
in
2012

l  GWAS
(Genome-‐wide
association
study)
becomes
popular

l  Big
impacts
in
many
ﬁelds:
Healthcare,
Agriculture,
Medicine

l  23andme
analyzes
users’
DNA
and
obtain
information
about

their

ancestries,
health
and
genetic
traits

12

Agenda

of
PFI

l  Current
condition
of
BigData
Analysis

l  Jubatus:
concept
and
characteristics

l  Inside
Jubatus:
Update,
Analyze,
and
Mix

13

Increasing
demand
in
BigData
applications:

Higher
necessity
of
deeper
real-‐time
analysis
l  Current:
simple
aggregation
and
pre-‐deﬁned
rule
processing

on
bigger
data

l  CEP,
Hadoop,
DSMS

l  Future:
deeper
analysis
for
rapid
decisions
and
actions

Decision
Speed

Jubatus
Hadoop
CEP
Deep

Reference：http://web.mit.edu/rudin/www/TPAMIPreprint.pdf

14
analysis

http://www.computerworlduk.com/news/networking/3302464/

Jubatus: OSS platform for Big Data analytics

l  Joint
development
of
PFI
and
NTT
laboratory

l  Project
started
in
April
2011

l  Released
as
an
open
source
software

l  You
can
download
it
from:
http://github.com/jubatus/

15

Key
technology:
Machine
learning

l  We
need
rapid
decisions
under
uncertainties

l  Anomaly
detection
from
M2M
sensor
data

l  Energy
demand
forecast
/
Smart
grid
optimization

l  Security
monitoring
on
raw
Internet
traﬃc

l  What
is
missing
for
fast
&
deep
analytics
on
BigData?

l  Online/real-‐time
machine
learning
platform

+
Scale-‐out
distributed
machine
learning
platform

1. Bigger data

2. Real-time

3. Deeper analysis

Online
machine
learning
l  Batch
machine
learning

l  Scan
all
data
before
building
a
model

l  Analysis
can
be
available
after
all
data
is
prepared

Model

l  Online
machine
learning

l  Model
is
updated
instantaneously
by
each
data
sample

l  Online
models
converge
with
the
batch
models

l  the
convergence
is
very
fast,
appx.
100
times
faster
than

batch

(1day
-‐>
5
min.)

Model

17

Jubatus
employs
latest
online
machine
learning

l  Advantages:
fast
and
memory-‐efficient

l  Low
latency
&
high
throughput

l  No
need
for
large
dataset
storage

l  Eg.
Online
learning
for
Linear
classification

l  Perceptron
(1958)

l  Passive
Aggressive

(2003)
Very
recent

progress
l  Confidence
Weighted
Learning

(2008)

l  AROW
(2009)

l  Normal
HERD

(2010)

l  Soft
Confidence
Weighted
Learning

(2012)

18

Data
analysis
goes
Real-‐time/Online
and
Large
scale

l  Jubatus
combines
them
into
a
uniﬁed
computation

framework
Real-‐time/

Online
Online
ML
alg.
Jubatus

2011-‐

Structured

Perceptron
2001

PA
2003,
CW
2008

Large
scale

Small
scale

&

Stand-‐alone
Distributed/

Parallel

WEKA
Mahout
computing

　

1993-‐

2006-‐

SPSS

1988-‐

Batch

19

What
Jubatus
currently
supports

1.  Classiﬁcation
(multi-‐class)

l  Perceptron
/
PA
/
CW
/
AROW

2.  Regression

l  PA-‐based
regression

3.  Nearest
neighbor
We
support
most
machine

l  LSH
/
MinHash
/
Euclid
LSH
learning/data
mining

4.  Recommendation
technologies
l  Based
on
nearest
neighbor

5.  Anomaly
detection

l  LOF
based
on
nearest
neighbor

6.  Graph
analysis

l  Shortest
path
/
Centrality
(PageRank)

7.  Simple
statistics
20

Hadoop
and
Mahout
are
not
good
for
online
learning

l  Hadoop

l  Advantages

l  Many
extensions
for
a
variety
of
applications

l  Good
for
distributed
data
storing
and
aggregation

l  Disadvantages

l  No
direct
support
for
machine
learning
and
online
processing

l  Mahout

l  Advantages

l  Popular
machine
learning
algorithms
are
implemented

l  Disadvantages

l  Some
implementations
are
less
mature

l  Still
not
capable
of
online
machine
learning

21

Jubatus
vs.
Hadoop,
RDB,
and
Storm:

Advantage
in
online
AND
distributed
ML
l  Only
Jubatus
satisﬁes
both
of
them
at
the
same
time

Jubatus Hadoop RDB Storm
Storing ✓✓
- ✓ -
BigData HDFS
Batch ✓ ✓✓
✓ -
learning Mahout SPSS, etc
Stream
✓ - - ✓✓
processing
Distributed ✓
✓✓ - -
learning Mahout
High  Online
importance
✓✓ - - -
learning
22

Agenda

of
PFI

l  Current
condition
of
BigData
Analysis

l  Jubatus:
concept
and
characteristics

l  Inside
Jubatus:
Update,
Analyze,
and
Mix

23

Distributed
online
learning
algorithm
is
not
trivial

Batch
learning
Online
learning

Learn
Learn

the
update Easy
to
parallelize Model
update
Learn
Model
update Model
update
Hard
to
Learn
Learn

parallelize
Model
update
the
update
due
to

Learn
frequent
updates
Time
Model
update Model
update

l  Online
learning
requires
frequent
model
updates

l  Naïve
distributed
architecture
leads
to
too
many

synchronization
operations

24

Solution:
Loose
model
sharing

l  Jubatus
only
shares
the
local
models
in
a
loose
manner

l  Fact:
Model
size
<<
Data
size

l  does
not
share
data
sets

l  Unique
approach
compared
to
existing
framework

l  Local
models
can
be
diﬀerent
on
the
servers

l  Diﬀerent
models
will
be
gradually
merged

Model Model Model

Mixed
Mixed
Mixed

model model model

Three
fundamental
operations
on
Jubatus:

UPDATE,
ANALYZE,
and
MIX
1.  UPDATE

l  Receive
a
sample,
learn
and
update
the
local
model

2.  ANALYZE

l  Receive
a
sample,
apply
the
local
model,
return
the
result

3.  MIX
(automatically
executed
in
backend)

l  Exchange
and
merge
the
local
models
between
servers

l  C.f.
Map-‐Shuﬄe-‐Reduce
operations
on
Hadoop

l  Algorithms
can
be
implemented
independently
from

l  Distribution
logic

l  Data
sharing

l  Failover
26

UPDATE

l  Each
data
sample
are
sent
to
one
(or
two)
server(s)

l  Local
models
are
updated
based
on
the
sample

l  Data
samples
are
NEVER
shared

Distributed 
randomly
Local
or consistently
Initial
model
model
1

Local
model Initial
model
2
27

MIX

l  Each
server
sends
its
model
diff
(difference)

l  Model
diffs
are
merged
and
distributed

l  Only
model
diffs
are
transmitted

Local Model Model
Initial Merged Initial Mixed
model -
model =
diff diff
diff +
model =
model
1 1 1 Merged
+
=
diff
Local Model Model
Initial Merged Initial Mixed
model -
2
model =
diff diff
diff +
model =
model
2 2

28

UPDATE
(iteration)

l  Each
server
starts
updating
from
the
mixed
model

l  The
mixed
model
improves
gradually
thanks
to
all
of
the

servers

Distributed 
randomly
Local
or consistently
Mixed
model
model
1

Local
model Mixed
model
2
29

ANALYZE

l  For
analysis,
each
sample
randomly
goes
to
a
server

l  Server
applies
the
current
mixed
model
to
the
sample

l  use
the
model
in
local
server
only,
doesn’t
communicate

l  The
results
are
returned
to
the
client

Distributed 
randomly
Mixed
model

Return prediction
Mixed
model
Return prediction
30

Why
Jubatus
can
work
in
real-‐time?

1.
Focus
on
online
machine
learning

l  Make
online
machine
learning
algorithms
distributed

2.
Update
locally

l  Online
training
without
communication
with
others

3.
Mix
only
models

l  Small
communication
cost,
low
latency,
good
performance

l  Advantage
compared
to
costly
Shuﬄe
in
MapReduce

4.
Analyze
locally

l  Each
server
has
mixed
model
and
need
not
to
communicate

l  Low
latency
for
making
predictions

5.
Everything
in-‐memory

l  Process
data
on-‐the-‐ﬂy

31

Summary

l  Jubatus
is
the
ﬁrst
OSS
platform
for
online
distributed

machine
learning
on
BigData
streams.

l  Download
it
from
http://github.com/jubatus/

l  We
welcome
your
contribution
and
collaboration

1. Bigger data

2. More in real-time

3. Deep analysis

Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (7)

Similaire à Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012

Similaire à Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012 (20)

Plus de Preferred Networks

Plus de Preferred Networks (20)

Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012