SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

SAMOA: A Platform for
Mining Big Data Streams
Nicolas Kourtellis
Associate Researcher
Telefonica I+D, Barcelona
1

What is Big Data?
Search queries
Facebook posts
Emails
Tweets
Photo shares
Clicks on ads
…
2

How BIG is your data?
Volume (+ Variety)
Too large for RAM of single commodity server
Velocity
Too fast for CPU of single commodity server
3

What is the Streaming Paradigm?
High amount of data, high speed of arrival
Updated models at “real” time
Potentially infinite sequence of data
Change over time (concept drift)
4

Approximation algorithms:
Single pass, one data item at a time
Sub-linear space and time per data item
Small error with high probability
A platform solution:
Support different algorithms & processing engines
Distributed
Scalable
5

What is SAMOA?
Scalable Advanced Massive Online Analysis
A platform for mining big data streams
Framework for developing new distributed stream
mining algorithms
Framework for deploying algorithms on new distributed
stream processing engines
6

Taxonomy
Machine
Learning
Distributed
Batch
Hadoop
Mahout
Stream
S4, Storm
SAMOA
Non
Distributed
Batch
R,
WEKA,
…
Stream
MOA
7

SAMOA ArchitectureArchitecture
SASAMOA%
Machine Learning
Algorithms
Distributed Stream
Processing Engines
Flink
8

Why is SAMOA important?
Program once, run everywhere
Reuse existing infrastructure
Avoid deploy cycles
No system downtime
No complex backup/update process
No need to select update frequency
9

ML Developer API
ML Developer API
Processing Item
Processor
Stream
10

ML Developer API
L Developer API
TopologyBuilder builder;
Processor sourceOne = new SourceProcessor();
builder.addProcessor(sourceOne);
Stream streamOne = builder.createStream(sourceOne);
!
Processor sourceTwo = new SourceProcessor();
builder.addProcessor(sourceTwo);
Stream streamTwo = builder.createStream(sourceTwo);
!
Processor join = new JoinProcessor());
builder.addProcessor(join)
.connectInputShuffle(streamOne)
.connectInputKey(streamTwo);
ML Developer API
TopologyBuilder builder;
Processor sourceOne = new SourceProcessor();
builder.addProcessor(sourceOne);
Stream streamOne = builder.createStream(sourceOne);
!
Processor sourceTwo = new SourceProcessor();
builder.addProcessor(sourceTwo);
Stream streamTwo = builder.createStream(sourceTwo);
!
Processor join = new JoinProcessor());
builder.addProcessor(join)
.connectInputShuffle(streamOne)
11

Deployment
Deployment
SAMOA-S4.jar
SAMOA-API.jar
SAMOA-Storm.jar
samoa-storm-deployable.jar
samoa-s4-deployable.s4r
S4 bindings
Storm bindings
API. Algorithm developer
depends only on this
To S4 cluster
To Storm cluster
12

Easy to test!
bin/samoa storm target/SAMOA-Storm-0.3.0-SNAPSHOT.jar
"PrequentialEvaluation
-d /tmp/dump.csv
-i 1000000 -f 100000
-l (classifiers.trees.VerticalHoeffdingTree -p 4 -k)
-s (generators.RandomTreeGenerator –r 1 -c 2 -o 10 -u 10)"
16

Case study: Decision Trees
VHT: Vertical Hoeffding Tree*
17
Task Parallelism
Task parallelism
*VHT: Vertical Hoeffding Tree. N. Kourtellis,
G. De Francisci Morales, A. Bifet, A.
Mordupo. IEEE BigData 2016.

Case study: VHT
18
Horizontal Parallelism
Stats
Stats
Stats
Stream
Histograms
Model
Instances
Model UpdatesHorizontal Parallelism

Case study: VHT
19
Vertical Parallelism
Stats
Stats
Stats
Stream
Model
Attributes
SplitsVertical Parallelism

Benefits of Vertical Parallelism
High number of attributes:
high level parallelism (e.g., documents)
vs. task parallelism:
obvious parallelism observed
vs. horizontal parallelism:
reduced memory usage (no model replication)
parallelized split computation
20

Vertical Hoeffding Tree
21
Vertical Hoeffding Tree
Control
Split
Result
Source (n) Model (n) Stats (n) Evaluator (1)
InstanceStream
Shuffle Grouping
Key Grouping
All Grouping

Preliminary results: Tweets
Zipf skew: 1.5
Bag of words: 100, 1000, 10000 (attributes)
Size of tweet: ~15 words
Instances: 1,000,000
Class: positive or negative (Gaussian random
variable)
10 runs
Local vs. Storm virtual cluster
22

Results: Accuracy
23
0
20
40
60
80
100
4 8 16 local
CorrectClassification%
Parallelism Level
Classification Accuracy vs.
Parallelism Level vs.
Number of Attributes
100 words
1000 words
10000 words

Results: Speedup
24
0
1
2
3
4
5
4 8 16
Speedup
Parallelism Level
Speedup vs.
Parallelism Level vs.
Number of Attributes
100 words
1000 words
10000 words

Is SAMOA for you?
Are you dealing with:
Big fast data?
Possibly endless streams of data?
Evolving data?
Do you need updated models at real time?
Do you want to test an algorithm on
different DSPEs?
25

SAMOA Team
Albert Bifet
Gianmarco
De Francisci Morales
Nicolas Kourtellis
Matthieu Morel
Arinto Murdopo
Olivier Van Laere
26

Status
Apache Incubator
 Released version 0.3.0 in July
Execution Engines
Input:
 Local FS
 HDFS
 Kafka [pending]
Parallel algorithms
Vertical Hoeffding Tree (classiﬁcation)
CluStream (clustering)
Adaptive Model Rules (regression)
PARMA (frequent pattern mining) [pending]
Execution engines
CluStream (clustering)
Adaptive Model Rules (regression)
PARMA (frequent pattern mining) [pendin
ecution engines
siﬁcation)
ession)
ining) [pending]
Heron?
27

Algorithms in SAMOA
Existing:
 Vertical Hoeffding Tree (classification)
 CluStream (clustering)
 Adaptive Model Rules (regression)
Pending:
 Distributed Naïve Bayes
 Stochastic Gradient Descent
 Adaptive + Boosting VHT
 Parallelized Gradient Boosted Decision Tree
 PARMA (frequent pattern mining)
 …
Check Samoa Roadmap for more
Looking for
contributors!
28

SAMOA: A Platform for
@ApacheSAMOA
http://samoa.incubator.apache.org/
https://github.com/apache/incubator-samoa
Nicolas Kourtellis
@kourtellis
nicolas.kourtellis@telefonica.com
29

SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (16)

Similaire à SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

Similaire à SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015) (20)

Plus de Nicolas Kourtellis

Plus de Nicolas Kourtellis (8)

Dernier

Dernier (20)

SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)