A general overview of the APACHE SAMOA platform for mining big data streams using machine learning algorithms running on distributed stream processing platforms such as Apache STORM, Apache Flink, Apache Samza and Apache Apex.
Results are shown from experimentation with VHT, the Vertical Hoeffding Tree proposed in "VHT: Vertical Hoeffding Tree." N. Kourtellis, G. De Francisci Morales, A. Bifet, A. Mordupo. IEEE BigData 2016.
Presentation in APACHE BIG DATA Europe 2015
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
1. SAMOA: A Platform for
Mining Big Data Streams
Nicolas Kourtellis
Associate Researcher
Telefonica I+D, Barcelona
1
2. What is Big Data?
Search queries
Facebook posts
Emails
Tweets
Photo shares
Clicks on ads
…
2
3. How BIG is your data?
Volume (+ Variety)
Too large for RAM of single commodity server
Velocity
Too fast for CPU of single commodity server
3
4. What is the Streaming Paradigm?
High amount of data, high speed of arrival
Updated models at “real” time
Potentially infinite sequence of data
Change over time (concept drift)
4
5. Mining Big Data Streams
Approximation algorithms:
Single pass, one data item at a time
Sub-linear space and time per data item
Small error with high probability
A platform solution:
Support different algorithms & processing engines
Distributed
Scalable
5
6. What is SAMOA?
Scalable Advanced Massive Online Analysis
A platform for mining big data streams
Framework for developing new distributed stream
mining algorithms
Framework for deploying algorithms on new distributed
stream processing engines
6
9. Why is SAMOA important?
Program once, run everywhere
Reuse existing infrastructure
Avoid deploy cycles
No system downtime
No complex backup/update process
No need to select update frequency
9
17. Case study: Decision Trees
VHT: Vertical Hoeffding Tree*
17
Task Parallelism
Task parallelism
*VHT: Vertical Hoeffding Tree. N. Kourtellis,
G. De Francisci Morales, A. Bifet, A.
Mordupo. IEEE BigData 2016.
18. Case study: VHT
18
Horizontal Parallelism
Stats
Stats
Stats
Stream
Histograms
Model
Instances
Model UpdatesHorizontal Parallelism
19. Case study: VHT
19
Vertical Parallelism
Stats
Stats
Stats
Stream
Model
Attributes
SplitsVertical Parallelism
20. Benefits of Vertical Parallelism
High number of attributes:
high level parallelism (e.g., documents)
vs. task parallelism:
obvious parallelism observed
vs. horizontal parallelism:
reduced memory usage (no model replication)
parallelized split computation
20
21. Vertical Hoeffding Tree
21
Vertical Hoeffding Tree
Control
Split
Result
Source (n) Model (n) Stats (n) Evaluator (1)
InstanceStream
Shuffle Grouping
Key Grouping
All Grouping
22. Preliminary results: Tweets
Zipf skew: 1.5
Bag of words: 100, 1000, 10000 (attributes)
Size of tweet: ~15 words
Instances: 1,000,000
Class: positive or negative (Gaussian random
variable)
10 runs
Local vs. Storm virtual cluster
22
23. Results: Accuracy
23
0
20
40
60
80
100
4 8 16 local
CorrectClassification%
Parallelism Level
Classification Accuracy vs.
Parallelism Level vs.
Number of Attributes
100 words
1000 words
10000 words
24. Results: Speedup
24
0
1
2
3
4
5
4 8 16
Speedup
Parallelism Level
Speedup vs.
Parallelism Level vs.
Number of Attributes
100 words
1000 words
10000 words
25. Is SAMOA for you?
Are you dealing with:
Big fast data?
Possibly endless streams of data?
Evolving data?
Do you need updated models at real time?
Do you want to test an algorithm on
different DSPEs?
25
27. Status
Apache Incubator
Released version 0.3.0 in July
Execution Engines
Input:
Local FS
HDFS
Kafka [pending]
Parallel algorithms
Vertical Hoeffding Tree (classification)
CluStream (clustering)
Adaptive Model Rules (regression)
PARMA (frequent pattern mining) [pending]
Execution engines
CluStream (clustering)
Adaptive Model Rules (regression)
PARMA (frequent pattern mining) [pendin
ecution engines
sification)
ession)
ining) [pending]
Heron?
27
28. Algorithms in SAMOA
Existing:
Vertical Hoeffding Tree (classification)
CluStream (clustering)
Adaptive Model Rules (regression)
Pending:
Distributed Naïve Bayes
Stochastic Gradient Descent
Adaptive + Boosting VHT
Parallelized Gradient Boosted Decision Tree
PARMA (frequent pattern mining)
…
Check Samoa Roadmap for more
Looking for
contributors!
28
29. SAMOA: A Platform for
Mining Big Data Streams
@ApacheSAMOA
http://samoa.incubator.apache.org/
https://github.com/apache/incubator-samoa
Nicolas Kourtellis
@kourtellis
nicolas.kourtellis@telefonica.com
29