Analyzing and applying machine learning algorithms to a possibly infinite flow of data is a challenging task. This presentation presents the SAMOA framework, which allows the development of machine learning algorithms on top of any distributed stream processing engine. It also demonstrates the development and use of a distributed clustering algorithm based on CluStream using the Apache S4 platform.
Streamlining Python Development: A Guide to a Modern Project Setup
Scalable Distributed Real-Time Clustering for Big Data Streams
1. Scalable Distributed Real-Time
Clustering for Big Data Streams
European Masters in Distributed Computing (EMDC)
Student
Antonio Severien
severien@yahoo-inc.com
Supervisors
Albert Bifet (Yahoo! Research)
Gianmarco De Francisci Morales (Yahoo! Research)
Marta Arias (Universitat Politecnica de Catalunya)
2. 27/06/13
Contributions
¤ SAMOA (Scalable Advanced Massive Online Analysis)
¤ Stream Processing Engine (SPE) abstraction framework
¤ Machine learning libraries adapter layer
¤ API for implementing data flow topologies
¤ SAMOA Clustering Algorithm
¤ Distributed stream clustering algorithm based on CluStream*
¤ Parallelize clustering task and scale-up on resource usage
(*) “A Framework for Clustering Evolving Data Streams”, Aggarwal et al. 2003
2
3. 27/06/13
Motivation
¤ How BIG is BIG in BIG Data???
¤ 2.5 quintillion of bytes generated every day.
¤ 90% of todays data was generated in the last 2 years
¤ Sensors, social networks, e-business, mobile, internet logs, etc.
¤ Problems… 3 Vs
¤ Storage is unviable due to massive Volume
¤ Production rate on increasing in Velocity
¤ Different sources, different data, different types means Variety
3
4. 27/06/13
Where is the Big Data?
¤ Where is the food?
¤ Databases?
¤ Data warehouses?
¤ Distributed databases?
¤ Distributed file systems?
¤ It’s flowing online! It’s Streaming!
4
5. 27/06/13
Crunching Big Data
¤ Map and Reduce
¤ MapReduce/GFS
¤ Hadoop/HDFS
¤ Stream Processing Engines (SPE)
¤ Apache S4
¤ Twitter Storm
5
6. 27/06/13
Distributed Systems
¤ Actors Model
¤ Independent concurrent processes
¤ Communicate asynchronously by message passing
¤ MapReduce Model
¤ Mappers: filter and sorting
¤ Reducers: summary and aggregation
¤ Large volume of data distributed
¤ Iterative: map-reduce-map-reduce…
6
7. 27/06/13
Streaming
¤ Streaming Model
¤ One-pass processing: discard item after use
¤ Low memory usage: store statistics and summaries
¤ Unbounded flow of data
¤ Evolving data sets
¤ Limited processing time
¤ Arrival order is not guaranteed
7
8. 27/06/13
Making sense
¤ Machine Learning & Data Mining
¤ Make sense, extract patterns and react accordingly
¤ Train machines to “think”
¤ Perceive behavior
¤ Relations between similar information
¤ Unsupervised Learning
¤ Clustering algorithms
8
9. 27/06/13
Machine Learning Tools
¤ Mahout
¤ Machine learning framework used on top of Hadoop/HDFS
¤ Batch processing with MapReduce model
¤ Open-source and good community support
¤ Massive Online Analysis (MOA)
¤ Stream machine learning tool
¤ Many algorithms implemented; based on WEKA
¤ Single machine constraint
¤ Jubatus
¤ Distributed streaming machine learning framework
¤ No clustering algorithms yet
¤ No stream platform abstraction
9
10. 27/06/13
Scalable Advanced Massive Online
Analysis (SAMOA)
¤ Distributed data streaming machine learning framework
¤ Stream Platform Engine Abstraction
¤ Code once, run everywhere
¤ Focus on distributed algorithm design
¤ Fault-tolerance, communication, consistency and
availability are provided by the underlying distributed
processing platform
¤ Initial release provides integration with,
¤ Apache S4
¤ Twitter Storm
10
11. 27/06/13
Scalable Advanced Massive Online
Analysis (SAMOA)
SAMOA Algorithms
&
SAMOA-API
ML Adapter
SAMOA
MOA
Other
ML
libraries
SPE Adapter
S4
Storm
Other
SPE
11
12. 27/06/13
Scalable Advanced Massive Online
Analysis (SAMOA)
SAMOA Algorithms
&
SAMOA-API
ML Adapter
SAMOA
MOA
Other
ML
libraries
SPE Adapter
S4
Storm
Other
SPE
12
13. 27/06/13
Scalable Advanced Massive Online
Analysis (SAMOA)
SAMOA Algorithms
&
SAMOA-API
ML Adapter
SAMOA
MOA
Other
ML
libraries
SPE Adapter
S4
Storm
Other
SPE
13
14. 27/06/13
Scalable Advanced Massive Online
Analysis (SAMOA)
SAMOA Algorithms
&
SAMOA-API
ML Adapter
SAMOA
MOA
Other
ML
libraries
SPE Adapter
S4
Storm
Other
SPE
14
15. 27/06/13
( Apache S4 )
¤ Distributed, semi fault-tolerant, stream processing
platform
¤ Based on the Actors model and inspired by the
MapReduce model
¤ Flexibility on data flow; any topology and processor unit
can be built, besides the mappers and reducers design
¤ Specialized in processing events from a stream and
emitting events into a stream
15
16. 27/06/13
Scalable Advanced Massive Online
Analysis (SAMOA)
SAMOA Topology
Task
PI
STREAM
SOURCE
Stream
PI
EPI
PI
PI
MAP
S4 App
Stream
STREAM
SOURCE
PE
PE
PE
PE
PE
16
17. 27/06/13
How to use?
¤ Adding SPE using API
¤ S4ProcessingItem: processing element wrapper
¤ S4Stream: wrapper for a S4 stream
¤ S4ComponentFactory: provides components specific from Apache
S4, such as processing elements and streams
¤ S4TopologyBuilder: creates the topology instances
¤ Adding algorithm and building topology
class
SimpleTask
{
...
TopologyBuilder
topologyBuilder
=
new
TopologyBuilder(
);
EntranceProcessinItem
entranceProcessingItem
=
topologyBuilder.createEntrancePI(
new
SourceProcessor(
)
);
Stream
stream
=
topologyBuilder.createStream(
entranceProcessingItem
);
ProcessingItem
processingItem
=
topologyBuilder.createPI(
new
Processor(
)
);
processingItem.connectInputKey(
stream
);
...
17
18. 27/06/13
Grouping the Best of All
¤ Flexible programming model
¤ Distributed stream processing engine abstraction
¤ Integrated machine learning and data mining algorithms
¤ Easy API to implement new algorithms and SPE adapters
18
19. 27/06/13
SAMOA Clustering Algorithm
¤ Distributed stream clustering algorithm
¤ Validate SAMOA implementation and
¤ Integration with Apache S4 using the SAMOA-S4 adapter
¤ Deploy on Apache S4
19
20. 27/06/13
Stream Clustering Algorithm
¤ CluStream Framework
¤ Based on k-means
¤ Online phase (micro-clustering)
¤ Offline phase (macro-clustering)
¤ k-means: partition a set of data into k distinct clusters
according to a similarity function
¤ Minimization of squared Euclidean distance objective
function:
20
21. 27/06/13
K-means Clustering Algorithm
¤ Advantages
¤ Simple, fast and efficient
¤ Known issues with k-means
¤ Sensitive to initial seeding
¤ Minimization problem is NP-hard even for simple
configurations
¤ 1-dimensional points
¤ Global optimum not guaranteed
¤ Good for spherical clustering, not good for arbitrary shapes
21
22. 27/06/13
Distributed Stream Clustering
¤ Online micro-clustering
¤ Apply on a local clustering phase
¤ Cluster Feature Vectors with Timestamp (CFT)
¤
N: number of data objects
¤
LS: linear sum of data objects
¤
SS: sum of squares of data objects
¤
LST: sum of timestamps
¤
SST: sum of squares of timestamps
¤ Offline macro-clustering
¤ Use of micro-clusters as weighted pseudo-points
¤ Apply on a global clustering phase with a weighted k-means
¤ Uses probabilistic seeding depending on the weighted
micro-clusters
22
24. 27/06/13
Scalable Advanced Massive Online
Analysis (SAMOA)
SAMOA Clustering Task
Clustering
STREAM
SOURCE
Global
Clustering PI
Distribution
PI
OUTPUT
Local Clustering PI
Evaluation
OUTPUT
Sampling PI
Evaluator PI
24
25. 27/06/13
Experiments, Evaluation & Results
¤ Experimental Setup
¤ Four 2.4Mhz Intel Xeon dual-quadcore, 48GB RAM
¤ Process parallelism level: 1, 8 & 16
¤ Instance dimensions: 3 & 15
¤ Source dataset: random events generator
¤ Noise: 0% & 10%
¤ Cluster movement speed: move 0.1 unit every 500 & 12000 instances
¤ Evaluations
¤ Scalability: measure throughput when adding concurrent
processes
¤ Clustering quality: measure if the clustering algorithm are
accurate
25
29. 27/06/13
Clustering Quality Metrics
¤ Internal & External evaluations
¤ Internal evaluation uses attributes available from the clustering
structure.
¤ External evaluation uses external validation structures.
¤ ex.: ground truth provided by the source generator.
¤ Metrics
¤ Cohesion coefficient (SSE): measures the intra clusters sum of
squares error
¤ Separation coefficient (BSS): measures the inter cluster betweensum of squares.
29
34. 27/06/13
Conclusion
¤ There is important information on the massive amount of
data being produced and discarded
¤ There is a need for tools to deal with this efficiently
¤ Efforts have been done to crunch big data
¤ Interpreting and retrieving relevant information is where
machine learning and data mining operate
¤ Using real-time analysis responds faster to evolving data
¤ SAMOA abstracts the platform and maintains the
algorithms; good to implement, test and use.
34
35. 27/06/13
Acknowledgements
¤ Thanks the Erasmus Mundus and all three universities
(UPC, KTH and IST) for providing this opportunity
¤ Thanks all the EMDC students
¤ Thanks Yahoo! Research for the great project
35