Scalable Distributed Real-Time Clustering for Big Data Streams

Scalable Distributed Real-Time
Clustering for Big Data Streams
European Masters in Distributed Computing (EMDC)

Student
Antonio Severien
severien@yahoo-inc.com
Supervisors
Albert Bifet (Yahoo! Research)
Gianmarco De Francisci Morales (Yahoo! Research)
Marta Arias (Universitat Politecnica de Catalunya)

27/06/13

Contributions
¤  SAMOA (Scalable Advanced Massive Online Analysis)
¤  Stream Processing Engine (SPE) abstraction framework
¤  Machine learning libraries adapter layer
¤  API for implementing data flow topologies

¤  SAMOA Clustering Algorithm
¤  Distributed stream clustering algorithm based on CluStream*
¤  Parallelize clustering task and scale-up on resource usage

(*) “A Framework for Clustering Evolving Data Streams”, Aggarwal et al. 2003
2

27/06/13

Motivation
¤  How BIG is BIG in BIG Data???
¤  2.5 quintillion of bytes generated every day.
¤  90% of todays data was generated in the last 2 years
¤  Sensors, social networks, e-business, mobile, internet logs, etc.

¤  Problems… 3 Vs
¤  Storage is unviable due to massive Volume
¤  Production rate on increasing in Velocity
¤  Different sources, different data, different types means Variety

3

27/06/13

Where is the Big Data?
¤  Where is the food?
¤  Databases?
¤  Data warehouses?
¤  Distributed databases?
¤  Distributed file systems?
¤  It’s flowing online! It’s Streaming!

4

27/06/13

Crunching Big Data
¤  Map and Reduce
¤  MapReduce/GFS
¤  Hadoop/HDFS

¤  Stream Processing Engines (SPE)
¤  Apache S4
¤  Twitter Storm

5

27/06/13

Distributed Systems
¤  Actors Model
¤  Independent concurrent processes
¤  Communicate asynchronously by message passing

¤  MapReduce Model
¤  Mappers: filter and sorting
¤  Reducers: summary and aggregation
¤  Large volume of data distributed
¤  Iterative: map-reduce-map-reduce…

6

27/06/13

Streaming
¤  Streaming Model
¤  One-pass processing: discard item after use
¤  Low memory usage: store statistics and summaries
¤  Unbounded flow of data
¤  Evolving data sets
¤  Limited processing time
¤  Arrival order is not guaranteed

7

27/06/13

Making sense
¤  Machine Learning & Data Mining
¤  Make sense, extract patterns and react accordingly
¤  Train machines to “think”
¤  Perceive behavior
¤  Relations between similar information

¤  Unsupervised Learning
¤  Clustering algorithms

8

27/06/13

Machine Learning Tools
¤  Mahout
¤  Machine learning framework used on top of Hadoop/HDFS
¤  Batch processing with MapReduce model
¤  Open-source and good community support

¤  Massive Online Analysis (MOA)
¤  Stream machine learning tool
¤  Many algorithms implemented; based on WEKA
¤  Single machine constraint

¤  Jubatus
¤  Distributed streaming machine learning framework
¤  No clustering algorithms yet
¤  No stream platform abstraction
9

27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)
¤  Distributed data streaming machine learning framework
¤  Stream Platform Engine Abstraction
¤  Code once, run everywhere
¤  Focus on distributed algorithm design
¤  Fault-tolerance, communication, consistency and
availability are provided by the underlying distributed
processing platform

¤  Initial release provides integration with,
¤  Apache S4
¤  Twitter Storm
10

27/06/13

Analysis (SAMOA)

SAMOA Algorithms
&
SAMOA-API

ML Adapter

SAMOA
MOA
Other
ML
libraries

SPE Adapter

S4

Storm

Other
SPE
11

27/06/13

Analysis (SAMOA)

SAMOA Algorithms
&
SAMOA-API

ML Adapter

SAMOA
MOA
Other
ML
libraries

SPE Adapter

S4

Storm

Other
SPE
12

27/06/13

Analysis (SAMOA)

SAMOA Algorithms
&
SAMOA-API

ML Adapter

SAMOA
MOA
Other
ML
libraries

SPE Adapter

S4

Storm

Other
SPE
13

27/06/13

Analysis (SAMOA)

SAMOA Algorithms
&
SAMOA-API

ML Adapter

SAMOA
MOA
Other
ML
libraries

SPE Adapter

S4

Storm

Other
SPE
14

27/06/13

( Apache S4 )
¤  Distributed, semi fault-tolerant, stream processing
platform
¤  Based on the Actors model and inspired by the
MapReduce model
¤  Flexibility on data flow; any topology and processor unit
can be built, besides the mappers and reducers design
¤  Specialized in processing events from a stream and
emitting events into a stream

15

27/06/13

Analysis (SAMOA)
SAMOA Topology

Task
PI

STREAM
SOURCE

Stream

PI

EPI
PI

PI

MAP
S4 App
Stream
STREAM
SOURCE

PE

PE

PE
PE

PE
16

27/06/13

How to use?
¤  Adding SPE using API
¤  S4ProcessingItem: processing element wrapper
¤  S4Stream: wrapper for a S4 stream
¤  S4ComponentFactory: provides components specific from Apache
S4, such as processing elements and streams
¤  S4TopologyBuilder: creates the topology instances

¤  Adding algorithm and building topology
class
SimpleTask
{

...

TopologyBuilder
topologyBuilder
=
new
TopologyBuilder(
);

EntranceProcessinItem
entranceProcessingItem
=

topologyBuilder.createEntrancePI(
new
SourceProcessor(
)
);

Stream
stream
=
topologyBuilder.createStream(
entranceProcessingItem
);

ProcessingItem
processingItem
=
topologyBuilder.createPI(
new
Processor(
)
);

processingItem.connectInputKey(
stream
);

...

17

27/06/13

Grouping the Best of All
¤  Flexible programming model
¤  Distributed stream processing engine abstraction
¤  Integrated machine learning and data mining algorithms
¤  Easy API to implement new algorithms and SPE adapters

18

27/06/13

SAMOA Clustering Algorithm
¤  Distributed stream clustering algorithm
¤  Validate SAMOA implementation and
¤  Integration with Apache S4 using the SAMOA-S4 adapter
¤  Deploy on Apache S4

19

27/06/13

Stream Clustering Algorithm
¤  CluStream Framework
¤  Based on k-means
¤  Online phase (micro-clustering)
¤  Offline phase (macro-clustering)

¤  k-means: partition a set of data into k distinct clusters
according to a similarity function
¤  Minimization of squared Euclidean distance objective
function:

20

27/06/13

K-means Clustering Algorithm
¤  Advantages
¤  Simple, fast and efficient

¤  Known issues with k-means
¤  Sensitive to initial seeding
¤  Minimization problem is NP-hard even for simple
configurations
¤  1-dimensional points
¤  Global optimum not guaranteed
¤  Good for spherical clustering, not good for arbitrary shapes

21

27/06/13

Distributed Stream Clustering
¤  Online micro-clustering
¤  Apply on a local clustering phase
¤  Cluster Feature Vectors with Timestamp (CFT)
¤ 

N: number of data objects

¤ 

LS: linear sum of data objects

¤ 

SS: sum of squares of data objects

¤ 

LST: sum of timestamps

¤ 

SST: sum of squares of timestamps

¤  Offline macro-clustering
¤  Use of micro-clusters as weighted pseudo-points
¤  Apply on a global clustering phase with a weighted k-means
¤  Uses probabilistic seeding depending on the weighted
micro-clusters
22

27/06/13

CluStream Snapshot
Micro-clusters

Macro-clusters

Ground Truth

23

27/06/13

Analysis (SAMOA)
SAMOA Clustering Task
Clustering

STREAM
SOURCE

Global
Clustering PI

Distribution
PI

OUTPUT

Local Clustering PI
Evaluation
OUTPUT

Sampling PI

Evaluator PI

24

27/06/13

Experiments, Evaluation & Results
¤  Experimental Setup
¤  Four 2.4Mhz Intel Xeon dual-quadcore, 48GB RAM
¤  Process parallelism level: 1, 8 & 16
¤  Instance dimensions: 3 & 15
¤  Source dataset: random events generator
¤  Noise: 0% & 10%
¤  Cluster movement speed: move 0.1 unit every 500 & 12000 instances

¤  Evaluations
¤  Scalability: measure throughput when adding concurrent
processes
¤  Clustering quality: measure if the clustering algorithm are
accurate
25

27/06/13

Scalability

Throughput (instances/second)

Baseline Comparison

Evaluation Step
26

27/06/13

Scalability

Average Throughput
(instances/second)

Average Throughput with Dimensions 3 and 15

Process Parallelism
27

27/06/13

Scalability

Avg. Cumulative Throughput
(instances/sec)

Parallelism Throughput with
Dimension 3

Process Parallelism
28

27/06/13

Clustering Quality Metrics
¤  Internal & External evaluations
¤  Internal evaluation uses attributes available from the clustering
structure.
¤  External evaluation uses external validation structures.
¤  ex.: ground truth provided by the source generator.

¤  Metrics
¤  Cohesion coefficient (SSE): measures the intra clusters sum of
squares error
¤  Separation coefficient (BSS): measures the inter cluster betweensum of squares.

29

27/06/13

Clustering Quality 0% Noise
Snapshot 25,000 instances


30

27/06/13

Ratio = BSS / GT

31

27/06/13



Good clustering
Poor clustering

32

27/06/13


33

27/06/13

Conclusion
¤  There is important information on the massive amount of
data being produced and discarded
¤  There is a need for tools to deal with this efficiently
¤  Efforts have been done to crunch big data
¤  Interpreting and retrieving relevant information is where
machine learning and data mining operate
¤  Using real-time analysis responds faster to evolving data
¤  SAMOA abstracts the platform and maintains the
algorithms; good to implement, test and use.
34

27/06/13

Acknowledgements
¤  Thanks the Erasmus Mundus and all three universities
(UPC, KTH and IST) for providing this opportunity
¤  Thanks all the EMDC students
¤  Thanks Yahoo! Research for the great project

35

Scalable Distributed Real-Time Clustering for Big Data Streams

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Scalable Distributed Real-Time Clustering for Big Data Streams

Similaire à Scalable Distributed Real-Time Clustering for Big Data Streams (20)

Dernier

Dernier (20)

Scalable Distributed Real-Time Clustering for Big Data Streams