SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Scalable Distributed Real-Time
Clustering for Big Data Streams
European Masters in Distributed Computing (EMDC)

Student
Antonio Severien
severien@yahoo-inc.com
Supervisors
Albert Bifet (Yahoo! Research)
Gianmarco De Francisci Morales (Yahoo! Research)
Marta Arias (Universitat Politecnica de Catalunya)
27/06/13

Contributions
¤  SAMOA (Scalable Advanced Massive Online Analysis)
¤  Stream Processing Engine (SPE) abstraction framework
¤  Machine learning libraries adapter layer
¤  API for implementing data flow topologies

¤  SAMOA Clustering Algorithm
¤  Distributed stream clustering algorithm based on CluStream*
¤  Parallelize clustering task and scale-up on resource usage

(*) “A Framework for Clustering Evolving Data Streams”, Aggarwal et al. 2003
2
27/06/13

Motivation
¤  How BIG is BIG in BIG Data???
¤  2.5 quintillion of bytes generated every day.
¤  90% of todays data was generated in the last 2 years
¤  Sensors, social networks, e-business, mobile, internet logs, etc.

¤  Problems… 3 Vs
¤  Storage is unviable due to massive Volume
¤  Production rate on increasing in Velocity
¤  Different sources, different data, different types means Variety

3
27/06/13

Where is the Big Data?
¤  Where is the food?
¤  Databases?
¤  Data warehouses?
¤  Distributed databases?
¤  Distributed file systems?
¤  It’s flowing online! It’s Streaming!

4
27/06/13

Crunching Big Data
¤  Map and Reduce
¤  MapReduce/GFS
¤  Hadoop/HDFS

¤  Stream Processing Engines (SPE)
¤  Apache S4
¤  Twitter Storm

5
27/06/13

Distributed Systems
¤  Actors Model
¤  Independent concurrent processes
¤  Communicate asynchronously by message passing

¤  MapReduce Model
¤  Mappers: filter and sorting
¤  Reducers: summary and aggregation
¤  Large volume of data distributed
¤  Iterative: map-reduce-map-reduce…

6
27/06/13

Streaming
¤  Streaming Model
¤  One-pass processing: discard item after use
¤  Low memory usage: store statistics and summaries
¤  Unbounded flow of data
¤  Evolving data sets
¤  Limited processing time
¤  Arrival order is not guaranteed

7
27/06/13

Making sense
¤  Machine Learning & Data Mining
¤  Make sense, extract patterns and react accordingly
¤  Train machines to “think”
¤  Perceive behavior
¤  Relations between similar information

¤  Unsupervised Learning
¤  Clustering algorithms

8
27/06/13

Machine Learning Tools
¤  Mahout
¤  Machine learning framework used on top of Hadoop/HDFS
¤  Batch processing with MapReduce model
¤  Open-source and good community support

¤  Massive Online Analysis (MOA)
¤  Stream machine learning tool
¤  Many algorithms implemented; based on WEKA
¤  Single machine constraint

¤  Jubatus
¤  Distributed streaming machine learning framework
¤  No clustering algorithms yet
¤  No stream platform abstraction
9
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)
¤  Distributed data streaming machine learning framework
¤  Stream Platform Engine Abstraction
¤  Code once, run everywhere
¤  Focus on distributed algorithm design
¤  Fault-tolerance, communication, consistency and
availability are provided by the underlying distributed
processing platform

¤  Initial release provides integration with,
¤  Apache S4
¤  Twitter Storm
10
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)

SAMOA Algorithms
&
SAMOA-API

ML Adapter

SAMOA
MOA
Other
ML
libraries

SPE Adapter

S4

Storm

Other
SPE
11
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)

SAMOA Algorithms
&
SAMOA-API

ML Adapter

SAMOA
MOA
Other
ML
libraries

SPE Adapter

S4

Storm

Other
SPE
12
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)

SAMOA Algorithms
&
SAMOA-API

ML Adapter

SAMOA
MOA
Other
ML
libraries

SPE Adapter

S4

Storm

Other
SPE
13
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)

SAMOA Algorithms
&
SAMOA-API

ML Adapter

SAMOA
MOA
Other
ML
libraries

SPE Adapter

S4

Storm

Other
SPE
14
27/06/13

( Apache S4 )
¤  Distributed, semi fault-tolerant, stream processing
platform
¤  Based on the Actors model and inspired by the
MapReduce model
¤  Flexibility on data flow; any topology and processor unit
can be built, besides the mappers and reducers design
¤  Specialized in processing events from a stream and
emitting events into a stream

15
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)
SAMOA Topology

Task
PI

STREAM
SOURCE

Stream

PI

EPI
PI

PI

MAP
S4 App
Stream
STREAM
SOURCE

PE

PE

PE
PE

PE
16
27/06/13

How to use?
¤  Adding SPE using API
¤  S4ProcessingItem: processing element wrapper
¤  S4Stream: wrapper for a S4 stream
¤  S4ComponentFactory: provides components specific from Apache
S4, such as processing elements and streams
¤  S4TopologyBuilder: creates the topology instances

¤  Adding algorithm and building topology
class	
  SimpleTask	
  {	
  
...	
  
	
  TopologyBuilder	
  topologyBuilder	
  =	
  new	
  TopologyBuilder(	
  );	
  	
  
	
  EntranceProcessinItem	
  entranceProcessingItem	
  =	
  	
  
	
  
	
  topologyBuilder.createEntrancePI(	
  new	
  SourceProcessor(	
  )	
  );	
  	
  
	
  Stream	
  stream	
  =	
  topologyBuilder.createStream(	
  entranceProcessingItem	
  );	
  
	
  ProcessingItem	
  processingItem	
  =	
  topologyBuilder.createPI(	
  new	
  Processor(	
  )	
  );	
  
	
  processingItem.connectInputKey(	
  stream	
  );	
  	
  
...	
  
	
  

17
27/06/13

Grouping the Best of All
¤  Flexible programming model
¤  Distributed stream processing engine abstraction
¤  Integrated machine learning and data mining algorithms
¤  Easy API to implement new algorithms and SPE adapters

18
27/06/13

SAMOA Clustering Algorithm
¤  Distributed stream clustering algorithm
¤  Validate SAMOA implementation and
¤  Integration with Apache S4 using the SAMOA-S4 adapter
¤  Deploy on Apache S4

19
27/06/13

Stream Clustering Algorithm
¤  CluStream Framework
¤  Based on k-means
¤  Online phase (micro-clustering)
¤  Offline phase (macro-clustering)

¤  k-means: partition a set of data into k distinct clusters
according to a similarity function
¤  Minimization of squared Euclidean distance objective
function:

20
27/06/13

K-means Clustering Algorithm
¤  Advantages
¤  Simple, fast and efficient

¤  Known issues with k-means
¤  Sensitive to initial seeding
¤  Minimization problem is NP-hard even for simple
configurations
¤  1-dimensional points
¤  Global optimum not guaranteed
¤  Good for spherical clustering, not good for arbitrary shapes

21
27/06/13

Distributed Stream Clustering
¤  Online micro-clustering
¤  Apply on a local clustering phase
¤  Cluster Feature Vectors with Timestamp (CFT)
¤ 

N: number of data objects

¤ 

LS: linear sum of data objects

¤ 

SS: sum of squares of data objects

¤ 

LST: sum of timestamps

¤ 

SST: sum of squares of timestamps

¤  Offline macro-clustering
¤  Use of micro-clusters as weighted pseudo-points
¤  Apply on a global clustering phase with a weighted k-means
¤  Uses probabilistic seeding depending on the weighted
micro-clusters
22
27/06/13

CluStream Snapshot
Micro-clusters

Macro-clusters

Ground Truth

23
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)
SAMOA Clustering Task
Clustering

STREAM
SOURCE

Global
Clustering PI

Distribution
PI

OUTPUT

Local Clustering PI
Evaluation
OUTPUT

Sampling PI

Evaluator PI

24
27/06/13

Experiments, Evaluation & Results
¤  Experimental Setup
¤  Four 2.4Mhz Intel Xeon dual-quadcore, 48GB RAM
¤  Process parallelism level: 1, 8 & 16
¤  Instance dimensions: 3 & 15
¤  Source dataset: random events generator
¤  Noise: 0% & 10%
¤  Cluster movement speed: move 0.1 unit every 500 & 12000 instances

¤  Evaluations
¤  Scalability: measure throughput when adding concurrent
processes
¤  Clustering quality: measure if the clustering algorithm are
accurate
25
27/06/13

Scalability

Throughput (instances/second)

Baseline Comparison

Evaluation Step
26
27/06/13

Scalability

Average Throughput
(instances/second)

Average Throughput with Dimensions 3 and 15

Process Parallelism
27
27/06/13

Scalability

Avg. Cumulative Throughput
(instances/sec)

Parallelism Throughput with
Dimension 3

Process Parallelism
28
27/06/13

Clustering Quality Metrics
¤  Internal & External evaluations
¤  Internal evaluation uses attributes available from the clustering
structure.
¤  External evaluation uses external validation structures.
¤  ex.: ground truth provided by the source generator.

¤  Metrics
¤  Cohesion coefficient (SSE): measures the intra clusters sum of
squares error
¤  Separation coefficient (BSS): measures the inter cluster betweensum of squares.

29
27/06/13

Clustering Quality 0% Noise
Snapshot 25,000 instances

Snapshot 45,000 instances

30
27/06/13

Clustering Quality 0% Noise
Ratio = BSS / GT

31
27/06/13

Clustering Quality 10% Noise
Snapshot 25,000 instances

Snapshot 45,000 instances

Good clustering
Poor clustering

32
27/06/13

Clustering Quality 10% Noise

33
27/06/13

Conclusion
¤  There is important information on the massive amount of
data being produced and discarded
¤  There is a need for tools to deal with this efficiently
¤  Efforts have been done to crunch big data
¤  Interpreting and retrieving relevant information is where
machine learning and data mining operate
¤  Using real-time analysis responds faster to evolving data
¤  SAMOA abstracts the platform and maintains the
algorithms; good to implement, test and use.
34
27/06/13

Acknowledgements
¤  Thanks the Erasmus Mundus and all three universities
(UPC, KTH and IST) for providing this opportunity
¤  Thanks all the EMDC students
¤  Thanks Yahoo! Research for the great project

35

Contenu connexe

Tendances

Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013MLconf
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streamsKrish_ver2
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learningStanley Wang
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learningViet-Trung TRAN
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELJenny Liu
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTDavide Gallitelli
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming AlgorithmsJoe Kelley
 
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...Otávio Carvalho
 
Deep recurrent neutral networks for Sequence Learning in Spark
Deep recurrent neutral networks for Sequence Learning in SparkDeep recurrent neutral networks for Sequence Learning in Spark
Deep recurrent neutral networks for Sequence Learning in SparkDataWorks Summit/Hadoop Summit
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream ProcessingZbigniew Jerzak
 
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetJ On The Beach
 
Elag 2012 - Under the hood of 3TU.Datacentrum.
Elag 2012 - Under the hood of 3TU.Datacentrum.Elag 2012 - Under the hood of 3TU.Datacentrum.
Elag 2012 - Under the hood of 3TU.Datacentrum.Egbert Gramsbergen
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016ijcsbi
 
Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesNatalino Busa
 
Introduction to neural networks and Keras
Introduction to neural networks and KerasIntroduction to neural networks and Keras
Introduction to neural networks and KerasJie He
 

Tendances (20)

18 Data Streams
18 Data Streams18 Data Streams
18 Data Streams
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learning
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
CS267_Graph_Lab
CS267_Graph_LabCS267_Graph_Lab
CS267_Graph_Lab
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDT
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
 
Deep recurrent neutral networks for Sequence Learning in Spark
Deep recurrent neutral networks for Sequence Learning in SparkDeep recurrent neutral networks for Sequence Learning in Spark
Deep recurrent neutral networks for Sequence Learning in Spark
 
Clustering
ClusteringClustering
Clustering
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream Processing
 
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert Bifet
 
Elag 2012 - Under the hood of 3TU.Datacentrum.
Elag 2012 - Under the hood of 3TU.Datacentrum.Elag 2012 - Under the hood of 3TU.Datacentrum.
Elag 2012 - Under the hood of 3TU.Datacentrum.
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016
 
Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologies
 
Introduction to neural networks and Keras
Introduction to neural networks and KerasIntroduction to neural networks and Keras
Introduction to neural networks and Keras
 

Similaire à Scalable Distributed Real-Time Clustering for Big Data Streams

Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph✔ Eric David Benari, PMP
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Dataconomy Media
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Apache Apex
 
Stream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data MicroservicesStream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data Microservicesmarius_bogoevici
 
Scientific
Scientific Scientific
Scientific marpierc
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Anomaly Detection at Scale
Anomaly Detection at ScaleAnomaly Detection at Scale
Anomaly Detection at ScaleJeff Henrikson
 
Swisscom Network Analytics
Swisscom Network AnalyticsSwisscom Network Analytics
Swisscom Network Analyticsconfluent
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsPetr Novotný
 
Parallel machines flinkforward2017
Parallel machines flinkforward2017Parallel machines flinkforward2017
Parallel machines flinkforward2017Nisha Talagala
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Scalable AutoML for Time Series Forecasting using Ray
Scalable AutoML for Time Series Forecasting using RayScalable AutoML for Time Series Forecasting using Ray
Scalable AutoML for Time Series Forecasting using RayDatabricks
 
Galaxy
GalaxyGalaxy
Galaxybosc
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit
 

Similaire à Scalable Distributed Real-Time Clustering for Big Data Streams (20)

Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
 
Apache edgent
Apache edgentApache edgent
Apache edgent
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
NextGenML
NextGenML NextGenML
NextGenML
 
Stream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data MicroservicesStream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data Microservices
 
Scientific
Scientific Scientific
Scientific
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Anomaly Detection at Scale
Anomaly Detection at ScaleAnomaly Detection at Scale
Anomaly Detection at Scale
 
Swisscom Network Analytics
Swisscom Network AnalyticsSwisscom Network Analytics
Swisscom Network Analytics
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big Graphs
 
Parallel machines flinkforward2017
Parallel machines flinkforward2017Parallel machines flinkforward2017
Parallel machines flinkforward2017
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Scalable AutoML for Time Series Forecasting using Ray
Scalable AutoML for Time Series Forecasting using RayScalable AutoML for Time Series Forecasting using Ray
Scalable AutoML for Time Series Forecasting using Ray
 
Galaxy
GalaxyGalaxy
Galaxy
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
 

Dernier

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Dernier (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Scalable Distributed Real-Time Clustering for Big Data Streams

  • 1. Scalable Distributed Real-Time Clustering for Big Data Streams European Masters in Distributed Computing (EMDC) Student Antonio Severien severien@yahoo-inc.com Supervisors Albert Bifet (Yahoo! Research) Gianmarco De Francisci Morales (Yahoo! Research) Marta Arias (Universitat Politecnica de Catalunya)
  • 2. 27/06/13 Contributions ¤  SAMOA (Scalable Advanced Massive Online Analysis) ¤  Stream Processing Engine (SPE) abstraction framework ¤  Machine learning libraries adapter layer ¤  API for implementing data flow topologies ¤  SAMOA Clustering Algorithm ¤  Distributed stream clustering algorithm based on CluStream* ¤  Parallelize clustering task and scale-up on resource usage (*) “A Framework for Clustering Evolving Data Streams”, Aggarwal et al. 2003 2
  • 3. 27/06/13 Motivation ¤  How BIG is BIG in BIG Data??? ¤  2.5 quintillion of bytes generated every day. ¤  90% of todays data was generated in the last 2 years ¤  Sensors, social networks, e-business, mobile, internet logs, etc. ¤  Problems… 3 Vs ¤  Storage is unviable due to massive Volume ¤  Production rate on increasing in Velocity ¤  Different sources, different data, different types means Variety 3
  • 4. 27/06/13 Where is the Big Data? ¤  Where is the food? ¤  Databases? ¤  Data warehouses? ¤  Distributed databases? ¤  Distributed file systems? ¤  It’s flowing online! It’s Streaming! 4
  • 5. 27/06/13 Crunching Big Data ¤  Map and Reduce ¤  MapReduce/GFS ¤  Hadoop/HDFS ¤  Stream Processing Engines (SPE) ¤  Apache S4 ¤  Twitter Storm 5
  • 6. 27/06/13 Distributed Systems ¤  Actors Model ¤  Independent concurrent processes ¤  Communicate asynchronously by message passing ¤  MapReduce Model ¤  Mappers: filter and sorting ¤  Reducers: summary and aggregation ¤  Large volume of data distributed ¤  Iterative: map-reduce-map-reduce… 6
  • 7. 27/06/13 Streaming ¤  Streaming Model ¤  One-pass processing: discard item after use ¤  Low memory usage: store statistics and summaries ¤  Unbounded flow of data ¤  Evolving data sets ¤  Limited processing time ¤  Arrival order is not guaranteed 7
  • 8. 27/06/13 Making sense ¤  Machine Learning & Data Mining ¤  Make sense, extract patterns and react accordingly ¤  Train machines to “think” ¤  Perceive behavior ¤  Relations between similar information ¤  Unsupervised Learning ¤  Clustering algorithms 8
  • 9. 27/06/13 Machine Learning Tools ¤  Mahout ¤  Machine learning framework used on top of Hadoop/HDFS ¤  Batch processing with MapReduce model ¤  Open-source and good community support ¤  Massive Online Analysis (MOA) ¤  Stream machine learning tool ¤  Many algorithms implemented; based on WEKA ¤  Single machine constraint ¤  Jubatus ¤  Distributed streaming machine learning framework ¤  No clustering algorithms yet ¤  No stream platform abstraction 9
  • 10. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) ¤  Distributed data streaming machine learning framework ¤  Stream Platform Engine Abstraction ¤  Code once, run everywhere ¤  Focus on distributed algorithm design ¤  Fault-tolerance, communication, consistency and availability are provided by the underlying distributed processing platform ¤  Initial release provides integration with, ¤  Apache S4 ¤  Twitter Storm 10
  • 11. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Algorithms & SAMOA-API ML Adapter SAMOA MOA Other ML libraries SPE Adapter S4 Storm Other SPE 11
  • 12. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Algorithms & SAMOA-API ML Adapter SAMOA MOA Other ML libraries SPE Adapter S4 Storm Other SPE 12
  • 13. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Algorithms & SAMOA-API ML Adapter SAMOA MOA Other ML libraries SPE Adapter S4 Storm Other SPE 13
  • 14. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Algorithms & SAMOA-API ML Adapter SAMOA MOA Other ML libraries SPE Adapter S4 Storm Other SPE 14
  • 15. 27/06/13 ( Apache S4 ) ¤  Distributed, semi fault-tolerant, stream processing platform ¤  Based on the Actors model and inspired by the MapReduce model ¤  Flexibility on data flow; any topology and processor unit can be built, besides the mappers and reducers design ¤  Specialized in processing events from a stream and emitting events into a stream 15
  • 16. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Topology Task PI STREAM SOURCE Stream PI EPI PI PI MAP S4 App Stream STREAM SOURCE PE PE PE PE PE 16
  • 17. 27/06/13 How to use? ¤  Adding SPE using API ¤  S4ProcessingItem: processing element wrapper ¤  S4Stream: wrapper for a S4 stream ¤  S4ComponentFactory: provides components specific from Apache S4, such as processing elements and streams ¤  S4TopologyBuilder: creates the topology instances ¤  Adding algorithm and building topology class  SimpleTask  {   ...    TopologyBuilder  topologyBuilder  =  new  TopologyBuilder(  );      EntranceProcessinItem  entranceProcessingItem  =        topologyBuilder.createEntrancePI(  new  SourceProcessor(  )  );      Stream  stream  =  topologyBuilder.createStream(  entranceProcessingItem  );    ProcessingItem  processingItem  =  topologyBuilder.createPI(  new  Processor(  )  );    processingItem.connectInputKey(  stream  );     ...     17
  • 18. 27/06/13 Grouping the Best of All ¤  Flexible programming model ¤  Distributed stream processing engine abstraction ¤  Integrated machine learning and data mining algorithms ¤  Easy API to implement new algorithms and SPE adapters 18
  • 19. 27/06/13 SAMOA Clustering Algorithm ¤  Distributed stream clustering algorithm ¤  Validate SAMOA implementation and ¤  Integration with Apache S4 using the SAMOA-S4 adapter ¤  Deploy on Apache S4 19
  • 20. 27/06/13 Stream Clustering Algorithm ¤  CluStream Framework ¤  Based on k-means ¤  Online phase (micro-clustering) ¤  Offline phase (macro-clustering) ¤  k-means: partition a set of data into k distinct clusters according to a similarity function ¤  Minimization of squared Euclidean distance objective function: 20
  • 21. 27/06/13 K-means Clustering Algorithm ¤  Advantages ¤  Simple, fast and efficient ¤  Known issues with k-means ¤  Sensitive to initial seeding ¤  Minimization problem is NP-hard even for simple configurations ¤  1-dimensional points ¤  Global optimum not guaranteed ¤  Good for spherical clustering, not good for arbitrary shapes 21
  • 22. 27/06/13 Distributed Stream Clustering ¤  Online micro-clustering ¤  Apply on a local clustering phase ¤  Cluster Feature Vectors with Timestamp (CFT) ¤  N: number of data objects ¤  LS: linear sum of data objects ¤  SS: sum of squares of data objects ¤  LST: sum of timestamps ¤  SST: sum of squares of timestamps ¤  Offline macro-clustering ¤  Use of micro-clusters as weighted pseudo-points ¤  Apply on a global clustering phase with a weighted k-means ¤  Uses probabilistic seeding depending on the weighted micro-clusters 22
  • 24. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Clustering Task Clustering STREAM SOURCE Global Clustering PI Distribution PI OUTPUT Local Clustering PI Evaluation OUTPUT Sampling PI Evaluator PI 24
  • 25. 27/06/13 Experiments, Evaluation & Results ¤  Experimental Setup ¤  Four 2.4Mhz Intel Xeon dual-quadcore, 48GB RAM ¤  Process parallelism level: 1, 8 & 16 ¤  Instance dimensions: 3 & 15 ¤  Source dataset: random events generator ¤  Noise: 0% & 10% ¤  Cluster movement speed: move 0.1 unit every 500 & 12000 instances ¤  Evaluations ¤  Scalability: measure throughput when adding concurrent processes ¤  Clustering quality: measure if the clustering algorithm are accurate 25
  • 27. 27/06/13 Scalability Average Throughput (instances/second) Average Throughput with Dimensions 3 and 15 Process Parallelism 27
  • 28. 27/06/13 Scalability Avg. Cumulative Throughput (instances/sec) Parallelism Throughput with Dimension 3 Process Parallelism 28
  • 29. 27/06/13 Clustering Quality Metrics ¤  Internal & External evaluations ¤  Internal evaluation uses attributes available from the clustering structure. ¤  External evaluation uses external validation structures. ¤  ex.: ground truth provided by the source generator. ¤  Metrics ¤  Cohesion coefficient (SSE): measures the intra clusters sum of squares error ¤  Separation coefficient (BSS): measures the inter cluster betweensum of squares. 29
  • 30. 27/06/13 Clustering Quality 0% Noise Snapshot 25,000 instances Snapshot 45,000 instances 30
  • 31. 27/06/13 Clustering Quality 0% Noise Ratio = BSS / GT 31
  • 32. 27/06/13 Clustering Quality 10% Noise Snapshot 25,000 instances Snapshot 45,000 instances Good clustering Poor clustering 32
  • 34. 27/06/13 Conclusion ¤  There is important information on the massive amount of data being produced and discarded ¤  There is a need for tools to deal with this efficiently ¤  Efforts have been done to crunch big data ¤  Interpreting and retrieving relevant information is where machine learning and data mining operate ¤  Using real-time analysis responds faster to evolving data ¤  SAMOA abstracts the platform and maintains the algorithms; good to implement, test and use. 34
  • 35. 27/06/13 Acknowledgements ¤  Thanks the Erasmus Mundus and all three universities (UPC, KTH and IST) for providing this opportunity ¤  Thanks all the EMDC students ¤  Thanks Yahoo! Research for the great project 35