SlideShare une entreprise Scribd logo
1  sur  41
A TALE OF DATA PATTERN DISCOVERY
IN PARALLEL
Yan Liu
Yan.iu@concordia.ca
Concordia University
Montreal, Quebec, Canada
Is Parallelism Necessary?
Enter your footer text here3
“I want to democratize AI.”
AI experts & HPC experts work
side by side
“… democratizing the processes
underlying the creation of AI
systems…”
Data Parallelism vs Model Parallelism
Enter your footer text here4
 Data Parallelism – When data is is to too large
 Partition the workload over multiple devices.
 Assume there are n workers(devices).
 Each worker will receive a copy the complete model
 Process the model on 1/n of the data
 Model Parallelism – When model is too large
 Each worker/device holds onto only part of the model
 E.g. LSTM Recurrent Neural Networks. Each layer of LSTM is assigned to one GPU
 No contention to update the shared model at the end of each iteration,
 Most of the communication happens when passing intermediate results between GPUs.
Key Factors of Parallel Analysis Pipelines
Algorithmic factors
 Discovery models
 Model parallelism
 Distance metrics
Data Parallelism factors
 Partition
 Load balancing
 Data locality and shuffling
Architecture factors
 Batch vs. Streaming
 Microservices
 DevOps
Quality factors
 Accuracy (comparing between algorithms
and ground truth)
 Scalability (throughput, latency, data
intensity )
 Stability
5
Parallel Programming Models
and Frameworks
Enter your footer text here6
Programming Model
MapReduce
 MapReduce
 Map: <k1, v1>  list<k2, v2>
 Shuffle
 Reduce: <k2, list{vg}>  list<k3, v3>
 Iterative algorithms
 In Hadoop, large overheads incurred
due to read/writing data to stable
storage in-between iterations
Enter your footer text here7
Spark – Resilient Distributed Datasets
 A data parallel programming model
for fault tolerant distributed datasets
 Partitioned collections with caching
 Transformations (define new RDDs),
actions (compute results)
 Restricted shared variables (broadcast,
accumulators)
Distributed Stream Processing
 S4 and Storm
9
Count the occurrences of each
word in these files.
Ensemble of Models in Parallel
Enter your footer text here10
Ensemble of Models in Learning
One dataset can have a number
of algorithms to analyze it.
Each algorithm has a number of
hyperparameters to configure.
The training data can be
organized as different input
structures.
11
Multi-class
Classification
Regression
Real Time Model
Selection
A Microservice Architecture of Multiple ML Models
Enter your footer text here12
CASE 1 : Select A Better Deep Learner in Classification
 Fit with multiple data scenarios
 Devices were deployed and run as legacy.
Their types can be unknown due to lack of
documentation.
 From collected measurement data of a field
to identify the device type for each segment
of device connection in a full lengthy
physical link.
 Accurate classification is essential for
further network configuration and capacity
optimization.
• Real world data collected by an industry
vendor
• 600,000 data samples as inputs
• Each sample contains 6241 features
• 6 types of classification as output
13
Problem Set Data Set
CASE 1 : Select A Better Deep Learner in Classification
Solution
 Device Ensemble of 2 neural network
models running on two GPU nodes
▪ Convolutional neural network (CNN)
▪ Residual network – (Resnet)
 Embed cross validation for hyperparameter
tuning in each model
 Find a more accurate model (e.g. 33%
accuracy vs 92% accuracy).
 Find a accurate mode faster (e.g. 100
epochs vs. 5 epochs, x20 faster). Result
14
CNN on GPU Node 1
Data
ResNet on GPU Node 2
5 epochs produce
100% validation
accuracy for app
scenario A
10,000 training data
samples produce 33%
validation accuracy for
app scenario B
600,000 training
samples produce
92% validation
accuracy for app
scenario B
100 epochs produce
100% validation
accuracy for app
scenario A
VS
VS
Solution
Result
Deep Learning and Customer Access to Credit
Article from Forbes.com on Feb 20, 2017
15
“We noticed a couple of years ago,” says Peter Maynard, Senior Vice President of Global
Analytics at Equifax, “that we were not getting enough statistical lift from our traditional
credit scoring methodology.”
“We spend a lot of time creating segments to build a model on. Determining the optimal
segment could take sometimes 20% of the time that it takes to build a model. In the context
of neural nets, those segments are the hidden layers—the neural net does it all for you. The
machine is figuring out what are the segments and what are the weights in a segment instead
of having an analyst do that. I find it really powerful.”
Instead of being hypotheses developed by data scientists, now the attributes are created by
the deep learning process, on the basis of a much larger set of historical or “trended data.
https://www.forbes.com/sites/gilpress/2017/02/20/equifax-and-sas-leverage-ai-and-deep-learning-to-improve-
consumer-access-to-credit/
Data Parallelism
Data Parallelism – Control the Degree of Parallelism
Enter your footer text here17
Partition Size = Math.max(minSize, Math.min(goalSize, blockSize)
minSize hadoop parameter
mapreduce.input.fileinputformat.split.minsize
dfs.block.size (128M) or fs.local.block.size (32M) According to Haddop 2.0, default partition size
goalSize totalInputSize/numPartitions
numPartitions decrease the partition size (increase the number
of partitions)
mapreduce.input.fileinputformat.split.minsize (1 byte) increase the partition size (decrease the number
of partitions)
Control the Degree of Parallelism through Partition
Example partitions
Input Files:
1987.csv - Size = 124183 KB
1988.csv - Size = 489297 KB
Scenario 1: Default
Partition size = 32MB, Num of Partitions= 19
1987.csv [Partition 0, 1,2,] size = 32MB;
[Partition 3] size = 25 MB;
1988.csv [Partition 4,5, …17] size = 32MB;
[Partition 18] size = 29 MB;
Enter your footer text here18
Scenario 2: Decrease partition size
Partition size = 19MB, Num of Partitions= 30
1987.csv [Partiton 0, 1,…5] size = 19MB;
[Partition 6] size = 21 MB;
1988.csv [Partition 7,8, …28] size = 19MB;
[Partition 19] size = 18 MB;
* Partition 6 contains both data from
1987.csv and 1988.csv
Data Skew
 Unbalanced workloads tends to
dominate the overall delays
 Using default hash partition is not
going to distribute the data uniformly
 Some partitions contain more element
on the reducer side than others
 Optimization technique: further break
down the skewed partitions into sub-
partitions
19
slow task
Load Balancing
 Proposed solution:
1. add a random number x : [0,
n], where in is the number of
partitions as the prefix to the
key such as k_new = n_k_old
2. Harsh partition with k_new
3. Process each partition
4. Remove the prefix
5. Perform further operations
20
Data Aggregation Methods
21
 Approach: Minimize data shuffling by tuning the join operation
Dataset A Dataset B
M
Output
M M M M M
Shuffle join
Dataset A
M
Output
M M
Smaller
dataset
Distributed Cache
collect
broadcast
local Join in
worker node
Reduce-side Join Map-side Join
Case 2 : Network Analytics
Enter your footer text here22
Feature Selection and Clustering Analysis
 Part of supervised machine learning analysis to classify anomaly dataset from
telecommunication network
 Data sets
- 305 078 rows represent devices
- 275 columns/features represent devices ports
- Each cell measures the device port reading at a timestamp
 Preprocessing before feeding into neural network classifier
 Input classifier what’s truly necessary, and not blindly throw all features
 Clustering is unsupervised learning and does not require prior knowledge of the data
 Increase robustness against noise or bad-quality data
 Avoid scalability bottlenecks with bigger datasets
 Accelerate iterations of the model-building process
Two Analysis Pipelines
24
 Identify which features are strongly correlated at various aggregation levels
Pipeline I : Using the agglomerative neighbor joining clustering algorithm
 NJ algorithm was applied to the feature correlation matrix of size 275*275
Pipeline II : Using PCA and DBSCAN as another clustering algorithm
 Using Dynamic Time Warping (DTW) to measure the distance between any two time series of a
feature
 Input DTW to PCA + DBSCAN clustering
Visualization of Neighbor Join Clustering
Enter your footer text here25
Enter your footer text here
Clustering of correlations of 275 features in cladogram. The length of each branch in the
tree represents the distance (d=1-|correlation| ) between nodes
Neighbor Joining
Clustering method for the creation of phylogenetic trees
- Takes distance matrix as input
- Initialize all nodes of the tree
- Calculate Q Matrix based on specific formula
- Find Smallest Q value
- Join Pair of nodes corresponding to smallest Q
- Update Original Distance matrix with new Joined Node
- Repeat until Tree completed
NJ Example
Neighbor Joining- Distance Matrix Generation
Enter your footer text here27
in-memory of key value tuples e.g. .= ((index_i,index_j) Correlation Coefficient)
2 3 64 51
1
2
3
4
5
6
Standard 6x6 Distance Matrix Data Structure, where
each cell contains a distance with a correspondence
to an i,j pair.
2 3 64 51
1
2
3
4
5
6
Upper matrix holds all the
information we need. Hence we can
discard bottom diagonal values
tuples to reduce unnecessary
computation
Neighbor Joining - Distance Matrix Generation
2 3 64 5
1
2
3
4
5
2 3 X4 5
X
2
3
4
5
2 3 64 5
1
2
3
4
5
2 3 4 5
1
2
3
4
2 3 4 5
U
2
3
4
5
Min Q is found as Indexes 1 and 6 Remove any cell with either I or J equal to
1 and 6
Node 1 and 6 are “joined” as node “U”. Place this node “U” at the
smallest index position, 1 in this case.
The grey boxes are the new distance values
between node U, and the remaining nodes 2 to 5
If there are any cells bigger than current value of
the min Q, decrement their indexes by 1. None in
this case.
Neighbor Joining in Spark
Description
Visualization
Spark Transformations
collectAsMap map filter substractedByKey unionmap
Step i : Compute distance
matrix
Step v-vi: Update distance Matrix for next
iteration
Step vii:
Set up other variables for next
iteration
Step ii : Calculate Q and Find
Min Q
Distance Matrix
Recursive
lookupreduceByKey map
Iterative algorithm and data dependencies between stages
Optimize each iteration of the NJ algorithm
Evaluation on a Cluster
0
1
2
3
4
5
6
7
8
9
0 10 20 30 40 50 60
Runtime
(mins)
Number of Cores
Run time vs Number of Cores
Optimal
Execution Time Decomposition
42.62%
23.99%
25.17%
7.60%
0.63%
Time distribution across Spark's metrics – 10 Cores Cluster
Executor Computing Time Scheduler Delay
Task Deserialization Time Shuffle R/W Time
Result serialization Time
38.97%
24.56%
18.51%
17.18%
0.78%
Time distribution across Spark's metrics – 20 Cores Cluster
Executor Computing Time Scheduler Delay Task Deserialization Time
Shuffle R/W Time Result serialization Time
Run-time Repartition
0
0.5
1
1.5
2
2.5
3
3.5
1 2 3 4 5 6
RunTime(s)
# of Iteration before repartition
Partition method effects on run time at different interval of parallel NJ
Coalesce
Repartition
New Optimal
Previous Optimal
shuffle = false
shuffle = true
DTW & Dimension Reduction + DBSCAN
[label string, vector]
Sequence alignment algorithm to measure
the similarity of time series as their distance
 Dynamic Time Warping
 Fast DTW : An approximation of Standard
DTW with O(n) complexityLabel Vector
X (1) (2) … (n)
Y (1) (2) … (n)
Z (1) (2) … (n)
… … … … …
Fast Dynamic Time Warping in Spark
Enter your footer text here34
countzipWithIndex map parallelize map groupByKey sortByKeymap
Generate tuples of Indexed pair of sequence
E.g. ((i,j)(X,Y)) , where I & j are the pairwise index of the upper
triangle and X, Y are the Time Series
Distance
Calculation
Filling
Distance
Matrix
Formatting RDD
for input to PCA
mapValues
PCA
Description
Visualization
Spark Transformations
Evaluation on a Cluster
0
1
2
3
4
5
6
7
0 10 20 30 40 50 60 70 80
Runtime(mins)
Number of Cores
FastDTW-PCA-DBSCAN: Run Time vs Number Of Cores
Observing Distance Metrics Effect on Run-Time
36
CASE 3 : Trajectory Grouping Pattern Discovery in Parallel
Streaming processing on real-time data set
 Design a scalable, distributed processing
analytics that discovers trajectory moving
together pattern over large and continuous
trajectory data stream.
 GeoLife, real world public GPS trajectory
dataset collected by Microsoft containing
178 real users’ outdoor activities from April
2007 to October 2011.
 17,621 trajectories and over 20 million
location records.
Parallel System Architecture of
Apache Spark Streaming Clusters
running up-to 20 AWS nodes
Run Ensemble of algorithms
▪ Snap-shot model
▪ Slot model with two distance
measures for clustering
37
▪ Process up to 30,000 updates per
second of moving objects within 14
seconds on an AWS cluster.
Problem Set
Data Set
Solution
Result
CASE 3 : Trajectory Grouping Pattern Discovery in Parallel
38
Snapshot Model : Gathering
Each snapshot consists of moving objects
from all trajectories which have the same
timestamped-location points.
I Snapshot clustering II Crowd Detection III Gatherings
generation
Partition
method
results
Discover
Gatherings
Merge
Find
clusters
Find
clusters
Find crowds
Find crowds
Streaming
data
Archived
data
Batch model
Streaming model
Window-
based
Partition
Merge
Find
clusters
Find
clusters
Incremental finding
crowds
Incremental finding
crowds
Discover
Gatherings
results
Trajectory
Data
... ...
... ...
Stream Data Analytics Workflow
Throughputs and End-to-end Delay
CASE 3 : Trajectory Grouping Pattern Discovery in Parallel
39
Slot Model : Trajectory Companion
Each trajectory slot
consists of range of
timestamped-location
points of moving objects
within the time period of
T.
Stream Data Analytics Workflow
Accuracy - the best performer is more accurate in finding a pattern with
comparable throughput and end-to-end delay
Looking Forward
Enter your footer text here40
 Parallelism is compound by factors
 Data partition and communication methods
 Metrics and algorithms
 Transformations and actions on data
 Dependencies of data and model
 Batch or streaming modes
 A discovery pipeline should be self-adaptive and elastic
 Run-time adaptive to the changes of (intermediate) data size and workload
 Autoscaling to virtualized computing nodes
 An ensemble approach to select the best performed algorithm, metric and pipeline
Before I came here I was confused about
the subject. Having listened to your lecture I
am still confused. But on a higher level.
Enrico Fermi (1901-1954)

Contenu connexe

Tendances

AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...ijcsit
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDataminingTools Inc
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
 
Cryptographic Cloud Storage with Hadoop Implementation
Cryptographic Cloud Storage with Hadoop ImplementationCryptographic Cloud Storage with Hadoop Implementation
Cryptographic Cloud Storage with Hadoop ImplementationIOSR Journals
 
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)Eswar Publications
 
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
K-MEDOIDS CLUSTERING  USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...K-MEDOIDS CLUSTERING  USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...ijscmc
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET Journal
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model CompressionApache MXNet
 
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...IJERA Editor
 
Thilaganga mphil cs viva presentation ppt
Thilaganga mphil cs viva presentation pptThilaganga mphil cs viva presentation ppt
Thilaganga mphil cs viva presentation pptthilaganga
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Ganesan Narayanasamy
 
A Survey Paper on Cluster Head Selection Techniques for Mobile Ad-Hoc Network
A Survey Paper on Cluster Head Selection Techniques for Mobile Ad-Hoc NetworkA Survey Paper on Cluster Head Selection Techniques for Mobile Ad-Hoc Network
A Survey Paper on Cluster Head Selection Techniques for Mobile Ad-Hoc NetworkIOSR Journals
 
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKSMULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKSijcses
 
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...cscpconf
 
expeditions praneeth_june-2021
expeditions praneeth_june-2021expeditions praneeth_june-2021
expeditions praneeth_june-2021Praneeth Vepakomma
 
Feature Subset Selection for High Dimensional Data Using Clustering Techniques
Feature Subset Selection for High Dimensional Data Using Clustering TechniquesFeature Subset Selection for High Dimensional Data Using Clustering Techniques
Feature Subset Selection for High Dimensional Data Using Clustering TechniquesIRJET Journal
 

Tendances (20)

AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming data
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
 
Hm2413291336
Hm2413291336Hm2413291336
Hm2413291336
 
Cryptographic Cloud Storage with Hadoop Implementation
Cryptographic Cloud Storage with Hadoop ImplementationCryptographic Cloud Storage with Hadoop Implementation
Cryptographic Cloud Storage with Hadoop Implementation
 
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)
 
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
K-MEDOIDS CLUSTERING  USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...K-MEDOIDS CLUSTERING  USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy GaussianBig Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model Compression
 
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
 
Thilaganga mphil cs viva presentation ppt
Thilaganga mphil cs viva presentation pptThilaganga mphil cs viva presentation ppt
Thilaganga mphil cs viva presentation ppt
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
A Survey Paper on Cluster Head Selection Techniques for Mobile Ad-Hoc Network
A Survey Paper on Cluster Head Selection Techniques for Mobile Ad-Hoc NetworkA Survey Paper on Cluster Head Selection Techniques for Mobile Ad-Hoc Network
A Survey Paper on Cluster Head Selection Techniques for Mobile Ad-Hoc Network
 
50120140505013
5012014050501350120140505013
50120140505013
 
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKSMULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
 
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
 
expeditions praneeth_june-2021
expeditions praneeth_june-2021expeditions praneeth_june-2021
expeditions praneeth_june-2021
 
Feature Subset Selection for High Dimensional Data Using Clustering Techniques
Feature Subset Selection for High Dimensional Data Using Clustering TechniquesFeature Subset Selection for High Dimensional Data Using Clustering Techniques
Feature Subset Selection for High Dimensional Data Using Clustering Techniques
 

Similaire à A TALE of DATA PATTERN DISCOVERY IN PARALLEL

A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 
Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)DheerajPachauri
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
Clustering
ClusteringClustering
ClusteringMeme Hei
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for BeginnersSanghamitra Deb
 
JovianDATA MDX Engine Comad oct 22 2011
JovianDATA MDX Engine Comad oct 22 2011JovianDATA MDX Engine Comad oct 22 2011
JovianDATA MDX Engine Comad oct 22 2011Satya Ramachandran
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoSri Ambati
 
IRJET- Enhanced Density Based Method for Clustering Data Stream
IRJET-  	  Enhanced Density Based Method for Clustering Data StreamIRJET-  	  Enhanced Density Based Method for Clustering Data Stream
IRJET- Enhanced Density Based Method for Clustering Data StreamIRJET Journal
 
Simplified Data Processing On Large Cluster
Simplified Data Processing On Large ClusterSimplified Data Processing On Large Cluster
Simplified Data Processing On Large ClusterHarsh Kevadia
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
 
Cloud computing skepticism - But i'm sure
Cloud computing skepticism - But i'm sureCloud computing skepticism - But i'm sure
Cloud computing skepticism - But i'm sureNguyen Duong
 
Cloud data management
Cloud data managementCloud data management
Cloud data managementambitlick
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...IJECEIAES
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...IRJET Journal
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingIRJET Journal
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Derryck Lamptey, MPhil, CISSP
 
Applications of machine learning in Wireless sensor networks.
Applications of machine learning in Wireless sensor networks.Applications of machine learning in Wireless sensor networks.
Applications of machine learning in Wireless sensor networks.Sahana B S
 
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5Robert Grossman
 

Similaire à A TALE of DATA PATTERN DISCOVERY IN PARALLEL (20)

A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Clustering
ClusteringClustering
Clustering
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 
CLOUD BIOINFORMATICS Part1
 CLOUD BIOINFORMATICS Part1 CLOUD BIOINFORMATICS Part1
CLOUD BIOINFORMATICS Part1
 
JovianDATA MDX Engine Comad oct 22 2011
JovianDATA MDX Engine Comad oct 22 2011JovianDATA MDX Engine Comad oct 22 2011
JovianDATA MDX Engine Comad oct 22 2011
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
 
IRJET- Enhanced Density Based Method for Clustering Data Stream
IRJET-  	  Enhanced Density Based Method for Clustering Data StreamIRJET-  	  Enhanced Density Based Method for Clustering Data Stream
IRJET- Enhanced Density Based Method for Clustering Data Stream
 
Simplified Data Processing On Large Cluster
Simplified Data Processing On Large ClusterSimplified Data Processing On Large Cluster
Simplified Data Processing On Large Cluster
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
Cloud computing skepticism - But i'm sure
Cloud computing skepticism - But i'm sureCloud computing skepticism - But i'm sure
Cloud computing skepticism - But i'm sure
 
Cloud data management
Cloud data managementCloud data management
Cloud data management
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
 
Applications of machine learning in Wireless sensor networks.
Applications of machine learning in Wireless sensor networks.Applications of machine learning in Wireless sensor networks.
Applications of machine learning in Wireless sensor networks.
 
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
 

Plus de Jenny Liu

Modeling Uncertainty For Middleware-based Streaming Power Grid Applications
Modeling Uncertainty For Middleware-based Streaming Power Grid ApplicationsModeling Uncertainty For Middleware-based Streaming Power Grid Applications
Modeling Uncertainty For Middleware-based Streaming Power Grid ApplicationsJenny Liu
 
SE4SG 2013 : Residential Electrical Demand Forecasting in Very Small Scale
SE4SG 2013 : Residential Electrical Demand Forecasting in  Very Small ScaleSE4SG 2013 : Residential Electrical Demand Forecasting in  Very Small Scale
SE4SG 2013 : Residential Electrical Demand Forecasting in Very Small ScaleJenny Liu
 
SE4SG 2013 : Towards a Bottom-up Development of Reference Architectures for S...
SE4SG 2013 : Towards a Bottom-up Development of Reference Architectures for S...SE4SG 2013 : Towards a Bottom-up Development of Reference Architectures for S...
SE4SG 2013 : Towards a Bottom-up Development of Reference Architectures for S...Jenny Liu
 
SE4SG 2013 : Towards a Constraint Based Approach for Self-Healing Smart Grids
SE4SG 2013 :  Towards a Constraint Based Approach for Self-Healing Smart GridsSE4SG 2013 :  Towards a Constraint Based Approach for Self-Healing Smart Grids
SE4SG 2013 : Towards a Constraint Based Approach for Self-Healing Smart GridsJenny Liu
 
SE4SG 2013 : MODAM: A MODular Agent-Based Modelling Framework
SE4SG 2013 : MODAM: A MODular Agent-Based Modelling Framework SE4SG 2013 : MODAM: A MODular Agent-Based Modelling Framework
SE4SG 2013 : MODAM: A MODular Agent-Based Modelling Framework Jenny Liu
 
SE4SG 2013 : A Run-Time Verification Framework for Smart Grid Applications Im...
SE4SG 2013 : A Run-Time Verification Framework for Smart Grid Applications Im...SE4SG 2013 : A Run-Time Verification Framework for Smart Grid Applications Im...
SE4SG 2013 : A Run-Time Verification Framework for Smart Grid Applications Im...Jenny Liu
 

Plus de Jenny Liu (6)

Modeling Uncertainty For Middleware-based Streaming Power Grid Applications
Modeling Uncertainty For Middleware-based Streaming Power Grid ApplicationsModeling Uncertainty For Middleware-based Streaming Power Grid Applications
Modeling Uncertainty For Middleware-based Streaming Power Grid Applications
 
SE4SG 2013 : Residential Electrical Demand Forecasting in Very Small Scale
SE4SG 2013 : Residential Electrical Demand Forecasting in  Very Small ScaleSE4SG 2013 : Residential Electrical Demand Forecasting in  Very Small Scale
SE4SG 2013 : Residential Electrical Demand Forecasting in Very Small Scale
 
SE4SG 2013 : Towards a Bottom-up Development of Reference Architectures for S...
SE4SG 2013 : Towards a Bottom-up Development of Reference Architectures for S...SE4SG 2013 : Towards a Bottom-up Development of Reference Architectures for S...
SE4SG 2013 : Towards a Bottom-up Development of Reference Architectures for S...
 
SE4SG 2013 : Towards a Constraint Based Approach for Self-Healing Smart Grids
SE4SG 2013 :  Towards a Constraint Based Approach for Self-Healing Smart GridsSE4SG 2013 :  Towards a Constraint Based Approach for Self-Healing Smart Grids
SE4SG 2013 : Towards a Constraint Based Approach for Self-Healing Smart Grids
 
SE4SG 2013 : MODAM: A MODular Agent-Based Modelling Framework
SE4SG 2013 : MODAM: A MODular Agent-Based Modelling Framework SE4SG 2013 : MODAM: A MODular Agent-Based Modelling Framework
SE4SG 2013 : MODAM: A MODular Agent-Based Modelling Framework
 
SE4SG 2013 : A Run-Time Verification Framework for Smart Grid Applications Im...
SE4SG 2013 : A Run-Time Verification Framework for Smart Grid Applications Im...SE4SG 2013 : A Run-Time Verification Framework for Smart Grid Applications Im...
SE4SG 2013 : A Run-Time Verification Framework for Smart Grid Applications Im...
 

Dernier

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...masabamasaba
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburgmasabamasaba
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...Nitya salvi
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 

Dernier (20)

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 

A TALE of DATA PATTERN DISCOVERY IN PARALLEL

  • 1. A TALE OF DATA PATTERN DISCOVERY IN PARALLEL Yan Liu Yan.iu@concordia.ca Concordia University Montreal, Quebec, Canada
  • 2.
  • 3. Is Parallelism Necessary? Enter your footer text here3 “I want to democratize AI.” AI experts & HPC experts work side by side “… democratizing the processes underlying the creation of AI systems…”
  • 4. Data Parallelism vs Model Parallelism Enter your footer text here4  Data Parallelism – When data is is to too large  Partition the workload over multiple devices.  Assume there are n workers(devices).  Each worker will receive a copy the complete model  Process the model on 1/n of the data  Model Parallelism – When model is too large  Each worker/device holds onto only part of the model  E.g. LSTM Recurrent Neural Networks. Each layer of LSTM is assigned to one GPU  No contention to update the shared model at the end of each iteration,  Most of the communication happens when passing intermediate results between GPUs.
  • 5. Key Factors of Parallel Analysis Pipelines Algorithmic factors  Discovery models  Model parallelism  Distance metrics Data Parallelism factors  Partition  Load balancing  Data locality and shuffling Architecture factors  Batch vs. Streaming  Microservices  DevOps Quality factors  Accuracy (comparing between algorithms and ground truth)  Scalability (throughput, latency, data intensity )  Stability 5
  • 6. Parallel Programming Models and Frameworks Enter your footer text here6
  • 7. Programming Model MapReduce  MapReduce  Map: <k1, v1>  list<k2, v2>  Shuffle  Reduce: <k2, list{vg}>  list<k3, v3>  Iterative algorithms  In Hadoop, large overheads incurred due to read/writing data to stable storage in-between iterations Enter your footer text here7
  • 8. Spark – Resilient Distributed Datasets  A data parallel programming model for fault tolerant distributed datasets  Partitioned collections with caching  Transformations (define new RDDs), actions (compute results)  Restricted shared variables (broadcast, accumulators)
  • 9. Distributed Stream Processing  S4 and Storm 9 Count the occurrences of each word in these files.
  • 10. Ensemble of Models in Parallel Enter your footer text here10
  • 11. Ensemble of Models in Learning One dataset can have a number of algorithms to analyze it. Each algorithm has a number of hyperparameters to configure. The training data can be organized as different input structures. 11 Multi-class Classification Regression Real Time Model Selection
  • 12. A Microservice Architecture of Multiple ML Models Enter your footer text here12
  • 13. CASE 1 : Select A Better Deep Learner in Classification  Fit with multiple data scenarios  Devices were deployed and run as legacy. Their types can be unknown due to lack of documentation.  From collected measurement data of a field to identify the device type for each segment of device connection in a full lengthy physical link.  Accurate classification is essential for further network configuration and capacity optimization. • Real world data collected by an industry vendor • 600,000 data samples as inputs • Each sample contains 6241 features • 6 types of classification as output 13 Problem Set Data Set
  • 14. CASE 1 : Select A Better Deep Learner in Classification Solution  Device Ensemble of 2 neural network models running on two GPU nodes ▪ Convolutional neural network (CNN) ▪ Residual network – (Resnet)  Embed cross validation for hyperparameter tuning in each model  Find a more accurate model (e.g. 33% accuracy vs 92% accuracy).  Find a accurate mode faster (e.g. 100 epochs vs. 5 epochs, x20 faster). Result 14 CNN on GPU Node 1 Data ResNet on GPU Node 2 5 epochs produce 100% validation accuracy for app scenario A 10,000 training data samples produce 33% validation accuracy for app scenario B 600,000 training samples produce 92% validation accuracy for app scenario B 100 epochs produce 100% validation accuracy for app scenario A VS VS Solution Result
  • 15. Deep Learning and Customer Access to Credit Article from Forbes.com on Feb 20, 2017 15 “We noticed a couple of years ago,” says Peter Maynard, Senior Vice President of Global Analytics at Equifax, “that we were not getting enough statistical lift from our traditional credit scoring methodology.” “We spend a lot of time creating segments to build a model on. Determining the optimal segment could take sometimes 20% of the time that it takes to build a model. In the context of neural nets, those segments are the hidden layers—the neural net does it all for you. The machine is figuring out what are the segments and what are the weights in a segment instead of having an analyst do that. I find it really powerful.” Instead of being hypotheses developed by data scientists, now the attributes are created by the deep learning process, on the basis of a much larger set of historical or “trended data. https://www.forbes.com/sites/gilpress/2017/02/20/equifax-and-sas-leverage-ai-and-deep-learning-to-improve- consumer-access-to-credit/
  • 17. Data Parallelism – Control the Degree of Parallelism Enter your footer text here17 Partition Size = Math.max(minSize, Math.min(goalSize, blockSize) minSize hadoop parameter mapreduce.input.fileinputformat.split.minsize dfs.block.size (128M) or fs.local.block.size (32M) According to Haddop 2.0, default partition size goalSize totalInputSize/numPartitions numPartitions decrease the partition size (increase the number of partitions) mapreduce.input.fileinputformat.split.minsize (1 byte) increase the partition size (decrease the number of partitions)
  • 18. Control the Degree of Parallelism through Partition Example partitions Input Files: 1987.csv - Size = 124183 KB 1988.csv - Size = 489297 KB Scenario 1: Default Partition size = 32MB, Num of Partitions= 19 1987.csv [Partition 0, 1,2,] size = 32MB; [Partition 3] size = 25 MB; 1988.csv [Partition 4,5, …17] size = 32MB; [Partition 18] size = 29 MB; Enter your footer text here18 Scenario 2: Decrease partition size Partition size = 19MB, Num of Partitions= 30 1987.csv [Partiton 0, 1,…5] size = 19MB; [Partition 6] size = 21 MB; 1988.csv [Partition 7,8, …28] size = 19MB; [Partition 19] size = 18 MB; * Partition 6 contains both data from 1987.csv and 1988.csv
  • 19. Data Skew  Unbalanced workloads tends to dominate the overall delays  Using default hash partition is not going to distribute the data uniformly  Some partitions contain more element on the reducer side than others  Optimization technique: further break down the skewed partitions into sub- partitions 19 slow task
  • 20. Load Balancing  Proposed solution: 1. add a random number x : [0, n], where in is the number of partitions as the prefix to the key such as k_new = n_k_old 2. Harsh partition with k_new 3. Process each partition 4. Remove the prefix 5. Perform further operations 20
  • 21. Data Aggregation Methods 21  Approach: Minimize data shuffling by tuning the join operation Dataset A Dataset B M Output M M M M M Shuffle join Dataset A M Output M M Smaller dataset Distributed Cache collect broadcast local Join in worker node Reduce-side Join Map-side Join
  • 22. Case 2 : Network Analytics Enter your footer text here22
  • 23. Feature Selection and Clustering Analysis  Part of supervised machine learning analysis to classify anomaly dataset from telecommunication network  Data sets - 305 078 rows represent devices - 275 columns/features represent devices ports - Each cell measures the device port reading at a timestamp  Preprocessing before feeding into neural network classifier  Input classifier what’s truly necessary, and not blindly throw all features  Clustering is unsupervised learning and does not require prior knowledge of the data  Increase robustness against noise or bad-quality data  Avoid scalability bottlenecks with bigger datasets  Accelerate iterations of the model-building process
  • 24. Two Analysis Pipelines 24  Identify which features are strongly correlated at various aggregation levels Pipeline I : Using the agglomerative neighbor joining clustering algorithm  NJ algorithm was applied to the feature correlation matrix of size 275*275 Pipeline II : Using PCA and DBSCAN as another clustering algorithm  Using Dynamic Time Warping (DTW) to measure the distance between any two time series of a feature  Input DTW to PCA + DBSCAN clustering
  • 25. Visualization of Neighbor Join Clustering Enter your footer text here25 Enter your footer text here Clustering of correlations of 275 features in cladogram. The length of each branch in the tree represents the distance (d=1-|correlation| ) between nodes
  • 26. Neighbor Joining Clustering method for the creation of phylogenetic trees - Takes distance matrix as input - Initialize all nodes of the tree - Calculate Q Matrix based on specific formula - Find Smallest Q value - Join Pair of nodes corresponding to smallest Q - Update Original Distance matrix with new Joined Node - Repeat until Tree completed NJ Example
  • 27. Neighbor Joining- Distance Matrix Generation Enter your footer text here27 in-memory of key value tuples e.g. .= ((index_i,index_j) Correlation Coefficient) 2 3 64 51 1 2 3 4 5 6 Standard 6x6 Distance Matrix Data Structure, where each cell contains a distance with a correspondence to an i,j pair. 2 3 64 51 1 2 3 4 5 6 Upper matrix holds all the information we need. Hence we can discard bottom diagonal values tuples to reduce unnecessary computation
  • 28. Neighbor Joining - Distance Matrix Generation 2 3 64 5 1 2 3 4 5 2 3 X4 5 X 2 3 4 5 2 3 64 5 1 2 3 4 5 2 3 4 5 1 2 3 4 2 3 4 5 U 2 3 4 5 Min Q is found as Indexes 1 and 6 Remove any cell with either I or J equal to 1 and 6 Node 1 and 6 are “joined” as node “U”. Place this node “U” at the smallest index position, 1 in this case. The grey boxes are the new distance values between node U, and the remaining nodes 2 to 5 If there are any cells bigger than current value of the min Q, decrement their indexes by 1. None in this case.
  • 29. Neighbor Joining in Spark Description Visualization Spark Transformations collectAsMap map filter substractedByKey unionmap Step i : Compute distance matrix Step v-vi: Update distance Matrix for next iteration Step vii: Set up other variables for next iteration Step ii : Calculate Q and Find Min Q Distance Matrix Recursive lookupreduceByKey map Iterative algorithm and data dependencies between stages Optimize each iteration of the NJ algorithm
  • 30. Evaluation on a Cluster 0 1 2 3 4 5 6 7 8 9 0 10 20 30 40 50 60 Runtime (mins) Number of Cores Run time vs Number of Cores Optimal
  • 31. Execution Time Decomposition 42.62% 23.99% 25.17% 7.60% 0.63% Time distribution across Spark's metrics – 10 Cores Cluster Executor Computing Time Scheduler Delay Task Deserialization Time Shuffle R/W Time Result serialization Time 38.97% 24.56% 18.51% 17.18% 0.78% Time distribution across Spark's metrics – 20 Cores Cluster Executor Computing Time Scheduler Delay Task Deserialization Time Shuffle R/W Time Result serialization Time
  • 32. Run-time Repartition 0 0.5 1 1.5 2 2.5 3 3.5 1 2 3 4 5 6 RunTime(s) # of Iteration before repartition Partition method effects on run time at different interval of parallel NJ Coalesce Repartition New Optimal Previous Optimal shuffle = false shuffle = true
  • 33. DTW & Dimension Reduction + DBSCAN [label string, vector] Sequence alignment algorithm to measure the similarity of time series as their distance  Dynamic Time Warping  Fast DTW : An approximation of Standard DTW with O(n) complexityLabel Vector X (1) (2) … (n) Y (1) (2) … (n) Z (1) (2) … (n) … … … … …
  • 34. Fast Dynamic Time Warping in Spark Enter your footer text here34 countzipWithIndex map parallelize map groupByKey sortByKeymap Generate tuples of Indexed pair of sequence E.g. ((i,j)(X,Y)) , where I & j are the pairwise index of the upper triangle and X, Y are the Time Series Distance Calculation Filling Distance Matrix Formatting RDD for input to PCA mapValues PCA Description Visualization Spark Transformations
  • 35. Evaluation on a Cluster 0 1 2 3 4 5 6 7 0 10 20 30 40 50 60 70 80 Runtime(mins) Number of Cores FastDTW-PCA-DBSCAN: Run Time vs Number Of Cores
  • 36. Observing Distance Metrics Effect on Run-Time 36
  • 37. CASE 3 : Trajectory Grouping Pattern Discovery in Parallel Streaming processing on real-time data set  Design a scalable, distributed processing analytics that discovers trajectory moving together pattern over large and continuous trajectory data stream.  GeoLife, real world public GPS trajectory dataset collected by Microsoft containing 178 real users’ outdoor activities from April 2007 to October 2011.  17,621 trajectories and over 20 million location records. Parallel System Architecture of Apache Spark Streaming Clusters running up-to 20 AWS nodes Run Ensemble of algorithms ▪ Snap-shot model ▪ Slot model with two distance measures for clustering 37 ▪ Process up to 30,000 updates per second of moving objects within 14 seconds on an AWS cluster. Problem Set Data Set Solution Result
  • 38. CASE 3 : Trajectory Grouping Pattern Discovery in Parallel 38 Snapshot Model : Gathering Each snapshot consists of moving objects from all trajectories which have the same timestamped-location points. I Snapshot clustering II Crowd Detection III Gatherings generation Partition method results Discover Gatherings Merge Find clusters Find clusters Find crowds Find crowds Streaming data Archived data Batch model Streaming model Window- based Partition Merge Find clusters Find clusters Incremental finding crowds Incremental finding crowds Discover Gatherings results Trajectory Data ... ... ... ... Stream Data Analytics Workflow Throughputs and End-to-end Delay
  • 39. CASE 3 : Trajectory Grouping Pattern Discovery in Parallel 39 Slot Model : Trajectory Companion Each trajectory slot consists of range of timestamped-location points of moving objects within the time period of T. Stream Data Analytics Workflow Accuracy - the best performer is more accurate in finding a pattern with comparable throughput and end-to-end delay
  • 40. Looking Forward Enter your footer text here40  Parallelism is compound by factors  Data partition and communication methods  Metrics and algorithms  Transformations and actions on data  Dependencies of data and model  Batch or streaming modes  A discovery pipeline should be self-adaptive and elastic  Run-time adaptive to the changes of (intermediate) data size and workload  Autoscaling to virtualized computing nodes  An ensemble approach to select the best performed algorithm, metric and pipeline
  • 41. Before I came here I was confused about the subject. Having listened to your lecture I am still confused. But on a higher level. Enrico Fermi (1901-1954)

Notes de l'éditeur

  1. R1: What are the appropriate models of trajectory pattern discovery? R2: What are the algorithmic approaches enabling parallel processing of trajectory data? R3: What are the distance metrics for comparing trajectories? R4: What are the parallelization design factors for efficient analysis?
  2. Why we choose ensemble models in learning
  3. The results are displayed as a radial cladogram on Figure 4. The cladogram was built
  4. 与GeoTab类似行业的分析经验