A TALE of DATA PATTERN DISCOVERY IN PARALLEL

A TALE OF DATA PATTERN DISCOVERY
IN PARALLEL
Yan Liu
Yan.iu@concordia.ca
Concordia University
Montreal, Quebec, Canada

Is Parallelism Necessary?
Enter your footer text here3
“I want to democratize AI.”
AI experts & HPC experts work
side by side
“… democratizing the processes
underlying the creation of AI
systems…”

Data Parallelism vs Model Parallelism
 Data Parallelism – When data is is to too large
 Partition the workload over multiple devices.
 Assume there are n workers(devices).
 Each worker will receive a copy the complete model
 Process the model on 1/n of the data
 Model Parallelism – When model is too large
 Each worker/device holds onto only part of the model
 E.g. LSTM Recurrent Neural Networks. Each layer of LSTM is assigned to one GPU
 No contention to update the shared model at the end of each iteration,
 Most of the communication happens when passing intermediate results between GPUs.

Key Factors of Parallel Analysis Pipelines
Algorithmic factors
 Discovery models
 Model parallelism
 Distance metrics
Data Parallelism factors
 Partition
 Load balancing
 Data locality and shuffling
Architecture factors
 Batch vs. Streaming
 Microservices
 DevOps
Quality factors
 Accuracy (comparing between algorithms
and ground truth)
 Scalability (throughput, latency, data
intensity )
 Stability
5

Parallel Programming Models
and Frameworks

Programming Model
MapReduce
 MapReduce
 Map: <k1, v1>  list<k2, v2>
 Shuffle
 Reduce: <k2, list{vg}>  list<k3, v3>
 Iterative algorithms
 In Hadoop, large overheads incurred
due to read/writing data to stable
storage in-between iterations

Spark – Resilient Distributed Datasets
 A data parallel programming model
for fault tolerant distributed datasets
 Partitioned collections with caching
 Transformations (define new RDDs),
actions (compute results)
 Restricted shared variables (broadcast,
accumulators)

Distributed Stream Processing
 S4 and Storm
9
Count the occurrences of each
word in these files.

Ensemble of Models in Parallel

Ensemble of Models in Learning
One dataset can have a number
of algorithms to analyze it.
Each algorithm has a number of
hyperparameters to configure.
The training data can be
organized as different input
structures.
11
Multi-class
Classification
Regression
Real Time Model
Selection

A Microservice Architecture of Multiple ML Models

CASE 1 : Select A Better Deep Learner in Classification
 Fit with multiple data scenarios
 Devices were deployed and run as legacy.
Their types can be unknown due to lack of
documentation.
 From collected measurement data of a field
to identify the device type for each segment
of device connection in a full lengthy
physical link.
 Accurate classification is essential for
further network configuration and capacity
optimization.
• Real world data collected by an industry
vendor
• 600,000 data samples as inputs
• Each sample contains 6241 features
• 6 types of classification as output
13
Problem Set Data Set

CASE 1 : Select A Better Deep Learner in Classification
Solution
 Device Ensemble of 2 neural network
models running on two GPU nodes
▪ Convolutional neural network (CNN)
▪ Residual network – (Resnet)
 Embed cross validation for hyperparameter
tuning in each model
 Find a more accurate model (e.g. 33%
accuracy vs 92% accuracy).
 Find a accurate mode faster (e.g. 100
epochs vs. 5 epochs, x20 faster). Result
14
CNN on GPU Node 1
Data
ResNet on GPU Node 2
5 epochs produce
100% validation
accuracy for app
scenario A
10,000 training data
samples produce 33%
validation accuracy for
app scenario B
600,000 training
samples produce
92% validation
accuracy for app
scenario B
100 epochs produce
100% validation
accuracy for app
scenario A
VS
VS
Solution
Result

Deep Learning and Customer Access to Credit
Article from Forbes.com on Feb 20, 2017
15
“We noticed a couple of years ago,” says Peter Maynard, Senior Vice President of Global
Analytics at Equifax, “that we were not getting enough statistical lift from our traditional
credit scoring methodology.”
“We spend a lot of time creating segments to build a model on. Determining the optimal
segment could take sometimes 20% of the time that it takes to build a model. In the context
of neural nets, those segments are the hidden layers—the neural net does it all for you. The
machine is figuring out what are the segments and what are the weights in a segment instead
of having an analyst do that. I find it really powerful.”
Instead of being hypotheses developed by data scientists, now the attributes are created by
the deep learning process, on the basis of a much larger set of historical or “trended data.
https://www.forbes.com/sites/gilpress/2017/02/20/equifax-and-sas-leverage-ai-and-deep-learning-to-improve-
consumer-access-to-credit/

Data Parallelism – Control the Degree of Parallelism
Partition Size = Math.max(minSize, Math.min(goalSize, blockSize)
minSize hadoop parameter
mapreduce.input.fileinputformat.split.minsize
dfs.block.size (128M) or fs.local.block.size (32M) According to Haddop 2.0, default partition size
goalSize totalInputSize/numPartitions
numPartitions decrease the partition size (increase the number
of partitions)
mapreduce.input.fileinputformat.split.minsize (1 byte) increase the partition size (decrease the number
of partitions)

Control the Degree of Parallelism through Partition
Example partitions
Input Files:
1987.csv - Size = 124183 KB
1988.csv - Size = 489297 KB
Scenario 1: Default
Partition size = 32MB, Num of Partitions= 19
1987.csv [Partition 0, 1,2,] size = 32MB;
[Partition 3] size = 25 MB;
1988.csv [Partition 4,5, …17] size = 32MB;
Scenario 2: Decrease partition size
Partition size = 19MB, Num of Partitions= 30
1987.csv [Partiton 0, 1,…5] size = 19MB;
1988.csv [Partition 7,8, …28] size = 19MB;
* Partition 6 contains both data from
1987.csv and 1988.csv

Data Skew
 Unbalanced workloads tends to
dominate the overall delays
 Using default hash partition is not
going to distribute the data uniformly
 Some partitions contain more element
on the reducer side than others
 Optimization technique: further break
down the skewed partitions into sub-
partitions
19
slow task

Load Balancing
 Proposed solution:
1. add a random number x : [0,
n], where in is the number of
partitions as the prefix to the
key such as k_new = n_k_old
2. Harsh partition with k_new
3. Process each partition
4. Remove the prefix
5. Perform further operations
20

Data Aggregation Methods
21
 Approach: Minimize data shuffling by tuning the join operation
Dataset A Dataset B
M
Output
M M M M M
Shuffle join
Dataset A
M
Output
M M
Smaller
dataset
Distributed Cache
collect
broadcast
local Join in
worker node
Reduce-side Join Map-side Join

Case 2 : Network Analytics

Feature Selection and Clustering Analysis
 Part of supervised machine learning analysis to classify anomaly dataset from
telecommunication network
 Data sets
- 305 078 rows represent devices
- 275 columns/features represent devices ports
- Each cell measures the device port reading at a timestamp
 Preprocessing before feeding into neural network classifier
 Input classifier what’s truly necessary, and not blindly throw all features
 Clustering is unsupervised learning and does not require prior knowledge of the data
 Increase robustness against noise or bad-quality data
 Avoid scalability bottlenecks with bigger datasets
 Accelerate iterations of the model-building process

Two Analysis Pipelines
24
 Identify which features are strongly correlated at various aggregation levels
Pipeline I : Using the agglomerative neighbor joining clustering algorithm
 NJ algorithm was applied to the feature correlation matrix of size 275*275
Pipeline II : Using PCA and DBSCAN as another clustering algorithm
 Using Dynamic Time Warping (DTW) to measure the distance between any two time series of a
feature
 Input DTW to PCA + DBSCAN clustering

Visualization of Neighbor Join Clustering
Enter your footer text here
Clustering of correlations of 275 features in cladogram. The length of each branch in the
tree represents the distance (d=1-|correlation| ) between nodes

Neighbor Joining
Clustering method for the creation of phylogenetic trees
- Takes distance matrix as input
- Initialize all nodes of the tree
- Calculate Q Matrix based on specific formula
- Find Smallest Q value
- Join Pair of nodes corresponding to smallest Q
- Update Original Distance matrix with new Joined Node
- Repeat until Tree completed
NJ Example

Neighbor Joining- Distance Matrix Generation
in-memory of key value tuples e.g. .= ((index_i,index_j) Correlation Coefficient)
2 3 64 51
1
2
3
4
5
6
Standard 6x6 Distance Matrix Data Structure, where
each cell contains a distance with a correspondence
to an i,j pair.
2 3 64 51
1
2
3
4
5
6
Upper matrix holds all the
information we need. Hence we can
discard bottom diagonal values
tuples to reduce unnecessary
computation

Neighbor Joining - Distance Matrix Generation
2 3 64 5
1
2
3
4
5
2 3 X4 5
X
2
3
4
5
2 3 64 5
1
2
3
4
5
2 3 4 5
1
2
3
4
2 3 4 5
U
2
3
4
5
Min Q is found as Indexes 1 and 6 Remove any cell with either I or J equal to
1 and 6
Node 1 and 6 are “joined” as node “U”. Place this node “U” at the
smallest index position, 1 in this case.
The grey boxes are the new distance values
between node U, and the remaining nodes 2 to 5
If there are any cells bigger than current value of
the min Q, decrement their indexes by 1. None in
this case.

Neighbor Joining in Spark
Description
Visualization
Spark Transformations
collectAsMap map filter substractedByKey unionmap
Step i : Compute distance
matrix
Step v-vi: Update distance Matrix for next
iteration
Step vii:
Set up other variables for next
iteration
Step ii : Calculate Q and Find
Min Q
Distance Matrix
Recursive
lookupreduceByKey map
Iterative algorithm and data dependencies between stages
Optimize each iteration of the NJ algorithm

Evaluation on a Cluster
0
1
2
3
4
5
6
7
8
9
0 10 20 30 40 50 60
Runtime
(mins)
Number of Cores
Run time vs Number of Cores
Optimal

Execution Time Decomposition
42.62%
23.99%
25.17%
7.60%
0.63%
Time distribution across Spark's metrics – 10 Cores Cluster
Executor Computing Time Scheduler Delay
Task Deserialization Time Shuffle R/W Time
Result serialization Time
38.97%
24.56%
18.51%
17.18%
0.78%
Time distribution across Spark's metrics – 20 Cores Cluster
Executor Computing Time Scheduler Delay Task Deserialization Time
Shuffle R/W Time Result serialization Time

Run-time Repartition
0
0.5
1
1.5
2
2.5
3
3.5
1 2 3 4 5 6
RunTime(s)
# of Iteration before repartition
Partition method effects on run time at different interval of parallel NJ
Coalesce
Repartition
New Optimal
Previous Optimal
shuffle = false
shuffle = true

DTW & Dimension Reduction + DBSCAN
[label string, vector]
Sequence alignment algorithm to measure
the similarity of time series as their distance
 Dynamic Time Warping
 Fast DTW : An approximation of Standard
DTW with O(n) complexityLabel Vector
X (1) (2) … (n)
Y (1) (2) … (n)
Z (1) (2) … (n)
… … … … …

Fast Dynamic Time Warping in Spark
countzipWithIndex map parallelize map groupByKey sortByKeymap
Generate tuples of Indexed pair of sequence
E.g. ((i,j)(X,Y)) , where I & j are the pairwise index of the upper
triangle and X, Y are the Time Series
Distance
Calculation
Filling
Distance
Matrix
Formatting RDD
for input to PCA
mapValues
PCA
Description
Visualization
Spark Transformations

Evaluation on a Cluster
0
1
2
3
4
5
6
7
0 10 20 30 40 50 60 70 80
Runtime(mins)
Number of Cores
FastDTW-PCA-DBSCAN: Run Time vs Number Of Cores

Observing Distance Metrics Effect on Run-Time
36

CASE 3 : Trajectory Grouping Pattern Discovery in Parallel
Streaming processing on real-time data set
 Design a scalable, distributed processing
analytics that discovers trajectory moving
together pattern over large and continuous
trajectory data stream.
 GeoLife, real world public GPS trajectory
dataset collected by Microsoft containing
178 real users’ outdoor activities from April
2007 to October 2011.
 17,621 trajectories and over 20 million
location records.
Parallel System Architecture of
Apache Spark Streaming Clusters
running up-to 20 AWS nodes
Run Ensemble of algorithms
▪ Snap-shot model
▪ Slot model with two distance
measures for clustering
37
▪ Process up to 30,000 updates per
second of moving objects within 14
seconds on an AWS cluster.
Problem Set
Data Set
Solution
Result

38
Snapshot Model : Gathering
Each snapshot consists of moving objects
from all trajectories which have the same
timestamped-location points.
I Snapshot clustering II Crowd Detection III Gatherings
generation
Partition
method
results
Discover
Gatherings
Merge
Find
clusters
Find
clusters
Find crowds
Find crowds
Streaming
data
Archived
data
Batch model
Streaming model
Window-
based
Partition
Merge
Find
clusters
Find
clusters
Incremental finding
crowds
Incremental finding
crowds
Discover
Gatherings
results
Trajectory
Data
... ...
... ...
Stream Data Analytics Workflow
Throughputs and End-to-end Delay

39
Slot Model : Trajectory Companion
Each trajectory slot
consists of range of
timestamped-location
points of moving objects
within the time period of
T.
Stream Data Analytics Workflow
Accuracy - the best performer is more accurate in finding a pattern with
comparable throughput and end-to-end delay

Looking Forward
 Parallelism is compound by factors
 Data partition and communication methods
 Metrics and algorithms
 Transformations and actions on data
 Dependencies of data and model
 Batch or streaming modes
 A discovery pipeline should be self-adaptive and elastic
 Run-time adaptive to the changes of (intermediate) data size and workload
 Autoscaling to virtualized computing nodes
 An ensemble approach to select the best performed algorithm, metric and pipeline

Before I came here I was confused about
the subject. Having listened to your lecture I
am still confused. But on a higher level.
Enrico Fermi (1901-1954)

A TALE of DATA PATTERN DISCOVERY IN PARALLEL

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à A TALE of DATA PATTERN DISCOVERY IN PARALLEL

Similaire à A TALE of DATA PATTERN DISCOVERY IN PARALLEL (20)

Plus de Jenny Liu

Plus de Jenny Liu (6)

Dernier

Dernier (20)

A TALE of DATA PATTERN DISCOVERY IN PARALLEL

Notes de l'éditeur