In the era of IoTs and A.I., distributed and parallel computing is embracing big data driven and algorithm focused applications and services. With rapid progress and development on parallel frameworks, algorithms and accelerated computing capacities, it still remains challenging on deliver an efficient and scalable data analysis solution. This talk shares a research experience on data pattern discovery in domain applications. In particular, the research scrutinizes key factors in analysis workflow design and data parallelism improvement on cloud.
Right Money Management App For Your Financial Goals
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
1. A TALE OF DATA PATTERN DISCOVERY
IN PARALLEL
Yan Liu
Yan.iu@concordia.ca
Concordia University
Montreal, Quebec, Canada
2.
3. Is Parallelism Necessary?
Enter your footer text here3
“I want to democratize AI.”
AI experts & HPC experts work
side by side
“… democratizing the processes
underlying the creation of AI
systems…”
4. Data Parallelism vs Model Parallelism
Enter your footer text here4
Data Parallelism – When data is is to too large
Partition the workload over multiple devices.
Assume there are n workers(devices).
Each worker will receive a copy the complete model
Process the model on 1/n of the data
Model Parallelism – When model is too large
Each worker/device holds onto only part of the model
E.g. LSTM Recurrent Neural Networks. Each layer of LSTM is assigned to one GPU
No contention to update the shared model at the end of each iteration,
Most of the communication happens when passing intermediate results between GPUs.
5. Key Factors of Parallel Analysis Pipelines
Algorithmic factors
Discovery models
Model parallelism
Distance metrics
Data Parallelism factors
Partition
Load balancing
Data locality and shuffling
Architecture factors
Batch vs. Streaming
Microservices
DevOps
Quality factors
Accuracy (comparing between algorithms
and ground truth)
Scalability (throughput, latency, data
intensity )
Stability
5
7. Programming Model
MapReduce
MapReduce
Map: <k1, v1> list<k2, v2>
Shuffle
Reduce: <k2, list{vg}> list<k3, v3>
Iterative algorithms
In Hadoop, large overheads incurred
due to read/writing data to stable
storage in-between iterations
Enter your footer text here7
8. Spark – Resilient Distributed Datasets
A data parallel programming model
for fault tolerant distributed datasets
Partitioned collections with caching
Transformations (define new RDDs),
actions (compute results)
Restricted shared variables (broadcast,
accumulators)
11. Ensemble of Models in Learning
One dataset can have a number
of algorithms to analyze it.
Each algorithm has a number of
hyperparameters to configure.
The training data can be
organized as different input
structures.
11
Multi-class
Classification
Regression
Real Time Model
Selection
13. CASE 1 : Select A Better Deep Learner in Classification
Fit with multiple data scenarios
Devices were deployed and run as legacy.
Their types can be unknown due to lack of
documentation.
From collected measurement data of a field
to identify the device type for each segment
of device connection in a full lengthy
physical link.
Accurate classification is essential for
further network configuration and capacity
optimization.
• Real world data collected by an industry
vendor
• 600,000 data samples as inputs
• Each sample contains 6241 features
• 6 types of classification as output
13
Problem Set Data Set
14. CASE 1 : Select A Better Deep Learner in Classification
Solution
Device Ensemble of 2 neural network
models running on two GPU nodes
▪ Convolutional neural network (CNN)
▪ Residual network – (Resnet)
Embed cross validation for hyperparameter
tuning in each model
Find a more accurate model (e.g. 33%
accuracy vs 92% accuracy).
Find a accurate mode faster (e.g. 100
epochs vs. 5 epochs, x20 faster). Result
14
CNN on GPU Node 1
Data
ResNet on GPU Node 2
5 epochs produce
100% validation
accuracy for app
scenario A
10,000 training data
samples produce 33%
validation accuracy for
app scenario B
600,000 training
samples produce
92% validation
accuracy for app
scenario B
100 epochs produce
100% validation
accuracy for app
scenario A
VS
VS
Solution
Result
15. Deep Learning and Customer Access to Credit
Article from Forbes.com on Feb 20, 2017
15
“We noticed a couple of years ago,” says Peter Maynard, Senior Vice President of Global
Analytics at Equifax, “that we were not getting enough statistical lift from our traditional
credit scoring methodology.”
“We spend a lot of time creating segments to build a model on. Determining the optimal
segment could take sometimes 20% of the time that it takes to build a model. In the context
of neural nets, those segments are the hidden layers—the neural net does it all for you. The
machine is figuring out what are the segments and what are the weights in a segment instead
of having an analyst do that. I find it really powerful.”
Instead of being hypotheses developed by data scientists, now the attributes are created by
the deep learning process, on the basis of a much larger set of historical or “trended data.
https://www.forbes.com/sites/gilpress/2017/02/20/equifax-and-sas-leverage-ai-and-deep-learning-to-improve-
consumer-access-to-credit/
17. Data Parallelism – Control the Degree of Parallelism
Enter your footer text here17
Partition Size = Math.max(minSize, Math.min(goalSize, blockSize)
minSize hadoop parameter
mapreduce.input.fileinputformat.split.minsize
dfs.block.size (128M) or fs.local.block.size (32M) According to Haddop 2.0, default partition size
goalSize totalInputSize/numPartitions
numPartitions decrease the partition size (increase the number
of partitions)
mapreduce.input.fileinputformat.split.minsize (1 byte) increase the partition size (decrease the number
of partitions)
18. Control the Degree of Parallelism through Partition
Example partitions
Input Files:
1987.csv - Size = 124183 KB
1988.csv - Size = 489297 KB
Scenario 1: Default
Partition size = 32MB, Num of Partitions= 19
1987.csv [Partition 0, 1,2,] size = 32MB;
[Partition 3] size = 25 MB;
1988.csv [Partition 4,5, …17] size = 32MB;
[Partition 18] size = 29 MB;
Enter your footer text here18
Scenario 2: Decrease partition size
Partition size = 19MB, Num of Partitions= 30
1987.csv [Partiton 0, 1,…5] size = 19MB;
[Partition 6] size = 21 MB;
1988.csv [Partition 7,8, …28] size = 19MB;
[Partition 19] size = 18 MB;
* Partition 6 contains both data from
1987.csv and 1988.csv
19. Data Skew
Unbalanced workloads tends to
dominate the overall delays
Using default hash partition is not
going to distribute the data uniformly
Some partitions contain more element
on the reducer side than others
Optimization technique: further break
down the skewed partitions into sub-
partitions
19
slow task
20. Load Balancing
Proposed solution:
1. add a random number x : [0,
n], where in is the number of
partitions as the prefix to the
key such as k_new = n_k_old
2. Harsh partition with k_new
3. Process each partition
4. Remove the prefix
5. Perform further operations
20
21. Data Aggregation Methods
21
Approach: Minimize data shuffling by tuning the join operation
Dataset A Dataset B
M
Output
M M M M M
Shuffle join
Dataset A
M
Output
M M
Smaller
dataset
Distributed Cache
collect
broadcast
local Join in
worker node
Reduce-side Join Map-side Join
22. Case 2 : Network Analytics
Enter your footer text here22
23. Feature Selection and Clustering Analysis
Part of supervised machine learning analysis to classify anomaly dataset from
telecommunication network
Data sets
- 305 078 rows represent devices
- 275 columns/features represent devices ports
- Each cell measures the device port reading at a timestamp
Preprocessing before feeding into neural network classifier
Input classifier what’s truly necessary, and not blindly throw all features
Clustering is unsupervised learning and does not require prior knowledge of the data
Increase robustness against noise or bad-quality data
Avoid scalability bottlenecks with bigger datasets
Accelerate iterations of the model-building process
24. Two Analysis Pipelines
24
Identify which features are strongly correlated at various aggregation levels
Pipeline I : Using the agglomerative neighbor joining clustering algorithm
NJ algorithm was applied to the feature correlation matrix of size 275*275
Pipeline II : Using PCA and DBSCAN as another clustering algorithm
Using Dynamic Time Warping (DTW) to measure the distance between any two time series of a
feature
Input DTW to PCA + DBSCAN clustering
25. Visualization of Neighbor Join Clustering
Enter your footer text here25
Enter your footer text here
Clustering of correlations of 275 features in cladogram. The length of each branch in the
tree represents the distance (d=1-|correlation| ) between nodes
26. Neighbor Joining
Clustering method for the creation of phylogenetic trees
- Takes distance matrix as input
- Initialize all nodes of the tree
- Calculate Q Matrix based on specific formula
- Find Smallest Q value
- Join Pair of nodes corresponding to smallest Q
- Update Original Distance matrix with new Joined Node
- Repeat until Tree completed
NJ Example
27. Neighbor Joining- Distance Matrix Generation
Enter your footer text here27
in-memory of key value tuples e.g. .= ((index_i,index_j) Correlation Coefficient)
2 3 64 51
1
2
3
4
5
6
Standard 6x6 Distance Matrix Data Structure, where
each cell contains a distance with a correspondence
to an i,j pair.
2 3 64 51
1
2
3
4
5
6
Upper matrix holds all the
information we need. Hence we can
discard bottom diagonal values
tuples to reduce unnecessary
computation
28. Neighbor Joining - Distance Matrix Generation
2 3 64 5
1
2
3
4
5
2 3 X4 5
X
2
3
4
5
2 3 64 5
1
2
3
4
5
2 3 4 5
1
2
3
4
2 3 4 5
U
2
3
4
5
Min Q is found as Indexes 1 and 6 Remove any cell with either I or J equal to
1 and 6
Node 1 and 6 are “joined” as node “U”. Place this node “U” at the
smallest index position, 1 in this case.
The grey boxes are the new distance values
between node U, and the remaining nodes 2 to 5
If there are any cells bigger than current value of
the min Q, decrement their indexes by 1. None in
this case.
29. Neighbor Joining in Spark
Description
Visualization
Spark Transformations
collectAsMap map filter substractedByKey unionmap
Step i : Compute distance
matrix
Step v-vi: Update distance Matrix for next
iteration
Step vii:
Set up other variables for next
iteration
Step ii : Calculate Q and Find
Min Q
Distance Matrix
Recursive
lookupreduceByKey map
Iterative algorithm and data dependencies between stages
Optimize each iteration of the NJ algorithm
30. Evaluation on a Cluster
0
1
2
3
4
5
6
7
8
9
0 10 20 30 40 50 60
Runtime
(mins)
Number of Cores
Run time vs Number of Cores
Optimal
31. Execution Time Decomposition
42.62%
23.99%
25.17%
7.60%
0.63%
Time distribution across Spark's metrics – 10 Cores Cluster
Executor Computing Time Scheduler Delay
Task Deserialization Time Shuffle R/W Time
Result serialization Time
38.97%
24.56%
18.51%
17.18%
0.78%
Time distribution across Spark's metrics – 20 Cores Cluster
Executor Computing Time Scheduler Delay Task Deserialization Time
Shuffle R/W Time Result serialization Time
32. Run-time Repartition
0
0.5
1
1.5
2
2.5
3
3.5
1 2 3 4 5 6
RunTime(s)
# of Iteration before repartition
Partition method effects on run time at different interval of parallel NJ
Coalesce
Repartition
New Optimal
Previous Optimal
shuffle = false
shuffle = true
33. DTW & Dimension Reduction + DBSCAN
[label string, vector]
Sequence alignment algorithm to measure
the similarity of time series as their distance
Dynamic Time Warping
Fast DTW : An approximation of Standard
DTW with O(n) complexityLabel Vector
X (1) (2) … (n)
Y (1) (2) … (n)
Z (1) (2) … (n)
… … … … …
34. Fast Dynamic Time Warping in Spark
Enter your footer text here34
countzipWithIndex map parallelize map groupByKey sortByKeymap
Generate tuples of Indexed pair of sequence
E.g. ((i,j)(X,Y)) , where I & j are the pairwise index of the upper
triangle and X, Y are the Time Series
Distance
Calculation
Filling
Distance
Matrix
Formatting RDD
for input to PCA
mapValues
PCA
Description
Visualization
Spark Transformations
35. Evaluation on a Cluster
0
1
2
3
4
5
6
7
0 10 20 30 40 50 60 70 80
Runtime(mins)
Number of Cores
FastDTW-PCA-DBSCAN: Run Time vs Number Of Cores
37. CASE 3 : Trajectory Grouping Pattern Discovery in Parallel
Streaming processing on real-time data set
Design a scalable, distributed processing
analytics that discovers trajectory moving
together pattern over large and continuous
trajectory data stream.
GeoLife, real world public GPS trajectory
dataset collected by Microsoft containing
178 real users’ outdoor activities from April
2007 to October 2011.
17,621 trajectories and over 20 million
location records.
Parallel System Architecture of
Apache Spark Streaming Clusters
running up-to 20 AWS nodes
Run Ensemble of algorithms
▪ Snap-shot model
▪ Slot model with two distance
measures for clustering
37
▪ Process up to 30,000 updates per
second of moving objects within 14
seconds on an AWS cluster.
Problem Set
Data Set
Solution
Result
38. CASE 3 : Trajectory Grouping Pattern Discovery in Parallel
38
Snapshot Model : Gathering
Each snapshot consists of moving objects
from all trajectories which have the same
timestamped-location points.
I Snapshot clustering II Crowd Detection III Gatherings
generation
Partition
method
results
Discover
Gatherings
Merge
Find
clusters
Find
clusters
Find crowds
Find crowds
Streaming
data
Archived
data
Batch model
Streaming model
Window-
based
Partition
Merge
Find
clusters
Find
clusters
Incremental finding
crowds
Incremental finding
crowds
Discover
Gatherings
results
Trajectory
Data
... ...
... ...
Stream Data Analytics Workflow
Throughputs and End-to-end Delay
39. CASE 3 : Trajectory Grouping Pattern Discovery in Parallel
39
Slot Model : Trajectory Companion
Each trajectory slot
consists of range of
timestamped-location
points of moving objects
within the time period of
T.
Stream Data Analytics Workflow
Accuracy - the best performer is more accurate in finding a pattern with
comparable throughput and end-to-end delay
40. Looking Forward
Enter your footer text here40
Parallelism is compound by factors
Data partition and communication methods
Metrics and algorithms
Transformations and actions on data
Dependencies of data and model
Batch or streaming modes
A discovery pipeline should be self-adaptive and elastic
Run-time adaptive to the changes of (intermediate) data size and workload
Autoscaling to virtualized computing nodes
An ensemble approach to select the best performed algorithm, metric and pipeline
41. Before I came here I was confused about
the subject. Having listened to your lecture I
am still confused. But on a higher level.
Enrico Fermi (1901-1954)
Notes de l'éditeur
R1: What are the appropriate models of trajectory pattern discovery?
R2: What are the algorithmic approaches enabling parallel processing of trajectory data?
R3: What are the distance metrics for comparing trajectories?
R4: What are the parallelization design factors for efficient analysis?
Why we choose ensemble models in learning
The results are displayed as a radial
cladogram on Figure 4. The cladogram was built