Many time series data mining problems require
subsequence similarity search as a subroutine. While this can
be performed with any distance measure, and dozens of
distance measures have been proposed in the last decade, there
is increasing evidence that Dynamic Time Warping (DTW) is
the best measure across a wide range of domains. Given
DTW’s usefulness and ubiquity, there has been a large
community-wide effort to mitigate its relative lethargy.
Proposed speedup techniques include early abandoning
strategies, lower-bound based pruning, indexing and
embedding. In this work we argue that we are now close to
exhausting all possible speedup from software, and that we
must turn to hardware-based solutions if we are to tackle the
many problems that are currently untenable even with stateof-
the-art algorithms running on high-end desktops. With this
motivation, we investigate both GPU (Graphics Processing
Unit) and FPGA (Field Programmable Gate Array) based
acceleration of subsequence similarity search under the DTW
measure. As we shall show, our novel algorithms allow GPUs,
which are typically bundled with standard desktops, to achieve
two orders of magnitude speedup. For problem domains which
require even greater scale up, we show that FPGAs costing just
a few thousand dollars can be used to produce four orders of
magnitude speedup. We conduct detailed case studies on the
classification of astronomical observations and similarity
search in commercial agriculture, and demonstrate that our
ideas allow us to tackle problems that would be simply
untenable otherwise.
Accelerating Dynamic Time Warping Subsequence Search with GPU
1. University of Naples “Parthenope”
Accelerating Dynamic Time Warping
Subsequence Search with GPU
Davide Nardone
0120/131
A.A 2015/16
2. Summary
1. Introduction to Time Series
2. Time Series contexts and tasks
3. Basic idea: Subsequence search
4. DTW: Definition and Background
5. Parallelizing DTW Subsequence search
6. Evaluation
7. Experimental Case Studies
8. Conclusions and Future Remarks
3. What are Time Series ?
Time series are collection of observations made sequentially in time
Each ti is a real number as shown below
0 20 40 60 80 100 120 140 160 180 200
4.5
4.6
4.7
4.8
4.9
5
5.1
5.2
5.3
5.4
5.5
T = t1,t2,t3,…,tn
Time
Value
5. Data Mining and Machine Learning tasks
Classification
Control
Clustering
0 50 0 1000 150 0 2000 2500
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
A B C
A B C
Motif Discovery
Anomaly Detection
6. Basic idea: Subsequence search
The smallest computed distance will result in the best occurrence of Q found in a
time series T.
Given a query series Q:
find the occurrences of Q in a time series T, which are more similar in term of
measure of distance.
Distance smaller than a threshold
Time
7. Finding a Similarity Measure
Euclidean Distance (ED) often produces pessimistic similarity measures
when it encounters distortion in the time axis. Instead, the signals must
be synchronized whether they represent the same pattern, but are in a
different relative phases.
In this case, some sort of synchronization is needed and the way to do it
is to use Dynamic Time Warping (DTW).
Euclidean Distance Dynamic Time Warping
8. DTW: Definition and Background
D(C,Q) = DTW(n,m)
DTW(i, j) = di, j +min
DTW(i-1, j)
DTW(i, j -1)
DTW(i-1, j -1)
ì
í
ï
î
ï
+¥
DTW is a measure distance for comparing two time series, let’s say C e Q.
It’s defined as following:
where di,j is a a distance such as |ci-qj| or (ci-qj)2.
In addition, for initial condition we set DTW(1,1)=(c1-q1)2 and for undefined terms
we assume .
9. The DTW distance: an example
Consider the following two time
series:
On the left the d matrix of distances (ED) and on the right the DTW
matrix.
10. What DTW computes
Considering the d distance matrix, in order to align two time series we must
find a warping path.
Start from (1,1) and end to (n,m)
Take one step at time
At each step, move only by increasing i,j or both
Sum all distances you’ve found in the “warping path”
• In this random example it’s 21
How to find the optimal warping path ?
11. What DTW computes (cont.)
Each warping path is a way to “align” (match) two time series, such that all samples
are matched with at least one sample of the other series.
The DTW distance is the (square root) cost of the optimal warping path (√15)
• The “Euclidean path” moves only along the main diagonal, and cost √29 in this example.
The recursive definition allows DTW to be computed in O(n×m) time, even if the
number of warping path is exponential.
• This is a classic example of dynamic programming algorithm
Once the DTW matrix has been filled, the optimal warping path can be recovered by
going back from DTWn,m.
13. Definition of the Problem
Given a time series T=t1,t2,…,tn and a query Q=q1,q2,…,qm
Find the subsequence Cs,m of T (which is any contiguous set of m sample
starting at s (i.e, ts, ts+1,…, ts+m-1)), such that DTW(Cs,m,Q) is minimum.
Computational Time O(n×m2)
Computational Space O(n×m) reducible to O(2×m)
Parallelizable
14. Why normalize ?
-5
-3
-1
1
3
0 100-5
-3
-1
1
3
Query
Distance to the query
0 500 1000 1500
0
40
80
120
Threshold = 30
Wandering baseline problem: Because of this problem due to the machinery
cables moving during the reading or to other problem associated with it,
some equal patterns may not be recognized as the same.
It needs to normalize any subsequence Cs,m of T, as well as the Q series
before they can be compared.
Z-normalization: (x-μ/σ).
• Shift and scale invariance
baseline
15. Parallelizing DTW Subsequence Search
Q
-8000
-7500
-7000
T
time
Slide a window of fixed size
Compute DTW measure between the query Q
and the Z-normalized subsequence Cs,m of the
time series T
Update the minimum whether necessary
Cs,m
Assign each DTW computation to a single thread!
16. CUDA Thread Organization
Because of the nature of the problem, the CUDA-threads are distributed
(for each block) only along the x-axis as well as the blocks are only
allocated along the x-axis of the Grid.
The number of blocks to allocate depends on
the time series length of T and Q, in fact
Grid _size =
number _of _subsequences
block _size
These are key parameters which influence the computational time of the
GP-GPU DTW!
where the number_of_subsequences and the block_size are given by n-m+1 and the
number of thread-per-block respectively.
17. Main stages
2. The CPU call the GPU kernel
• Every kernel thread operates on a specific sliding window in two steps:
1. Accessing the sliding window to compute the mean and variance;
2. Computing the normalized DTW distance to the query.
3. The CPU copies the output from the GPU
• The algorithm computes the minimum distance to obtain the most
similar subsequence of T to the query Q.
1. The CPU copies the whole times series T to the global memory of the
GPU
• Since the query Q is fixed, we first copy it into the global memory,
then based on the DTW algorithm version we keep it there or store
into the shared memory.
18. DTW: Global Memory vs Shared Memory
The only difference between the two kernel functions concern the way to
access and store the time series Q and the warping matrix.
float warping_mat[WS][2];
float *query;
Global Memory
__shared__ float warping_mat [WS][2];
extern __shared__ float query_sm[];
Shared Memory
Note: In both the versions the warping_mat is allocated statically (WS is the
window size) because, in doing so, the CUDA compiler is likely to store this
array in the register file which is much faster then allocate it dynamically (i.e.
cudaMalloc()).
Because Q is a fixed time series and by the problem definition its size
doesn’t change during the execution (plus it’s much smaller than T), it can
fit in the shared memory of the GPU device in which the read/write
operations are 150x faster than the global memory.
19. Evaluation
The processing units used for the purpose are:
1. CPU: Intel Core i7-860 CPU at 2.80GHz;
2. GPU: NVIDIA Quadro K5000 with 1,536 cores.
In order to get significant result, the execution time analysis has been
performed by varying three critical parameters:
1. The T time series length (2,500,15,000,…,1,080,000);
2. The Q time series length (100,200,…,1,000);
3. The number of thread for each block (64,128,…,1024).
Since all the combinations of these parameters yield many graphs, here we
show only the ones that are more meaningful, and if necessary presenting some
graphs in which occur particular circumstances.
20. Consideration on the GPU execution time (1)
Remembering this equation
It’s possible to state that with the increasing of the number_of_subsequences
and the decreasing of the block_size, the Grid_size increase.
Grid _size =
number_of _subsequences
block _size
How does it influence the execution time ?
• A growing number of blocks in the grid corresponds to an higher effort by the GPU
to access to all the shared data structures (DTW-shared version).
• Intuitively this result should show up more frequently as the block_size gets
smaller and the length of query Q gets larger!
21. Block_size=128Block_size=6
4
Block_size=3
2
This phenomenon for the
DTW-shared algorithm
begins to appear with the
decrease of the block_size
and the increase of the Q
time series length.
This problem is due to the
high overhead produced by
the CUDA kernel to access to
the shared data structures,
that therefore make the
DTW-global algorithm's
performances quite similar
to the DTW-shared version
or even better.
Consideration on the GPU execution time (2)
22. Problem cases
Based upon the three parameters on which the algorithm’s performance depend
• T series length;
• Q series length;
• Block_Size.
we make some considerations about the Speed-up and the Parallel Scalability of the
DTW algorithm.
Hereafter, for the next observations, we will only consider the DTW-shared
algorithm.
In particular we will look at the problem in two different manners:
1. By fixing the the T Time Series length and varying the Q Time Series length;
2. By fixing the the Q Time Series length and varying the T Time Series length.
23. Case 1: Speed up
64 128 256 512 1024
2,500 67.170 67.012 66.674 65.774 58.747
15,000 128.593 187.926 230.083 244.225 272.303
90,000 138.777 227.739 294.817 319.392 314.872
540,000 142.175 236.758 308.555 334.440 327.428
1,080,000 142.336 237.772 309.799 333.535 328.111
T Series
length
# of threads
Speed up on average by varying the Q time series length
In this first case, we look at the problem by fixing the T time series length
and varying the Q time series length.
It’s possible to observe that along the columns, the Speed-up is almost always
growing, indicating a fair parallel scalability for the problem.
24. Case 1: Parallel scalability
This measurement indicates how efficient an application is when using an
increasing numbers of parallel processing elements (threads).
The longer the length of T is, more constant is its curve of Speed-up (any
configuration)
25. Case 2: Speed up
In this second case, we look at the problem by fixing the Q time series
length and varying the T time series length.
64 128 256 512 1024
100 208.57 271.85 268.98 265.88 266.41
200 208.05 269.29 266.38 262.89 264.2
300 163.31 247.27 261.89 258.74 259.77
400 147.33 232.49 264.5 261.05 262.07
500 119.85 206.31 262.28 257.45 258.43
600 94.33 162.41 238.87 256.11 257.61
700 80.931 143.85 223.26 258.08 257.4
800 80.818 144.5 224.4 259.84 259.14
900 67.828 118.7 205.25 260.03 259.59
1,000 67.079 117.75 204.04 258.67 258.3
VARIANCE 3035.4 3731.4 677.5 8.0547 8.9986
Q Series
length
# of threads
Speed up on average by varying the T time series length
26. Case 2: Parallel scalability
This measurement indicates how efficient an application is when using an
increasing numbers of parallel processing elements (threads).
CUDA-thread configuration on 512 and 1,024 threads seems to be constant,
showing up neither a benefit nor a loss of Speed-up.
27. Best CUDA-threads Configuration
In order to asses the best configuration of thread-per-block (for any task of the
problem), we clustered and compared all the results by varying the Q time series
length.
28. Best CUDA-threads Configuration (cont.)
By taking the minimum value for each cluster (over the 10 runs), it was
possible to visualize the best 3D trend in algorithm performance as well as
to better understand what CUDA-threads configuration best fit for a
particular problem.
29. Experimental Case Studies
Here, have been considered three case studies in which some of the previous
tasks might be involved:
1. Case Study in Entomology (Subsequence search);
2. Case Study in Cardiology (Anomaly Detection);
3. Case Study in Astrology (Classification).
DTW subsequence similarity search is a key problem in many higher level Data
Mining or Machine Learning tasks such as motif discovery, anomaly detection,
association discovery and classification.
Many research projects use DTW subsequence similarity search as a subroutine,
and could greatly benefit from significantly improved performance.
30. Case Study in Entomology
Many species of insect feed by inserting their stylet into a plant and sucking out
sap. While this behavior in itself is generally not harmful to the plants, if one
plant has a disease, the insect will transmit it from plant to plant.
By a study conducted in [1], as
soon as the insect’s stylet
penetrates the plant, an
Electrical Penetration Graph
(EPG) signal occurs and can then
be amplified an recorded.
One critical task researcher is to
search for patterns in long traces
which present variability in time
of transitions, process for which
the DTW is well suited.
31. Case Study in Entomology (cont.)
The GPU solution just took ≈12.30 seconds while the CPU took ≈68 minutes,
which is too slow for a real application in the entomology field.
The GPU solution is resulted then ≈331x faster than the CPU one.
In addition these time series are for a single query-by-content; if we needed to
perform 1,000 such searches (possibly different in size) the CPU version would
take approximately 34 days! [4]
Since the time series are significant in length, to test the GPU scalability in this
domain, it’s been searched for the query shown below (500 in length) in a EPG
trace of length 1,499,000.
32. Case Study in Cardiology
Congestive Heart failure is a complex clinical syndrome that occurs when the heart
is unable to pump sufficiently to maintain blood flow to meet the body’s need.
By constantly monitoring the Electrocardiogram (ECG) of a patient, it’s possible to
recognize irregular pattern and prevent a possible/immediate heart failure.
Such task in Data Mining is called Anomaly Detection or Outlier Detection which is
here illustrated by means of the DTW.
33. Case Study in Cardiology (cont.)
To test the validity of the method, we have considered the sample and control of
the ECG signal of a patient affected by this pathology [2][3], and tested our GPU
and CPU algorithms.
ALGORITHM STEPS
1. Perform the
Subsequence similarity
search between the
Control signal and all
the subsequences of the
Sample signal (m-n+1).
2. Perform a thresholding
task so as to preserve all
DTW measures under a
certain value (∂=3).
3. Overlap on the Sample
signal only the curves
which belong to the
thresholded DTW value.
Control Sample
DTW graph Outlier
34. Case Study in Cardiology (cont.)
Finally, to test the scalability on such long ECG sequences, it’s been used the GPU
version for both the DTW subsequence similarity search and the thresholding
algorithm.
For this purpose, long sequences around 1,6 and 12 hours have been used.
CPU GPU
1 hr. 23 min. 4.32 sec.
6 hr. 2.3 hr 25 sec.
12 hr. 4.7hr 49.sec
Time
Length CPU GPU
1 hr. 0.588 ms 1.03e-2 ms
6 hr. 3.573 ms 4.83e-2 ms
12 hr. 7.149 ms 9.37e-2 ms
Time
Length
These execution times are for a single query-by-content.
Subsequence similarity search Thresholding
35. Case Study in Astrology
A star light curve is a graph which shows the brightness of a stellar object over a
period of time.
The reason why the stars change their brightness include planetary transits,
cataclysmic or explosive events (nova or supernova).
Since 1855 to nowadays many star light curves have been collected and an
obvious thing to do is to classify them.
Astronomers have an algorithm called universal phasing to produce a canonical
alignment for the light curves, but it has some problem when applied to large
datasets, plus it doesn’t work good as they believe.
Anyway, by using the idea of the Subsequence search similarity is possible to
solve an univariate and supervised problem of classification.
While it’s possible to extract
a single light curve cycle,
there is no well-defined
starting point.
36. Case Study in Astrology (cont.)
The basic idea is to compare each curve of the training set over any curve of
the testing set by using the DTW measure, and assigning to the latter the label
of the training curve whose DTW value calculated is minimum (more similar).
37. Case Study in Astrology (cont.)
From the work considered at [4], it’s been used a three-class star light curve
dataset which had been universally phased at Time Series Center at Harvard
University.
In order to compare the CPU and GPU performance it’s been created a testing
set with just 128 objects and a training set of 1,024 objects.
38. Case Study in Astrology (cont.)
While it’s possible to extract a single light curve cycle, there is no well-defined
starting point, therefore here it has also tested the so called Universal Phasing
Assumption.
Rotation Invariant DTW – O(n3)
• Try all possible rotations to find the minimum possible distance
• Compute the DTW between the Q series and each shift of the C time series
DTW distance 53.49
rDTW distance 0
0 20 40 60 80 100 120
1
5
9
C
0 20 40 60 80 100 120
1
5
9
Q
39. Case Study in Astrology (cont.)
This problem has never tested before, presumably because the Rotation
Invariant version of the DTW (rDTW) is O(n3), which is quite untenable for a
normal CPU. Therefore, it was interesting to test such task on GPU.
Also in this case it’s been created a testing set with just 128 objects and a
training set of 1,024 objects.
40. Case Study in Astrology (cont.)
As shown below in the table the result are quite interesting
Accurancy Time GPU Time CPU
ED 80.47% <1 sec. 2.5 sec.
rED 81.25% 14.6 sec. 43.6 min.
DTW 88.28% 1.8 min. 35.4 min.
rDTW 91.4% 3.37 hours 42 days
It’s important to point out that the norm-2 has been used to compute the
distance matrix d.
By using different measures of distance, such as norm-1, we obtained different
results (i.e. DTW 86.7% and rDTW 84.4%).
41. Conclusions and Future Remarks
The Subsequence similarity search is an important problem that has attracted a
lot of great interest.
The CPU solutions cannot provide an adequate speed to handle these problems
while the GPU solutions has been revealed a good tool to handle all those
problems that before were computationally untenable.
In addition it’s been shown with three different cases studies, as a GP-GPU DTW
version has led to very significant results, both in time complexity and accuracy.
Future works include revisiting current algorithms that use DTW as a subroutine
and the implementation of a GP-GPUs version for a Multi-Dimensional Dynamic
Time Warping (MD-DTW).
42. References
[1] D. L. MC-Lean & M. G. Kinsey(1964). A Technique for Electronically Recording
Aphid Feeding and Salivation. Nature 202, 1358 - 1359 (27 June 1964).
[2] Goldberger, A. L., Amaral, L. A., Glass, L., Hausdor, J. M., Ivanov, P. C., Mark, R.
G., & Stanley, H. E. (2000). Physiobank, physiotoolkit, and physionet components of
a new research resource for complex physiologic signals. Circulation, 101(23), e215-
e220.
[3] Baim, D. S., Colucci, W. S., Monrad, E. S., Smith, H. S., Wright, R. F., Lanoue, A.,
& Braunwald, E. (1986). Survival of patients with severe congestive heart failure
treated with oral milrinone. Journal of the American College of Cardiology, 7(3), 661-
670.
[4] Sart, Doruk, et al. Accelerating dynamic time warping subsequence search with
GPUs and FPGAs. Data Mining (ICDM), 2010 IEEE 10th International Conference on.
IEEE, 2010.