1. IT'S ABOUT TIME !!
Presented By-
P.SHANMUKHA SREENIVAS
M.MGT 1
2. AN OVERVIEW ON TIME SERIES DATA MINING
OUTLINE
2
1. Introduction
2. Similarity Search in Time Series Data
3. Feature-based Dimensionality Reduction
4. Discretization
5. Other Time Series Data Mining Tasks
6. Conclusions
3. 3
Introduction
6145.45
6128.75
6142.7
6201.2
6151.9
6050.95
5917.75
5855.95
5984
5993.9
5934.8
5920.05
5950
5950.7
5963.8
6141.15
..
..
6471.4
6511.7
6563.25
6558.45
6492.7
6546.75
A time series is a collection of observations
made sequentially in time.
CNX IT returns
Examples: Financial time series, scientific time series
4. TIME SERIES SIMILARITY SEARCH
4
Some examples:
- Identifying companies with similar patterns of growth.
- Determining products with similar selling patterns
- Discovering stocks with similar movement in stock prices.
- Finding out whether a musical score is similar to one of a set
of copyrighted scores.
5. Major Time Series Data Mining Tasks
• Indexing
• Clustering
• Classification
• Prediction
• Anomaly Detection
Indexing and clustering make explicit use of a distance measure
The others make implicit use of a distance measure
6. TIME SERIES SIMILARITY SEARCH
DISTANCE MEASURES
Euclidean distance
Dynamic Time Warping
Other distance measures
o Threshold query based similarity search (TQuEST)
o Minkowski Distance
6
7. 7
Euclidean Distance Metric
Given two time series
Q = q1…qn
and
C = c1…cn
their Euclidean distance is
defined as:
n
2 ,
i i D Q C q c
i
1
C
Q
D(Q,C)
8. What’s wrong with Euclidean Distance?
Similar sequences but they are shifted and have different scales
Normalize the time series before measuring
the distance between them. 푥푖
What if a sequence is stretched or compressed along the time axis?
(Goldin and Kanellakis, 1995)
′ =
푥푖 − μ
σ
9. 9
Dynamic Time Warping (Berndt et al.)
Dynamic Time Warping is a technique that finds the optimal
alignment between two time series if one time series may be
“warped” non-linearly by stretching or shrinking it along its time
axis.
This warping between two time series can be used or to determine
the similarity between the two time series.
Fixed Time Axis
Sequences are aligned “one to one”.
“Warped” Time Axis
Nonlinear alignments are possible.
10. DYNAMIC TIME WARPING
[BERNDT, CLIFFORD, 1994]
Allows acceleration-deceleration of signals along the time
dimension
Basic idea
X = (x1; x2; :::xN); N є N Y = (y1; y2; :::yM); M є N
*Data sequences should be sampled at equidistant points in time
Algorithm starts by building the distance matrix C є R (N*M)
representing all pairwise distances between X and Y
This distance matrix is also called as the local cost matrix
c(i,j) = ||xi - yj|| i є [1 : N]; j є [1 : M]
Once the local cost matrix is built, the algorithm finds the
alignment path which runs through the low-cost areas – ‘valleys’
on the augmented cost matrix
11. C
Q
C Q
HOW IS DTW
CALCULATED?
(i,j) = d(qi,cj) + min{ (i-1,j-1) , (i-1,j ) , (i,j-1) }
Warping path w
12. CONSTRAINTS
Boundary condition
Shanmukha Sreenivas P , DoMS
The starting and ending points of the warping path must be the first and the
last points of aligned sequences i.e C1 =(1,1) Ck=(M,N)
Monotonicity condition
n1< n2 < ::: < nK and m1< m2< :::< mK.
This condition preserves the time-ordering of points.
Step size condition
This criteria limits the warping path from long jumps (shifts in time) while
aligning sequences.
i.e we’ll be looking at only these values w(i-1,j-1) , w(i-1,j ) , w(i,j-1)
12
13. Shanmukha Sreenivas P , DoMS
CONSTRAINT VISUALIZATION
a)Admissible path satisfying constraints
b)Violation of boundary condition
c)Violation of monotonicity
d)Violation of step size
13
14. STEP SIZE CONDITION
A global constraint constrains the indices of the warping path wk = (i,j)k such that
j-r i j+r
Where r is a term defining allowed range of warping for a given point in a
sequence.
r =
Sakoe-Chiba Band Itakura Parallelogram
18. FORMULATION
Let D(i, j) refer to the dynamic time warping
distance between the subsequences
x1, x2, …, xi
y1, y2, …, yj
D(i, j) = | xi – yj | + min{ D(i – 1, j), D(i – 1, j – 1), D(i, j – 1) }
19. SOLUTION BY DYNAMIC PROGRAMMING
Basic implementation = O(n2) where n is the length of
the sequences
will have to solve the problem for each (i, j)
pair
If warping window is specified, then O(nw)
Only solve for the (i, j) pairs where | i – j | <=
w
20. FEATURE-BASED DIMENSIONALITY
REDUCTION
20
• Time series databases are often extremely large.
Searching directly on these data will be very
complex and inefficient.
• To overcome this problem, we should use some of
transformation methods to reduce the magnitude of
time series.
• These transformation methods are called
dimensionality reduction techniques.
21. 21
Dimensionality Reduction
C
An Example of a
Technique I
0 20 40 60 80 100 120 140
Raw
Data
0.4995
0.5264
0.5523
0.5761
0.5973
0.6153
0.6301
0.6420
0.6515
0.6596
0.6672
0.6751
0.6843
0.6954
0.7086
0.7240
0.7412
0.7595
0.7780
0.7956
0.8115
0.8247
0.8345
0.8407
0.8431
0.8423
0.8387
…
The graphic shows a
time series with 128
points.
The raw data used to
produce the graphic is
also reproduced as a
column of numbers (just
the first 30 or so points are
shown).
n = 128
27. DISCRETIZATION
27
• Discretization of a time series is tranforming it into a
symbolic string.
• The main benefit of this discretization is that there is an
enormous wealth of existing algorithms and data structures
that allow the efficient manipulations of symbolic
representations.
• Lin and Keogh et al. (2003) proposed a method called
Symbolic Aggregate Approximation (SAX), which allows
the descretization of original time series into symbolic
strings.
28. SYMBOLIC AGGREGATE
APPROXIMATION (SAX) [LIN ET AL. 2003]
28
baabccbc
The first symbolic representation
of time series, that allows
discretization of time series into
symbolic strings
29. HOW DO WE OBTAIN SAX
29
C
C
0 20 40 60 80 100 120
0
-
b
20 40 60 80 100 120
b
b
a
c
c
c
a
baabccbc
First convert the time
series to PAA
representation, then
convert the PAA to
symbols
30. TWO PARAMETER CHOICES
30
0 20 40 60 80 100 120
0
-
b
20 40 60 80 100 120
b
b
a
c
c
c
a
C
C
1 2 3 4 5 6 7
1
8
The word size, in this
case 8
The alphabet size (cardinality), in this case 3
3
2
1
31. Structural representations help in
understanding time series through
Data analysis + Visualization
SAX is claimed to be a landmark representation
of time series
Symbolic and therefore allows use of discrete data
structures and their corresponding algorithms for
analysis
Also helps with visualization
31
32. THANK YOU
www.cs.ucr.edu/~eamonn/TSDMA/index.html
32
Datasets and code used in
this presentation can be
found at..