Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Prochain SlideShare
Chargement dans…5
×

# Clustering Random Walk Time Series

Talk given at Geometric Science of Information GSI 2015

• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Soyez le premier à commenter

### Clustering Random Walk Time Series

1. 1. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Clustering Random Walk Time Series GSI 2015 - Geometric Science of Information Gautier Marti, Frank Nielsen, Philippe Very, Philippe Donnat 29 October 2015 Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
2. 2. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion 1 Introduction 2 Geometry of Random Walk Time Series 3 The Hierarchical Block Model 4 Conclusion Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
3. 3. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Context (data from www.datagrapple.com) Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
4. 4. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion What is a clustering program? Deﬁnition Clustering is the task of grouping a set of objects in such a way that objects in the same group (cluster) are more similar to each other than those in diﬀerent groups. Example of a clustering program We aim at ﬁnding k groups by positioning k group centers {c1, . . . , ck} such that data points {x1, . . . , xn} minimize minc1,...,ck n i=1 mink j=1 d(xi , cj )2 But, what is the distance d between two random walk time series? Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
5. 5. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion What are clusters of Random Walk Time Series? French banks and building materials CDS over 2006-2015 Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
6. 6. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion What are clusters of Random Walk Time Series? French banks and building materials CDS over 2006-2015 Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
7. 7. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion 1 Introduction 2 Geometry of Random Walk Time Series 3 The Hierarchical Block Model 4 Conclusion Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
8. 8. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Geometry of RW TS ≡ Geometry of Random Variables i.i.d. observations: X1 : X1 1 , X2 1 , . . . , XT 1 X2 : X1 2 , X2 2 , . . . , XT 2 . . . , . . . , . . . , . . . , . . . XN : X1 N, X2 N, . . . , XT N Which distances d(Xi , Xj ) between dependent random variables? Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
9. 9. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Pitfalls of a basic distance Let (X, Y ) be a bivariate Gaussian vector, with X ∼ N(µX , σ2 X ), Y ∼ N(µY , σ2 Y ) and whose correlation is ρ(X, Y ) ∈ [−1, 1]. E[(X − Y )2 ] = (µX − µY )2 + (σX − σY )2 + 2σX σY (1 − ρ(X, Y )) Now, consider the following values for correlation: ρ(X, Y ) = 0, so E[(X − Y )2] = (µX − µY )2 + σ2 X + σ2 Y . Assume µX = µY and σX = σY . For σX = σY 1, we obtain E[(X − Y )2] 1 instead of the distance 0, expected from comparing two equal Gaussians. ρ(X, Y ) = 1, so E[(X − Y )2] = (µX − µY )2 + (σX − σY )2. Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
10. 10. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Pitfalls of a basic distance Let (X, Y ) be a bivariate Gaussian vector, with X ∼ N (µX , σ2 X ), Y ∼ N (µY , σ2 Y ) and whose correlation is ρ(X, Y ) ∈ [−1, 1]. E[(X − Y ) 2 ] = (µX − µY ) 2 + (σX − σY ) 2 + 2σX σY (1 − ρ(X, Y )) Now, consider the following values for correlation: ρ(X, Y ) = 0, so E[(X − Y )2 ] = (µX − µY )2 + σ2 X + σ2 Y . Assume µX = µY and σX = σY . For σX = σY 1, we obtain E[(X − Y )2 ] 1 instead of the distance 0, expected from comparing two equal Gaussians. ρ(X, Y ) = 1, so E[(X − Y )2 ] = (µX − µY )2 + (σX − σY )2 . 30 20 10 0 10 20 30 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Probability density functions of Gaus- sians N(−5, 1) and N(5, 1), Gaus- sians N(−5, 3) and N(5, 3), and Gaussians N(−5, 10) and N(5, 10). Green, red and blue Gaussians are equidistant using L2 geometry on the parameter space (µ, σ). Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
11. 11. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Sklar’s Theorem Theorem (Sklar’s Theorem (1959)) For any random vector X = (X1, . . . , XN) having continuous marginal cdfs Pi , 1 ≤ i ≤ N, its joint cumulative distribution P is uniquely expressed as P(X1, . . . , XN) = C(P1(X1), . . . , PN(XN)), where C, the multivariate distribution of uniform marginals, is known as the copula of X. Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
12. 12. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Sklar’s Theorem Theorem (Sklar’s Theorem (1959)) For any random vector X = (X1, . . . , XN ) having continuous marginal cdfs Pi , 1 ≤ i ≤ N, its joint cumulative distribution P is uniquely expressed as P(X1, . . . , XN ) = C(P1(X1), . . . , PN (XN )), where C, the multivariate distribution of uniform marginals, is known as the copula of X. Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
13. 13. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion The Copula Transform Deﬁnition (The Copula Transform) Let X = (X1, . . . , XN) be a random vector with continuous marginal cumulative distribution functions (cdfs) Pi , 1 ≤ i ≤ N. The random vector U = (U1, . . . , UN) := P(X) = (P1(X1), . . . , PN(XN)) is known as the copula transform. Ui , 1 ≤ i ≤ N, are uniformly distributed on [0, 1] (the probability integral transform): for Pi the cdf of Xi , we have x = Pi (Pi −1 (x)) = Pr(Xi ≤ Pi −1 (x)) = Pr(Pi (Xi ) ≤ x), thus Pi (Xi ) ∼ U[0, 1]. Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
14. 14. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion The Copula Transform Deﬁnition (The Copula Transform) Let X = (X1, . . . , XN ) be a random vector with continuous marginal cumulative distribution functions (cdfs) Pi , 1 ≤ i ≤ N. The random vector U = (U1, . . . , UN ) := P(X) = (P1(X1), . . . , PN (XN )) is known as the copula transform. 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 X∼U[0,1] 10 8 6 4 2 0 2 Y∼ln(X) ρ≈0.84 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 PX (X) 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 PY(Y) ρ=1 The Copula Transform invariance to strictly increasing transformation Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
15. 15. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Deheuvels’ Empirical Copula Transform Let (Xt 1 , . . . , Xt N ), 1 ≤ t ≤ T, be T observations from a random vector (X1, . . . , XN ) with continuous margins. Since one cannot directly obtain the corresponding copula observations (Ut 1, . . . , Ut N ) = (P1(Xt 1 ), . . . , PN (Xt N )), where t = 1, . . . , T, without knowing a priori (P1, . . . , PN ), one can instead Deﬁnition (The Empirical Copula Transform) estimate the N empirical margins PT i (x) = 1 T T t=1 1(Xt i ≤ x), 1 ≤ i ≤ N, to obtain the T empirical observations ( ˜Ut 1, . . . , ˜Ut N ) = (PT 1 (Xt 1 ), . . . , PT N (Xt N )). Equivalently, since ˜Ut i = Rt i /T, Rt i being the rank of observation Xt i , the empirical copula transform can be considered as the normalized rank transform. In practice x_transform = rankdata(x)/len(x) Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
16. 16. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Generic Non-Parametric Distance d2 θ (Xi , Xj ) = θ3E |Pi (Xi ) − Pj (Xj )|2 + (1 − θ) 1 2 R dPi dλ − dPj dλ 2 dλ (i) 0 ≤ dθ ≤ 1, (ii) 0 < θ < 1, dθ metric, (iii) dθ is invariant under diﬀeomorphism Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
17. 17. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Generic Non-Parametric Distance d2 0 : 1 2 R dPi dλ − dPj dλ 2 dλ = Hellinger2 d2 1 : 3E |Pi (Xi ) − Pj (Xj )|2 = 1 − ρS 2 = 2−6 1 0 1 0 C(u, v)dudv Remark: If f (x, θ) = cΦ(u1, . . . , uN; Σ) N i=1 fi (xi ; νi ) then ds2 = ds2 GaussCopula + N i=1 ds2 margins Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
18. 18. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion 1 Introduction 2 Geometry of Random Walk Time Series 3 The Hierarchical Block Model 4 Conclusion Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
19. 19. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion The Hierarchical Block Model A model of nested partitions The nested partitions deﬁned by the model can be seen on the distance matrix for a proper distance and the right permutation of the data points In practice, one observe and work with the above distance matrix which is identitical to the left one up to a permutation of the data Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
20. 20. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Results: Data from Hierarchical Block Model Adjusted Rand Index Algo. Distance Distrib Correl Correl+Distrib HC-AL (1 − ρ)/2 0.00 ±0.01 0.99 ±0.01 0.56 ±0.01 E[(X − Y )2 ] 0.00 ±0.00 0.09 ±0.12 0.55 ±0.05 GPR θ = 0 0.34 ±0.01 0.01 ±0.01 0.06 ±0.02 GPR θ = 1 0.00 ±0.01 0.99 ±0.01 0.56 ±0.01 GPR θ = .5 0.34 ±0.01 0.59 ±0.12 0.57 ±0.01 GNPR θ = 0 1 0.00 ±0.00 0.17 ±0.00 GNPR θ = 1 0.00 ±0.00 1 0.57 ±0.00 GNPR θ = .5 0.99 ±0.01 0.25 ±0.20 0.95 ±0.08 AP (1 − ρ)/2 0.00 ±0.00 0.99 ±0.07 0.48 ±0.02 E[(X − Y )2 ] 0.14 ±0.03 0.94 ±0.02 0.59 ±0.00 GPR θ = 0 0.25 ±0.08 0.01 ±0.01 0.05 ±0.02 GPR θ = 1 0.00 ±0.01 0.99 ±0.01 0.48 ±0.02 GPR θ = .5 0.06 ±0.00 0.80 ±0.10 0.52 ±0.02 GNPR θ = 0 1 0.00 ±0.00 0.18 ±0.01 GNPR θ = 1 0.00 ±0.01 1 0.59 ±0.00 GNPR θ = .5 0.39 ±0.02 0.39 ±0.11 1 Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
21. 21. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Results: Application to Credit Default Swap Time Series Distance matrices computed on CDS time series exhibit a hierarchical block structure Marti, Very, Donnat, Nielsen IEEE ICMLA 2015 (un)Stability of clusters with L2 distance Stability of clusters with the proposed distance Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
22. 22. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Consistency Deﬁnition (Consistency of a clustering algorithm) A clustering algorithm A is consistent with respect to the Hierarchical Block Model deﬁning a set of nested partitions P if the probability that the algorithm A recovers all the partitions in P converges to 1 when T → ∞. Deﬁnition (Space-conserving algorithm) A space-conserving algorithm does not distort the space, i.e. the distance Dij between two clusters Ci and Cj is such that Dij ∈ min x∈Ci ,y∈Cj d(x, y), max x∈Ci ,y∈Cj d(x, y) . Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
23. 23. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Consistency Theorem (Consistency of space-conserving algorithms (Andler, Marti, Nielsen, Donnat, 2015)) Space-conserving algorithms (e.g., Single, Average, Complete Linkage) are consistent with respect to the Hierarchical Block Model. T = 100 T = 1000 T = 10000 Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
24. 24. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion 1 Introduction 2 Geometry of Random Walk Time Series 3 The Hierarchical Block Model 4 Conclusion Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
25. 25. Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Discussion and questions? Avenue for research: distances on (copula,margins) clustering using multivariate dependence information clustering using multi-wise dependence information Optimal Copula Transport for Clustering Multivariate Time Series, Marti, Nielsen, Donnat, 2015 Gautier Marti, Frank Nielsen Clustering Random Walk Time Series