We investigate the dimensionality properties of the Internet delay space, i.e., the matrix of measured round-trip latencies between Internet hosts. Previous work on network coordinates has indicated that this matrix can be embedded,
with reasonably low distortion, into a 4- to 9-dimensional Euclidean space. The application of Principal Component Analysis (PCA) reveals the same dimensionality values. Our work addresses the question: to what extent is the dimensionality an intrinsic property of the delay space, defined without reference to a host metric such as Euclidean space? Is the intrinsic dimensionality of the Internet delay space
approximately equal to the dimension determined using embedding techniques or PCA? If not, what explains the discrepancy? What properties of the network contribute to
its overall dimensionality? Using datasets obtained via the King [14] method, we study different measures of dimensionality to establish the following conclusions. First, based on its power-law behavior, the structure of the delay space can be better characterized by fractal measures. Second, the intrinsic dimension is significantly smaller than the value
predicted by the previous studies; in fact by our measures it is less than 2. Third, we demonstrate a particular way in which the AS topology is reflected in the delay space; subnetworks composed of hosts which share an upstream Tier-1 autonomous system in common possess lower dimensionality than the combined delay space. Finally, we observe that
fractal measures, due to their sensitivity to non-linear structures, display higher precision for measuring the influence of subtle features of the delay space geometry.
In this work, we view the Internet as a metric space, where the metric is defined as the round-trip time to send a packet between two hosts. One of the most important properties that characterizes a metric space is its dimensionality. Properly estimating the dimensionality of a metric space is crucial for characterizing its structure and designing algorithms based on that space: if one's estimate of the dimension is too low, it is impossible to find an embedding that preserves distances; if the estimate is too high then we make the algorithms inefficient as we run into the so-called curse of dimensionality: there are too many degrees of freedom to explore and the algorithms get lost. A major application of this abstraction is a coordinate-based positioning system. These systems aim at mapping the network into a metric space in such a way that the geometric distances estimates the real latency with low degree of error. For such systems, the dimensionality of the target space is a tunable parameter. In addition the dimensionality is known to affect the accuracy of the predictions, the stability of coordinates over time and the time to converge to stable coordinates. Our uses a latency matrix obtained via the King method, which is a convenient way Of measuring the latency between nameservers without having login access to them, to answer two fundamental questions: What is the dimensionality of the Internet delay space? And… What forces contribute to its geometric properties? Apart from its implications to the performance of coordinate systems, this characterization is by itself a topic of practical interest as it uncovers properties and opens new questions on the nature and complexity of the network.
Measurement-based positioning systems implement the same functionality as the coordinate-based counterparts by performing measurements in a careful way.
In this work, we view the Internet as a metric space, where the metric is defined as the round-trip time to send a packet between two hosts. One of the most important properties that characterizes a metric space is its dimensionality. Properly estimating the dimensionality of a metric space is crucial for characterizing its structure and designing algorithms based on that space: if one's estimate of the dimension is too low, it is impossible to find an embedding that preserves distances; if the estimate is too high then we make the algorithms inefficient as we run into the so-called curse of dimensionality: there are too many degrees of freedom to explore and the algorithms get lost. A major application of this abstraction is a coordinate-based positioning system. These systems aim at mapping the network into a metric space in such a way that the geometric distances estimates the real latency with low degree of error. For such systems, the dimensionality of the target space is a tunable parameter. In addition the dimensionality is known to affect the accuracy of the predictions, the stability of coordinates over time and the time to converge to stable coordinates. Our uses a latency matrix obtained via the King method, which is a convenient way Of measuring the latency between nameservers without having login access to them, to answer two fundamental questions: What is the dimensionality of the Internet delay space? And… What forces contribute to its geometric properties? Apart from its implications to the performance of coordinate systems, this characterization is by itself a topic of practical interest as it uncovers properties and opens new questions on the nature and complexity of the network. Prior work: 5- to 9- dimensional Euclidean space with “reasonably” low distortion
data can be approximately embedded into a low-dimensional Euclidean space the distance matrix can be accurately approximated by a low-rank matrix
Dimensionality notions used in prior work often reflect the assumption that low-dimensional data can be approximately embedded in a low-dimensional Euclidean space or that the distance matrix can be accurately approximated by a low-rank matrix. Thus, one could define the embedding dimension of a space by finding the lowest-dimensional Euclidean space that admits an embedding with adequate percentiles of relative error, or one could define the dimension by using Principal Component Analysis to identify the smallest value of k for which the distance matrix has a rank-k approximation with adequate relative error. In these two plots we see the outcome of applying this process to the Internet measurements mentioned earlier. The vertical bar in the graph on the left shows the relative error obtained when embedding the data into a Euclidean space of dimension 1,2,3, etc. The graph on the right shows the percent of variance explained by the first k principal components of the distance matrix for varying values of k. What are the problems with these methods? 1) The problem with the first approach is that the embedding algorithm might fail to produce the real value of dimensionality, since the curse of dimensionality comes into play: after 7 dimensions we observe higher percentiles of relative errors. 2) The problem with PCA is that, 2.1) as a method grounded in linear algebra, it is oblivious to non-linear relationships between the dimensions. For instance, if we try to estimate the surface of a sphere, it will indicate a 3 dimensional object whereas it is actually 2 dimensional. 2.2) As in the case of this plot it is not always clear where to establish the cutoff point beyond which the subsequent components explain only a negligible variance. To the extent that these methods estimate the dimensionality of the delay space, they indicate a value between 4 and 7.
Dimensionality notions used in prior work often reflect the assumption that low-dimensional data can be approximately embedded in a low-dimensional Euclidean space or that the distance matrix can be accurately approximated by a low-rank matrix. Thus, one could define the embedding dimension of a space by finding the lowest-dimensional Euclidean space that admits an embedding with adequate percentiles of relative error, or one could define the dimension by using Principal Component Analysis to identify the smallest value of k for which the distance matrix has a rank-k approximation with adequate relative error. In these two plots we see the outcome of applying this process to the Internet measurements mentioned earlier. The vertical bar in the graph on the left shows the relative error obtained when embedding the data into a Euclidean space of dimension 1,2,3, etc. The graph on the right shows the percent of variance explained by the first k principal components of the distance matrix for varying values of k. What are the problems with these methods? 1) The problem with the first approach is that the embedding algorithm might fail to produce the real value of dimensionality, since the curse of dimensionality comes into play: after 7 dimensions we observe higher percentiles of relative errors. 2) The problem with PCA is that, 2.1) as a method grounded in linear algebra, it is oblivious to non-linear relationships between the dimensions. For instance, if we try to estimate the surface of a sphere, it will indicate a 3 dimensional object whereas it is actually 2 dimensional. 2.2) As in the case of this plot it is not always clear where to establish the cutoff point beyond which the subsequent components explain only a negligible variance. To the extent that these methods estimate the dimensionality of the delay space, they indicate a value between 4 and 7.
If the measured distances reflect a metric other than Euclidean distance Algorithm fails to produce low-distortion embedding in any dimension! e.g., hilly terrain
In our work, we explore the structural and statistical properties of the Internet delay space in order to better characterize its dimensionality. For instance, metric spaces that exhibit power-law behavior can be measured using fractal dimensions Which is a intrinsic measure that work without a reference to a external metric space. One of the way from which the power law behavior arises in the delay space is When we plot in logscale the number of nodes (y-axis) that are within a given distance (x-axis). 1) The first striking feature of this plot is a power law that persists over two orders of magnitude. Datasets that display a property like that are said to behavior like a fractal. 2) Another interesting observation is that this range of distances include all RTT between 2 and 100ms, Which in turn include all non-oceanic distances. 3) When we observe this behavior, we can measure the intrisic dimensionality of the dataset as the power-law exponent. The other surprising finding is that the intrinsic dimensionality measured by this method is so much lower than what can be estimated by previous methods!
Is the dimensionality value less than 2D due to the fact that the Internet lives on the surface of the Earth? Is the surface of the Earth 0.9-dimensional?
Another way in which our work illuminates the structure of the delay space is by revealing that the delay space is not homogeneously 1.8-dimensional but is made up of a small number of low-dimensional pieces. One might expect the subnetworks to be simper because the decomposition eliminated inefficient routes that go from a network, up to its Tier-1 provider, over to another Tier-1 provider down to its customer. Or perhaps this decomposition just decomposed the network into subsnetworks of equal complexity. I ’ll now show that that’s not the case. Upon decomposing the delay space into overlapping pieces, each corresponding to A Tier-1 AS and its downstream customers, we observe that each individual piece Has dimensionality 10% lower than the combined delay space. This dimensionality shift cannot be achieved by other kinds of decompositions, namely By decomposing into pieces of low diameter, or clustered geographically, or randomly selecting a set with lower cardinality. Another interesting finding is that this dimensionality shift can only be detected by fractal measures. The embedding dimension and PCA are oblivious to it. This also demonstrates the power and applicability of the fractal measures. Disproportionate to the rest of the links What is the geometric effect of analyzing each Tier-1 network, together with its downstream customers in isolation Expect better behaved pieces
So far, this presentation presented no evidence that the this study may lead to better network embeddings. In fact, the evidence so far suggests the opposite. However, now we ’ll see that it does lead to better non-linear embeddings.