SlideShare une entreprise Scribd logo
1  sur  10
Télécharger pour lire hors ligne
Challenges and Opportunities in Internet Data Mining

                             David G. Andersen                            Nick Feamster
                         Carnegie Mellon University               Georgia Institute of Technology

                                                    CMU-PDL-06-102
                                                            Jan 2006




                                                Parallel Data Laboratory
                                                Carnegie Mellon University
                                                Pittsburgh, PA 15213-3890




                                                           Abstract

Internet measurement data provides the foundation for the operation and planning of the networks that comprise the Internet, and
is a necessary component in research for analysis, simulation, and emulation. Despite its critical role, however, the management of
this data—from collection and transmission to storage and its use within applications—remains primarily ad hoc, using techniques
created and re-created by each corporation or researcher that uses the data. This paper examines several of the challenges faced
when attempting to collect and archive large volumes of network measurement data, and outlines an architecture for an Internet
data repository—the datapository—designed to create a framework for collaboratively addressing these challenges.
Keywords: network monitoring, databases, data management, data mining
1    Introduction
Communications networks produce vast amounts of data that both help network operators manage and plan
their networks and enable researchers to study a variety of network characteristics. The data is rich, and it
varies from the minute (e.g., kilobits per hour of periodic average link utilization data) to the massive (e.g.,
gigabits per second of live traffic captures). Its uses are equally diverse: On short timescales, the data can
facilitate network monitoring, troubleshooting, and reactive routing [9]. Over minutes to days, the data is
useful for network traffic engineering, in which operators attempt to shift the flow of traffic away from over-
utilized links onto less busy paths. Over months, the data is useful for capacity planning; on even longer
timescales, the data helps researchers perform longitudinal studies.
      Network data is essential for the operation and planning of the networks that comprise the Internet; it
is also vital for analysis, simulation, and emulation. Despite its critical role, the management of this data—
from collection and transmission to storage and its use within applications—remains disconcertingly ad hoc,
using techniques created and re-created by each corporation or researcher. As one example, consider the
storage formats used for node-to-node loss and latency (“ping”) data. A number of different projects collect
such data, in formats ranging from:

    • A hierarchy of directories, one per day, with an N-line text file containing a matrix of ping data
      between a set of N nodes, with errors recorded in a separate file. (PlanetLab all-pairs pings data [21].)
    • A generalized binary network data storage format (Skitter’s ARTS++ format [5]).
    • Stored internally using a ROOT-based file [4] (RIPE’s Test Traffic Machines [11]).
    • A MySQL database, with one row per pair of probes and a non-standard text format for export upon
      demand (The RON testbed all-pairs probe data [2]).

      This paper examines several challenges in collecting and archiving large volumes of network mea-
surement data, and outlines an architecture for an Internet data repository—the datapository—designed to
create a framework for collaboratively addressing these challenges. Rather than attempt to solve all of these
problems, we hope that the datapository will catalyze sharing of tools, formats, and data among researchers
and operators. We envision that the datapository will be useful both for storage and analysis, and as a tool
for real-time network monitoring.
      The heterogeneity of data formats and the lack of an infrastructure to analyze it efficiently harm both
network operations and the science of network measurement. The data’s unwieldy nature prevents network
operators from harnessing the vast amounts of data that could save them time and money (e.g., via network
provisioning and traffic engineering) and help defend their networks from attack (e.g., by alerting operators
to routing or traffic anomalies).
      From a scientific perspective, the lack of an analysis facility means that every network measurement
researcher must develop “home brew” analysis tools and encounter the same set of traps and pitfalls as
previous researchers. Worse yet, after a study is published, the authors’ analysis tools may go unused for
long periods of time before another researcher tries to study the same published results years later—at which
point the data may have moved to a different location or been re-organized entirely. Even if a researcher
successfully locates the data and tools from a particular study, he is left with the challenge of re-discovering
how to reproduce the results (“Which parameters did they use to generate that graph?”, “What subset of this
data was used?”, etc.). Indeed, for Internet measurement, repeatable experimentation—a major cornerstone
of science—requires significant advances in the storage and management of data.
      The remainder of this paper discusses the challenges that arise in constructing a network data analysis
facility (Section 2), presents the initial design of the datapository (Section 3), and presents our vision for
the opportunities that we hope the datapository will ultimately provide (Section 4). We warn the reader in


                                                       1
advance that this paper raises more questions than it answers, though we hope it shines light on a possible
path through the measurement minefield.


2     Challenges
This section briefly surveys the challenges in realizing a network data collection and storage facility.

2.1 Robust, distributed data collection
Collecting network measurement data is by nature distributed; this collection must be resilient to network
and server outages, providing the datapository with an eventually complete data set. 1 In our experience,
measurement nodes may be offline for days to months before coming back online and may be partitioned
from the network for hours to days while still collecting measurements.
     While work such as Paxson’s Strategies for Sound Internet Measurement [17] provides valuable guide-
lines for how to measure and analyze data about the network, less attention has been paid to the collection
of this data for subsequent analysis. Unfortunately, when large number of nodes or large volumes of data
are involved, the difficulties involved in collecting and organizing can be as large as the difficulties of how
to measure the network in the first place.

2.2 Multi-timescale, heterogeneous data
The combination of performing real-time analysis and data mining years of archived data presents challenges
to both conventional and streaming database systems:
     Non-standard, irregular data. Network data is information-dense, and we cannot predict how users
will query the data, which makes indexing a challenge. Some data is best served by custom search tech-
niques. For example, each routing update contains a network prefix and a prefix length. The prefix length
specifies the number of significant bits in the prefix. A common operation in prefix queries is to find a
longest-prefix match (LPM): Given an IP address, find the prefix with the longest prefix length whose sig-
nificant bits are equal to those bits in the IP address. Some data structures (e.g., search tries) can efficiently
perform LPM, but the best way to express this query in SQL is as a set of 32 OR’d conditions for each prefix
length.

    SELECT * from table
    WHERE (prefix = x AND mask=32)
      OR (prefix = x & 0xfffffffe
           AND mask=31)
      OR ...

     Additionally, network monitoring data is often not amenable to the regular form of standard databases.
Packet traces, for instance, have nearly arbitrary content, and a search may examine any of these fields,
extract the TCP stream from a series of packets, or examine application-layer headers. RBDMSes do not
handle such queries, and existing tools such as tcpdump or ethereal cannot efficiently process terabytes
of packet traces. Other data, such as the popular NetFlow format for flow summarization, has similar
limitations for data mining.
    1 The measurement techniques used must also be robust to these events. This lesson is explained well by Paxson, whose
original Internet measurements were triggered from a centralized node, and therefore under-counted some failures [16], or whose
measurements depended on the DNS [18]. The authors of this paper have seen these mistakes repeated several times since.




                                                              2
Raw Data                       Collection and Storage             Analysis, Queries, and
                                                                                         External Interfaces
                   Probe & Traceroute
                                                                                           Compute Engine
                     BGP Updates

                         Netflow                                                            Compute Engine
                                              Compute Engines
                                              (Input processing)      Storage & DB
                                                                                            (Public services)
                     Packet Traces

                       SNMP Data                                     Archival Storage
                                                                                            Compute Engine
                                                                                            (SQL processing)
                     Router Configs




                                        Figure 1: The datapository architecture.


      To process non-standard data sources, the datapository may be able to leverage PADS, a system that
takes as input a data format description and parses raw data files of arbitrary formats [10].
      Large archival data mining. Aurora [3] provides a promising way to process real-time streams of
network data, and AT&T’s Gigascope [6] supports real-time monitoring of gigabit networks. These systems
are excellent building blocks for processing raw data, but they are typically designed for either low-speed
streams (e.g., financial transactions) or modest-size “windows” of data. Longitudinal studies and joint anal-
ysis of network data may require much large windows over high-speed streams.
      Most conventional solutions are unsuited to our application, and are unable to take advantage of many
of the simplifying advantages of a data archive: the data is read-only, is easily cached, follows a strict time
ordering, and is amenable to both intra- and inter-query parallelism.
      Data-mining solutions such as Netezza [14] offer attractive solutions for data mining. Network mea-
surement data offers many opportunities for optimization: the data is read-only, is easily cached, and follows
a strict time ordering. Unfortunately, current data mining tools do not cope well with the unique data types
present in network monitoring data. We believe that the long-term development of the datapository will spur
the development of log-processing database management systems.

2.3 Workflow metadata and privacy
Metadata about data’s origins, errors, and modifications—its provenance—must be maintained and commu-
nicated with the data as it progresses through the analysis workflow [13, 23]. Data producers or the research
community must agree upon common metadata formats for data [1] and analysis tools in order to facilitate
data sharing.
     An important part of the workflow is addressing privacy requirements. The privacy requirements for
measurement data range from nearly none (e.g., many BGP routing traces, which are publicly available)
to extreme (e.g., an un-anonymized packet trace). In order to gain the efficiencies of scale that we hope to
realize with the Datapository, it must be able to host and analyze data at the ends and in the middle of the pri-
vacy spectrum. In fact, the steps of “declassifying” this data (e.g., using anonymization “scrubbers” [12] or
summarization techniques) will be an important part of the analysis workflow that the datapository enables.


3    The datapository
The datapository consists of a distributed set of data sources and a central storage and analysis infrastruc-
ture, shown in Figure 1. The data is processed by a set of compute engines dedicated to input processing,


                                                               3
which insert the data into the central storage and database nodes. Once inserted, the data may be analyzed
by tools on other compute engines.

3.1 Data Collection
The datapository collects data from many sources, including the RON testbed path monitoring nodes and
BGP feeds, Abilene data, and the RouteViews BGP archive. Internally, the datapository understands an
orthogonal set of data types (which abstract a set of data formats), and access methods:
     Data types are a canonical version of a particular type of measurement (e.g.traceroutes, end-to-end
probes, and BGP routing updates). The datapository is currently archiving and processing BGP routing
tables and updates from RON and Abilene, and spam logs from two spam honeypots. We have immediate
plans to archive RouteViews BGP feeds and end-to-end probes from the RON testbed.
     Data formats are different storage formats for a particular data type. Each of the formats listed in the
Introduction would be one such data format; other examples include the MRT format commonly used to
store BGP routing updates, or ASCII files recording traceroutes.
     Access methods are implementations of techniques that the datapository obtains data from the data
sources. Examples include the code that obtains data (e.g., RouteViews) via FTP, and that which securely
copies data from the RON testbed nodes to the datapository via SSH.
     Central to the datapository is the notion of a feed. A feed represents an (access method, data format,
data type) tuple, along with the ancillary data used by the access method or data format (e.g., usernames
and passwords, or state corresponding to which objects have already been retrieved). Each feed defines
a workflow that uses an access method to obtain data in a particular format, converts that format into the
canonical representation for that data type, and inserts the data into storage.

3.2 Data Storage
We wish to store data in a cluster of databases that offer an SQL-like interface. Our prior experience and
that of others suggests that basing analysis out of a database (we choose SQL out of familiarity) improves
consistency, facilitates reuse, and reduces researcher effort [20, 17]. To this end, the present datapository
architecture uses a set of MySQL compute engines, each of which has read-only access to the historical data
via a distributed filesystem.
     This approach has the advantage of simplicity and working with off-the-shelf components, but fails to
take advantage of the intra-query parallelism that is available in many of the data-mining queries we issue
against the datapository. In addition, the current architecture is limited in its ability to simultaneously query
the historical data and newly arrived data, which is maintained read-write by a single database node. We
view the development of more suitable storage and mining techniques as a prime challenge for the further
development of the datapository.

3.3 Analysis and Queries
We envision three “users” for the datapository: Network operators, researchers performing analysis, and
simulation/emulation systems that require traces as input.
     Network operators will be able to use the datapository for real-time debugging and network monitoring.
By aggregating data from many vantage points into a logically centralized database, operators can efficiently
get information from a geographically and topologically diverse set of network locations. Our previous
implementation of a public “BGP monitor” was useful to operators in identifying network-wide failures and
debugging reachability problems. Such utilities are particularly useful because they grant operators a view
of the network from vantage points outside their own network.


                                                       4
Our instance of the datapository holds datasets that researchers and operators choose to make publicly
available. Operators may wish to build private datapositories to analyze confidential data and query that data
either in isolation or in combination with the publicly available datapository. Analyzing traffic and routing
data can help operators with traffic engineering and capacity planning.
      The datapository can help network researchers perform long-term longitudinal studies. Network data
often exists in two forms: (1) short-term data collected for a particular experiment and (2) long-term data
that is collected, but never analyzed in toto due to the massive overhead in running a query For example, the
RouteViews BGP routing updates comprise BGP data from about 41 routers in 35 different ISPs [15]. An
hour of compressed BGP updates from RouteViews requires one megabyte; as a result, longitudinal queries
across multiple years and multiple BGP datasets can become prohibitively expensive. As a shared facility,
the datapository can amortize storage, processing, and management costs, such as building and maintaining
a database and query infrastructure for longitudinal datasets. We hope that lowering the cost of running
queries will foster serendipity by making it easier for researchers to issue exploratory queries.
      The datapository aggregates multiple data feeds into a single archive, which facilitates performing
joint analysis across multiple datasets. Among many examples, our past research has shown the benefits of
performing joint analysis across traceroutes, active probes, and BGP routing updates to study the correlations
between end-to-end path performance and Internet routing stability [9]. More recently, we have studied the
interaction between BGP route hijacking and spam activity [8]. It is our hope that, by providing a common
storage and query infrastructure for diverse datasets, Internet researchers will find new ways of finding
patterns and correlations across datasets.

3.4 External Interfaces
The datapository makes raw and aggregated data available through standard interfaces. The datapository cur-
rently supports an XMLRPC export mechanism, and we are implementing bulk raw data export. The XML-
RPC interface can export data in various formats, including human-readable output and Matlab-friendly
matrices. This data export mechanism allows researchers to focus on analysis techniques without being
intimately familiar with the myriad formats for network data. In addition to these export formats, we have
created several Web interfaces for browsing data and for network troubleshooting.
     To facilitate experimentation, we are integrating the datapository with Emulab [22]. Emulab provides
hardware resources (nodes and links) that can be configured into a desired network topology. Researchers
use this emulated network to evaluate real code and distributed systems. Emulab provides a valuable mid-
dle ground between simulation and untamed (and un-repeatable) live networks. However, the accuracy of
many emulation experiments—much like simulations—lives and dies by the model and input parameters.
Coupling the datapository to Emulab can significantly improve this aspect.
     Our goal is to foster a seamless flow of data from network traces to Emulab experiments and back.
Emulab users will have access to realistic topologies, background traffic patterns, configurations, and even
prior experimental and analytical data. The data produced by these experiments can then be inserted into
the datapository for archiving and further processing.
     This integration presents a classical workflow problem, managing the flow of data from the datapository,
into emulation (and perhaps the real world, via Emulab’s wide-area nodes), and back. Internet research
involves a continual cycle of measuring a system to better understand and evaluate its properties, modeling
these properties in a way that makes them easy to integrate into a controlled experiment (e.g., in simulation
or emulation), and making changes to the system that is actually deployed. The datapository acts as a critical
component in this process by facilitating data collection and management, as well as acting as a driver for
trace-based emulation. Section 4.1 describes this process in more detail, as well as its relation to Emulab’s
scientific workflow management [7].



                                                      5
4    Opportunities
This section explores possible future uses for the datapository. We focus first on how the datapository might
help scientific experimentation; then, we explore how it might be employed as a cornerstone of network
management—even to the point of controlling network operation.

4.1 Repeatable Scientific Experimentation
Two significant challenges in experimentation are repeating a past experiment (under the same or different
conditions) and performing new analysis of old data. Without standard techniques for packaging old data
and analysis scripts, measurement data is often shuffled about and forgotten or becomes a subset of a con-
tinuously collected data stream. At a later date, researchers must recall which subsets of each data stream
and which processing scripts they used to produce each set of results.
     The datapository can save measurement data from these fates by allowing a researcher to store the
details of their analysis workflow. This archived workflow names all of the machinery a researcher would
need to reproduce a particular experiment: the data, all scripts needed for analysis, and the process by which
these scripts should be executed (i.e., a description of the experiment). The workflow should also refer to
any relevant information about how the data was collected and other noteworthy aspects of the data, such
as times when the collection machinery was faulty or non-functional, how the data was collected, etc. Our
support for workflows is complementary to Emulab, which proposes integrating workflow management into
generating—not just analyzing—measurements [7].
     The datapository’s archives may also prove useful for driving trace-driven network emulations. The
networking community is developing several platforms for network emulation, which would benefit from
the ability to play back traces of actual network data. For example, a researcher might set up an experiment
that mirrors a real network topology, process traces at the datapository as they are collected from the real
network, and project the failures in these traces onto the mirrored network.

4.2 Network Monitoring and Control
Network operators must ensure availability and good performance in the face of constantly changing condi-
tions (e.g., changing traffic patterns, link failures) and security threats (e.g., worms, denial-of-service (DoS)
attacks). Unfortunately, critical data often escapes operators’ notice due to both the lack of an infrastructure
to collect and analyze this data and the lack of adequate detection algorithms. We hope that the dataposi-
tory will evolve to support real-time analysis and detection. A real-time analysis facility might even allow
operators to automate some aspects of network control. For example, such a system could detect erroneous
routing advertisements and prevent them from propagating; it could also detect a DoS or worm attacks and
install filters.


5    Discussion and Summary
The datapository is an in-progress facility for shared network data analysis. We believe that a community-
based approach to network data analysis can play a critical role in both networking research and operations.
In contrast to existing approaches that emphasize cataloging and standardizing Internet data [19], we believe
that providing a public computational and storage resource will be an effective stimulus for community
involvement and adoption of the standards that evolve.
     While building the datapository, we have encountered several conflicts between existing workflow and
data management systems and the requirements of network data. We hope that this paper will serve as a
catalyst for the improvement of both our data repository, and the available data management techniques.


                                                       6
References
 [1] Mark Allman, Ethan Blanton, and Wesley M. Eddy. A scalable system for sharing Internet measure-
     ment. In Passive & Active Measurement (PAM), Fort Collins, CO, March 2004.

 [2] David G. Andersen, Hari Balakrishnan, M. Frans Kaashoek, and Robert Morris. Experience with
     an Evolving Overlay Network Testbed. ACM Computer Communications Review, 33(3):13–19, July
     2003.

 [3] Hari Balakrishnan, Magdalena Balazinska, Donald Carney, et al. Retrospective on Aurora. VLDB
     Journal, January 2004.

 [4] Rene Brun and Fons Rademakers. ROOT - an object oriented data analysis framework. In Proc.
     AIHENP’96 Workshop, September 1996.

 [5] CAIDA’s Skitter project, 2002. http://www.caida.org/tools/measurement/skitter/.

 [6] Chuck Cranor, Theodore Johnson, Oliver Spataschek, and Vladislav Shkapenyuk. Gigascope: a stream
     database for network applications. In Proc. ACM SIGMOD, pages 647–651, San Diego, CA, June
     2003.

 [7] Eric Eide, Juliana Freire, Jay Lepreau, Tim Stack, and Leigh Stoller. Integrated scientific workflow
     management for the Emulab network testbed. Submitted for publication, December 2005.

 [8] Nick Feamster. Open Problems in BGP Anomaly Detection. http://www.caida.org/outreach/
     isma/0411/abstracts.xml#NickFeamster2, Nov 2004. Slides available upon request.

 [9] Nick Feamster, David Andersen, Hari Balakrishnan, and M. Frans Kaashoek. Measuring the effects of
     Internet path faults on reactive routing. In Proc. ACM SIGMETRICS, San Diego, CA, June 2003.

[10] Kathleen Fisher and Robert Gruber. Pads: a domain-specific language for processing ad hoc data. In
     Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implemen-
     tation, pages 295–304, Chicago, IL, Jun 2005.

[11] Fotis Georgatos, Florian Gruber, Daniel Karrenberg, Mark Santcroos, Ana Susanj, Henk Uijterwaal,
     and Rene Wilhelm. Providing active measurements as a regular service for ISP’s. In Passive & Active
     Measurement (PAM), April 2001.

[12] Alefiya Hussain, Genevieve Bartlett, Yuri Pryadkin, John Heidemann, Christos Papadopoulos, and
     Joseph Bannister. Experiences with a continuous network tracing infrastructure. In Proc. ACM SIG-
     COMM Workshop on Mining Network Data (MineNet), Philadelphia, PA, August 2005.

[13] Jonathan Ledlie, Chaki Ng, David Holland, Kiran-Kumar Muniswamy-Reddy, Uri Braun, and Margo
     Seltzer. Provenance-aware sensor data storage. In Proc. Workshop on Networking Meets Databases
     (NetDB), April 2005.

[14] Netezza. Business intelligence data warehouse appliance. Netezza Home Page.

[15] University of Oregon. RouteViews. http://www.routeviews.org/.

[16] V. Paxson. End-to-End Routing Behavior in the Internet. IEEE/ACM Transactions on Networking,
     5(5):601–615, 1997.



                                                   7
[17] Vern Paxson. Strategies for sound Internet measurement. In Proc. ACM SIGCOMM Internet Measure-
     ment Conference, Taormina, Sicily, Italy, October 2004.

[18] Vern Paxson, Andrew Adams, and Matt Mathis. Experiences with NIMI. In Passive & Active Mea-
     surement (PAM), April 2000.

[19] Colleen Shannon, David Moore, Ken Keys, Marina Fomenkov, Bradley Huffaker, and k. claffy. The
     Internet measurement data catalog. ACM Computer Communications Review, 35(5):97–100, October
     2005.

[20] Neil Spring, Ratul Mahajan, and David Wetherall. Measuring ISP topologies with Rocketfuel. In Proc.
     ACM SIGCOMM, Pittsburgh, PA, August 2002.

[21] Jeremy Stribling. Planetlab - all pairs pings. http://pdos.csail.mit.edu/~strib/pl_app/, nov
     2005.

[22] Brian White, Jay Lepreau, Leigh Stoller, Robert Ricci, Shashi Guruprasad, Mac Newbold, Mike Hibler,
     Chad Barb, and Abhijeet Joglekar. An integrated experimental environment for distributed systems and
     networks. In Proc. 5th USENIX OSDI, pages 255–270, Boston, MA, December 2002.

[23] Jennifer Widom. Trio: A system for integrated management of data, accuracy, and lineage. In Proc.
     2nd Biennial Conference on Innovative Data Systems Research (CIDR), Pacific Grove, CA, January
     2005.




                                                   8

Contenu connexe

Tendances

A novel cache resolution technique for cooperative caching in wireless mobile...
A novel cache resolution technique for cooperative caching in wireless mobile...A novel cache resolution technique for cooperative caching in wireless mobile...
A novel cache resolution technique for cooperative caching in wireless mobile...csandit
 
Towards a low cost etl system
Towards a low cost etl systemTowards a low cost etl system
Towards a low cost etl systemijdms
 
An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...Alexander Decker
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Architecture and Evaluation on Cooperative Caching In Wireless P2P
Architecture and Evaluation on Cooperative Caching In Wireless  P2PArchitecture and Evaluation on Cooperative Caching In Wireless  P2P
Architecture and Evaluation on Cooperative Caching In Wireless P2PIOSR Journals
 
Cloak-Reduce Load Balancing Strategy for Mapreduce
Cloak-Reduce Load Balancing Strategy for MapreduceCloak-Reduce Load Balancing Strategy for Mapreduce
Cloak-Reduce Load Balancing Strategy for MapreduceAIRCC Publishing Corporation
 
Handling Selfishness in Replica Allocation over a Mobile Ad-Hoc Network
Handling Selfishness in Replica Allocation over a Mobile Ad-Hoc NetworkHandling Selfishness in Replica Allocation over a Mobile Ad-Hoc Network
Handling Selfishness in Replica Allocation over a Mobile Ad-Hoc NetworkIJCERT
 
Peer to peer cache resolution mechanism for mobile ad hoc networks
Peer to peer cache resolution mechanism for mobile ad hoc networksPeer to peer cache resolution mechanism for mobile ad hoc networks
Peer to peer cache resolution mechanism for mobile ad hoc networksijwmn
 
PROVIDES AN APPROACH BASED ON ADAPTIVE FORWARDING AND LABEL SWITCHING TO IMPR...
PROVIDES AN APPROACH BASED ON ADAPTIVE FORWARDING AND LABEL SWITCHING TO IMPR...PROVIDES AN APPROACH BASED ON ADAPTIVE FORWARDING AND LABEL SWITCHING TO IMPR...
PROVIDES AN APPROACH BASED ON ADAPTIVE FORWARDING AND LABEL SWITCHING TO IMPR...AIRCC Publishing Corporation
 
ODRS: Optimal Data Replication Scheme for Time Efficiency In MANETs
ODRS: Optimal Data Replication Scheme for Time Efficiency In  MANETsODRS: Optimal Data Replication Scheme for Time Efficiency In  MANETs
ODRS: Optimal Data Replication Scheme for Time Efficiency In MANETsIOSR Journals
 
Efficient IOT Based Sensor Data Analysis in Wireless Sensor Networks with Cloud
Efficient IOT Based Sensor Data Analysis in Wireless Sensor Networks with CloudEfficient IOT Based Sensor Data Analysis in Wireless Sensor Networks with Cloud
Efficient IOT Based Sensor Data Analysis in Wireless Sensor Networks with Cloudiosrjce
 
Caching on Named Data Network: a Survey and Future Research
Caching on Named Data Network: a Survey  and Future Research Caching on Named Data Network: a Survey  and Future Research
Caching on Named Data Network: a Survey and Future Research IJECEIAES
 
Effective Approach For Content Based Image Retrieval In Peer-Peer To Networks
Effective Approach For Content Based Image Retrieval In Peer-Peer To NetworksEffective Approach For Content Based Image Retrieval In Peer-Peer To Networks
Effective Approach For Content Based Image Retrieval In Peer-Peer To NetworksIRJET Journal
 
In network aggregation techniques for wireless sensor networks - a survey
In network aggregation techniques for wireless sensor networks - a surveyIn network aggregation techniques for wireless sensor networks - a survey
In network aggregation techniques for wireless sensor networks - a surveyGungi Achi
 

Tendances (17)

A novel cache resolution technique for cooperative caching in wireless mobile...
A novel cache resolution technique for cooperative caching in wireless mobile...A novel cache resolution technique for cooperative caching in wireless mobile...
A novel cache resolution technique for cooperative caching in wireless mobile...
 
Towards a low cost etl system
Towards a low cost etl systemTowards a low cost etl system
Towards a low cost etl system
 
An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Architecture and Evaluation on Cooperative Caching In Wireless P2P
Architecture and Evaluation on Cooperative Caching In Wireless  P2PArchitecture and Evaluation on Cooperative Caching In Wireless  P2P
Architecture and Evaluation on Cooperative Caching In Wireless P2P
 
Cloak-Reduce Load Balancing Strategy for Mapreduce
Cloak-Reduce Load Balancing Strategy for MapreduceCloak-Reduce Load Balancing Strategy for Mapreduce
Cloak-Reduce Load Balancing Strategy for Mapreduce
 
Handling Selfishness in Replica Allocation over a Mobile Ad-Hoc Network
Handling Selfishness in Replica Allocation over a Mobile Ad-Hoc NetworkHandling Selfishness in Replica Allocation over a Mobile Ad-Hoc Network
Handling Selfishness in Replica Allocation over a Mobile Ad-Hoc Network
 
Sensor Data Management
Sensor Data ManagementSensor Data Management
Sensor Data Management
 
Peer to peer cache resolution mechanism for mobile ad hoc networks
Peer to peer cache resolution mechanism for mobile ad hoc networksPeer to peer cache resolution mechanism for mobile ad hoc networks
Peer to peer cache resolution mechanism for mobile ad hoc networks
 
PROVIDES AN APPROACH BASED ON ADAPTIVE FORWARDING AND LABEL SWITCHING TO IMPR...
PROVIDES AN APPROACH BASED ON ADAPTIVE FORWARDING AND LABEL SWITCHING TO IMPR...PROVIDES AN APPROACH BASED ON ADAPTIVE FORWARDING AND LABEL SWITCHING TO IMPR...
PROVIDES AN APPROACH BASED ON ADAPTIVE FORWARDING AND LABEL SWITCHING TO IMPR...
 
ODRS: Optimal Data Replication Scheme for Time Efficiency In MANETs
ODRS: Optimal Data Replication Scheme for Time Efficiency In  MANETsODRS: Optimal Data Replication Scheme for Time Efficiency In  MANETs
ODRS: Optimal Data Replication Scheme for Time Efficiency In MANETs
 
Efficient IOT Based Sensor Data Analysis in Wireless Sensor Networks with Cloud
Efficient IOT Based Sensor Data Analysis in Wireless Sensor Networks with CloudEfficient IOT Based Sensor Data Analysis in Wireless Sensor Networks with Cloud
Efficient IOT Based Sensor Data Analysis in Wireless Sensor Networks with Cloud
 
C0312023
C0312023C0312023
C0312023
 
Caching on Named Data Network: a Survey and Future Research
Caching on Named Data Network: a Survey  and Future Research Caching on Named Data Network: a Survey  and Future Research
Caching on Named Data Network: a Survey and Future Research
 
Effective Approach For Content Based Image Retrieval In Peer-Peer To Networks
Effective Approach For Content Based Image Retrieval In Peer-Peer To NetworksEffective Approach For Content Based Image Retrieval In Peer-Peer To Networks
Effective Approach For Content Based Image Retrieval In Peer-Peer To Networks
 
In network aggregation techniques for wireless sensor networks - a survey
In network aggregation techniques for wireless sensor networks - a surveyIn network aggregation techniques for wireless sensor networks - a survey
In network aggregation techniques for wireless sensor networks - a survey
 
B0330811
B0330811B0330811
B0330811
 

En vedette

The Mobile Landscape (March 2014)
The Mobile Landscape (March 2014)The Mobile Landscape (March 2014)
The Mobile Landscape (March 2014)David McClelland
 
Introduction to Web Science Institute
Introduction to Web Science InstituteIntroduction to Web Science Institute
Introduction to Web Science InstituteWeb Science Institute
 
"飞住玩"法国-深度游-菜单
"飞住玩"法国-深度游-菜单"飞住玩"法国-深度游-菜单
"飞住玩"法国-深度游-菜单viraree
 
Malalties relacionades amb estils de vida
Malalties relacionades amb estils de vidaMalalties relacionades amb estils de vida
Malalties relacionades amb estils de vidaNoelia Medina Allué
 
Kennismaking Flynth Regio Midden
Kennismaking Flynth Regio MiddenKennismaking Flynth Regio Midden
Kennismaking Flynth Regio MiddenFransJansen505
 
Acatistul sfântului luca al Crimeei
Acatistul sfântului luca al CrimeeiAcatistul sfântului luca al Crimeei
Acatistul sfântului luca al CrimeeiAntonella Stancu
 
Acatistul cuviosului paisie_aghioritul
Acatistul cuviosului paisie_aghioritulAcatistul cuviosului paisie_aghioritul
Acatistul cuviosului paisie_aghioritulAntonella Stancu
 
Jan -city_market_report_hi_resolution
Jan  -city_market_report_hi_resolutionJan  -city_market_report_hi_resolution
Jan -city_market_report_hi_resolutionjimc11
 
RCUK PATINA presentation - Tom Frankland
RCUK PATINA presentation - Tom FranklandRCUK PATINA presentation - Tom Frankland
RCUK PATINA presentation - Tom FranklandWeb Science Institute
 
State of Central Florida Arts Organizations 2012
State of Central Florida Arts Organizations 2012State of Central Florida Arts Organizations 2012
State of Central Florida Arts Organizations 2012PresentMark
 
Data is Dull! Make Challenging Content Interesting with Online Video! (Jan 2012)
Data is Dull! Make Challenging Content Interesting with Online Video! (Jan 2012)Data is Dull! Make Challenging Content Interesting with Online Video! (Jan 2012)
Data is Dull! Make Challenging Content Interesting with Online Video! (Jan 2012)David McClelland
 
FlytelFun
FlytelFunFlytelFun
FlytelFunviraree
 
Zeii-si-miturile-lumii-antice
Zeii-si-miturile-lumii-anticeZeii-si-miturile-lumii-antice
Zeii-si-miturile-lumii-anticeAntonella Stancu
 
SEO SEM and Web Analytics in Newcastle Upon Tyne
SEO SEM and Web Analytics in Newcastle Upon TyneSEO SEM and Web Analytics in Newcastle Upon Tyne
SEO SEM and Web Analytics in Newcastle Upon TyneWeb Social Media
 

En vedette (20)

The Mobile Landscape (March 2014)
The Mobile Landscape (March 2014)The Mobile Landscape (March 2014)
The Mobile Landscape (March 2014)
 
Introduction to Web Science Institute
Introduction to Web Science InstituteIntroduction to Web Science Institute
Introduction to Web Science Institute
 
"飞住玩"法国-深度游-菜单
"飞住玩"法国-深度游-菜单"飞住玩"法国-深度游-菜单
"飞住玩"法国-深度游-菜单
 
Malalties relacionades amb estils de vida
Malalties relacionades amb estils de vidaMalalties relacionades amb estils de vida
Malalties relacionades amb estils de vida
 
Read my blog
Read my blog Read my blog
Read my blog
 
Digital Economies 12 Dec 2011
Digital Economies 12 Dec 2011Digital Economies 12 Dec 2011
Digital Economies 12 Dec 2011
 
Intro to UOSM2012
Intro to UOSM2012Intro to UOSM2012
Intro to UOSM2012
 
Kennismaking Flynth Regio Midden
Kennismaking Flynth Regio MiddenKennismaking Flynth Regio Midden
Kennismaking Flynth Regio Midden
 
Acatistul sfântului luca al Crimeei
Acatistul sfântului luca al CrimeeiAcatistul sfântului luca al Crimeei
Acatistul sfântului luca al Crimeei
 
Acatistul cuviosului paisie_aghioritul
Acatistul cuviosului paisie_aghioritulAcatistul cuviosului paisie_aghioritul
Acatistul cuviosului paisie_aghioritul
 
Jan -city_market_report_hi_resolution
Jan  -city_market_report_hi_resolutionJan  -city_market_report_hi_resolution
Jan -city_market_report_hi_resolution
 
RCUK PATINA presentation - Tom Frankland
RCUK PATINA presentation - Tom FranklandRCUK PATINA presentation - Tom Frankland
RCUK PATINA presentation - Tom Frankland
 
Cool Social Media Tools
Cool Social Media ToolsCool Social Media Tools
Cool Social Media Tools
 
Big Data
Big DataBig Data
Big Data
 
State of Central Florida Arts Organizations 2012
State of Central Florida Arts Organizations 2012State of Central Florida Arts Organizations 2012
State of Central Florida Arts Organizations 2012
 
Data is Dull! Make Challenging Content Interesting with Online Video! (Jan 2012)
Data is Dull! Make Challenging Content Interesting with Online Video! (Jan 2012)Data is Dull! Make Challenging Content Interesting with Online Video! (Jan 2012)
Data is Dull! Make Challenging Content Interesting with Online Video! (Jan 2012)
 
FlytelFun
FlytelFunFlytelFun
FlytelFun
 
Zeii-si-miturile-lumii-antice
Zeii-si-miturile-lumii-anticeZeii-si-miturile-lumii-antice
Zeii-si-miturile-lumii-antice
 
SEO SEM and Web Analytics in Newcastle Upon Tyne
SEO SEM and Web Analytics in Newcastle Upon TyneSEO SEM and Web Analytics in Newcastle Upon Tyne
SEO SEM and Web Analytics in Newcastle Upon Tyne
 
Curriculum Innovation for DE Lunch
Curriculum Innovation for DE LunchCurriculum Innovation for DE Lunch
Curriculum Innovation for DE Lunch
 

Similaire à Internet data mining 2006

Database techniques for resilient network monitoring and inspection
Database techniques for resilient network monitoring and inspectionDatabase techniques for resilient network monitoring and inspection
Database techniques for resilient network monitoring and inspectionTELKOMNIKA JOURNAL
 
The Overview of Discovery and Reconciliation of LTE Network
The Overview of Discovery and Reconciliation of LTE NetworkThe Overview of Discovery and Reconciliation of LTE Network
The Overview of Discovery and Reconciliation of LTE NetworkIRJET Journal
 
Ncct Ieee Software Abstract Collection Volume 1 50+ Abst
Ncct   Ieee Software Abstract Collection Volume 1   50+ AbstNcct   Ieee Software Abstract Collection Volume 1   50+ Abst
Ncct Ieee Software Abstract Collection Volume 1 50+ Abstncct
 
How Does Your Real-time Data Look?
How Does Your Real-time Data Look?How Does Your Real-time Data Look?
How Does Your Real-time Data Look?Supreet Oberoi
 
A flexible phasor data concentrator design
A flexible phasor data concentrator designA flexible phasor data concentrator design
A flexible phasor data concentrator designsunder fou
 
Study on potential capabilities of a nodb system
Study on potential capabilities of a nodb systemStudy on potential capabilities of a nodb system
Study on potential capabilities of a nodb systemijitjournal
 
White Paper: Advanced Cyber Analytics with Greenplum Database
White Paper: Advanced Cyber Analytics with Greenplum DatabaseWhite Paper: Advanced Cyber Analytics with Greenplum Database
White Paper: Advanced Cyber Analytics with Greenplum DatabaseEMC
 
Cloud data management
Cloud data managementCloud data management
Cloud data managementambitlick
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
 
A Platform for Large-Scale Grid Data Service on Dynamic High-Performance Netw...
A Platform for Large-Scale Grid Data Service on Dynamic High-Performance Netw...A Platform for Large-Scale Grid Data Service on Dynamic High-Performance Netw...
A Platform for Large-Scale Grid Data Service on Dynamic High-Performance Netw...Tal Lavian Ph.D.
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingeSAT Journals
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom IndustryCloudera, Inc.
 
Iaetsd survey on big data analytics for sdn (software defined networks)
Iaetsd survey on big data analytics for sdn (software defined networks)Iaetsd survey on big data analytics for sdn (software defined networks)
Iaetsd survey on big data analytics for sdn (software defined networks)Iaetsd Iaetsd
 
Efficient Database Management System For Wireless Sensor Network
Efficient Database Management System For Wireless Sensor Network Efficient Database Management System For Wireless Sensor Network
Efficient Database Management System For Wireless Sensor Network Onyebuchi nosiri
 
Anomalous symmetry succession for seek out
Anomalous symmetry succession for seek outAnomalous symmetry succession for seek out
Anomalous symmetry succession for seek outiaemedu
 
Granularity analysis of classification and estimation for complex datasets wi...
Granularity analysis of classification and estimation for complex datasets wi...Granularity analysis of classification and estimation for complex datasets wi...
Granularity analysis of classification and estimation for complex datasets wi...IJECEIAES
 

Similaire à Internet data mining 2006 (20)

Database techniques for resilient network monitoring and inspection
Database techniques for resilient network monitoring and inspectionDatabase techniques for resilient network monitoring and inspection
Database techniques for resilient network monitoring and inspection
 
1105.1950
1105.19501105.1950
1105.1950
 
The Overview of Discovery and Reconciliation of LTE Network
The Overview of Discovery and Reconciliation of LTE NetworkThe Overview of Discovery and Reconciliation of LTE Network
The Overview of Discovery and Reconciliation of LTE Network
 
Ncct Ieee Software Abstract Collection Volume 1 50+ Abst
Ncct   Ieee Software Abstract Collection Volume 1   50+ AbstNcct   Ieee Software Abstract Collection Volume 1   50+ Abst
Ncct Ieee Software Abstract Collection Volume 1 50+ Abst
 
How Does Your Real-time Data Look?
How Does Your Real-time Data Look?How Does Your Real-time Data Look?
How Does Your Real-time Data Look?
 
A flexible phasor data concentrator design
A flexible phasor data concentrator designA flexible phasor data concentrator design
A flexible phasor data concentrator design
 
Study on potential capabilities of a nodb system
Study on potential capabilities of a nodb systemStudy on potential capabilities of a nodb system
Study on potential capabilities of a nodb system
 
White Paper: Advanced Cyber Analytics with Greenplum Database
White Paper: Advanced Cyber Analytics with Greenplum DatabaseWhite Paper: Advanced Cyber Analytics with Greenplum Database
White Paper: Advanced Cyber Analytics with Greenplum Database
 
Cloud data management
Cloud data managementCloud data management
Cloud data management
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
Fp3111131118
Fp3111131118Fp3111131118
Fp3111131118
 
A Platform for Large-Scale Grid Data Service on Dynamic High-Performance Netw...
A Platform for Large-Scale Grid Data Service on Dynamic High-Performance Netw...A Platform for Large-Scale Grid Data Service on Dynamic High-Performance Netw...
A Platform for Large-Scale Grid Data Service on Dynamic High-Performance Netw...
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passing
 
DIET_BLAST
DIET_BLASTDIET_BLAST
DIET_BLAST
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
 
Iaetsd survey on big data analytics for sdn (software defined networks)
Iaetsd survey on big data analytics for sdn (software defined networks)Iaetsd survey on big data analytics for sdn (software defined networks)
Iaetsd survey on big data analytics for sdn (software defined networks)
 
Efficient Database Management System For Wireless Sensor Network
Efficient Database Management System For Wireless Sensor Network Efficient Database Management System For Wireless Sensor Network
Efficient Database Management System For Wireless Sensor Network
 
Anomalous symmetry succession for seek out
Anomalous symmetry succession for seek outAnomalous symmetry succession for seek out
Anomalous symmetry succession for seek out
 
Granularity analysis of classification and estimation for complex datasets wi...
Granularity analysis of classification and estimation for complex datasets wi...Granularity analysis of classification and estimation for complex datasets wi...
Granularity analysis of classification and estimation for complex datasets wi...
 
Aa31163168
Aa31163168Aa31163168
Aa31163168
 

Dernier

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Dernier (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Internet data mining 2006

  • 1. Challenges and Opportunities in Internet Data Mining David G. Andersen Nick Feamster Carnegie Mellon University Georgia Institute of Technology CMU-PDL-06-102 Jan 2006 Parallel Data Laboratory Carnegie Mellon University Pittsburgh, PA 15213-3890 Abstract Internet measurement data provides the foundation for the operation and planning of the networks that comprise the Internet, and is a necessary component in research for analysis, simulation, and emulation. Despite its critical role, however, the management of this data—from collection and transmission to storage and its use within applications—remains primarily ad hoc, using techniques created and re-created by each corporation or researcher that uses the data. This paper examines several of the challenges faced when attempting to collect and archive large volumes of network measurement data, and outlines an architecture for an Internet data repository—the datapository—designed to create a framework for collaboratively addressing these challenges.
  • 2. Keywords: network monitoring, databases, data management, data mining
  • 3. 1 Introduction Communications networks produce vast amounts of data that both help network operators manage and plan their networks and enable researchers to study a variety of network characteristics. The data is rich, and it varies from the minute (e.g., kilobits per hour of periodic average link utilization data) to the massive (e.g., gigabits per second of live traffic captures). Its uses are equally diverse: On short timescales, the data can facilitate network monitoring, troubleshooting, and reactive routing [9]. Over minutes to days, the data is useful for network traffic engineering, in which operators attempt to shift the flow of traffic away from over- utilized links onto less busy paths. Over months, the data is useful for capacity planning; on even longer timescales, the data helps researchers perform longitudinal studies. Network data is essential for the operation and planning of the networks that comprise the Internet; it is also vital for analysis, simulation, and emulation. Despite its critical role, the management of this data— from collection and transmission to storage and its use within applications—remains disconcertingly ad hoc, using techniques created and re-created by each corporation or researcher. As one example, consider the storage formats used for node-to-node loss and latency (“ping”) data. A number of different projects collect such data, in formats ranging from: • A hierarchy of directories, one per day, with an N-line text file containing a matrix of ping data between a set of N nodes, with errors recorded in a separate file. (PlanetLab all-pairs pings data [21].) • A generalized binary network data storage format (Skitter’s ARTS++ format [5]). • Stored internally using a ROOT-based file [4] (RIPE’s Test Traffic Machines [11]). • A MySQL database, with one row per pair of probes and a non-standard text format for export upon demand (The RON testbed all-pairs probe data [2]). This paper examines several challenges in collecting and archiving large volumes of network mea- surement data, and outlines an architecture for an Internet data repository—the datapository—designed to create a framework for collaboratively addressing these challenges. Rather than attempt to solve all of these problems, we hope that the datapository will catalyze sharing of tools, formats, and data among researchers and operators. We envision that the datapository will be useful both for storage and analysis, and as a tool for real-time network monitoring. The heterogeneity of data formats and the lack of an infrastructure to analyze it efficiently harm both network operations and the science of network measurement. The data’s unwieldy nature prevents network operators from harnessing the vast amounts of data that could save them time and money (e.g., via network provisioning and traffic engineering) and help defend their networks from attack (e.g., by alerting operators to routing or traffic anomalies). From a scientific perspective, the lack of an analysis facility means that every network measurement researcher must develop “home brew” analysis tools and encounter the same set of traps and pitfalls as previous researchers. Worse yet, after a study is published, the authors’ analysis tools may go unused for long periods of time before another researcher tries to study the same published results years later—at which point the data may have moved to a different location or been re-organized entirely. Even if a researcher successfully locates the data and tools from a particular study, he is left with the challenge of re-discovering how to reproduce the results (“Which parameters did they use to generate that graph?”, “What subset of this data was used?”, etc.). Indeed, for Internet measurement, repeatable experimentation—a major cornerstone of science—requires significant advances in the storage and management of data. The remainder of this paper discusses the challenges that arise in constructing a network data analysis facility (Section 2), presents the initial design of the datapository (Section 3), and presents our vision for the opportunities that we hope the datapository will ultimately provide (Section 4). We warn the reader in 1
  • 4. advance that this paper raises more questions than it answers, though we hope it shines light on a possible path through the measurement minefield. 2 Challenges This section briefly surveys the challenges in realizing a network data collection and storage facility. 2.1 Robust, distributed data collection Collecting network measurement data is by nature distributed; this collection must be resilient to network and server outages, providing the datapository with an eventually complete data set. 1 In our experience, measurement nodes may be offline for days to months before coming back online and may be partitioned from the network for hours to days while still collecting measurements. While work such as Paxson’s Strategies for Sound Internet Measurement [17] provides valuable guide- lines for how to measure and analyze data about the network, less attention has been paid to the collection of this data for subsequent analysis. Unfortunately, when large number of nodes or large volumes of data are involved, the difficulties involved in collecting and organizing can be as large as the difficulties of how to measure the network in the first place. 2.2 Multi-timescale, heterogeneous data The combination of performing real-time analysis and data mining years of archived data presents challenges to both conventional and streaming database systems: Non-standard, irregular data. Network data is information-dense, and we cannot predict how users will query the data, which makes indexing a challenge. Some data is best served by custom search tech- niques. For example, each routing update contains a network prefix and a prefix length. The prefix length specifies the number of significant bits in the prefix. A common operation in prefix queries is to find a longest-prefix match (LPM): Given an IP address, find the prefix with the longest prefix length whose sig- nificant bits are equal to those bits in the IP address. Some data structures (e.g., search tries) can efficiently perform LPM, but the best way to express this query in SQL is as a set of 32 OR’d conditions for each prefix length. SELECT * from table WHERE (prefix = x AND mask=32) OR (prefix = x & 0xfffffffe AND mask=31) OR ... Additionally, network monitoring data is often not amenable to the regular form of standard databases. Packet traces, for instance, have nearly arbitrary content, and a search may examine any of these fields, extract the TCP stream from a series of packets, or examine application-layer headers. RBDMSes do not handle such queries, and existing tools such as tcpdump or ethereal cannot efficiently process terabytes of packet traces. Other data, such as the popular NetFlow format for flow summarization, has similar limitations for data mining. 1 The measurement techniques used must also be robust to these events. This lesson is explained well by Paxson, whose original Internet measurements were triggered from a centralized node, and therefore under-counted some failures [16], or whose measurements depended on the DNS [18]. The authors of this paper have seen these mistakes repeated several times since. 2
  • 5. Raw Data Collection and Storage Analysis, Queries, and External Interfaces Probe & Traceroute Compute Engine BGP Updates Netflow Compute Engine Compute Engines (Input processing) Storage & DB (Public services) Packet Traces SNMP Data Archival Storage Compute Engine (SQL processing) Router Configs Figure 1: The datapository architecture. To process non-standard data sources, the datapository may be able to leverage PADS, a system that takes as input a data format description and parses raw data files of arbitrary formats [10]. Large archival data mining. Aurora [3] provides a promising way to process real-time streams of network data, and AT&T’s Gigascope [6] supports real-time monitoring of gigabit networks. These systems are excellent building blocks for processing raw data, but they are typically designed for either low-speed streams (e.g., financial transactions) or modest-size “windows” of data. Longitudinal studies and joint anal- ysis of network data may require much large windows over high-speed streams. Most conventional solutions are unsuited to our application, and are unable to take advantage of many of the simplifying advantages of a data archive: the data is read-only, is easily cached, follows a strict time ordering, and is amenable to both intra- and inter-query parallelism. Data-mining solutions such as Netezza [14] offer attractive solutions for data mining. Network mea- surement data offers many opportunities for optimization: the data is read-only, is easily cached, and follows a strict time ordering. Unfortunately, current data mining tools do not cope well with the unique data types present in network monitoring data. We believe that the long-term development of the datapository will spur the development of log-processing database management systems. 2.3 Workflow metadata and privacy Metadata about data’s origins, errors, and modifications—its provenance—must be maintained and commu- nicated with the data as it progresses through the analysis workflow [13, 23]. Data producers or the research community must agree upon common metadata formats for data [1] and analysis tools in order to facilitate data sharing. An important part of the workflow is addressing privacy requirements. The privacy requirements for measurement data range from nearly none (e.g., many BGP routing traces, which are publicly available) to extreme (e.g., an un-anonymized packet trace). In order to gain the efficiencies of scale that we hope to realize with the Datapository, it must be able to host and analyze data at the ends and in the middle of the pri- vacy spectrum. In fact, the steps of “declassifying” this data (e.g., using anonymization “scrubbers” [12] or summarization techniques) will be an important part of the analysis workflow that the datapository enables. 3 The datapository The datapository consists of a distributed set of data sources and a central storage and analysis infrastruc- ture, shown in Figure 1. The data is processed by a set of compute engines dedicated to input processing, 3
  • 6. which insert the data into the central storage and database nodes. Once inserted, the data may be analyzed by tools on other compute engines. 3.1 Data Collection The datapository collects data from many sources, including the RON testbed path monitoring nodes and BGP feeds, Abilene data, and the RouteViews BGP archive. Internally, the datapository understands an orthogonal set of data types (which abstract a set of data formats), and access methods: Data types are a canonical version of a particular type of measurement (e.g.traceroutes, end-to-end probes, and BGP routing updates). The datapository is currently archiving and processing BGP routing tables and updates from RON and Abilene, and spam logs from two spam honeypots. We have immediate plans to archive RouteViews BGP feeds and end-to-end probes from the RON testbed. Data formats are different storage formats for a particular data type. Each of the formats listed in the Introduction would be one such data format; other examples include the MRT format commonly used to store BGP routing updates, or ASCII files recording traceroutes. Access methods are implementations of techniques that the datapository obtains data from the data sources. Examples include the code that obtains data (e.g., RouteViews) via FTP, and that which securely copies data from the RON testbed nodes to the datapository via SSH. Central to the datapository is the notion of a feed. A feed represents an (access method, data format, data type) tuple, along with the ancillary data used by the access method or data format (e.g., usernames and passwords, or state corresponding to which objects have already been retrieved). Each feed defines a workflow that uses an access method to obtain data in a particular format, converts that format into the canonical representation for that data type, and inserts the data into storage. 3.2 Data Storage We wish to store data in a cluster of databases that offer an SQL-like interface. Our prior experience and that of others suggests that basing analysis out of a database (we choose SQL out of familiarity) improves consistency, facilitates reuse, and reduces researcher effort [20, 17]. To this end, the present datapository architecture uses a set of MySQL compute engines, each of which has read-only access to the historical data via a distributed filesystem. This approach has the advantage of simplicity and working with off-the-shelf components, but fails to take advantage of the intra-query parallelism that is available in many of the data-mining queries we issue against the datapository. In addition, the current architecture is limited in its ability to simultaneously query the historical data and newly arrived data, which is maintained read-write by a single database node. We view the development of more suitable storage and mining techniques as a prime challenge for the further development of the datapository. 3.3 Analysis and Queries We envision three “users” for the datapository: Network operators, researchers performing analysis, and simulation/emulation systems that require traces as input. Network operators will be able to use the datapository for real-time debugging and network monitoring. By aggregating data from many vantage points into a logically centralized database, operators can efficiently get information from a geographically and topologically diverse set of network locations. Our previous implementation of a public “BGP monitor” was useful to operators in identifying network-wide failures and debugging reachability problems. Such utilities are particularly useful because they grant operators a view of the network from vantage points outside their own network. 4
  • 7. Our instance of the datapository holds datasets that researchers and operators choose to make publicly available. Operators may wish to build private datapositories to analyze confidential data and query that data either in isolation or in combination with the publicly available datapository. Analyzing traffic and routing data can help operators with traffic engineering and capacity planning. The datapository can help network researchers perform long-term longitudinal studies. Network data often exists in two forms: (1) short-term data collected for a particular experiment and (2) long-term data that is collected, but never analyzed in toto due to the massive overhead in running a query For example, the RouteViews BGP routing updates comprise BGP data from about 41 routers in 35 different ISPs [15]. An hour of compressed BGP updates from RouteViews requires one megabyte; as a result, longitudinal queries across multiple years and multiple BGP datasets can become prohibitively expensive. As a shared facility, the datapository can amortize storage, processing, and management costs, such as building and maintaining a database and query infrastructure for longitudinal datasets. We hope that lowering the cost of running queries will foster serendipity by making it easier for researchers to issue exploratory queries. The datapository aggregates multiple data feeds into a single archive, which facilitates performing joint analysis across multiple datasets. Among many examples, our past research has shown the benefits of performing joint analysis across traceroutes, active probes, and BGP routing updates to study the correlations between end-to-end path performance and Internet routing stability [9]. More recently, we have studied the interaction between BGP route hijacking and spam activity [8]. It is our hope that, by providing a common storage and query infrastructure for diverse datasets, Internet researchers will find new ways of finding patterns and correlations across datasets. 3.4 External Interfaces The datapository makes raw and aggregated data available through standard interfaces. The datapository cur- rently supports an XMLRPC export mechanism, and we are implementing bulk raw data export. The XML- RPC interface can export data in various formats, including human-readable output and Matlab-friendly matrices. This data export mechanism allows researchers to focus on analysis techniques without being intimately familiar with the myriad formats for network data. In addition to these export formats, we have created several Web interfaces for browsing data and for network troubleshooting. To facilitate experimentation, we are integrating the datapository with Emulab [22]. Emulab provides hardware resources (nodes and links) that can be configured into a desired network topology. Researchers use this emulated network to evaluate real code and distributed systems. Emulab provides a valuable mid- dle ground between simulation and untamed (and un-repeatable) live networks. However, the accuracy of many emulation experiments—much like simulations—lives and dies by the model and input parameters. Coupling the datapository to Emulab can significantly improve this aspect. Our goal is to foster a seamless flow of data from network traces to Emulab experiments and back. Emulab users will have access to realistic topologies, background traffic patterns, configurations, and even prior experimental and analytical data. The data produced by these experiments can then be inserted into the datapository for archiving and further processing. This integration presents a classical workflow problem, managing the flow of data from the datapository, into emulation (and perhaps the real world, via Emulab’s wide-area nodes), and back. Internet research involves a continual cycle of measuring a system to better understand and evaluate its properties, modeling these properties in a way that makes them easy to integrate into a controlled experiment (e.g., in simulation or emulation), and making changes to the system that is actually deployed. The datapository acts as a critical component in this process by facilitating data collection and management, as well as acting as a driver for trace-based emulation. Section 4.1 describes this process in more detail, as well as its relation to Emulab’s scientific workflow management [7]. 5
  • 8. 4 Opportunities This section explores possible future uses for the datapository. We focus first on how the datapository might help scientific experimentation; then, we explore how it might be employed as a cornerstone of network management—even to the point of controlling network operation. 4.1 Repeatable Scientific Experimentation Two significant challenges in experimentation are repeating a past experiment (under the same or different conditions) and performing new analysis of old data. Without standard techniques for packaging old data and analysis scripts, measurement data is often shuffled about and forgotten or becomes a subset of a con- tinuously collected data stream. At a later date, researchers must recall which subsets of each data stream and which processing scripts they used to produce each set of results. The datapository can save measurement data from these fates by allowing a researcher to store the details of their analysis workflow. This archived workflow names all of the machinery a researcher would need to reproduce a particular experiment: the data, all scripts needed for analysis, and the process by which these scripts should be executed (i.e., a description of the experiment). The workflow should also refer to any relevant information about how the data was collected and other noteworthy aspects of the data, such as times when the collection machinery was faulty or non-functional, how the data was collected, etc. Our support for workflows is complementary to Emulab, which proposes integrating workflow management into generating—not just analyzing—measurements [7]. The datapository’s archives may also prove useful for driving trace-driven network emulations. The networking community is developing several platforms for network emulation, which would benefit from the ability to play back traces of actual network data. For example, a researcher might set up an experiment that mirrors a real network topology, process traces at the datapository as they are collected from the real network, and project the failures in these traces onto the mirrored network. 4.2 Network Monitoring and Control Network operators must ensure availability and good performance in the face of constantly changing condi- tions (e.g., changing traffic patterns, link failures) and security threats (e.g., worms, denial-of-service (DoS) attacks). Unfortunately, critical data often escapes operators’ notice due to both the lack of an infrastructure to collect and analyze this data and the lack of adequate detection algorithms. We hope that the dataposi- tory will evolve to support real-time analysis and detection. A real-time analysis facility might even allow operators to automate some aspects of network control. For example, such a system could detect erroneous routing advertisements and prevent them from propagating; it could also detect a DoS or worm attacks and install filters. 5 Discussion and Summary The datapository is an in-progress facility for shared network data analysis. We believe that a community- based approach to network data analysis can play a critical role in both networking research and operations. In contrast to existing approaches that emphasize cataloging and standardizing Internet data [19], we believe that providing a public computational and storage resource will be an effective stimulus for community involvement and adoption of the standards that evolve. While building the datapository, we have encountered several conflicts between existing workflow and data management systems and the requirements of network data. We hope that this paper will serve as a catalyst for the improvement of both our data repository, and the available data management techniques. 6
  • 9. References [1] Mark Allman, Ethan Blanton, and Wesley M. Eddy. A scalable system for sharing Internet measure- ment. In Passive & Active Measurement (PAM), Fort Collins, CO, March 2004. [2] David G. Andersen, Hari Balakrishnan, M. Frans Kaashoek, and Robert Morris. Experience with an Evolving Overlay Network Testbed. ACM Computer Communications Review, 33(3):13–19, July 2003. [3] Hari Balakrishnan, Magdalena Balazinska, Donald Carney, et al. Retrospective on Aurora. VLDB Journal, January 2004. [4] Rene Brun and Fons Rademakers. ROOT - an object oriented data analysis framework. In Proc. AIHENP’96 Workshop, September 1996. [5] CAIDA’s Skitter project, 2002. http://www.caida.org/tools/measurement/skitter/. [6] Chuck Cranor, Theodore Johnson, Oliver Spataschek, and Vladislav Shkapenyuk. Gigascope: a stream database for network applications. In Proc. ACM SIGMOD, pages 647–651, San Diego, CA, June 2003. [7] Eric Eide, Juliana Freire, Jay Lepreau, Tim Stack, and Leigh Stoller. Integrated scientific workflow management for the Emulab network testbed. Submitted for publication, December 2005. [8] Nick Feamster. Open Problems in BGP Anomaly Detection. http://www.caida.org/outreach/ isma/0411/abstracts.xml#NickFeamster2, Nov 2004. Slides available upon request. [9] Nick Feamster, David Andersen, Hari Balakrishnan, and M. Frans Kaashoek. Measuring the effects of Internet path faults on reactive routing. In Proc. ACM SIGMETRICS, San Diego, CA, June 2003. [10] Kathleen Fisher and Robert Gruber. Pads: a domain-specific language for processing ad hoc data. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implemen- tation, pages 295–304, Chicago, IL, Jun 2005. [11] Fotis Georgatos, Florian Gruber, Daniel Karrenberg, Mark Santcroos, Ana Susanj, Henk Uijterwaal, and Rene Wilhelm. Providing active measurements as a regular service for ISP’s. In Passive & Active Measurement (PAM), April 2001. [12] Alefiya Hussain, Genevieve Bartlett, Yuri Pryadkin, John Heidemann, Christos Papadopoulos, and Joseph Bannister. Experiences with a continuous network tracing infrastructure. In Proc. ACM SIG- COMM Workshop on Mining Network Data (MineNet), Philadelphia, PA, August 2005. [13] Jonathan Ledlie, Chaki Ng, David Holland, Kiran-Kumar Muniswamy-Reddy, Uri Braun, and Margo Seltzer. Provenance-aware sensor data storage. In Proc. Workshop on Networking Meets Databases (NetDB), April 2005. [14] Netezza. Business intelligence data warehouse appliance. Netezza Home Page. [15] University of Oregon. RouteViews. http://www.routeviews.org/. [16] V. Paxson. End-to-End Routing Behavior in the Internet. IEEE/ACM Transactions on Networking, 5(5):601–615, 1997. 7
  • 10. [17] Vern Paxson. Strategies for sound Internet measurement. In Proc. ACM SIGCOMM Internet Measure- ment Conference, Taormina, Sicily, Italy, October 2004. [18] Vern Paxson, Andrew Adams, and Matt Mathis. Experiences with NIMI. In Passive & Active Mea- surement (PAM), April 2000. [19] Colleen Shannon, David Moore, Ken Keys, Marina Fomenkov, Bradley Huffaker, and k. claffy. The Internet measurement data catalog. ACM Computer Communications Review, 35(5):97–100, October 2005. [20] Neil Spring, Ratul Mahajan, and David Wetherall. Measuring ISP topologies with Rocketfuel. In Proc. ACM SIGCOMM, Pittsburgh, PA, August 2002. [21] Jeremy Stribling. Planetlab - all pairs pings. http://pdos.csail.mit.edu/~strib/pl_app/, nov 2005. [22] Brian White, Jay Lepreau, Leigh Stoller, Robert Ricci, Shashi Guruprasad, Mac Newbold, Mike Hibler, Chad Barb, and Abhijeet Joglekar. An integrated experimental environment for distributed systems and networks. In Proc. 5th USENIX OSDI, pages 255–270, Boston, MA, December 2002. [23] Jennifer Widom. Trio: A system for integrated management of data, accuracy, and lineage. In Proc. 2nd Biennial Conference on Innovative Data Systems Research (CIDR), Pacific Grove, CA, January 2005. 8