The demands of data-intensive science represent a challenge for diverse scientific communities. Data volumes from various sources are increasing exponentially, creating data management challenges. New approaches and technologies are needed to enable scientists to effectively analyze and store massive amounts of data.
Dev Dives: Streamline document processing with UiPath Studio Web
The Challenges of Data-Intensive Science
1. Data-Intensive Research
Jano van Hemert
research.nesc.ac.uk
NI VER
U S
E
IT
TH
Y
O F
H
G
E
R
D I U
N B
2. Downloaded from www.sciencemag.org on July 6, 2009
COMPUTER SCIENCE
The demands of data-intensive science
Beyond the Data Deluge represent a challenge for diverse scientific
communities.
Gordon Bell,1 Tony Hey,1 Alex Szalay2
S
ince at least Newton’s laws of motion in
the 17th century, scientists have recog-
nized experimental and theoretical sci-
ence as the basic research paradigms for
understanding nature. In recent decades, com-
puter simulations have become an essential
third paradigm: a standard tool for scientists to
explore domains that are inaccessible to theory
and experiment, such as the evolution of the
universe, car passenger crash testing, and pre-
dicting climate change. As simulations and
experiments yield ever more data, a fourth par-
adigm is emerging, consisting of the tech-
niques and technologies needed to perform
data-intensive science (1). For example, new
types of computer clusters are emerging that
are optimized for data movement and analysis
rather than computing, while in astronomy and
other sciences, integrated data systems allow
data analysis and storage on site instead of
requiring download of large amounts of data. Moon and Pleiades from the VO. Astronomy has been one of the first disciplines to embrace data-intensive
Today, some areas of science are facing science with the Virtual Observatory (VO), enabling highly efficient access to data and analysis tools at a cen-
hundred- to thousandfold increases in data tralized site. The image shows the Pleiades star cluster form the Digitized Sky Survey combined with an image
volumes from satellites, telescopes, high- of the moon, synthesized within the World Wide Telescope service.
throughput instruments, sensor networks,
accelerators, and supercomputers, compared challenging scientists (4). In contrast to the tra- ing of these digital data are becoming increas-
to the volumes generated only a decade ago ditional hypothesis-led approach to biology, ingly burdensome for research scientists.
(2). In astronomy and particle physics, Venter and others have argued that a data- Over the past 40 years or more, Moore’s
these new experiments generate petabytes intensive inductive approach to genomics Law has enabled transistors on silicon chips to
CREDIT: JONATHAN FAY/MICROSOFT
(1 petabyte = 1015 bytes) of data per year. In (such as shotgun sequencing) is necessary to get smaller and processors to get faster. At the
bioinformatics, the increasing volume (3) and address large-scale ecosystem questions (5, 6). same time, technology improvements for
the extreme heterogeneity of the data are Other research fields also face major data disks for storage cannot keep up with the ever
management challenges. In almost every labo- increasing flood of scientific data generated
ratory, “born digital” data proliferate in files, by the faster computers. In university research
1MicrosoftResearch, One Microsoft Way, Redmond, WA spreadsheets, or databases stored on hard labs, Beowulf clusters—groups of usually
98052, USA. 2Department of Physics and Astronomy, Johns
Hopkins University, 3701 San Martin Drive, Baltimore, MD drives, digital notebooks, Web sites, blogs, and identical, inexpensive PC computers that can
21218, USA. E-mail: szalay@jhu.edu wikis. The management, curation, and archiv- be used for parallel computations—have
www.sciencemag.org SCIENCE VOL 323 6 MARCH 2009 1297
Published by AAAS
3. o investigate the 10.1126/science.1171406
Downloaded from www.sciencemag.org on July 6, 2009
COMPUTER SCIENCE
The demands of data-intensive science
Beyond the Data Deluge represent a challenge for diverse scientific
communities.
Gordon Bell,1 Tony Hey,1 Alex Szalay2
S
ince at least Newton’s laws of motion in
the 17th century, scientists have recog-
nized experimental and theoretical sci-
The demands of data-intensive science
ence as the basic research paradigms for
understanding nature. In recent decades, com-
puter simulations have become an essential
represent a challenge for diverse scientific
third paradigm: a standard tool for scientists to
explore domains that are inaccessible to theory
and experiment, such as the evolution of the
communities.
universe, car passenger crash testing, and pre-
dicting climate change. As simulations and
experiments yield ever more data, a fourth par-
adigm is emerging, consisting of the tech-
niques and technologies needed to perform
data-intensive science (1). For example, new
types of computer clusters are emerging that
are optimized for data movement and analysis
rather than computing, while in astronomy and
other sciences, integrated data systems allow
data analysis and storage on site instead of
requiring download of large amounts of data. Moon and Pleiades from the VO. Astronomy has been one of the first disciplines to embrace data-intensive
Today, some areas of science are facing science with the Virtual Observatory (VO), enabling highly efficient access to data and analysis tools at a cen-
hundred- to thousandfold increases in data tralized site. The image shows the Pleiades star cluster form the Digitized Sky Survey combined with an image
volumes from satellites, telescopes, high- of the moon, synthesized within the World Wide Telescope service.
throughput instruments, sensor networks,
accelerators, and supercomputers, compared challenging scientists (4). In contrast to the tra- ing of these digital data are becoming increas-
to the volumes generated only a decade ago ditional hypothesis-led approach to biology, ingly burdensome for research scientists.
(2). In astronomy and particle physics, Venter and others have argued that a data- Over the past 40 years or more, Moore’s
these new experiments generate petabytes intensive inductive approach to genomics Law has enabled transistors on silicon chips to
CREDIT: JONATHAN FAY/MICROSOFT
(1 petabyte = 1015 bytes) of data per year. In (such as shotgun sequencing) is necessary to get smaller and processors to get faster. At the
bioinformatics, the increasing volume (3) and address large-scale ecosystem questions (5, 6). same time, technology improvements for
the extreme heterogeneity of the data are Other research fields also face major data disks for storage cannot keep up with the ever
management challenges. In almost every labo- increasing flood of scientific data generated
ratory, “born digital” data proliferate in files, by the faster computers. In university research
1MicrosoftResearch, One Microsoft Way, Redmond, WA spreadsheets, or databases stored on hard labs, Beowulf clusters—groups of usually
98052, USA. 2Department of Physics and Astronomy, Johns
Hopkins University, 3701 San Martin Drive, Baltimore, MD drives, digital notebooks, Web sites, blogs, and identical, inexpensive PC computers that can
21218, USA. E-mail: szalay@jhu.edu wikis. The management, curation, and archiv- be used for parallel computations—have
www.sciencemag.org SCIENCE VOL 323 6 MARCH 2009 1297
Published by AAAS
4. NEWS FEATURE 2020 COMPUTING NATURE|Vol 440|23 March 2006
J. MAGEE
EVERYTHING,EVERYWHERE
Tiny computers that constantly monitor ecosystems, buildings and even human bodies
could turn science on its head. Declan Butler investigates.
Editor's Notes
* This is not about projects, publications
* One of the papers that is signposting
* Sensors, large machines, interaction with data (software), interaction between people, interaction of software on data, ...
* EMBL-EBI now reached 4.5 petabytes
* MESUR has 1 billion records on usage data
* PACS at 160 GB in August 2009, quadruples every year
* More explicit forms of demands
* More explicit forms of demands
* A proposed solution
* How do you go about implementing a solution under the fourth paradigm?
* Formulation = an abstract description of the data-intensive challenge
* Execution = an implementation of the challenge that runs on a computational platform
* Interaction = necessary to manage the formulation process and to steer the execution
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Formulation = an abstract description of the data-intensive challenge
* Execution = an implementation of the challenge that runs on a computational platform
* Interaction = necessary to manage the formulation process and to steer the execution
* Formulation = an abstract description of the data-intensive challenge
* Execution = an implementation of the challenge that runs on a computational platform
* Interaction = necessary to manage the formulation process and to steer the execution
* Formulation = an abstract description of the data-intensive challenge
* Execution = an implementation of the challenge that runs on a computational platform
* Interaction = necessary to manage the formulation process and to steer the execution
* Formulation = an abstract description of the data-intensive challenge
* Execution = an implementation of the challenge that runs on a computational platform
* Interaction = necessary to manage the formulation process and to steer the execution
* Formulation = an abstract description of the data-intensive challenge
* Execution = an implementation of the challenge that runs on a computational platform
* Interaction = necessary to manage the formulation process and to steer the execution
* Formulation = an abstract description of the data-intensive challenge
* Execution = an implementation of the challenge that runs on a computational platform
* Interaction = necessary to manage the formulation process and to steer the execution
* Formulation = an abstract description of the data-intensive challenge
* Execution = an implementation of the challenge that runs on a computational platform
* Interaction = necessary to manage the formulation process and to steer the execution
* Formulation = an abstract description of the data-intensive challenge
* Execution = an implementation of the challenge that runs on a computational platform
* Interaction = necessary to manage the formulation process and to steer the execution
* Formulation = an abstract description of the data-intensive challenge
* Execution = an implementation of the challenge that runs on a computational platform
* Interaction = necessary to manage the formulation process and to steer the execution