SlideShare a Scribd company logo
1 of 46
Data-Intensive Research
                           Jano van Hemert
                            research.nesc.ac.uk
            NI VER
          U        S
 E




                      IT
TH




                       Y
O F




                       H
                       G




      E
                   R




          D I     U
              N B
Downloaded from www.sciencemag.org on July 6, 2009
                                 COMPUTER SCIENCE
                                                                                                                                                      The demands of data-intensive science

                                 Beyond the Data Deluge                                                                                               represent a challenge for diverse scientific
                                                                                                                                                      communities.
                                 Gordon Bell,1 Tony Hey,1 Alex Szalay2



                                 S
                                        ince at least Newton’s laws of motion in
                                        the 17th century, scientists have recog-
                                        nized experimental and theoretical sci-
                                 ence as the basic research paradigms for
                                 understanding nature. In recent decades, com-
                                 puter simulations have become an essential
                                 third paradigm: a standard tool for scientists to
                                 explore domains that are inaccessible to theory
                                 and experiment, such as the evolution of the
                                 universe, car passenger crash testing, and pre-
                                 dicting climate change. As simulations and
                                 experiments yield ever more data, a fourth par-
                                 adigm is emerging, consisting of the tech-
                                 niques and technologies needed to perform
                                 data-intensive science (1). For example, new
                                 types of computer clusters are emerging that
                                 are optimized for data movement and analysis
                                 rather than computing, while in astronomy and
                                 other sciences, integrated data systems allow
                                 data analysis and storage on site instead of
                                 requiring download of large amounts of data.               Moon and Pleiades from the VO. Astronomy has been one of the first disciplines to embrace data-intensive
                                     Today, some areas of science are facing                science with the Virtual Observatory (VO), enabling highly efficient access to data and analysis tools at a cen-
                                 hundred- to thousandfold increases in data                 tralized site. The image shows the Pleiades star cluster form the Digitized Sky Survey combined with an image
                                 volumes from satellites, telescopes, high-                 of the moon, synthesized within the World Wide Telescope service.
                                 throughput instruments, sensor networks,
                                 accelerators, and supercomputers, compared                 challenging scientists (4). In contrast to the tra-       ing of these digital data are becoming increas-
                                 to the volumes generated only a decade ago                 ditional hypothesis-led approach to biology,              ingly burdensome for research scientists.
                                 (2). In astronomy and particle physics,                    Venter and others have argued that a data-                   Over the past 40 years or more, Moore’s
                                 these new experiments generate petabytes                   intensive inductive approach to genomics                  Law has enabled transistors on silicon chips to
CREDIT: JONATHAN FAY/MICROSOFT




                                 (1 petabyte = 1015 bytes) of data per year. In             (such as shotgun sequencing) is necessary to              get smaller and processors to get faster. At the
                                 bioinformatics, the increasing volume (3) and              address large-scale ecosystem questions (5, 6).           same time, technology improvements for
                                 the extreme heterogeneity of the data are                      Other research fields also face major data            disks for storage cannot keep up with the ever
                                                                                            management challenges. In almost every labo-              increasing flood of scientific data generated
                                                                                            ratory, “born digital” data proliferate in files,         by the faster computers. In university research
                                 1MicrosoftResearch, One Microsoft Way, Redmond, WA         spreadsheets, or databases stored on hard                 labs, Beowulf clusters—groups of usually
                                 98052, USA. 2Department of Physics and Astronomy, Johns
                                 Hopkins University, 3701 San Martin Drive, Baltimore, MD   drives, digital notebooks, Web sites, blogs, and          identical, inexpensive PC computers that can
                                 21218, USA. E-mail: szalay@jhu.edu                         wikis. The management, curation, and archiv-              be used for parallel computations—have

                                                                               www.sciencemag.org            SCIENCE         VOL 323       6 MARCH 2009                                                        1297
                                                                                                              Published by AAAS
o investigate the                                                                                                                                                          10.1126/science.1171406




                                                                                                                                                                                                                           Downloaded from www.sciencemag.org on July 6, 2009
                                          COMPUTER SCIENCE
                                                                                                                                                               The demands of data-intensive science

                                          Beyond the Data Deluge                                                                                               represent a challenge for diverse scientific
                                                                                                                                                               communities.
                                          Gordon Bell,1 Tony Hey,1 Alex Szalay2



                                          S
                                                 ince at least Newton’s laws of motion in
                                                 the 17th century, scientists have recog-
                                                 nized experimental and theoretical sci-

                                                                                The demands of data-intensive science
                                          ence as the basic research paradigms for
                                          understanding nature. In recent decades, com-
                                          puter simulations have become an essential
                                                                                represent a challenge for diverse scientific
                                          third paradigm: a standard tool for scientists to
                                          explore domains that are inaccessible to theory
                                          and experiment, such as the evolution of the
                                                                                communities.
                                          universe, car passenger crash testing, and pre-
                                          dicting climate change. As simulations and
                                          experiments yield ever more data, a fourth par-
                                          adigm is emerging, consisting of the tech-
                                          niques and technologies needed to perform
                                          data-intensive science (1). For example, new
                                          types of computer clusters are emerging that
                                          are optimized for data movement and analysis
                                          rather than computing, while in astronomy and
                                          other sciences, integrated data systems allow
                                          data analysis and storage on site instead of
                                          requiring download of large amounts of data.               Moon and Pleiades from the VO. Astronomy has been one of the first disciplines to embrace data-intensive
                                              Today, some areas of science are facing                science with the Virtual Observatory (VO), enabling highly efficient access to data and analysis tools at a cen-
                                          hundred- to thousandfold increases in data                 tralized site. The image shows the Pleiades star cluster form the Digitized Sky Survey combined with an image
                                          volumes from satellites, telescopes, high-                 of the moon, synthesized within the World Wide Telescope service.
                                          throughput instruments, sensor networks,
                                          accelerators, and supercomputers, compared                 challenging scientists (4). In contrast to the tra-       ing of these digital data are becoming increas-
                                          to the volumes generated only a decade ago                 ditional hypothesis-led approach to biology,              ingly burdensome for research scientists.
                                          (2). In astronomy and particle physics,                    Venter and others have argued that a data-                   Over the past 40 years or more, Moore’s
                                          these new experiments generate petabytes                   intensive inductive approach to genomics                  Law has enabled transistors on silicon chips to
         CREDIT: JONATHAN FAY/MICROSOFT




                                          (1 petabyte = 1015 bytes) of data per year. In             (such as shotgun sequencing) is necessary to              get smaller and processors to get faster. At the
                                          bioinformatics, the increasing volume (3) and              address large-scale ecosystem questions (5, 6).           same time, technology improvements for
                                          the extreme heterogeneity of the data are                      Other research fields also face major data            disks for storage cannot keep up with the ever
                                                                                                     management challenges. In almost every labo-              increasing flood of scientific data generated
                                                                                                     ratory, “born digital” data proliferate in files,         by the faster computers. In university research
                                          1MicrosoftResearch, One Microsoft Way, Redmond, WA         spreadsheets, or databases stored on hard                 labs, Beowulf clusters—groups of usually
                                          98052, USA. 2Department of Physics and Astronomy, Johns
                                          Hopkins University, 3701 San Martin Drive, Baltimore, MD   drives, digital notebooks, Web sites, blogs, and          identical, inexpensive PC computers that can
                                          21218, USA. E-mail: szalay@jhu.edu                         wikis. The management, curation, and archiv-              be used for parallel computations—have

                                                                                        www.sciencemag.org            SCIENCE         VOL 323       6 MARCH 2009                                                        1297
                                                                                                                       Published by AAAS
NEWS FEATURE 2020 COMPUTING                                               NATURE|Vol 440|23 March 2006




                                                                                                         J. MAGEE
EVERYTHING,EVERYWHERE
      Tiny computers that constantly monitor ecosystems, buildings and even human bodies
                   could turn science on its head. Declan Butler investigates.
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science
The Challenges of Data-Intensive Science

More Related Content

What's hot

Scientific Data Management
Scientific Data ManagementScientific Data Management
Scientific Data ManagementJian Qin
 
13,573,002 Method Patent The Heart Beacon Cycle
13,573,002 Method Patent The Heart Beacon Cycle13,573,002 Method Patent The Heart Beacon Cycle
13,573,002 Method Patent The Heart Beacon CycleSAW Concepts LLC
 
The Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark DataThe Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark Datavbrant
 
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Bryan Heidorn
 
Big data from small data: A deep survey of the neuroscience landscape data via
Big data from small data:  A deep survey of the neuroscience landscape data viaBig data from small data:  A deep survey of the neuroscience landscape data via
Big data from small data: A deep survey of the neuroscience landscape data viaNeuroscience Information Framework
 
Beyond Preservation: Situating Archaeological Data in Professional Practice
Beyond Preservation: Situating Archaeological Data in Professional PracticeBeyond Preservation: Situating Archaeological Data in Professional Practice
Beyond Preservation: Situating Archaeological Data in Professional PracticeEric Kansa
 
OII Summer Doctoral Programme 2010: Global brain by Meyer & Schroeder
OII Summer Doctoral Programme 2010: Global brain by Meyer & SchroederOII Summer Doctoral Programme 2010: Global brain by Meyer & Schroeder
OII Summer Doctoral Programme 2010: Global brain by Meyer & SchroederEric Meyer
 
Strong field science core proposal for uph ill site
Strong field science core proposal for uph ill siteStrong field science core proposal for uph ill site
Strong field science core proposal for uph ill siteahsanrabbani
 
Sci 2011 big_data(30_may13)2nd revised _ loet
Sci 2011 big_data(30_may13)2nd revised _ loetSci 2011 big_data(30_may13)2nd revised _ loet
Sci 2011 big_data(30_may13)2nd revised _ loetHan Woo PARK
 

What's hot (14)

Summary of 3DPAS
Summary of 3DPASSummary of 3DPAS
Summary of 3DPAS
 
Keller geo edu
Keller geo eduKeller geo edu
Keller geo edu
 
Scientific Data Management
Scientific Data ManagementScientific Data Management
Scientific Data Management
 
13,573,002 Method Patent The Heart Beacon Cycle
13,573,002 Method Patent The Heart Beacon Cycle13,573,002 Method Patent The Heart Beacon Cycle
13,573,002 Method Patent The Heart Beacon Cycle
 
The Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark DataThe Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark Data
 
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
 
Big data from small data: A deep survey of the neuroscience landscape data via
Big data from small data:  A deep survey of the neuroscience landscape data viaBig data from small data:  A deep survey of the neuroscience landscape data via
Big data from small data: A deep survey of the neuroscience landscape data via
 
Beyond Preservation: Situating Archaeological Data in Professional Practice
Beyond Preservation: Situating Archaeological Data in Professional PracticeBeyond Preservation: Situating Archaeological Data in Professional Practice
Beyond Preservation: Situating Archaeological Data in Professional Practice
 
OII Summer Doctoral Programme 2010: Global brain by Meyer & Schroeder
OII Summer Doctoral Programme 2010: Global brain by Meyer & SchroederOII Summer Doctoral Programme 2010: Global brain by Meyer & Schroeder
OII Summer Doctoral Programme 2010: Global brain by Meyer & Schroeder
 
Collins seattle-2014-final
Collins seattle-2014-finalCollins seattle-2014-final
Collins seattle-2014-final
 
Strong field science core proposal for uph ill site
Strong field science core proposal for uph ill siteStrong field science core proposal for uph ill site
Strong field science core proposal for uph ill site
 
eResearch@UCT
eResearch@UCTeResearch@UCT
eResearch@UCT
 
Christine borgman keynote
Christine borgman keynoteChristine borgman keynote
Christine borgman keynote
 
Sci 2011 big_data(30_may13)2nd revised _ loet
Sci 2011 big_data(30_may13)2nd revised _ loetSci 2011 big_data(30_may13)2nd revised _ loet
Sci 2011 big_data(30_may13)2nd revised _ loet
 

Viewers also liked

Mumbai Coast Road - Points made at Public Hearing
Mumbai Coast Road - Points made at Public HearingMumbai Coast Road - Points made at Public Hearing
Mumbai Coast Road - Points made at Public HearingRishi Aggarwal
 
Додаток 47. Матеріал для фасилітації. Допомога учням в плануванні проекту
Додаток 47. Матеріал для фасилітації. Допомога учням в плануванні проектуДодаток 47. Матеріал для фасилітації. Допомога учням в плануванні проекту
Додаток 47. Матеріал для фасилітації. Допомога учням в плануванні проектуSchool 20, Zaporizhzhya
 
Событие в условиях изенений
Событие в условиях изененийСобытие в условиях изенений
Событие в условиях изененийБути Ковец
 
Herramientas colaborativas
Herramientas colaborativasHerramientas colaborativas
Herramientas colaborativasjuancarlostobon
 
Maker Experience: user centered toolkit for makers
Maker Experience: user centered toolkit for makersMaker Experience: user centered toolkit for makers
Maker Experience: user centered toolkit for makersCodemotion
 
Susie Almaneih - Ready, Set, Recycle! 5 Ways to Get your Kids Involved Now
Susie Almaneih - Ready, Set, Recycle! 5 Ways to Get your Kids Involved NowSusie Almaneih - Ready, Set, Recycle! 5 Ways to Get your Kids Involved Now
Susie Almaneih - Ready, Set, Recycle! 5 Ways to Get your Kids Involved NowSusie Almaneih
 
Бутиковская булавка
Бутиковская булавкаБутиковская булавка
Бутиковская булавкаБути Ковец
 
Bitcoin Lending A Smart Way To Invest
Bitcoin Lending A Smart Way To InvestBitcoin Lending A Smart Way To Invest
Bitcoin Lending A Smart Way To InvestSaloni Sardana
 
TDD and mobile development: some forgotten techniques, illustrated with Android
TDD and mobile development: some forgotten techniques, illustrated with AndroidTDD and mobile development: some forgotten techniques, illustrated with Android
TDD and mobile development: some forgotten techniques, illustrated with AndroidCodemotion
 
Computational Biologist-The Next Pharma Scientist?
Computational Biologist-The Next Pharma Scientist?Computational Biologist-The Next Pharma Scientist?
Computational Biologist-The Next Pharma Scientist?Debarati Roy
 
PEGASI Smart Glasses
PEGASI Smart GlassesPEGASI Smart Glasses
PEGASI Smart GlassesBerry Chen
 
Livewire innovation AC120 Arc Chaser
Livewire innovation AC120 Arc ChaserLivewire innovation AC120 Arc Chaser
Livewire innovation AC120 Arc ChaserWAVENIX CO.,LTD.
 
Home automation
Home automationHome automation
Home automationNitesh Ray
 
Gaussian Process Latent Variable Models & applications in single-cell genomics
Gaussian Process Latent Variable Models & applications in single-cell genomicsGaussian Process Latent Variable Models & applications in single-cell genomics
Gaussian Process Latent Variable Models & applications in single-cell genomicsKieran Campbell
 

Viewers also liked (20)

PrashantProfile
PrashantProfilePrashantProfile
PrashantProfile
 
Mumbai Coast Road - Points made at Public Hearing
Mumbai Coast Road - Points made at Public HearingMumbai Coast Road - Points made at Public Hearing
Mumbai Coast Road - Points made at Public Hearing
 
Додаток 47. Матеріал для фасилітації. Допомога учням в плануванні проекту
Додаток 47. Матеріал для фасилітації. Допомога учням в плануванні проектуДодаток 47. Матеріал для фасилітації. Допомога учням в плануванні проекту
Додаток 47. Матеріал для фасилітації. Допомога учням в плануванні проекту
 
Событие в условиях изенений
Событие в условиях изененийСобытие в условиях изенений
Событие в условиях изенений
 
Herramientas colaborativas
Herramientas colaborativasHerramientas colaborativas
Herramientas colaborativas
 
Maker Experience: user centered toolkit for makers
Maker Experience: user centered toolkit for makersMaker Experience: user centered toolkit for makers
Maker Experience: user centered toolkit for makers
 
Susie Almaneih - Ready, Set, Recycle! 5 Ways to Get your Kids Involved Now
Susie Almaneih - Ready, Set, Recycle! 5 Ways to Get your Kids Involved NowSusie Almaneih - Ready, Set, Recycle! 5 Ways to Get your Kids Involved Now
Susie Almaneih - Ready, Set, Recycle! 5 Ways to Get your Kids Involved Now
 
Taller1.2
Taller1.2Taller1.2
Taller1.2
 
Бутиковская булавка
Бутиковская булавкаБутиковская булавка
Бутиковская булавка
 
UKCCSRC Data and Information Archive Briefing - UKCCSRC Strathclyde Biannual ...
UKCCSRC Data and Information Archive Briefing - UKCCSRC Strathclyde Biannual ...UKCCSRC Data and Information Archive Briefing - UKCCSRC Strathclyde Biannual ...
UKCCSRC Data and Information Archive Briefing - UKCCSRC Strathclyde Biannual ...
 
Developing scalable CO2 capture technology
Developing scalable CO2 capture technologyDeveloping scalable CO2 capture technology
Developing scalable CO2 capture technology
 
Bitcoin Lending A Smart Way To Invest
Bitcoin Lending A Smart Way To InvestBitcoin Lending A Smart Way To Invest
Bitcoin Lending A Smart Way To Invest
 
Balanço do Período das Chuvas - POCV 2015/2016 - Defesa Civil de Santo André
Balanço do Período das Chuvas - POCV 2015/2016 - Defesa Civil de Santo AndréBalanço do Período das Chuvas - POCV 2015/2016 - Defesa Civil de Santo André
Balanço do Período das Chuvas - POCV 2015/2016 - Defesa Civil de Santo André
 
TDD and mobile development: some forgotten techniques, illustrated with Android
TDD and mobile development: some forgotten techniques, illustrated with AndroidTDD and mobile development: some forgotten techniques, illustrated with Android
TDD and mobile development: some forgotten techniques, illustrated with Android
 
Changes in the Dutch CCS Landscape - Jan Brouwer, CATO - UKCCSRC Strathclyde ...
Changes in the Dutch CCS Landscape - Jan Brouwer, CATO - UKCCSRC Strathclyde ...Changes in the Dutch CCS Landscape - Jan Brouwer, CATO - UKCCSRC Strathclyde ...
Changes in the Dutch CCS Landscape - Jan Brouwer, CATO - UKCCSRC Strathclyde ...
 
Computational Biologist-The Next Pharma Scientist?
Computational Biologist-The Next Pharma Scientist?Computational Biologist-The Next Pharma Scientist?
Computational Biologist-The Next Pharma Scientist?
 
PEGASI Smart Glasses
PEGASI Smart GlassesPEGASI Smart Glasses
PEGASI Smart Glasses
 
Livewire innovation AC120 Arc Chaser
Livewire innovation AC120 Arc ChaserLivewire innovation AC120 Arc Chaser
Livewire innovation AC120 Arc Chaser
 
Home automation
Home automationHome automation
Home automation
 
Gaussian Process Latent Variable Models & applications in single-cell genomics
Gaussian Process Latent Variable Models & applications in single-cell genomicsGaussian Process Latent Variable Models & applications in single-cell genomics
Gaussian Process Latent Variable Models & applications in single-cell genomics
 

Similar to The Challenges of Data-Intensive Science

4th_paradigm_book_complete_lr
4th_paradigm_book_complete_lr4th_paradigm_book_complete_lr
4th_paradigm_book_complete_lrDominic A Ienco
 
Evolution of e-Research
Evolution of e-ResearchEvolution of e-Research
Evolution of e-ResearchDavid De Roure
 
Data Science definition
Data Science definitionData Science definition
Data Science definitionCarloLauro1
 
Let's talk about Data Science
Let's talk about Data ScienceLet's talk about Data Science
Let's talk about Data ScienceCarlo Lauro
 
Curation of Research Data
Curation of Research DataCuration of Research Data
Curation of Research DataMichael Day
 
The Evolution of e-Research: Machines, Methods and Music
The Evolution of e-Research: Machines, Methods and MusicThe Evolution of e-Research: Machines, Methods and Music
The Evolution of e-Research: Machines, Methods and MusicDavid De Roure
 
Discover Data Portal
Discover Data PortalDiscover Data Portal
Discover Data PortalTom Loughran
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of ScienceGlobus
 
Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?LEARN Project
 
ICSTI Annual Meeting 2014 Tokyo Y. Murayama
ICSTI Annual Meeting 2014 Tokyo Y. MurayamaICSTI Annual Meeting 2014 Tokyo Y. Murayama
ICSTI Annual Meeting 2014 Tokyo Y. MurayamaYasuhiro Murayama
 
How to use science maps to navigate large information spaces? What is the lin...
How to use science maps to navigate large information spaces? What is the lin...How to use science maps to navigate large information spaces? What is the lin...
How to use science maps to navigate large information spaces? What is the lin...Andrea Scharnhorst
 
Knowledge – dynamics – landscape - navigation – what have interfaces to digit...
Knowledge – dynamics – landscape - navigation – what have interfaces to digit...Knowledge – dynamics – landscape - navigation – what have interfaces to digit...
Knowledge – dynamics – landscape - navigation – what have interfaces to digit...Andrea Scharnhorst
 
accelerating-data-driven
accelerating-data-drivenaccelerating-data-driven
accelerating-data-drivenJoshua Chudy
 
06 e science-bio diversity@ pacc 18.07.2014
06 e science-bio diversity@ pacc 18.07.201406 e science-bio diversity@ pacc 18.07.2014
06 e science-bio diversity@ pacc 18.07.2014VinothkumaR Ramu
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositoriesChris Rusbridge
 
Keynote IEEE International Workshop on Cloud Analytics. Dennis Gannon
Keynote IEEE International Workshop on Cloud Analytics. Dennis  GannonKeynote IEEE International Workshop on Cloud Analytics. Dennis  Gannon
Keynote IEEE International Workshop on Cloud Analytics. Dennis GannonMicrosoft Azure for Research
 

Similar to The Challenges of Data-Intensive Science (20)

4th_paradigm_book_complete_lr
4th_paradigm_book_complete_lr4th_paradigm_book_complete_lr
4th_paradigm_book_complete_lr
 
The Fourth Paradigm Book
The Fourth Paradigm BookThe Fourth Paradigm Book
The Fourth Paradigm Book
 
Evolution of e-Research
Evolution of e-ResearchEvolution of e-Research
Evolution of e-Research
 
Data Science definition
Data Science definitionData Science definition
Data Science definition
 
Let's talk about Data Science
Let's talk about Data ScienceLet's talk about Data Science
Let's talk about Data Science
 
E scidocdays review
E scidocdays reviewE scidocdays review
E scidocdays review
 
Curation of Research Data
Curation of Research DataCuration of Research Data
Curation of Research Data
 
The Evolution of e-Research: Machines, Methods and Music
The Evolution of e-Research: Machines, Methods and MusicThe Evolution of e-Research: Machines, Methods and Music
The Evolution of e-Research: Machines, Methods and Music
 
Discover Data Portal
Discover Data PortalDiscover Data Portal
Discover Data Portal
 
Embrace Technology – or It will Embrace You
Embrace Technology – or It will Embrace YouEmbrace Technology – or It will Embrace You
Embrace Technology – or It will Embrace You
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of Science
 
Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?
 
ICSTI Annual Meeting 2014 Tokyo Y. Murayama
ICSTI Annual Meeting 2014 Tokyo Y. MurayamaICSTI Annual Meeting 2014 Tokyo Y. Murayama
ICSTI Annual Meeting 2014 Tokyo Y. Murayama
 
How to use science maps to navigate large information spaces? What is the lin...
How to use science maps to navigate large information spaces? What is the lin...How to use science maps to navigate large information spaces? What is the lin...
How to use science maps to navigate large information spaces? What is the lin...
 
Knowledge – dynamics – landscape - navigation – what have interfaces to digit...
Knowledge – dynamics – landscape - navigation – what have interfaces to digit...Knowledge – dynamics – landscape - navigation – what have interfaces to digit...
Knowledge – dynamics – landscape - navigation – what have interfaces to digit...
 
accelerating-data-driven
accelerating-data-drivenaccelerating-data-driven
accelerating-data-driven
 
06 e science-bio diversity@ pacc 18.07.2014
06 e science-bio diversity@ pacc 18.07.201406 e science-bio diversity@ pacc 18.07.2014
06 e science-bio diversity@ pacc 18.07.2014
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositories
 
Cifar
CifarCifar
Cifar
 
Keynote IEEE International Workshop on Cloud Analytics. Dennis Gannon
Keynote IEEE International Workshop on Cloud Analytics. Dennis  GannonKeynote IEEE International Workshop on Cloud Analytics. Dennis  Gannon
Keynote IEEE International Workshop on Cloud Analytics. Dennis Gannon
 

Recently uploaded

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Recently uploaded (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

The Challenges of Data-Intensive Science

  • 1. Data-Intensive Research Jano van Hemert research.nesc.ac.uk NI VER U S E IT TH Y O F H G E R D I U N B
  • 2. Downloaded from www.sciencemag.org on July 6, 2009 COMPUTER SCIENCE The demands of data-intensive science Beyond the Data Deluge represent a challenge for diverse scientific communities. Gordon Bell,1 Tony Hey,1 Alex Szalay2 S ince at least Newton’s laws of motion in the 17th century, scientists have recog- nized experimental and theoretical sci- ence as the basic research paradigms for understanding nature. In recent decades, com- puter simulations have become an essential third paradigm: a standard tool for scientists to explore domains that are inaccessible to theory and experiment, such as the evolution of the universe, car passenger crash testing, and pre- dicting climate change. As simulations and experiments yield ever more data, a fourth par- adigm is emerging, consisting of the tech- niques and technologies needed to perform data-intensive science (1). For example, new types of computer clusters are emerging that are optimized for data movement and analysis rather than computing, while in astronomy and other sciences, integrated data systems allow data analysis and storage on site instead of requiring download of large amounts of data. Moon and Pleiades from the VO. Astronomy has been one of the first disciplines to embrace data-intensive Today, some areas of science are facing science with the Virtual Observatory (VO), enabling highly efficient access to data and analysis tools at a cen- hundred- to thousandfold increases in data tralized site. The image shows the Pleiades star cluster form the Digitized Sky Survey combined with an image volumes from satellites, telescopes, high- of the moon, synthesized within the World Wide Telescope service. throughput instruments, sensor networks, accelerators, and supercomputers, compared challenging scientists (4). In contrast to the tra- ing of these digital data are becoming increas- to the volumes generated only a decade ago ditional hypothesis-led approach to biology, ingly burdensome for research scientists. (2). In astronomy and particle physics, Venter and others have argued that a data- Over the past 40 years or more, Moore’s these new experiments generate petabytes intensive inductive approach to genomics Law has enabled transistors on silicon chips to CREDIT: JONATHAN FAY/MICROSOFT (1 petabyte = 1015 bytes) of data per year. In (such as shotgun sequencing) is necessary to get smaller and processors to get faster. At the bioinformatics, the increasing volume (3) and address large-scale ecosystem questions (5, 6). same time, technology improvements for the extreme heterogeneity of the data are Other research fields also face major data disks for storage cannot keep up with the ever management challenges. In almost every labo- increasing flood of scientific data generated ratory, “born digital” data proliferate in files, by the faster computers. In university research 1MicrosoftResearch, One Microsoft Way, Redmond, WA spreadsheets, or databases stored on hard labs, Beowulf clusters—groups of usually 98052, USA. 2Department of Physics and Astronomy, Johns Hopkins University, 3701 San Martin Drive, Baltimore, MD drives, digital notebooks, Web sites, blogs, and identical, inexpensive PC computers that can 21218, USA. E-mail: szalay@jhu.edu wikis. The management, curation, and archiv- be used for parallel computations—have www.sciencemag.org SCIENCE VOL 323 6 MARCH 2009 1297 Published by AAAS
  • 3. o investigate the 10.1126/science.1171406 Downloaded from www.sciencemag.org on July 6, 2009 COMPUTER SCIENCE The demands of data-intensive science Beyond the Data Deluge represent a challenge for diverse scientific communities. Gordon Bell,1 Tony Hey,1 Alex Szalay2 S ince at least Newton’s laws of motion in the 17th century, scientists have recog- nized experimental and theoretical sci- The demands of data-intensive science ence as the basic research paradigms for understanding nature. In recent decades, com- puter simulations have become an essential represent a challenge for diverse scientific third paradigm: a standard tool for scientists to explore domains that are inaccessible to theory and experiment, such as the evolution of the communities. universe, car passenger crash testing, and pre- dicting climate change. As simulations and experiments yield ever more data, a fourth par- adigm is emerging, consisting of the tech- niques and technologies needed to perform data-intensive science (1). For example, new types of computer clusters are emerging that are optimized for data movement and analysis rather than computing, while in astronomy and other sciences, integrated data systems allow data analysis and storage on site instead of requiring download of large amounts of data. Moon and Pleiades from the VO. Astronomy has been one of the first disciplines to embrace data-intensive Today, some areas of science are facing science with the Virtual Observatory (VO), enabling highly efficient access to data and analysis tools at a cen- hundred- to thousandfold increases in data tralized site. The image shows the Pleiades star cluster form the Digitized Sky Survey combined with an image volumes from satellites, telescopes, high- of the moon, synthesized within the World Wide Telescope service. throughput instruments, sensor networks, accelerators, and supercomputers, compared challenging scientists (4). In contrast to the tra- ing of these digital data are becoming increas- to the volumes generated only a decade ago ditional hypothesis-led approach to biology, ingly burdensome for research scientists. (2). In astronomy and particle physics, Venter and others have argued that a data- Over the past 40 years or more, Moore’s these new experiments generate petabytes intensive inductive approach to genomics Law has enabled transistors on silicon chips to CREDIT: JONATHAN FAY/MICROSOFT (1 petabyte = 1015 bytes) of data per year. In (such as shotgun sequencing) is necessary to get smaller and processors to get faster. At the bioinformatics, the increasing volume (3) and address large-scale ecosystem questions (5, 6). same time, technology improvements for the extreme heterogeneity of the data are Other research fields also face major data disks for storage cannot keep up with the ever management challenges. In almost every labo- increasing flood of scientific data generated ratory, “born digital” data proliferate in files, by the faster computers. In university research 1MicrosoftResearch, One Microsoft Way, Redmond, WA spreadsheets, or databases stored on hard labs, Beowulf clusters—groups of usually 98052, USA. 2Department of Physics and Astronomy, Johns Hopkins University, 3701 San Martin Drive, Baltimore, MD drives, digital notebooks, Web sites, blogs, and identical, inexpensive PC computers that can 21218, USA. E-mail: szalay@jhu.edu wikis. The management, curation, and archiv- be used for parallel computations—have www.sciencemag.org SCIENCE VOL 323 6 MARCH 2009 1297 Published by AAAS
  • 4. NEWS FEATURE 2020 COMPUTING NATURE|Vol 440|23 March 2006 J. MAGEE EVERYTHING,EVERYWHERE Tiny computers that constantly monitor ecosystems, buildings and even human bodies could turn science on its head. Declan Butler investigates.

Editor's Notes

  1. * This is not about projects, publications
  2. * One of the papers that is signposting
  3. * Sensors, large machines, interaction with data (software), interaction between people, interaction of software on data, ...
  4. * EMBL-EBI now reached 4.5 petabytes * MESUR has 1 billion records on usage data * PACS at 160 GB in August 2009, quadruples every year
  5. * More explicit forms of demands
  6. * More explicit forms of demands
  7. * A proposed solution * How do you go about implementing a solution under the fourth paradigm?
  8. * Formulation = an abstract description of the data-intensive challenge * Execution = an implementation of the challenge that runs on a computational platform * Interaction = necessary to manage the formulation process and to steer the execution
  9. * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  10. * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  11. * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  12. * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  13. * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  14. * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  15. * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  16. * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  17. * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  18. * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  19. * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  20. * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  21. * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  22. * Formulation = an abstract description of the data-intensive challenge * Execution = an implementation of the challenge that runs on a computational platform * Interaction = necessary to manage the formulation process and to steer the execution
  23. * Formulation = an abstract description of the data-intensive challenge * Execution = an implementation of the challenge that runs on a computational platform * Interaction = necessary to manage the formulation process and to steer the execution
  24. * Formulation = an abstract description of the data-intensive challenge * Execution = an implementation of the challenge that runs on a computational platform * Interaction = necessary to manage the formulation process and to steer the execution
  25. * Formulation = an abstract description of the data-intensive challenge * Execution = an implementation of the challenge that runs on a computational platform * Interaction = necessary to manage the formulation process and to steer the execution
  26. * Formulation = an abstract description of the data-intensive challenge * Execution = an implementation of the challenge that runs on a computational platform * Interaction = necessary to manage the formulation process and to steer the execution
  27. * Formulation = an abstract description of the data-intensive challenge * Execution = an implementation of the challenge that runs on a computational platform * Interaction = necessary to manage the formulation process and to steer the execution
  28. * Formulation = an abstract description of the data-intensive challenge * Execution = an implementation of the challenge that runs on a computational platform * Interaction = necessary to manage the formulation process and to steer the execution
  29. * Formulation = an abstract description of the data-intensive challenge * Execution = an implementation of the challenge that runs on a computational platform * Interaction = necessary to manage the formulation process and to steer the execution
  30. * Formulation = an abstract description of the data-intensive challenge * Execution = an implementation of the challenge that runs on a computational platform * Interaction = necessary to manage the formulation process and to steer the execution