SlideShare une entreprise Scribd logo
1  sur  19
Empowering Transformational Science
Chelle Gentemann (Farallon Institute) twitter: @ChelleGentemann
Ryan Abernathey (Columbia / LDEO) twitter: @rabernat
Aimee Barciauskas (Development Seed) twitter: @_aimeeb
(there are lots of links in this presentation! click away!)
SWOT
NISAR
NASA Physical Oceanography Program
Communities build open science.
Open science is more efficient.
Efficient science leads to
transformational results.
Data: time to find, access, clean, & format data for analysis
Software: what tools are easily available?
Compute: access to compute == speed of results
What impacts the velocity of science?
Data, Software, & Compute
80%
Data Preparation
(download, clean, & organize files)
10%
Batch
Processing
10%
Think about
science
Traditional methods of data access
cannot leverage large volumes of data
6
https://earthdata.nasa.gov/eosdis/cloud-evolution
SWOT
NISAR
Data, Software, Compute
Analytics Optimized Data Store (AODS)
a few examples of
AODS formats
Current method -
NetCDF files - organized into ‘reasonable’ data sizes per file, usually by orbit, granule, or
day. Filename has information about date, sensor, version. Reading usually involved
calculating the filename, opening, reading, processing, closing.
Analytics Optimized Data Store (one example of many different formats)
Zarr - makes large datasets easily accessible to distributed computing. Original data is
stored in directories each having chunked data corresponding to dataset dimensions.
Metadata is read by zarr libraries to read only the chunks necessary to complete a
subsetting request.
Technology advances -
Lazy loading - also known as asynchronous loading - defer initialization of an object until
the point at which it is needed. Developed for webpages. Delays reading data until needed
for compute.
Advanced OSS libraries:
Xarray - library for analyzing multi-dimensional arrays, lazy loading.
Dask - able to break a large computational problems into a network of smaller problems for
distribution across multiple processors
Intake - lightweight set of tools for loading and sharing data in data science projects
NetCDF Zarr
What does a data store look like?
Organized so that each file can fit into RAM,
usually by day, orbit, or granules
organization and format invisible to user,
data accessed by metadata
Time to access data?
https://nbviewer.jupyter.org/github/cgentemann/Biophysical/blob/master/Test_AVISO_zarr_simple_version.ipynb
Modern software tools use lazy loading
to access large datasets
Accessing netCDF data: 11 minutes (depends on computer)
1 - user creates list of filenames
2 - access dataset by reading the metadata distributed through files
Accessing Zarr data: 0.1 seconds (metadata consolidated)
1 - access dataset by reading the consolidated metadata
Calculate mean over region
NetCDF - 12 minutes
Zarr - 4 seconds
My version of
lazy loading
before I knew
python - on
bedrest,
pregnant with
twins
STOP ------------- THIS IS DIFFERENT ------------------
1 line of code to access a 28-year, global, 25km dataset
1 line of code to select a region, calculate mean, & plot time series
in LESS than 1 minute
Data, Software, Compute
Inspiration: Stephan Hoyer, Jake Vanderplas (SciPy 2015)
SciPy
Data, Software, Compute
Analytics Optimized Data
Store (AODS)
Data Provider’s $ Data Consumer’s $
Scalable Parallel
Computing Frameworks
Agency driven solutions
Grass-Roots Solutions
13
14
Pangeo Architecture
Jupyter for interactive data
analysis on remote
systemsCloud / HPC
Xarray provides data structures
and intuitive interface for
interacting with datasetsParallel computing system allows users
deploy clusters of compute nodes for
data processing.
Dask tells the nodes what to do.
Distributedstorage
“Analytics Optimized
Data Stores”
stored on globally-
available distributed
storage.
@pangeo_data
How can data providers reduce barriers?
Reimagine how cloud data access and tools can enable
transformational science
Publish cloud-
optimized data Interactive
tutorials
Contribute to OSS tools
Increase user interactions/feedback
How does minimizing barriers to data
change science?
Levels the playing
field for all who
want to contribute
Traditional Project Timeline
Impacts: Reduce Time to Science
80%
Data Preparation
(download, clean, & organize files)
10%
Batch
Processing
10%
Think about
science
Cloud-based Project Timeline
5%
Load
AODS
5%
Parallel
Processing
90%
Think about science
Traditional Project Code
Impacts: Reproducibility
Cloud-based Project Code
# step 1: open data (stored on local hard drive)
>>> data = open_data(“/path/to/private/files”)
Error: files not found
# step 1: open data (globally accessible)
>>> data = open_data(“http://catalog.pangeo.io/path/to/dataset”)
# step 2: process data
>>> process(data)
Reproducibility in data-driven science requires more than just code!
Thank you!
Open source science
What impacts the velocity of progress?
Data, Software, & Compute
STOP ------------- THIS IS DIFFERENT ------------------
1 line of code to access a 28-year, global, 25km dataset
1 line of code to select a region, calculate mean, & plot time series
in LESS than 1 minute

Contenu connexe

Tendances

Is Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data ScienceIs Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data ScienceEdureka!
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Ashok Royal
 
Significance Of Hadoop For Data Science
Significance Of Hadoop For Data ScienceSignificance Of Hadoop For Data Science
Significance Of Hadoop For Data ScienceRobert Smith
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Robert Grossman
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Robert Grossman
 
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsReal-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsAnita de Waard
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
 
Big data and computing grid
Big data and computing gridBig data and computing grid
Big data and computing gridThang Nguyen
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forumChris Dwan
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 
Presentation on Big Data Hadoop (Summer Training Demo)
Presentation on Big Data Hadoop (Summer Training Demo)Presentation on Big Data Hadoop (Summer Training Demo)
Presentation on Big Data Hadoop (Summer Training Demo)Ashok Royal
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterIOSR Journals
 
An incremental and distributed inference methodfor large scale ontologies bas...
An incremental and distributed inference methodfor large scale ontologies bas...An incremental and distributed inference methodfor large scale ontologies bas...
An incremental and distributed inference methodfor large scale ontologies bas...LeMeniz Infotech
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science ServicesIan Foster
 

Tendances (20)

Is Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data ScienceIs Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data Science
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
 
Significance Of Hadoop For Data Science
Significance Of Hadoop For Data ScienceSignificance Of Hadoop For Data Science
Significance Of Hadoop For Data Science
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Big data
Big dataBig data
Big data
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsReal-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
Big data and computing grid
Big data and computing gridBig data and computing grid
Big data and computing grid
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
Presentation on Big Data Hadoop (Summer Training Demo)
Presentation on Big Data Hadoop (Summer Training Demo)Presentation on Big Data Hadoop (Summer Training Demo)
Presentation on Big Data Hadoop (Summer Training Demo)
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
 
An incremental and distributed inference methodfor large scale ontologies bas...
An incremental and distributed inference methodfor large scale ontologies bas...An incremental and distributed inference methodfor large scale ontologies bas...
An incremental and distributed inference methodfor large scale ontologies bas...
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 

Similaire à Empowering Transformational Science

Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with SparkArjen de Vries
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationDenodo
 
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...San Diego Supercomputer Center
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overviewNitesh Ghosh
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
 

Similaire à Empowering Transformational Science (20)

Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Hadoop
HadoopHadoop
Hadoop
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
 
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 

Plus de Chelle Gentemann

Butterfly Satellite Mission Overview
Butterfly Satellite Mission OverviewButterfly Satellite Mission Overview
Butterfly Satellite Mission OverviewChelle Gentemann
 
Satellite passive microwave measurements of the climate crisis
Satellite passive microwave measurements of the climate crisisSatellite passive microwave measurements of the climate crisis
Satellite passive microwave measurements of the climate crisisChelle Gentemann
 
FOSS4G 2021: Open source science
FOSS4G 2021: Open source scienceFOSS4G 2021: Open source science
FOSS4G 2021: Open source scienceChelle Gentemann
 
Open ecosystems help science storm the cloud
Open ecosystems help science storm the cloudOpen ecosystems help science storm the cloud
Open ecosystems help science storm the cloudChelle Gentemann
 
Building a Community of Practice
Building a Community of PracticeBuilding a Community of Practice
Building a Community of PracticeChelle Gentemann
 
Multi-sensor Improved Sea Surface Temperatures Project
Multi-sensor Improved Sea Surface Temperatures ProjectMulti-sensor Improved Sea Surface Temperatures Project
Multi-sensor Improved Sea Surface Temperatures ProjectChelle Gentemann
 
Saildrone Baja 2018 Cruise
Saildrone Baja 2018 CruiseSaildrone Baja 2018 Cruise
Saildrone Baja 2018 CruiseChelle Gentemann
 
The changing landscape of science
The changing landscape of scienceThe changing landscape of science
The changing landscape of scienceChelle Gentemann
 

Plus de Chelle Gentemann (9)

Butterfly Satellite Mission Overview
Butterfly Satellite Mission OverviewButterfly Satellite Mission Overview
Butterfly Satellite Mission Overview
 
Satellite passive microwave measurements of the climate crisis
Satellite passive microwave measurements of the climate crisisSatellite passive microwave measurements of the climate crisis
Satellite passive microwave measurements of the climate crisis
 
FOSS4G 2021: Open source science
FOSS4G 2021: Open source scienceFOSS4G 2021: Open source science
FOSS4G 2021: Open source science
 
Open ecosystems help science storm the cloud
Open ecosystems help science storm the cloudOpen ecosystems help science storm the cloud
Open ecosystems help science storm the cloud
 
Building a Community of Practice
Building a Community of PracticeBuilding a Community of Practice
Building a Community of Practice
 
Open Science
Open ScienceOpen Science
Open Science
 
Multi-sensor Improved Sea Surface Temperatures Project
Multi-sensor Improved Sea Surface Temperatures ProjectMulti-sensor Improved Sea Surface Temperatures Project
Multi-sensor Improved Sea Surface Temperatures Project
 
Saildrone Baja 2018 Cruise
Saildrone Baja 2018 CruiseSaildrone Baja 2018 Cruise
Saildrone Baja 2018 Cruise
 
The changing landscape of science
The changing landscape of scienceThe changing landscape of science
The changing landscape of science
 

Dernier

Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxrohankumarsinghrore1
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspectsmuralinath2
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfrohankumarsinghrore1
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
An introduction on sequence tagged site mapping
An introduction on sequence tagged site mappingAn introduction on sequence tagged site mapping
An introduction on sequence tagged site mappingadibshanto115
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptxSilpa
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.Silpa
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Silpa
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 

Dernier (20)

Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
An introduction on sequence tagged site mapping
An introduction on sequence tagged site mappingAn introduction on sequence tagged site mapping
An introduction on sequence tagged site mapping
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 

Empowering Transformational Science

  • 1. Empowering Transformational Science Chelle Gentemann (Farallon Institute) twitter: @ChelleGentemann Ryan Abernathey (Columbia / LDEO) twitter: @rabernat Aimee Barciauskas (Development Seed) twitter: @_aimeeb (there are lots of links in this presentation! click away!) SWOT NISAR NASA Physical Oceanography Program
  • 2.
  • 3. Communities build open science. Open science is more efficient. Efficient science leads to transformational results.
  • 4. Data: time to find, access, clean, & format data for analysis Software: what tools are easily available? Compute: access to compute == speed of results What impacts the velocity of science? Data, Software, & Compute 80% Data Preparation (download, clean, & organize files) 10% Batch Processing 10% Think about science
  • 5. Traditional methods of data access cannot leverage large volumes of data
  • 7. Analytics Optimized Data Store (AODS) a few examples of AODS formats Current method - NetCDF files - organized into ‘reasonable’ data sizes per file, usually by orbit, granule, or day. Filename has information about date, sensor, version. Reading usually involved calculating the filename, opening, reading, processing, closing. Analytics Optimized Data Store (one example of many different formats) Zarr - makes large datasets easily accessible to distributed computing. Original data is stored in directories each having chunked data corresponding to dataset dimensions. Metadata is read by zarr libraries to read only the chunks necessary to complete a subsetting request. Technology advances - Lazy loading - also known as asynchronous loading - defer initialization of an object until the point at which it is needed. Developed for webpages. Delays reading data until needed for compute. Advanced OSS libraries: Xarray - library for analyzing multi-dimensional arrays, lazy loading. Dask - able to break a large computational problems into a network of smaller problems for distribution across multiple processors Intake - lightweight set of tools for loading and sharing data in data science projects
  • 8. NetCDF Zarr What does a data store look like? Organized so that each file can fit into RAM, usually by day, orbit, or granules organization and format invisible to user, data accessed by metadata
  • 9. Time to access data? https://nbviewer.jupyter.org/github/cgentemann/Biophysical/blob/master/Test_AVISO_zarr_simple_version.ipynb Modern software tools use lazy loading to access large datasets Accessing netCDF data: 11 minutes (depends on computer) 1 - user creates list of filenames 2 - access dataset by reading the metadata distributed through files Accessing Zarr data: 0.1 seconds (metadata consolidated) 1 - access dataset by reading the consolidated metadata Calculate mean over region NetCDF - 12 minutes Zarr - 4 seconds My version of lazy loading before I knew python - on bedrest, pregnant with twins STOP ------------- THIS IS DIFFERENT ------------------ 1 line of code to access a 28-year, global, 25km dataset 1 line of code to select a region, calculate mean, & plot time series in LESS than 1 minute
  • 10. Data, Software, Compute Inspiration: Stephan Hoyer, Jake Vanderplas (SciPy 2015) SciPy
  • 11. Data, Software, Compute Analytics Optimized Data Store (AODS) Data Provider’s $ Data Consumer’s $ Scalable Parallel Computing Frameworks
  • 14. 14 Pangeo Architecture Jupyter for interactive data analysis on remote systemsCloud / HPC Xarray provides data structures and intuitive interface for interacting with datasetsParallel computing system allows users deploy clusters of compute nodes for data processing. Dask tells the nodes what to do. Distributedstorage “Analytics Optimized Data Stores” stored on globally- available distributed storage. @pangeo_data
  • 15. How can data providers reduce barriers? Reimagine how cloud data access and tools can enable transformational science Publish cloud- optimized data Interactive tutorials Contribute to OSS tools Increase user interactions/feedback
  • 16. How does minimizing barriers to data change science? Levels the playing field for all who want to contribute
  • 17. Traditional Project Timeline Impacts: Reduce Time to Science 80% Data Preparation (download, clean, & organize files) 10% Batch Processing 10% Think about science Cloud-based Project Timeline 5% Load AODS 5% Parallel Processing 90% Think about science
  • 18. Traditional Project Code Impacts: Reproducibility Cloud-based Project Code # step 1: open data (stored on local hard drive) >>> data = open_data(“/path/to/private/files”) Error: files not found # step 1: open data (globally accessible) >>> data = open_data(“http://catalog.pangeo.io/path/to/dataset”) # step 2: process data >>> process(data) Reproducibility in data-driven science requires more than just code!
  • 19. Thank you! Open source science What impacts the velocity of progress? Data, Software, & Compute STOP ------------- THIS IS DIFFERENT ------------------ 1 line of code to access a 28-year, global, 25km dataset 1 line of code to select a region, calculate mean, & plot time series in LESS than 1 minute