SlideShare une entreprise Scribd logo
1  sur  24
BIG DATA (IN BIOLOGY):
INTEGRATING LARGE, FAST MOVING,
   HETEROGENEOUS DATASETS
                   Adina Howe

            Argonne National Laboratory
             Michigan State University

 EPA Air Sensors 2013: Data Quality and Applications
                  March 19, 2013
Introduction – My perspective

                          Experiment
                           Design




        Applied                                 Data
       Solutions        Engineering           Generation
                      Microbial Ecology
                       Bioinformatics




                Data                   Workflow /
               analysis                 Tools
THE DATA DELUGE
An exponential landscape
Next-generation sequencing growth
             outpacing computational resources
Log Scale!




                                          Stein, Genome Biology, 2010
Next-generation sequencing growth
outpacing computational resources




                             Stein, Genome Biology, 2010
Effects of low cost sequencing…
              First free-living bacterium sequenced
              for billions of dollars and years of
              analysis




                          Personal genome can be
                          mapped in a few days and
                          hundreds to few thousand
                          dollars
Effects of low cost sequencing on
research




                          Sboner et al., Genome Biology, 2011
Effects of low cost sequencing on
research




                          Sboner et al., Genome Biology, 2011
Effects of low cost sequencing on
research




                          Sboner et al., Genome Biology, 2011
Technology




                                                     Core
                       Value added
                                                  competency
RETHINKING
What it takes to deliver
Technical obstacles in the big data deluge
• Access to the data and its value
• Access to the resources


Democratization of both data and resource access
“80% of awards and 50% of $$ are for grants < $350,000”

Root causes:
• Data volume and velocity “clog”                                       Experiment
                                                                         Design


• Data is very heterogeneous
• Previous efforts are difficult to integrate    Applied
                                                Solutions
                                                                                                Data
                                                                                              Generation



• Innovation is necessary but hard


                                                                                     Workflow /
                                                        Data analysis
                                                                                      Tools
Social obstacles are the most difficult.
• Shift of costs do not mean a shift of expectations
  • “Give me the answer so I can get back to work.”


• A culture of sharing (data, time, and tools)


• Evolution of necessary training
• Creating teams that can communicate across domains


• Incentives are not strong enough
• Patterns for success (useful data sharing and
 collaboration) are not apparent or well understood.
POSSIBLE SOLUTIONS
Common solutions: been there, done that




                             http://xkcd.com/927/
What would an ideal solution look like?
• Flexible access to
    data, tools, and resources
•   Cost
    effective, consistent, reusab
    le (scalable)
•   Rapid exploration
•   Incentives to
    participate, share, communi
    cate
•   Community sandbox (vs
    lab-specific)
•       Platform which supports an “ecology” of
    Painless
        databases, interfaces, and analysis software.
The success of organization: Amazon
• > 50 million users, > 1 million product partners, billions of
    reviews, dozens of compute services.
•   Continually changing/updating data sets.
•   Explicitly adopted a service-oriented architecture that
    enables both internal and external use of this data.
•   For example, the Amazon.com website is itself built from
    over 150 independent services…
•   Amazon routinely deploys new services and functionality.


                                http://highscalability.com/amazon-architecture

                                https://plus.google.com/112678702228711889851/posts
                                /eVeouesvaVX
Amazon development guideline:
      Colloquially said, “You should eat your own dogfood.”


   Design and implement the database and database
   functionality to meet your own needs; only use the
    functionality you’ve explicitly made available to
                        everyone.

 To adapt to research: database functionality should be
 designed in tight integration with researchers who are
      using it, both at a user interface level and
                   programmatically.
If the “customers” aren’t integrated into the
  development loop:




http://blog.thingsdesigner.com/uploads/id/tree_swing_development_requirements.jpg
DOE Knowledgebase (KBase)
• Emerging software and data environment to enable
  researchers
• Service oriented architecture where biological data
  integrated into single data model with Kbase services
  loosely coupled to achieve various functions
• Open development environments for community
  contribution (public data, services, software)
• Provides robust and scalable infrastructure (with some
  level of support)




                                                 https://kbase.us
Kbase uses service oriented architecture




                                                                 Higher level functions
           http://kbase.us/files/6913/4990/5274/Infrastructure.pptx.pdf
DOE KBase Investment
                                              “…may also apply for
                                              additional supplemental
                                              funding of up to $300,000
                                              per year for development of
                                              systems biology and –omics
                                              data driven applications in
                                              collaboration with the
                                              DOE Systems Biology
                                              Knowledgbase.”




 Free tutorials / workshops for the community provided.
Advice for the next round…
                                         Big data is a
                                          community
                                         problem and
Data generator:                            solution
• Managing expectations and value
                                         Platform / Teams

                                          Access
Developer:
• “Eat your own dogfood”                  Training

                                          Communication
Data analyzer:
• Analyze with reproducibility in mind
Resources
• Amazon interviews

http://highscalability.com/amazon-architecture

• Titus Brown’s blog post on heterogeneous data integration

http://ivory.idyll.org/blog/software-architecture-for-heterogeneous-data-
integration.html

• Kbase website

http://www.kbase.us

• Software carpentry – “helping scientists build better software”

http://software-carpentry.org
Thanks!

Please feel free to contact me:

http://adina.github.com
adina@anl.gov




                                  http://cheezburger.com/6983817216

Contenu connexe

Tendances

Massive-Scale Analytics Applied to Real-World Problems
Massive-Scale Analytics Applied to Real-World ProblemsMassive-Scale Analytics Applied to Real-World Problems
Massive-Scale Analytics Applied to Real-World Problemsinside-BigData.com
 
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...Andrea Wiggins
 
Free as in Puppies: Compensating for ICT Constraints in Citizen Science
Free as in Puppies: Compensating for ICT Constraints in Citizen ScienceFree as in Puppies: Compensating for ICT Constraints in Citizen Science
Free as in Puppies: Compensating for ICT Constraints in Citizen ScienceAndrea Wiggins
 
Citizen Science Phenotypes
Citizen Science PhenotypesCitizen Science Phenotypes
Citizen Science PhenotypesAndrea Wiggins
 
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...Andrea Wiggins
 
Software Ecosystems = Big Data
Software Ecosystems = Big DataSoftware Ecosystems = Big Data
Software Ecosystems = Big DataTom Mens
 
Scio12 sem web_final
Scio12 sem web_finalScio12 sem web_final
Scio12 sem web_finalKristi Holmes
 
Data Intensive Collaboration in Science and Engineering: CSCW workshop themes
Data Intensive Collaboration in Science and Engineering: CSCW workshop themesData Intensive Collaboration in Science and Engineering: CSCW workshop themes
Data Intensive Collaboration in Science and Engineering: CSCW workshop themesAndrea Wiggins
 
BeSTGRID OpenGridForum 29 GIN session
BeSTGRID OpenGridForum 29 GIN sessionBeSTGRID OpenGridForum 29 GIN session
BeSTGRID OpenGridForum 29 GIN sessionNick Jones
 
Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Dan Taylor
 
Citizen science
Citizen scienceCitizen science
Citizen sciencesamar1407
 
Opening ndm2012 sc12
Opening ndm2012 sc12Opening ndm2012 sc12
Opening ndm2012 sc12balmanme
 
Taming the Big Data Beast - Together
Taming the Big Data Beast - TogetherTaming the Big Data Beast - Together
Taming the Big Data Beast - TogetherKennisalliantie
 
ELIXIR . Technical Coordinator
ELIXIR. Technical CoordinatorELIXIR. Technical Coordinator
ELIXIR . Technical CoordinatorRafael C. Jimenez
 
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012Lee Dirks
 
EarthCubeArchitectureWS_June2015
EarthCubeArchitectureWS_June2015EarthCubeArchitectureWS_June2015
EarthCubeArchitectureWS_June2015Kerstin Lehnert
 

Tendances (20)

Massive-Scale Analytics Applied to Real-World Problems
Massive-Scale Analytics Applied to Real-World ProblemsMassive-Scale Analytics Applied to Real-World Problems
Massive-Scale Analytics Applied to Real-World Problems
 
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...
 
Free as in Puppies: Compensating for ICT Constraints in Citizen Science
Free as in Puppies: Compensating for ICT Constraints in Citizen ScienceFree as in Puppies: Compensating for ICT Constraints in Citizen Science
Free as in Puppies: Compensating for ICT Constraints in Citizen Science
 
Citizen Science Phenotypes
Citizen Science PhenotypesCitizen Science Phenotypes
Citizen Science Phenotypes
 
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...
 
Crowdsourcing Science
Crowdsourcing ScienceCrowdsourcing Science
Crowdsourcing Science
 
Software Ecosystems = Big Data
Software Ecosystems = Big DataSoftware Ecosystems = Big Data
Software Ecosystems = Big Data
 
Little eScience
Little eScienceLittle eScience
Little eScience
 
Scio12 sem web_final
Scio12 sem web_finalScio12 sem web_final
Scio12 sem web_final
 
Citizen Science and Inquiry
Citizen Science and InquiryCitizen Science and Inquiry
Citizen Science and Inquiry
 
Data Intensive Collaboration in Science and Engineering: CSCW workshop themes
Data Intensive Collaboration in Science and Engineering: CSCW workshop themesData Intensive Collaboration in Science and Engineering: CSCW workshop themes
Data Intensive Collaboration in Science and Engineering: CSCW workshop themes
 
BeSTGRID OpenGridForum 29 GIN session
BeSTGRID OpenGridForum 29 GIN sessionBeSTGRID OpenGridForum 29 GIN session
BeSTGRID OpenGridForum 29 GIN session
 
Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2
 
Citizen science
Citizen scienceCitizen science
Citizen science
 
Opening ndm2012 sc12
Opening ndm2012 sc12Opening ndm2012 sc12
Opening ndm2012 sc12
 
Summary of 3DPAS
Summary of 3DPASSummary of 3DPAS
Summary of 3DPAS
 
Taming the Big Data Beast - Together
Taming the Big Data Beast - TogetherTaming the Big Data Beast - Together
Taming the Big Data Beast - Together
 
ELIXIR . Technical Coordinator
ELIXIR. Technical CoordinatorELIXIR. Technical Coordinator
ELIXIR . Technical Coordinator
 
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
 
EarthCubeArchitectureWS_June2015
EarthCubeArchitectureWS_June2015EarthCubeArchitectureWS_June2015
EarthCubeArchitectureWS_June2015
 

En vedette

ISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar SlidesISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar SlidesAdina Chuang Howe
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talkc.titus.brown
 
Environmental Genomics
Environmental Genomics Environmental Genomics
Environmental Genomics Erik Rumbaugh
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptxc.titus.brown
 
2015 Soil Science of America Meeting
2015 Soil Science of America Meeting2015 Soil Science of America Meeting
2015 Soil Science of America MeetingAdina Chuang Howe
 

En vedette (7)

ISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar SlidesISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar Slides
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
 
Environmental Genomics
Environmental Genomics Environmental Genomics
Environmental Genomics
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
 
Bioinformatics on Azure
Bioinformatics on AzureBioinformatics on Azure
Bioinformatics on Azure
 
2015 Soil Science of America Meeting
2015 Soil Science of America Meeting2015 Soil Science of America Meeting
2015 Soil Science of America Meeting
 
Environmental pollution
Environmental pollutionEnvironmental pollution
Environmental pollution
 

Similaire à EPA 2013 Air Sensors Meeting Big Data Talk

CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECAProject
 
Considerations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflowConsiderations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflowEagle Genomics
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sangerChris Dwan
 
Research Methodology (how to choose Datasets ).pptx
Research Methodology (how to choose Datasets ).pptxResearch Methodology (how to choose Datasets ).pptx
Research Methodology (how to choose Datasets ).pptxZainab Alhassani
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsSri Ambati
 
EMBL Australian Bioinformatics Resource AHM - Data Commons
EMBL Australian Bioinformatics Resource AHM   - Data CommonsEMBL Australian Bioinformatics Resource AHM   - Data Commons
EMBL Australian Bioinformatics Resource AHM - Data CommonsVivien Bonazzi
 
Toward a FAIR Biomedical Data Ecosystem
Toward a FAIR Biomedical Data EcosystemToward a FAIR Biomedical Data Ecosystem
Toward a FAIR Biomedical Data EcosystemGlobus
 
Taming Big Science Data Growth with Converged Infrastructure
Taming Big Science Data Growth with Converged InfrastructureTaming Big Science Data Growth with Converged Infrastructure
Taming Big Science Data Growth with Converged InfrastructureThe BioTeam Inc.
 
Scientific software sustainability and ecosystem complexity
Scientific software sustainability and ecosystem complexityScientific software sustainability and ecosystem complexity
Scientific software sustainability and ecosystem complexityJames Howison
 
Infrastructure for Supporting Computational Social Science
Infrastructure for Supporting Computational Social ScienceInfrastructure for Supporting Computational Social Science
Infrastructure for Supporting Computational Social ScienceDerek Hansen
 
Research Solutions for Education
Research Solutions for EducationResearch Solutions for Education
Research Solutions for EducationLee Stott
 
Tragedy of the (Data) Commons
Tragedy of the (Data) CommonsTragedy of the (Data) Commons
Tragedy of the (Data) CommonsJames Hendler
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
 
OTN Gambia 2008
OTN Gambia 2008OTN Gambia 2008
OTN Gambia 2008Greg Fegan
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
 
D paul ecn2013
D paul ecn2013D paul ecn2013
D paul ecn2013ECNOfficer
 

Similaire à EPA 2013 Air Sensors Meeting Big Data Talk (20)

CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
 
Considerations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflowConsiderations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflow
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sanger
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Research Methodology (how to choose Datasets ).pptx
Research Methodology (how to choose Datasets ).pptxResearch Methodology (how to choose Datasets ).pptx
Research Methodology (how to choose Datasets ).pptx
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
EMBL Australian Bioinformatics Resource AHM - Data Commons
EMBL Australian Bioinformatics Resource AHM   - Data CommonsEMBL Australian Bioinformatics Resource AHM   - Data Commons
EMBL Australian Bioinformatics Resource AHM - Data Commons
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARL
 
Toward a FAIR Biomedical Data Ecosystem
Toward a FAIR Biomedical Data EcosystemToward a FAIR Biomedical Data Ecosystem
Toward a FAIR Biomedical Data Ecosystem
 
Taming Big Science Data Growth with Converged Infrastructure
Taming Big Science Data Growth with Converged InfrastructureTaming Big Science Data Growth with Converged Infrastructure
Taming Big Science Data Growth with Converged Infrastructure
 
Scientific software sustainability and ecosystem complexity
Scientific software sustainability and ecosystem complexityScientific software sustainability and ecosystem complexity
Scientific software sustainability and ecosystem complexity
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Infrastructure for Supporting Computational Social Science
Infrastructure for Supporting Computational Social ScienceInfrastructure for Supporting Computational Social Science
Infrastructure for Supporting Computational Social Science
 
Research Solutions for Education
Research Solutions for EducationResearch Solutions for Education
Research Solutions for Education
 
Tragedy of the (Data) Commons
Tragedy of the (Data) CommonsTragedy of the (Data) Commons
Tragedy of the (Data) Commons
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
OTN Gambia 2008
OTN Gambia 2008OTN Gambia 2008
OTN Gambia 2008
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce Hoff
 
Big Data
Big Data Big Data
Big Data
 
D paul ecn2013
D paul ecn2013D paul ecn2013
D paul ecn2013
 

Plus de Adina Chuang Howe

Merrill Retreat 2018 - Nebraska City, Nebraska
Merrill Retreat 2018 - Nebraska City, NebraskaMerrill Retreat 2018 - Nebraska City, Nebraska
Merrill Retreat 2018 - Nebraska City, NebraskaAdina Chuang Howe
 
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back AgainIowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back AgainAdina Chuang Howe
 
Job Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio EngineeringJob Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio EngineeringAdina Chuang Howe
 
Adina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABEAdina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABEAdina Chuang Howe
 
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do thisANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do thisAdina Chuang Howe
 
Metagenomic data analysis discussion NEON Workshop
Metagenomic data analysis discussion NEON WorkshopMetagenomic data analysis discussion NEON Workshop
Metagenomic data analysis discussion NEON WorkshopAdina Chuang Howe
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesAdina Chuang Howe
 

Plus de Adina Chuang Howe (11)

Merrill Retreat 2018 - Nebraska City, Nebraska
Merrill Retreat 2018 - Nebraska City, NebraskaMerrill Retreat 2018 - Nebraska City, Nebraska
Merrill Retreat 2018 - Nebraska City, Nebraska
 
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back AgainIowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
 
Job Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio EngineeringJob Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio Engineering
 
Adina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABEAdina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABE
 
Sweden_eemis_big_data
Sweden_eemis_big_dataSweden_eemis_big_data
Sweden_eemis_big_data
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Big Data Field Museum
Big Data Field MuseumBig Data Field Museum
Big Data Field Museum
 
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do thisANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
 
Metagenomic data analysis discussion NEON Workshop
Metagenomic data analysis discussion NEON WorkshopMetagenomic data analysis discussion NEON Workshop
Metagenomic data analysis discussion NEON Workshop
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop Slides
 

Dernier

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 

Dernier (20)

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 

EPA 2013 Air Sensors Meeting Big Data Talk

  • 1. BIG DATA (IN BIOLOGY): INTEGRATING LARGE, FAST MOVING, HETEROGENEOUS DATASETS Adina Howe Argonne National Laboratory Michigan State University EPA Air Sensors 2013: Data Quality and Applications March 19, 2013
  • 2. Introduction – My perspective Experiment Design Applied Data Solutions Engineering Generation Microbial Ecology Bioinformatics Data Workflow / analysis Tools
  • 3. THE DATA DELUGE An exponential landscape
  • 4. Next-generation sequencing growth outpacing computational resources Log Scale! Stein, Genome Biology, 2010
  • 5. Next-generation sequencing growth outpacing computational resources Stein, Genome Biology, 2010
  • 6. Effects of low cost sequencing… First free-living bacterium sequenced for billions of dollars and years of analysis Personal genome can be mapped in a few days and hundreds to few thousand dollars
  • 7. Effects of low cost sequencing on research Sboner et al., Genome Biology, 2011
  • 8. Effects of low cost sequencing on research Sboner et al., Genome Biology, 2011
  • 9. Effects of low cost sequencing on research Sboner et al., Genome Biology, 2011
  • 10. Technology Core Value added competency RETHINKING What it takes to deliver
  • 11. Technical obstacles in the big data deluge • Access to the data and its value • Access to the resources Democratization of both data and resource access “80% of awards and 50% of $$ are for grants < $350,000” Root causes: • Data volume and velocity “clog” Experiment Design • Data is very heterogeneous • Previous efforts are difficult to integrate Applied Solutions Data Generation • Innovation is necessary but hard Workflow / Data analysis Tools
  • 12. Social obstacles are the most difficult. • Shift of costs do not mean a shift of expectations • “Give me the answer so I can get back to work.” • A culture of sharing (data, time, and tools) • Evolution of necessary training • Creating teams that can communicate across domains • Incentives are not strong enough • Patterns for success (useful data sharing and collaboration) are not apparent or well understood.
  • 14. Common solutions: been there, done that http://xkcd.com/927/
  • 15. What would an ideal solution look like? • Flexible access to data, tools, and resources • Cost effective, consistent, reusab le (scalable) • Rapid exploration • Incentives to participate, share, communi cate • Community sandbox (vs lab-specific) • Platform which supports an “ecology” of Painless databases, interfaces, and analysis software.
  • 16. The success of organization: Amazon • > 50 million users, > 1 million product partners, billions of reviews, dozens of compute services. • Continually changing/updating data sets. • Explicitly adopted a service-oriented architecture that enables both internal and external use of this data. • For example, the Amazon.com website is itself built from over 150 independent services… • Amazon routinely deploys new services and functionality. http://highscalability.com/amazon-architecture https://plus.google.com/112678702228711889851/posts /eVeouesvaVX
  • 17. Amazon development guideline: Colloquially said, “You should eat your own dogfood.” Design and implement the database and database functionality to meet your own needs; only use the functionality you’ve explicitly made available to everyone. To adapt to research: database functionality should be designed in tight integration with researchers who are using it, both at a user interface level and programmatically.
  • 18. If the “customers” aren’t integrated into the development loop: http://blog.thingsdesigner.com/uploads/id/tree_swing_development_requirements.jpg
  • 19. DOE Knowledgebase (KBase) • Emerging software and data environment to enable researchers • Service oriented architecture where biological data integrated into single data model with Kbase services loosely coupled to achieve various functions • Open development environments for community contribution (public data, services, software) • Provides robust and scalable infrastructure (with some level of support) https://kbase.us
  • 20. Kbase uses service oriented architecture Higher level functions http://kbase.us/files/6913/4990/5274/Infrastructure.pptx.pdf
  • 21. DOE KBase Investment “…may also apply for additional supplemental funding of up to $300,000 per year for development of systems biology and –omics data driven applications in collaboration with the DOE Systems Biology Knowledgbase.” Free tutorials / workshops for the community provided.
  • 22. Advice for the next round… Big data is a community problem and Data generator: solution • Managing expectations and value Platform / Teams Access Developer: • “Eat your own dogfood” Training Communication Data analyzer: • Analyze with reproducibility in mind
  • 23. Resources • Amazon interviews http://highscalability.com/amazon-architecture • Titus Brown’s blog post on heterogeneous data integration http://ivory.idyll.org/blog/software-architecture-for-heterogeneous-data- integration.html • Kbase website http://www.kbase.us • Software carpentry – “helping scientists build better software” http://software-carpentry.org
  • 24. Thanks! Please feel free to contact me: http://adina.github.com adina@anl.gov http://cheezburger.com/6983817216