SlideShare une entreprise Scribd logo
1  sur  29
Why HPCs hate biologists
(and what we‟re doing about it)

        C. Titus Brown
      Assistant Professor
    CSE, MMG, BEACON
   Michigan State University
       September 2012
        ctb@msu.edu
Outline
 The basic problem(s) – resequencing and
 assembly

 Data processing & data flow


 Compression approaches


 Some thoughts for the future
Shotgun genomics
 Collect samples;


 Extract DNA;


 Feed into sequencer;


 Computationally analyze.




                      Wikipedia: Environmental shotgun sequencing.p
Analogy: shredding books
        It was the best of times, it was the wor
          , it was the worst of times, it was the
          isdom, it was the age of foolishness
        mes, it was the age of wisdom, it was th



It was the best of times, it was the worst of times, it was
     the age of wisdom, it was the age of foolishness

          …but for lots and lots of fragments!
Sequencers also produce
errors…
         It was the Gest of times, it was the wor
            , it was the worst of timZs, it was the
            isdom, it was the age of foolisXness
           , it was the worVt of times, it was the
         mes, it was Ahe age of wisdom, it was th
          It was the best of times, it Gas the wor
         mes, it was the age of witdom, it was th
             isdom, it was tIe age of foolishness



It was the best of times, it was the worst of times, it was the
         age of wisdom, it was the age of foolishness
Three basic problems
    Resequencing, counting, and assembly.
1. Resequencing analysis
  We know a reference genome, and want to find
  variants (blue) in a background of errors (red)
2. Counting
  We have a reference genome (or gene set) and
  want to know how much we have. Think gene
             expression/microarrays.
3. Assembly
 We don‟t have a genome or any reference, and we
              want to construct one.
  (This is how all new genomes are sequenced.)
Noisy observations <->
information
         It was the Gest of times, it was the wor
            , it was the worst of timZs, it was the
            isdom, it was the age of foolisXness
           , it was the worVt of times, it was the
         mes, it was Ahe age of wisdom, it was th
          It was the best of times, it Gas the wor
         mes, it was the age of witdom, it was th
             isdom, it was tIe age of foolishness



It was the best of times, it was the worst of times, it was the
         age of wisdom, it was the age of foolishness
The scale of the problem is stunning.
 I estimate a worldwide capacity for DNA
  sequencing of 15 petabases/yr (it‟s probably
  larger).
 Individual labs can generate ~100 Gbp in ~1
  week for $10k.
 This sequencing is at a boutique level:
   Sequencing formats are semi-standard.
   Basic analysis approaches are ~80% cookbook.
   Every biological prep, problem, and analysis is
   different.
 Traditionally, biologists receive no training in
  computation
 …and our computational infrastructure is
  optimizing for high performance computing, not
“Three types of data scientists.”
      (Bob Grossman, U. Chicago, at XLDB)

1. Your data gathering rate is slower than Moore‟s
   Law.
2. Your data gathering rate matches Moore‟s Law.
3. Your data gathering rate exceeds Moore‟s Law.
http://www.genome.gov/sequencingcosts/
“Three types of data scientists.”

1. Your data gathering rate is slower than Moore‟s
Law.
                         => Be lazy, all will work out.

2. Your data gathering rate matches Moore‟s Law.
     => You need to write good software, but all will
                                          work out.

3. Your data gathering rate exceeds Moore‟s Law.
                          => You need serious help.
A few use cases
1.   Real-time pathogen analysis
2.   Cancer genome analysis => diagnosis &
     treatment regimen
3.   Evolution of drug resistance in HIV
4.   Gene expression analysis in agricultural animals
5.   Examining microbial community change in
     response to agriculture and/or global climate
     change
6.   Gene discovery & genome sequencing in non-
     model organisms
Three basic problems
    Resequencing, counting, and assembly.
~2 GB – 2 TB of single-chassis RAM


                                                           "Information"
                                                             "Information"
     Raw data                              "Information"
                            Analysis                           "Information"
   (~10-100 GB)                                ~1 GB             "Information"
                                                                    Database &
                                                                    integration




A few interesting computational challenges:

 - How do we store raw data for (1) provenance and (2)
reanalysis?
 - What kind of infrastructure (hardware/software) is the
“right” approach?
 - Can we reduce the cost of analysis?
A few use cases
1.   Real-time pathogen analysis
2.   Cancer genome analysis => diagnosis &
     treatment regimen
3.   Evolution of drug resistance in HIV
4.   Gene expression analysis in agricultural animals
5.   Examining microbial community change in
     response to agriculture and/or global climate
     change
6.   Gene discovery & genome sequencing in non-
     model organisms
What kind of approaches?

 Hardware solutions may not be appropriate
   Everyone in biology/biomedical informatics has or
    will have these data sets; need “commodity”
    solutions (=> ~cloud?)
   …but current commodity hardware is optimized for
    processing power, while memory and I/O is
    expensive.


 Are there algorithmic approaches with which we
 can apply leverage?
"Information"
                                                             "Information"
     Raw data     Compression              "Information"
                                Analysis                       "Information"
   (~10-100 GB)     (~2 GB)                    ~1 GB             "Information"
                                                                    Database &
                                                                    integration




A software & algorithms approach: can we develop
lossy compression approaches that

1. Reduce data size & remove errors => efficient
   processing?
2. Retain all “information”? (think JPEG)

If so, then we canShort answer the compressed data for
                   store only is: yes, we can.
later reanalysis.
My research driven by my
problems…
 Est ~50 Tbp to comprehensively sample the
  microbial composition of a gram of soil.
 Currently we have approximately 2 Tbp spread
  across 9 soil samples, for one project; 1 Tbp
  across 10 samples for another.

 Need 3 TB RAM on single chassis to do
  assembly of 300 Gbp.
 …estimate 500 TB RAM for 50 Tbp of sequence.


As it turns out, if we can solve that problem, we can
                     solve the rest 
My lab at MSU:
Theoretical => applied solutions.




Theoretical advances
                         Practically useful & usable          Demonstrated
in data structures and
                         implementations, at scale.    effectiveness on real data.
      algorithms
1. CountMin Sketch
    To add element: increment associated counter at all hash locales
    To get count: retrieve minimum counter across all hash locales




                       http://highlyscalable.wordpress.com/2012/0
                       5/01/probabilistic-structures-web-analytics-
                       data-mining/
2. Online, streaming, lossy                  (NOVEL)

compression.




 Transcriptomes, microbial genomes incl
 MDA, and most metagenomes can be assembled
 in under 50 GB of RAM, with identical or
 improved results.

 Core algorithm is single pass, “low” memory.
Streaming Twitter analysis.
(NOVEL)

3. Compressible de Bruijn graphs

            1%
                      5%




           10%       15%




                           Jason Pell & Arend Hintze
Concluding thoughts
 Our approaches provide significant and
  substantial practical and theoretical leverage to
  some really challenging problems.
 They provide a path to the future:
   Many-core compatible; distributable?
   Decreased memory footprint => cloud computing
   can be used for many analyses.
 They are in use, ~dozens of labs using digital
  normalization.
 …although we‟re still in the process of publishing
  them.
There is nothing up my sleeves.
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
  http://ged.msu.edu/interests.html
 Preprints: on arXiv, q-bio:
  „diginorm arxiv‟

      …and I welcome e-mails, ctb@msu.edu
Thank you for the invitation!

Acknowledgements
Lab members involved                Collaborators
   Adina Howe (w/Tiedje)            Jim Tiedje, MSU
   Jason Pell
   Arend Hintze                     Billie Swalla, UW
   Rosangela Canino-                Janet Jansson, LBNL
    Koning
   Qingpeng Zhang                   Susannah Tringe, JGI
   Elijah Lowe
   Likit Preeyanon                 Funding
   Jiarong Guo
   Tim Brom                         USDA NIFA; NSF IOS;
   Kanchan Pavangadkar                   BEACON.
   Eric McDonald

Contenu connexe

Tendances

U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
c.titus.brown
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
c.titus.brown
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
c.titus.brown
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
c.titus.brown
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
c.titus.brown
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
c.titus.brown
 
The art of good science writing
The art of good science writingThe art of good science writing
The art of good science writing
Keith Bradnam
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
c.titus.brown
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
c.titus.brown
 

Tendances (20)

2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
2013 alumni-webinar
2013 alumni-webinar2013 alumni-webinar
2013 alumni-webinar
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
Jan2016 pac bio giab
Jan2016 pac bio giabJan2016 pac bio giab
Jan2016 pac bio giab
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
 
The art of good science writing
The art of good science writingThe art of good science writing
The art of good science writing
 
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaScaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
 
Sweden_eemis_big_data
Sweden_eemis_big_dataSweden_eemis_big_data
Sweden_eemis_big_data
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekGenomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
 

En vedette

2011 rwc webquest
2011 rwc webquest2011 rwc webquest
2011 rwc webquest
Takahe One
 
Section 1031 For Legistlative Review 12.16.09
Section 1031 For Legistlative Review 12.16.09Section 1031 For Legistlative Review 12.16.09
Section 1031 For Legistlative Review 12.16.09
Edmund_Wheeler
 
Power Point Gov
Power Point GovPower Point Gov
Power Point Gov
arii827
 
Retirement Planning
Retirement PlanningRetirement Planning
Retirement Planning
CamiloSilva
 
Buscadores (Fodehum)
Buscadores (Fodehum)Buscadores (Fodehum)
Buscadores (Fodehum)
grupo3fodehum
 
Organic fertilization and microbial dynamics
Organic fertilization and microbial dynamicsOrganic fertilization and microbial dynamics
Organic fertilization and microbial dynamics
juveultra
 

En vedette (20)

XBRL in Oracle 11i and R12
XBRL in Oracle 11i and R12XBRL in Oracle 11i and R12
XBRL in Oracle 11i and R12
 
Private Work: How to Secure a Fair Contract + Get Paid
Private Work: How to Secure a Fair Contract + Get PaidPrivate Work: How to Secure a Fair Contract + Get Paid
Private Work: How to Secure a Fair Contract + Get Paid
 
Team8presentation
Team8presentationTeam8presentation
Team8presentation
 
2011 rwc webquest
2011 rwc webquest2011 rwc webquest
2011 rwc webquest
 
Section 1031 For Legistlative Review 12.16.09
Section 1031 For Legistlative Review 12.16.09Section 1031 For Legistlative Review 12.16.09
Section 1031 For Legistlative Review 12.16.09
 
Pagerank
PagerankPagerank
Pagerank
 
Experiments in Web 2.0: creative communications and digital footprints
Experiments in Web 2.0: creative communications and digital footprints Experiments in Web 2.0: creative communications and digital footprints
Experiments in Web 2.0: creative communications and digital footprints
 
Power Point Gov
Power Point GovPower Point Gov
Power Point Gov
 
Retirement Planning
Retirement PlanningRetirement Planning
Retirement Planning
 
Roll Over Power Point Advanced Edu Safety Gew 6 10 09
Roll Over Power Point Advanced Edu Safety Gew 6 10 09Roll Over Power Point Advanced Edu Safety Gew 6 10 09
Roll Over Power Point Advanced Edu Safety Gew 6 10 09
 
Buscadores (Fodehum)
Buscadores (Fodehum)Buscadores (Fodehum)
Buscadores (Fodehum)
 
Organic fertilization and microbial dynamics
Organic fertilization and microbial dynamicsOrganic fertilization and microbial dynamics
Organic fertilization and microbial dynamics
 
Portage County Health Literacy Coalition
Portage County Health Literacy CoalitionPortage County Health Literacy Coalition
Portage County Health Literacy Coalition
 
RSC &amp; RIRG
RSC &amp; RIRGRSC &amp; RIRG
RSC &amp; RIRG
 
2011 Ohio Hispanic Business Summit
2011 Ohio Hispanic Business Summit2011 Ohio Hispanic Business Summit
2011 Ohio Hispanic Business Summit
 
Legal Issues Important for Doing Business in the U.S. | Martijn Steger
Legal Issues Important for Doing Business in the U.S. | Martijn StegerLegal Issues Important for Doing Business in the U.S. | Martijn Steger
Legal Issues Important for Doing Business in the U.S. | Martijn Steger
 
Underage Drinking Parties in San Antonio 2016
Underage Drinking Parties in San Antonio 2016Underage Drinking Parties in San Antonio 2016
Underage Drinking Parties in San Antonio 2016
 
A perfect storm
A perfect stormA perfect storm
A perfect storm
 
Seniorforsker Uffe Jørgensen; Aarhus Universitet
Seniorforsker Uffe Jørgensen; Aarhus UniversitetSeniorforsker Uffe Jørgensen; Aarhus Universitet
Seniorforsker Uffe Jørgensen; Aarhus Universitet
 
Nobody laughed
Nobody laughedNobody laughed
Nobody laughed
 

Similaire à 2012 hpcuserforum talk

BEACON 101: Sequencing tech
BEACON 101: Sequencing techBEACON 101: Sequencing tech
BEACON 101: Sequencing tech
c.titus.brown
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
c.titus.brown
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
c.titus.brown
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
c.titus.brown
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
c.titus.brown
 

Similaire à 2012 hpcuserforum talk (20)

2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
BEACON 101: Sequencing tech
BEACON 101: Sequencing techBEACON 101: Sequencing tech
BEACON 101: Sequencing tech
 
2014 naples
2014 naples2014 naples
2014 naples
 
2014 villefranche
2014 villefranche2014 villefranche
2014 villefranche
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
B.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseB.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 database
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 

Plus de c.titus.brown

2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
c.titus.brown
 

Plus de c.titus.brown (17)

2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 
2014 bosc-keynote
2014 bosc-keynote2014 bosc-keynote
2014 bosc-keynote
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

2012 hpcuserforum talk

  • 1. Why HPCs hate biologists (and what we‟re doing about it) C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University September 2012 ctb@msu.edu
  • 2. Outline  The basic problem(s) – resequencing and assembly  Data processing & data flow  Compression approaches  Some thoughts for the future
  • 3. Shotgun genomics  Collect samples;  Extract DNA;  Feed into sequencer;  Computationally analyze. Wikipedia: Environmental shotgun sequencing.p
  • 4. Analogy: shredding books It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for lots and lots of fragments!
  • 5. Sequencers also produce errors… It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishness It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
  • 6. Three basic problems Resequencing, counting, and assembly.
  • 7. 1. Resequencing analysis We know a reference genome, and want to find variants (blue) in a background of errors (red)
  • 8. 2. Counting We have a reference genome (or gene set) and want to know how much we have. Think gene expression/microarrays.
  • 9. 3. Assembly We don‟t have a genome or any reference, and we want to construct one. (This is how all new genomes are sequenced.)
  • 10. Noisy observations <-> information It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishness It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
  • 11. The scale of the problem is stunning.  I estimate a worldwide capacity for DNA sequencing of 15 petabases/yr (it‟s probably larger).  Individual labs can generate ~100 Gbp in ~1 week for $10k.  This sequencing is at a boutique level:  Sequencing formats are semi-standard.  Basic analysis approaches are ~80% cookbook.  Every biological prep, problem, and analysis is different.  Traditionally, biologists receive no training in computation  …and our computational infrastructure is optimizing for high performance computing, not
  • 12. “Three types of data scientists.” (Bob Grossman, U. Chicago, at XLDB) 1. Your data gathering rate is slower than Moore‟s Law. 2. Your data gathering rate matches Moore‟s Law. 3. Your data gathering rate exceeds Moore‟s Law.
  • 14. “Three types of data scientists.” 1. Your data gathering rate is slower than Moore‟s Law. => Be lazy, all will work out. 2. Your data gathering rate matches Moore‟s Law. => You need to write good software, but all will work out. 3. Your data gathering rate exceeds Moore‟s Law. => You need serious help.
  • 15. A few use cases 1. Real-time pathogen analysis 2. Cancer genome analysis => diagnosis & treatment regimen 3. Evolution of drug resistance in HIV 4. Gene expression analysis in agricultural animals 5. Examining microbial community change in response to agriculture and/or global climate change 6. Gene discovery & genome sequencing in non- model organisms
  • 16. Three basic problems Resequencing, counting, and assembly.
  • 17. ~2 GB – 2 TB of single-chassis RAM "Information" "Information" Raw data "Information" Analysis "Information" (~10-100 GB) ~1 GB "Information" Database & integration A few interesting computational challenges: - How do we store raw data for (1) provenance and (2) reanalysis? - What kind of infrastructure (hardware/software) is the “right” approach? - Can we reduce the cost of analysis?
  • 18. A few use cases 1. Real-time pathogen analysis 2. Cancer genome analysis => diagnosis & treatment regimen 3. Evolution of drug resistance in HIV 4. Gene expression analysis in agricultural animals 5. Examining microbial community change in response to agriculture and/or global climate change 6. Gene discovery & genome sequencing in non- model organisms
  • 19. What kind of approaches?  Hardware solutions may not be appropriate  Everyone in biology/biomedical informatics has or will have these data sets; need “commodity” solutions (=> ~cloud?)  …but current commodity hardware is optimized for processing power, while memory and I/O is expensive.  Are there algorithmic approaches with which we can apply leverage?
  • 20. "Information" "Information" Raw data Compression "Information" Analysis "Information" (~10-100 GB) (~2 GB) ~1 GB "Information" Database & integration A software & algorithms approach: can we develop lossy compression approaches that 1. Reduce data size & remove errors => efficient processing? 2. Retain all “information”? (think JPEG) If so, then we canShort answer the compressed data for store only is: yes, we can. later reanalysis.
  • 21. My research driven by my problems…  Est ~50 Tbp to comprehensively sample the microbial composition of a gram of soil.  Currently we have approximately 2 Tbp spread across 9 soil samples, for one project; 1 Tbp across 10 samples for another.  Need 3 TB RAM on single chassis to do assembly of 300 Gbp.  …estimate 500 TB RAM for 50 Tbp of sequence. As it turns out, if we can solve that problem, we can solve the rest 
  • 22. My lab at MSU: Theoretical => applied solutions. Theoretical advances Practically useful & usable Demonstrated in data structures and implementations, at scale. effectiveness on real data. algorithms
  • 23. 1. CountMin Sketch To add element: increment associated counter at all hash locales To get count: retrieve minimum counter across all hash locales http://highlyscalable.wordpress.com/2012/0 5/01/probabilistic-structures-web-analytics- data-mining/
  • 24. 2. Online, streaming, lossy (NOVEL) compression.  Transcriptomes, microbial genomes incl MDA, and most metagenomes can be assembled in under 50 GB of RAM, with identical or improved results.  Core algorithm is single pass, “low” memory.
  • 26. (NOVEL) 3. Compressible de Bruijn graphs 1% 5% 10% 15% Jason Pell & Arend Hintze
  • 27. Concluding thoughts  Our approaches provide significant and substantial practical and theoretical leverage to some really challenging problems.  They provide a path to the future:  Many-core compatible; distributable?  Decreased memory footprint => cloud computing can be used for many analyses.  They are in use, ~dozens of labs using digital normalization.  …although we‟re still in the process of publishing them.
  • 28. There is nothing up my sleeves. Everything discussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog („titus brown blog‟)  Twitter: @ctitusbrown  Grants on Lab Web site: http://ged.msu.edu/interests.html  Preprints: on arXiv, q-bio: „diginorm arxiv‟ …and I welcome e-mails, ctb@msu.edu
  • 29. Thank you for the invitation! Acknowledgements Lab members involved Collaborators  Adina Howe (w/Tiedje)  Jim Tiedje, MSU  Jason Pell  Arend Hintze  Billie Swalla, UW  Rosangela Canino-  Janet Jansson, LBNL Koning  Qingpeng Zhang  Susannah Tringe, JGI  Elijah Lowe  Likit Preeyanon Funding  Jiarong Guo  Tim Brom USDA NIFA; NSF IOS;  Kanchan Pavangadkar BEACON.  Eric McDonald

Notes de l'éditeur

  1. Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression =&gt; OLC assembly.