Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI

1 351 vues

Publié le

Historical perspectives of Internet Era, leading into Genomics and personalized medicine applications of BigData and Data Science best practices

Publié dans : Santé & Médecine
  • Soyez le premier à commenter

Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI

  1. 1. © 2014 MapR Technologies 1© 2014 MapR Technologies Hadoop for Genomics: What you need to know
  2. 2. © 2014 MapR Technologies 2 DNA Sequencing, pre-2004 years CPU transistors/mm2 HDD GB/mm2 DNA bp/$, pre-2004
  3. 3. © 2014 MapR Technologies 3 DNA Sequencing, 2004 Disruption years CPU transistors/mm2 HDD GB/mm2DNA bp/$, post-2004 DNA bp/$, pre-2004
  4. 4. © 2014 MapR Technologies 4 DNA Sequencing, 2004 Disruption years CPU transistors/mm2 HDD GB/mm2DNA bp/$, post-2004 DNA bp/$, pre-2004 Similar disruption occurred for Internet traffic in mid-1990s
  5. 5. © 2014 MapR Technologies 5 Effect: Many DNA-Based Apps Coming… • 2014: US$ 2B, mostly research, mostly chemical costs • 2020: US$ 20B, mostly clinical, mostly analytics costs Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning 0 5 10 15 20 25 2014 2020 Clinical Non-Clinical
  6. 6. © 2014 MapR Technologies 6 Genomics Value Chain Order Test from Clinic Extract Biosample BioBank Biosample DNA Extraction Sequence Biosample Secondary Analytics Tertiary Analytics Reporting to Clinic Academic R&D Pharma R&D Clinic Therapy Increased scale requirement Increased feature set requirement
  7. 7. © 2014 MapR Technologies 7 Genomics Value Chain Sequence Biosample Secondary Analytics Tertiary Analytics Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK (manual) Pharma R&D OK, e.g. ILMN XTen Not OK (GATK) Missing, manual Clinic Therapy OK, e.g. ILMN XTen Missing Missing Increased scale requirement Increased feature set requirement Requirements • Data Intense • Batch • High utilization • Low COGS Requirements • Data Intense • Interactive • Easy to integrate • Expressive
  8. 8. © 2014 MapR Technologies 8 Target Application: Alleviate / Prevent (Deterministic) Suffering Variant Calling DNA Sequencer Reads Reference Genome Genotype/ Phenotype/ Individual Matrix Cure & Prevent Disease Medical Records Patient
  9. 9. © 2014 MapR Technologies 9 http://steamcommunity.com/app/203160/discussions/0/846956188647169800/ http://www.vox.com/2015/2/1/7955921/lara-croft-moores-law What Does Moore’s Law Feel Like? #Dataviz: Lara Croft 230=>40,000 Polygons (1996-2014)
  10. 10. © 2014 MapR Technologies 10 Application: Forensics http://cgi.uconn.edu/stranger-visions-forensic-art-exhibit/ http://snapshot.parabon-nanolabs.com/ http://www.nature.com/news/mugshots-built-from-dna-data-1.14899
  11. 11. © 2014 MapR Technologies 11 Growth in Resource Capacity
  12. 12. © 2014 MapR Technologies 12 Disruption Circa 2000 NASDAQ Composite
  13. 13. © 2014 MapR Technologies 13 What Happened? What did winners do right to survive the .com recession? NASDAQ Composite
  14. 14. © 2014 MapR Technologies 14 Early 1990s: Early eCommerce Vendor Setup Storage read/write read/write Website Back Office
  15. 15. © 2014 MapR Technologies 15 Early 1990s: Early eCommerce Vendor Setup Storage read/write read/write Website Back Office <= SAN & NAS, Oracle <= HPC
  16. 16. © 2014 MapR Technologies 16 Late 1990s: Workload became too big Storage read/write read/write Website WebsiteWebsite Website Back Office Back Office
  17. 17. © 2014 MapR Technologies 17 Survivor Strategy Revealed: Google Publishes • 2003: Google Filesystem (aka GFS) – http://research.google.com/archive/gfs.html • 2004: MapReduce – http://research.google.com/archive/mapreduce.html • 2006: BigTable – http://research.google.com/archive/bigtable.html
  18. 18. © 2014 MapR Technologies 18 Scale-out with Google FS + MapReduce read/write read/write Website WebsiteWebsite Website Storage + Compute Cluster Back Office Back Office
  19. 19. © 2014 MapR Technologies 19© 2014 MapR Technologies Genomics: Internet Boom Déjà Vu
  20. 20. © 2014 MapR Technologies 20 DNA Sequencing, post-2004 DNA Sequence NASDAQ Composite
  21. 21. © 2014 MapR Technologies 21 DNA Sequencing, pre-2004 Storage write-only read/write High-Performance Compute Cluster Coordinator / Edge Node Sequencer SAN & NAS => HPC =>
  22. 22. © 2014 MapR Technologies 22 DNA Sequencing, post-2004 Storage write-only read/write High-Performance Compute Cluster Coordinator / Edge Node DNA Sequencer Cluster (e.g. Illumina X-Ten)
  23. 23. © 2014 MapR Technologies 23 DNA Sequencing, post-2004 Storage write-only read/write High-Performance Compute Cluster Coordinator / Edge Node DNA Sequencer Cluster (e.g. Illumina X-Ten) HPC bottleneck Sequencer back-pressure
  24. 24. © 2014 MapR Technologies 24 DNA Sequencing, post-2004 Storage write-only read/write High-Performance Compute Cluster Coordinator / Edge Node DNA Sequencer Cluster (e.g. Illumina X-Ten) HPC bottleneck Sequencer back-pressure NAS doesn’t look like a great solution anymore…
  25. 25. © 2014 MapR Technologies 25 Solution: Implemented 2014 @ Complete Genomics with MapR write-only DNA Sequencer Cluster (e.g. Illumina X-Ten Storage + Compute Cluster Decentralize I/O Decentralize I/O
  26. 26. © 2014 MapR Technologies 26 Application Server mapr-nfsserver Linux NFS Client Mapr client API Loopback Mount: localhost:/mapr /mapr mapr-fileserver S1 mapr-fileserver S2 mapr-fileserver S3 mapr-fileserver S4 mapr-fileserver S5 Chunk 1 256MB MapR Inline Compression 1 2 3 4 5 1 2Chunk 2 256MB 3Chunk 3 256MB 4Chunk 4 256MB 5Chunk 5 256MB Translate NFS into API Calls 1 1 1 4 4 2 3 2 2 3 3 4 55 5 MapR Data Platform Network Security : MapR RPC Full Wire Encryption Client -> Server Communication Server -> Server Communication Supported Compression algorithms ( per Directory ) LZ4, LZF, ZLIB Network Traffic will be compressed automatically MapR NFS Gateway on Application Servers
  27. 27. © 2014 MapR Technologies 27 [WHITEBOARD BREAK]
  28. 28. © 2014 MapR Technologies 28© 2014 MapR Technologies [REDACTED]
  29. 29. © 2014 MapR Technologies 29 Allows Secondary Analytics to Scale Out Variant Calling DNA Sequencer Reads Reference Genome Genotype/ Phenotype/ Individual Matrix Cure & Prevent Disease Medical Records Patient
  30. 30. © 2014 MapR Technologies 30 Secondary Analytics: Acute Pain Point FastQ Reads Aligned Reads Variants ADAM + Avocado Matrix rotation is very I/O intense Velvet: Algorithms for de novo short read assembly using de Bruijn graphs, Zerbino & Birney. 2008 Local de novo is best… …only feasible with efficient rotations
  31. 31. © 2014 MapR Technologies 31 Apache Parquet
  32. 32. © 2014 MapR Technologies 32 Row-Oriented Format read1 chr1 10000 read2 TTGGAG ABCDEF read2 chr1 20000 - TCGTAA ABCDEF read3 chr2 5000 - GGGAAC ABCDEF read4 chr3 1000000 read6 CCCTAC ABCDEF read5 chr4 900000 - TTTAAG ABCDEF 0 5 20 40 57 ID Reference Position Next ID Sequence Quality
  33. 33. © 2014 MapR Technologies 33 Row-Oriented Splitting
  34. 34. © 2014 MapR Technologies 34 Column-Oriented Format read1 read2 read3 read4 read5 chr1 chr1 chr2 chr3 chr4 10000 20000 5000 1000000 900000 read2 - - read6 - TTGGAG TCGTAA GGGAAC CCCTAC TTTAAG ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ID Reference Position Next ID Sequence Quality
  35. 35. © 2014 MapR Technologies 35 Column-Oriented Format Partitioning read1 read2 read3 read4 read5 chr1 chr1 chr2 chr3 chr4 10000 20000 5000 1000000 900000 read2 - - read6 - TTGGAG TCGTAA TTGGAG GGGAAC TTTAAG ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ID Reference Position Next ID Sequence Quality
  36. 36. © 2014 MapR Technologies 36 Column-Oriented Format Splitting
  37. 37. © 2014 MapR Technologies 37 Apache Parquet
  38. 38. © 2014 MapR Technologies 38 Apache Parquet http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
  39. 39. © 2014 MapR Technologies 39 Allows Secondary Analytics to Scale Out GATK / HPC method: flat after chromosome split Hadoop / Spark method
  40. 40. © 2014 MapR Technologies 40© 2014 MapR Technologies Tertiary Analytics
  41. 41. © 2014 MapR Technologies 41 Downstream Analytics: GWAS/PheWAS FastQ Reads Aligned Reads Variants Function Phenotypes Scalable GWAS/PheWA S: “Green Field” Territory ADAM + Avocado
  42. 42. © 2014 MapR Technologies 42 Target Application: Alleviate / Prevent Suffering Variant Calling DNA Sequencer Reads Reference Genome Genotype/ Phenotype/ Individual Matrix Cure & Prevent Disease Medical Records Patient
  43. 43. © 2014 MapR Technologies 43 GWAS Overview (Genome-wide Association Study) • Which genome features are associated with phenotype X? https://en.wikipedia.org/wiki/Genome-wide_association_study
  44. 44. © 2014 MapR Technologies 44 PheWAS Overview (Phenome-wide …) • Which phenotypes are associated with genome variant X? http://www.tcpinnovations.com/drugbaron/phewas-the-tool-thats-revolutionizing-drug-development-that-youve-likely-never-heard-of/
  45. 45. © 2014 MapR Technologies 45 Genome × Phenome Analysis For given population, given SNP 𝛿, and given phenotype ϕ: Count the number of occurrences as the value of the matrix 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 SPARSE Billion + Phenotypes SPARSEBillion+Genotypes
  46. 46. © 2014 MapR Technologies 46 Disease Cause via Genome × Phenome Matrix Factorization • Row Eigenvectors of X represent – Sets of related phenotypes (by SNP) • Column Eigenvectors of Y represent – Sets of related SNPS (by phenotype) 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 Principal Column Vector Archetype Genotypes Archetype Phenotypes Principal Row Vector Sparse Matrix Package is Actively Developed in Spark Community
  47. 47. © 2014 MapR Technologies 47 Generalized Approach: Genome × Phenome Tensor • Maintain individual identity • Aggregating individuals gives up statistical power • Leverage pedigrees – Individuals are not independent observations Variants Phenotypes Variants Phenotypes
  48. 48. © 2014 MapR Technologies 48 Scalable Variant Store => Root out Disease Causes Model P ~ F(G) Fortunately, this has already been done… Genotypes Med Record Phenotypes, e.g. disease risk, drug response
  49. 49. © 2014 MapR Technologies 49 Largest Biometric Database in the World PEOPLE 1.2B PEOPLE
  50. 50. © 2014 MapR Technologies 50 Why Create Aadhaar? • India: 1.2 billion residents – 640,000 villages, ~60% lives under $2/day – ~75% literacy, <3% pay income tax, <20% have bank accounts – ~800 million mobile, ~200-300 million migrant workers • Govt. spends about $25-40 billion on direct subsidies – Residents have no standard identity document – Most programs plagued with ghost and multiple identities causing leakage of 30-40% Standardize identity => Stop leakage
  51. 51. © 2014 MapR Technologies 51 Aadhaar Biometric Capture & Index Raw Digital Fingerprint
  52. 52. © 2014 MapR Technologies 52 Aadhaar Biometric ID Creation F(x): unique features G(x): uncommon features H(x): other features • 900MM people loaded in 4 years • In production – 1MM registrations/day – 200+ trillion lookups/day • All built on MapR-DB (HBase) Low Entropy + Unique Low Entropy + Infrequent
  53. 53. © 2014 MapR Technologies 53 Consistent, Low Latency --- M7 Read Latency --- Others Read Latency
  54. 54. © 2014 MapR Technologies 54 How Does this Relate to Genomics? F-1(x): common features F(x): unique features G(x): uncommon features H(x): other features Same data shape and size • Aadhaar: 1B humans, 5MB minutia • Genome: 7B humans, ~3M variants
  55. 55. © 2014 MapR Technologies 55 How Does this Relate to Genomics? F-1(x): common features F(x): unique features G(x): uncommon features H(x): other features Phenotype: healthy or sick? Phenotype Partition => Low Entropy
  56. 56. © 2014 MapR Technologies 56 ≈ individuals fingerprint minutiae Find rare minutiae to uniquely identify medicalrecords genetic variants Find shared variants to get disease root cause Takeaway 1: Don’t reinvent the wheel
  57. 57. © 2014 MapR Technologies 57 Takeaway 2: Evolution, not Revolution DNA Sequence NASDAQ Composite
  58. 58. © 2014 MapR Technologies 58 Thank You @allenday // @mapr Now a few slides about MapR’s product… …and proposed next actions
  59. 59. © 2014 MapR Technologies 59 “Quick Start” Package Engagement includes: 1. Identification of data sources, transformations and reporting engines 2. Access and use of the solution template including source code 3. Training on customizing the solution template to the organization’s requirement 4. Deployment architecture document that enables a production deployment plan for the specific solution SOLUTION TEMPLATE KNOWLEDGE TRANSFER DEPLOYMENT ARCHITECTURE
  60. 60. © 2014 MapR Technologies 60 “Quick Start” 1 – Resequencing with Hadoop Reduces Storage Hardware Requirements Accelerates Data Processing Time Minimal impact to existing data pipelines “Quick Start” 2 – Variant Analysis with NoSQL Present data for exploration Operationalize complex workflows Web-scale performance
  61. 61. © 2014 MapR Technologies 62 Genomics Value Chain Sequence Biosample Secondary Analytics Tertiary Analytics Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK Pharma R&D OK, e.g. ILMN XTen Not OK Missing Clinic Therapy OK, e.g. ILMN XTen Missing Missing
  62. 62. © 2014 MapR Technologies 63 Genomics Value Chain Sequence Biosample Secondary Analytics Tertiary Analytics Academic R&D OK, e.g. ILMN XTen OK, (GATK) Not OK Pharma R&D OK, e.g. ILMN XTen Not OK Missing Clinic Therapy OK, e.g. ILMN XTen Missing Missing Addressed by Quick Start 1 Addressed by Quick Start 2
  63. 63. © 2014 MapR Technologies 64© 2014 MapR Technologies BONUS ROUND
  64. 64. © 2014 MapR Technologies 65© 2014 MapR Technologies Genealogy Company Slides credit: Bill Yetman, Hadoop Summit 2014 http://slidesha.re/1vRh3kY
  65. 65. © 2014 MapR Technologies 66 GERMLINE is… • …an algorithm that finds hidden relationships within a pool of DNA • …the reference implementation of that algorithm written in C++. • You can find it here: http://www1.cs.columbia.edu/~gusev/germline/ 6 6
  66. 66. © 2014 MapR Technologies 67 Projected GERMLINE run times (in hours) 6 7 Hours Samples 0 100 200 300 400 500 600 700 2,500 12,500 22,500 32,500 42,500 52,500 62,500 72,500 82,500 92,500 102,500 112,500 122,500 GERMLINE run times Projected GERMLINE run times 700 hours = 29+ days EXPONENTIAL COMPLEXITY
  67. 67. © 2014 MapR Technologies 68 GERMLINE: What’s the Problem? • GERMLINE (the implementation) was not meant to be used in an industrial setting – Stateless, single threaded, prone to swapping (heavy memory usage) – GERMLINE performs poorly on large data sets • Our metrics predicted exactly where the process would slow to a crawl • Put simply: GERMLINE couldn't scale 6 8
  68. 68. © 2014 MapR Technologies 69 Run times for matching (in hours) 6 9 Hours Samples 0 20 40 60 80 100 120 140 160 180 GERMLINE run times Jermline run times Projected GERMLINE run times EXPONENTIAL LINEAR HBase Refactor
  69. 69. © 2014 MapR Technologies 70 • Paper submitted describing the implementation • Releasing as an Open Source project soon • [HBase Schema/Algorithm Slides] 7 0
  70. 70. © 2014 MapR Technologies 71© 2014 MapR Technologies Further Growth & Optimization
  71. 71. © 2014 MapR Technologies 72 Underdog (Strand Phasing) performance – Went from 12 hours to process 1,000 samples to under 25 minutes with a MapReduce implementation 7 2 With improved accuracy! Underdog replaces Beagle 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 Total Run Size Total Beagle-Underdog Duration
  72. 72. © 2014 MapR Technologies 73 Pipeline steps and incremental change… – Incremental change over time – Supporting the business in a “just in time” Agile way 7 3 0 50000 100000 150000 200000 250000 500 3622 7243 9615 12353 16333 19522 22861 26642 31172 35986 40852 45252 49817 54738 61675 69496 77257 84337 90074 97448 104684 111937 119669 127194 134970 142232 149988 157710 165685 173719 181617 189817 197853 205855 213471 221290 228912 236516 243550 251315 259164 267266 275335 283114 291017 298823 306556 314662 322655 330745 338813 346847 354938 362954 371064 379208 387334 395432 Beagle-Underdog Phasing Pipeline Finalize Relationship Processing Germline-Jermline Results Processing Germline-Jermline Processing Beagle Post Phasing Admixture Plink Prep Pipeline Initialization Jermline replaces Germline Ethnicity V2 Release Underdog Replaces Beagle AdMixture on Hadoop
  73. 73. © 2014 MapR Technologies 74 …while the business continues to grow rapidly 7 4 - 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000 Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14 #ofprocessedsamples) DNA Database Size

×