Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Hadoop ecosystem for genomics
Uri Laserson
Mount Sinai School of Medicine
29 October 2013

1
Agenda
1.

Hadoop overview
•
•
•

2.

Scalable variant store
•
•

2

Historical context
Hadoop overview
Some sins in bioin...
Historical Context

3
4
Indexing the Web
•

Web is Huge
•

•

How do you index it?
•
•
•
•

5

Hundreds of millions of pages in 1999
Crawl all the...
6
Databases in 1999
1.
2.
3.
4.
5.

7

Buy a really big machine
Install expensive DBMS on it
Point your workload at it
Hope ...
8
Database Limitations
•

Didn’t scale horizontally
•

High marginal cost ($$$)

No real fault-tolerance story
• Vendor lock...
10
Google does something different
•

Designed their own storage and processing
infrastructure
•

•

Google File System (GFS)...
Google does something different
It worked!
• Powered Google Search for many years
• General framework for large-scale batc...
Google benevolent enough to publish

2003
13

2004
Birth of Hadoop at Yahoo!
2004-2006: Doug Cutting and Mike Cafarella
implement GFS/MR.
• 2006: Spun out as Apache Hadoop
•...
Open-source proliferation
Google

Open-source

Function

GFS

HDFS

Distributed file system

MapReduce

MapReduce

Batch d...
Overview of core technology

16
HDFS design assumptions
Based on Google File System
• Files are large (GBs to TBs)
• Failures are common
•

•
•

Massive s...
HDFS properties
•

Fault-tolerant
•

•

Horizontally scalable
•

•

Gracefully responds to node/disk/network failures
Low ...
MapReduce computation

19
MapReduce computation
•

Structured as
1.
2.
3.

Embarrassingly parallel “map stage”
Cluster-wide distributed sort (“shuff...
WordCount example

21
Cloudera Hadoop Stack

22
Cloudera Hadoop Stack

23
Cloudera Hadoop Stack

24
Cloudera Hadoop Stack
Storm

Spark

STREAM

DISTRIBUTED
MEMORY

GraphLab
GRAPH
COMPUTATION

25
Cloudera Impala
Modern MPP
database built on top
of HDFS
Designed for
interactive queries
on terabyte-scale
data sets.

26
Cloudera Search
• Interactive search queries on top of
HDFS
• Built on Solr and SolrCloud
• Near-realtime indexing of new ...
Serialization/RPC formats
•
•
•
•
•

Specify schemas/services in user-friendly IDLs
Code-generation to multiple languages ...
Serialization/RPC formats

struct Tweet {
1: required i32 userId;
2: required string userName;
3: required string text;
4:...
Serialization/RPC formats
struct Observation {
// can be general contig too
1: required string chromosome,
// python-style...
Parquet format
Row-major format

31
Parquet format
Column-major format

32
Parquet format advantages
•

Columnar format
•
•

•

read fewer bytes
compression more efficient (incl. dictionary encodin...
Query Times on TPCDS Queries
500
450
400
350

Seconds

300
Text
250

Seq w/ Snappy
RC w/Snappy

200

Parquet w/Snappy

150...
Core paradigm shifts with Hadoop

Colocation of storage and compute

Fault tolerance with cheap hardware

35
Benefits of Hadoop ecosystem
•

Inexpensive commodity compute/storage
•

•

Tolerates random hardware failure

Decreased n...
Some sins in genomics data infrastructure

37
HPC separates compute from storage
HPC is about compute.
Hadoop is about data.
Storage infrastructure
• Proprietary, distr...
Hadoop colocates compute and storage
HPC is about compute.
Hadoop is about data.
Compute cluster
Storage infrastructure
• ...
HPC is lower-level than Hadoop
HPC only exposes job scheduling
• Parallelization typically occurs through MPI
•

•
•

Very...
File system as DB; text file as LCD
Broad joint caller with 25k genomes hits file handle
limits
• Files streamed over netw...
Job scheduler as workflow tool
Submitting jobs to scheduler is very low level
• Workflow engines/execution models provide ...
Poor security/access models
•

Deal with complex set of constraints from a variety of
consents/redactions
•
•
•
•

43

Cer...
Treating computation as free
Many institutions make large clusters available for
“free” to the average researcher
• Focus ...
Treating computation as free

Stein, L. D. The case for cloud computing in genome informatics. Genome Biol (2010).
45
Treating computation as free

Sboner et al. “The real cost of sequencing: higher than you think”. Genome Biology (2011).
4...
Lack of benchmarks for tracking progress
•

Need to benchmark whether quality of methods are
improving

http://www.nist.go...
Lack of benchmarks for tracking progress

Bradnam et al. “Assemblathon 2”, Gigascience 2, 10 (2013).
48
Academic code
Unreproducible, unbuildable, undocumented, unmainta
ined, unavailable, backward-incompatible, shitty code

M...
Fundamentally a barrier to scaling.

50
51
NCBI Sequence Read Archive (SRA)
Today…
1.14 petabytes

One year ago…
609 terabytes

52
Every ‘ome has a -seq

Genome

DNA-seq

RNA-seq
Transcriptome FRT-seq
NET-seq
Methylome
Immunome

Immune-seq

Proteome

53...
Prescriptions for the future

54
Move to Hadoop-style environment
Data centralization on HDFS
• Data-local execution to avoid moving terabytes
• Higher-lev...
APIs instead of file formats
Service-oriented architectures ensure stable contracts
• Allows for implementation changes wi...
High-granularity access/common consent
1.

Use technologies with highly-granular access
control
•

2.

Create common conse...
Tools for open-source/reproducibility
Software and computations should be opensourced, e.g., on GitHub
• Release VMs or ip...
Building scalable variant store

59
Genomics ETL
biochemistry

•
•
•
•

60

.fastq

short read
alignment

.bam

genotype
calling

.vcf

analysis

Short read a...
Genomics ETL

GATK best practices
61
ADAM

62
ADAM
• Defining alternative to BAM format that’s
• Hadoop-friendly, splittable, designed for
distributed computing
• Forma...
Genomics ETL

.fastq

64

short read
alignment

.bam

genotype
calling

.vcf

analysis
Querying large, integrated variant data
Biotech client has thousands of genomes
• Want to expose ad hoc querying functiona...
Conventional approaches: manual
•

Manually parsing flat files
•
•
•

Write ad hoc scripts in perl or python
Build data st...
Conventional approaches: database
•

Very feature rich and mature
•
•
•

Common analytical tasks (e.g., joins, group-by, e...
Conventional approaches: domain-specific
•
•
•
•
•

68

e.g., PLINK/SEQ
Designed for specific use-cases
Workflows are high...
Hadoop sol’n: storage
•

Impala/Hive metastore provide a unified, flexible data
model
•

•

69

Define Avro types for all ...
Hadoop sol’n: available analytics engines
•
•
•
•
•

70

Analytical operations implemented by experts in
distributed syste...
Variant store architecture
.vcf

ETL

.parquet

.csv
external
annotations

Avro schema

Thrift service
JDBC
REST API
Impal...
Example schema
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/reference...
Example schema
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/reference...
Example schema
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/reference...
Why denormalization is good
•

Replace joins with filters
•
•

•

75

For query engines with efficient scans, this simplif...
...
{
"default": null,
"doc": "Genotype",
"type": [
"null",
"string"
],
"name": "VCF_CALL_GT"
},
{
"default": null,
"doc":...
Example variant-filtering query
•

“Give me all SNPs that are:
•
•
•
•
•

•

77

on chromosome 5
absent from dbSNP
present...
Example variant-filtering query
SELECT cosmic as snp_id,
vcf_chrom as chr,
vcf_pos as pos,
sample_id as sample,
vcf_call_g...
Impala execution
Query compiled into execution tree, chopped up
across all nodes (if possible)
• Two join implementations
...
Other desirable query-examples
“How do the mutations in a given subject compare to
the mutations in other phenotypically s...
Types of queries desired
Lot’s of these queries can be simply translated into
SQL queries
• Similar to functionality provi...
All-vs-all eQTL
•

Possible to generate trillions of hypothesis tests
•
•

•

107 loci x 104 phenotypes x 10s of tissues =...
All-vs-all eQTL
•

“Find all SNPs that are:
•
•

•

in LD with some lead SNP
or eQTL of interest
align with some functiona...
Conclusions
Hadoop ecosystem provides centralized, scalable
repository for data
• An abundance of tools for providing view...
Cloud-based implementation
Hadoop-ecosystem architecture easily translates to
the cloud (AWS, OpenStack)
• Provides elasti...
Future work
•

Broad Institute has experimented with Google’s
BigQuery for a variant store
•
•

BigQuery is Google’s Dreme...
Future work
Drive towards several large data warehouses; storage
backend optimized for particular access patterns
• Each c...
Acknowledgements
Cloudera
Josh Wills
Jeff Hammerbacher
Impala team (Nong Li)
Sandy Ryza

Julien Le Dem (Twitter)
Our biote...
89
Prochain SlideShare
Chargement dans…5
×

Hadoop for Bioinformatics: Building a Scalable Variant Store

11 444 vues

Publié le

Talk at Mount Sinai School of Medicine. Introduction to the Hadoop ecosystem, problems in bioinformatics data analytics, and a specific use case of building a genome variant store backed by Cloudera Impala.

Publié dans : Technologie
  • on slide #77, you have mentioned that a query finished in couple of seconds on 1000 genome data set. can you briefly explain the sytem(s) configuration you have used for it? thank you
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici

Hadoop for Bioinformatics: Building a Scalable Variant Store

  1. 1. Hadoop ecosystem for genomics Uri Laserson Mount Sinai School of Medicine 29 October 2013 1
  2. 2. Agenda 1. Hadoop overview • • • 2. Scalable variant store • • 2 Historical context Hadoop overview Some sins in bioinformatics Possible conventional solutions Hadoop/Impala implementation
  3. 3. Historical Context 3
  4. 4. 4
  5. 5. Indexing the Web • Web is Huge • • How do you index it? • • • • 5 Hundreds of millions of pages in 1999 Crawl all the pages Rank pages based on relevance metrics Build search index of keywords to pages Do it in real time!
  6. 6. 6
  7. 7. Databases in 1999 1. 2. 3. 4. 5. 7 Buy a really big machine Install expensive DBMS on it Point your workload at it Hope it doesn’t fail Ambitious: buy another big machine as backup
  8. 8. 8
  9. 9. Database Limitations • Didn’t scale horizontally • High marginal cost ($$$) No real fault-tolerance story • Vendor lock-in ($$$) • SQL unsuited for search ranking • • • 9 Complex analysis (PageRank) Unstructured data
  10. 10. 10
  11. 11. Google does something different • Designed their own storage and processing infrastructure • • Google File System (GFS) and MapReduce (MR) Goals: KISS Cheap • Scalable • Reliable • 11
  12. 12. Google does something different It worked! • Powered Google Search for many years • General framework for large-scale batch computation tasks • Still used internally at Google to this day • 12
  13. 13. Google benevolent enough to publish 2003 13 2004
  14. 14. Birth of Hadoop at Yahoo! 2004-2006: Doug Cutting and Mike Cafarella implement GFS/MR. • 2006: Spun out as Apache Hadoop • Named after Doug’s son’s yellow stuffed elephant • 14
  15. 15. Open-source proliferation Google Open-source Function GFS HDFS Distributed file system MapReduce MapReduce Batch distributed data processing Bigtable HBase Distributed DB/key-value store Protobuf/Stubby Thrift or Avro Data serialization/RPC Pregel Giraph Distributed graph processing Dremel/F1 Cloudera Impala Scalable interactive SQL (MPP) FlumeJava Crunch Abstracted data pipelines on Hadoop Hadoop 15
  16. 16. Overview of core technology 16
  17. 17. HDFS design assumptions Based on Google File System • Files are large (GBs to TBs) • Failures are common • • • Massive scale means failures very likely Disk, node, or network failures Accesses are large and sequential • Files are append-only • 17
  18. 18. HDFS properties • Fault-tolerant • • Horizontally scalable • • Gracefully responds to node/disk/network failures Low marginal cost High-bandwidth HDFS storage distribution 1 Node A Node B Node C Node D Node E 2 2 1 1 2 1 3 4 2 3 3 3 4 5 5 4 5 4 5 Input File 18
  19. 19. MapReduce computation 19
  20. 20. MapReduce computation • Structured as 1. 2. 3. Embarrassingly parallel “map stage” Cluster-wide distributed sort (“shuffle”) Aggregation “reduce stage” Data-locality: process the data where it is stored • Fault-tolerance: failed tasks automatically detected and restarted • Schema-on-read: data must not be stored conforming to rigid schema • 20
  21. 21. WordCount example 21
  22. 22. Cloudera Hadoop Stack 22
  23. 23. Cloudera Hadoop Stack 23
  24. 24. Cloudera Hadoop Stack 24
  25. 25. Cloudera Hadoop Stack Storm Spark STREAM DISTRIBUTED MEMORY GraphLab GRAPH COMPUTATION 25
  26. 26. Cloudera Impala Modern MPP database built on top of HDFS Designed for interactive queries on terabyte-scale data sets. 26
  27. 27. Cloudera Search • Interactive search queries on top of HDFS • Built on Solr and SolrCloud • Near-realtime indexing of new documents 27
  28. 28. Serialization/RPC formats • • • • • Specify schemas/services in user-friendly IDLs Code-generation to multiple languages (wirecompatible/portable) Compact, binary formats Natural support for schema evolution Multiple implementations: • 28 Apache Thrift, Apache Avro, Google’s Protocol Buffers
  29. 29. Serialization/RPC formats struct Tweet { 1: required i32 userId; 2: required string userName; 3: required string text; 4: optional Location loc; 16: optional string language = "english" } service Twitter { void ping(); bool postTweet(1:Tweet tweet); TweetSearchResult searchTweets(1:string query); } 29
  30. 30. Serialization/RPC formats struct Observation { // can be general contig too 1: required string chromosome, // python-style 0-based slicing 2: required i64 start, 3: required i64 end, // unique identifier for data set // (like UCSC genome browser track) 4: required string track, // these are likely derived from the // track; separated for convenient join 5: optional string experiment, 6: optional string sample, // one of these should be non-null, // depending on the type of data 7: optional string valueStr, 8: optional i64 valueInt, 9: optional double valueDouble } 30
  31. 31. Parquet format Row-major format 31
  32. 32. Parquet format Column-major format 32
  33. 33. Parquet format advantages • Columnar format • • • read fewer bytes compression more efficient (incl. dictionary encodings) Thrift/Avro/Protobuf-compatible data model • Support for nested data structures Binary encodings • Hadoop-friendly (“splittable”; implemented in Java) • Predicate pushdown • http://parquet.io/ • 33
  34. 34. Query Times on TPCDS Queries 500 450 400 350 Seconds 300 Text 250 Seq w/ Snappy RC w/Snappy 200 Parquet w/Snappy 150 100 50 0 Q27 34 Q34 Q42 Q43 Q46 Q52 Q55 Q59 Q65 Q73 Q79 Q96
  35. 35. Core paradigm shifts with Hadoop Colocation of storage and compute Fault tolerance with cheap hardware 35
  36. 36. Benefits of Hadoop ecosystem • Inexpensive commodity compute/storage • • Tolerates random hardware failure Decreased need for high-bandwidth network pipes Co-locate compute and storage • Exploit data locality • • Simple horizontal scalability by adding nodes • • • • 36 MapReduce jobs effectively guaranteed to scale Fault-tolerance/replication built-in. Data is durable Large ecosystem of tools Flexible data storage. Schema-on-read. Unstructured data.
  37. 37. Some sins in genomics data infrastructure 37
  38. 38. HPC separates compute from storage HPC is about compute. Hadoop is about data. Storage infrastructure • Proprietary, distributed file system • Expensive Compute cluster Big network pipe ($$$) • High-performance hardware • Low failure rate • Expensive User typically works by manually submitting jobs to scheduler e.g., LSF, Grid Engine, etc. 38
  39. 39. Hadoop colocates compute and storage HPC is about compute. Hadoop is about data. Compute cluster Storage infrastructure • Commodity hardware • Data-locality • Reduced networking needs User typically works by manually submitting jobs to scheduler e.g., LSF, Grid Engine, etc. 39
  40. 40. HPC is lower-level than Hadoop HPC only exposes job scheduling • Parallelization typically occurs through MPI • • • Very low-level communication primitives Difficult to horizontally scale by simply adding nodes Large data sets must be manually split • Failures must be dealt with manually • • 40 Hadoop has fault-tolerance, data locality, horizontal scalability
  41. 41. File system as DB; text file as LCD Broad joint caller with 25k genomes hits file handle limits • Files streamed over network (HPC architecture) • Large files split manually • Sharing data/collaborating involves copying large files • 41
  42. 42. Job scheduler as workflow tool Submitting jobs to scheduler is very low level • Workflow engines/execution models provide high level execution graphs with fault-tolerance • • 42 e.g., MapReduce, Oozie, Spark, Luigi, Crunch, Cascading, Pi g, Hive
  43. 43. Poor security/access models • Deal with complex set of constraints from a variety of consents/redactions • • • • 43 Certain individuals redact certain parts of their genomes Certain samples can only be used as controls for particular studies Different research groups want to control access to the data they generate Clinical trial data must have more rigorous access restrictions
  44. 44. Treating computation as free Many institutions make large clusters available for “free” to the average researcher • Focus of dropping sequencing cost has been on biochemistry • 44
  45. 45. Treating computation as free Stein, L. D. The case for cloud computing in genome informatics. Genome Biol (2010). 45
  46. 46. Treating computation as free Sboner et al. “The real cost of sequencing: higher than you think”. Genome Biology (2011). 46
  47. 47. Lack of benchmarks for tracking progress • Need to benchmark whether quality of methods are improving http://www.nist.gov/mml/bbd/ppgenomeinabottle2.cfm 47
  48. 48. Lack of benchmarks for tracking progress Bradnam et al. “Assemblathon 2”, Gigascience 2, 10 (2013). 48
  49. 49. Academic code Unreproducible, unbuildable, undocumented, unmainta ined, unavailable, backward-incompatible, shitty code Most developers self-taught. Only one-third think formal training is important. [1, 2] “…people in my lab have requested code from authors and received source code with syntax errors in it” [3] [1]: Haussler et al. “A Million Cancer Genome Warehouse” (2012) [2]: Hannay et al. “How do scientists develop and use scientific software?” (2009) [3]: http://ivory.idyll.org/blog/on-code-review-of-scientific-code.html 49
  50. 50. Fundamentally a barrier to scaling. 50
  51. 51. 51
  52. 52. NCBI Sequence Read Archive (SRA) Today… 1.14 petabytes One year ago… 609 terabytes 52
  53. 53. Every ‘ome has a -seq Genome DNA-seq RNA-seq Transcriptome FRT-seq NET-seq Methylome Immunome Immune-seq Proteome 53 Bisulfite-seq PhIP-seq Bind-n-seq
  54. 54. Prescriptions for the future 54
  55. 55. Move to Hadoop-style environment Data centralization on HDFS • Data-local execution to avoid moving terabytes • Higher-level execution engines to abstract away computations from details of execution • Hadoop-friendly, evolvable, serialization formats for: • • • • 55 Storage- and compute-efficiency Abstracting data model from data storage details Built-in horizontal scalability and fault-tolerance
  56. 56. APIs instead of file formats Service-oriented architectures ensure stable contracts • Allows for implementation changes with new technologies • Software community has lots of experience with this type of architecture, along with mature tools. • Can be implemented as language-independent. • 56
  57. 57. High-granularity access/common consent 1. Use technologies with highly-granular access control • 2. Create common consents for patients to “donate” their data to research • 57 e.g., Apache Accumulo, cell-based access control e.g., Personal Genome Project, SAGE Portable Legal Consent, NCI “information donor”
  58. 58. Tools for open-source/reproducibility Software and computations should be opensourced, e.g., on GitHub • Release VMs or ipython notebooks with publications • • • 58 “executable paper” to generate figures Allow others to easily recompute all analyses
  59. 59. Building scalable variant store 59
  60. 60. Genomics ETL biochemistry • • • • 60 .fastq short read alignment .bam genotype calling .vcf analysis Short read alignment is embarrassingly parallel Pileup/variant calling requires distributed sort GATK is a reimplementation of MapReduce; could run on Hadoop Early Hadoop tools • Crossbow: short read alignment/variant calling • Hadoop-BAM: distributed bamtools • BioPig: manipulating large fasta/q • Contrail: de-novo assembly
  61. 61. Genomics ETL GATK best practices 61
  62. 62. ADAM 62
  63. 63. ADAM • Defining alternative to BAM format that’s • Hadoop-friendly, splittable, designed for distributed computing • Format built as Avro objects • Data stored as Parquet format (columnar) • Attempting to reimplement GATK pipeline to function on Hadoop/Parquet • Currently run out of the AMPLab at UC Berkeley 63
  64. 64. Genomics ETL .fastq 64 short read alignment .bam genotype calling .vcf analysis
  65. 65. Querying large, integrated variant data Biotech client has thousands of genomes • Want to expose ad hoc querying functionality on large scale • • • Integrating data with public data sets (e.g., ENCODE, UCSC tracks, dbSNP, etc.) • 65 e.g., vcftools/PLINK-SEQ on terabyte-scale data sets Terabyte-scale annotation sets
  66. 66. Conventional approaches: manual • Manually parsing flat files • • • Write ad hoc scripts in perl or python Build data structures in memory for histograms/aggregations Custom script per query counts_dict = {} for chain in vdj.parse_VDJXML(inhandle): try: counts_dict[chain.junction] += 1 except KeyError: counts_dict[chain.junction] = 1 for count in counts_dict.itervalues(): print >>outhandle, np.int_(count) 66
  67. 67. Conventional approaches: database • Very feature rich and mature • • • Common analytical tasks (e.g., joins, group-by, etc.) Access control Very mature Scalability issues • Indices can be prohibitive • RDBMS: schemas can be annoyingly rigid • NoSQL: adolescent implementations (but easy to start) • 67
  68. 68. Conventional approaches: domain-specific • • • • • 68 e.g., PLINK/SEQ Designed for specific use-cases Workflows are highly opinionated/rigid Requires learning another language Scalability issues
  69. 69. Hadoop sol’n: storage • Impala/Hive metastore provide a unified, flexible data model • • 69 Define Avro types for all data Data stored as Parquet format to maximize compression and query performance
  70. 70. Hadoop sol’n: available analytics engines • • • • • 70 Analytical operations implemented by experts in distributed systems Impala implements RDBMS-style operations Search offers metadata indexing Spark offers in-memory processing for ML HDFS-based analytical engines designed for horizontal scalability
  71. 71. Variant store architecture .vcf ETL .parquet .csv external annotations Avro schema Thrift service JDBC REST API Impala shell 71 query Hive metastore Impala query engine Results
  72. 72. Example schema ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 72
  73. 73. Example schema ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 73
  74. 74. Example schema ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 74
  75. 75. Why denormalization is good • Replace joins with filters • • • 75 For query engines with efficient scans, this simplifies queries and can improve performance Parquet format supports predicate pushdowns, reducing necessary I/O Because storage is cheap, amortize cost of up-front join over simpler queries going forward
  76. 76. ... { "default": null, "doc": "Genotype", "type": [ "null", "string" ], "name": "VCF_CALL_GT" }, { "default": null, "doc": "Genotype Quality", "type": [ "null", "int" ], "name": "VCF_CALL_GQ" }, { "default": null, "doc": "Read Depth", "type": [ "null", "int" ], "name": "VCF_CALL_DP" }, { "default": [], "doc": "Haplotype Quality", "type": "string", "name": "VCF_CALL_HQ" } Example schema { "name": "VCF", "type": "record" "fields": [ { "type": "string", "name": "VCF_CHROM" }, { "type": "int", "name": "VCF_POS" }, { "type": "string", "name": "VCF_ID" }, { "type": "string", "name": "VCF_REF" }, { "type": "string", "name": "VCF_ALT" }, ... ] } 76
  77. 77. Example variant-filtering query • “Give me all SNPs that are: • • • • • • 77 on chromosome 5 absent from dbSNP present in COSMIC observed in breast cancer samples absent from prostate cancer samples” On full 1000 genome data set (~37 billion variants), query finishes in a couple seconds
  78. 78. Example variant-filtering query SELECT cosmic as snp_id, vcf_chrom as chr, vcf_pos as pos, sample_id as sample, vcf_call_gt as genotype, sample_affection as phenotype FROM hg19_parquet_snappy_join_cached_partitioned WHERE COSMIC IS NOT NULL AND dbSNP IS NULL AND sample_study = ”breast_cancer" AND VCF_CHROM = "16"; 78
  79. 79. Impala execution Query compiled into execution tree, chopped up across all nodes (if possible) • Two join implementations • 1. 2. Broadcast: each node gets copy of full right table Shuffle: both sides of join are partitioned Partitioned tables vastly reduce amount of I/O • File formats make enormous difference in query performance • 79
  80. 80. Other desirable query-examples “How do the mutations in a given subject compare to the mutations in other phenotypically similar subjects?” • “For a given gene, in what pathways and cancer subtypes is it involved?” (connecting phenotypes to annotations) • “How common are an observed set of mutations?” • “For a given type of cancer, what are the characteristic disruptions?” • 80
  81. 81. Types of queries desired Lot’s of these queries can be simply translated into SQL queries • Similar to functionality provided by PLINK/SEQ, but designed to scale to much larger data sets • 81
  82. 82. All-vs-all eQTL • Possible to generate trillions of hypothesis tests • • • 107 loci x 104 phenotypes x 10s of tissues = 1012 p-values Tested below on 120 billion associations Example queries: • “Given 5 genes of interest, find top 20 most significant eQTLs (cis and/or trans)” • • “Find all cis-eQTLs across the entire genome” • • 82 Finishes in several seconds Finishes in a couple of minutes Limited by disk throughput
  83. 83. All-vs-all eQTL • “Find all SNPs that are: • • • in LD with some lead SNP or eQTL of interest align with some functional annotation of interest” Still in testing, but likely finishes in seconds Schaub et al, Genome Research, 2012 83
  84. 84. Conclusions Hadoop ecosystem provides centralized, scalable repository for data • An abundance of tools for providing views/analytics into the data store • • Separate implementation details from data pipelines Software quality/data structures/file formats matter • Genomics has much to gain from moving away from HPC architecture toward Hadoop ecosystem architecture • 84
  85. 85. Cloud-based implementation Hadoop-ecosystem architecture easily translates to the cloud (AWS, OpenStack) • Provides elastic capacity; no large initial CAPEX • Risk of vendor lock-in once data set is large • Allows simple sharing of data via public S3 buckets, for example • 85
  86. 86. Future work • Broad Institute has experimented with Google’s BigQuery for a variant store • • BigQuery is Google’s Dremel exposed to public on Google’s cloud Closed-source, only Google cloud Developed API for working with variant data • Soon develop Impala-backed implementation of Broad API • • 86 To be open-sourced
  87. 87. Future work Drive towards several large data warehouses; storage backend optimized for particular access patterns • Each can expose one or more APIs for different applications/access levels. • Haussler, D. et al. A Million Cancer Genome Warehouse. (2012). Tech Report. • 87
  88. 88. Acknowledgements Cloudera Josh Wills Jeff Hammerbacher Impala team (Nong Li) Sandy Ryza Julien Le Dem (Twitter) Our biotech client Mike Schatz (CSHL) Matt Massie 88
  89. 89. 89

×