Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Genomics Is Not Special: Towards Data Intensive Biology

4 830 vues

Publié le

Genomics and life sciences is using antiquated technology for processing data. As the data volume is increasing in the life sciences, many in the biology community are reinventing the wheel, without realizing the existence of a rich ecosystem of tools for processing large data sets: Hadoop.

Publié dans : Sciences
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • Can you share the video of the talk ?
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • @Anton Kulaga The only one is Spark, which one may argue is an important one. Starting from zero, Python is probably the most useful/versatile language for doing genomics/data analysis. Also, Spark has a decent Python framework.
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • Most of the frameworks that were listed are written in Scala. So why when you suggest learning python instead of learning Scala to everyone?
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici

Genomics Is Not Special: Towards Data Intensive Biology

  1. 1. Genomics Is Not Special Uri Laserson // laserson@cloudera.com // 13 November 2014 Toward Data-Intensive Biology
  2. 2. 2© 2014 Cloudera, Inc. All rights reserved. http://omicsmaps.com/ >25 Pbp / year
  3. 3. 3© 2014 Cloudera, Inc. All rights reserved. Carr and Church, Nat. Biotech. 27: 1151 (20
  4. 4. 4© 2014 Cloudera, Inc. All rights reserved. For every “-ome” there’s a “-seq” Genome DNA-seq Transcriptome RNA-seq FRT-seq NET-seq Methylome Bisulfite-seq Immunome Immune-seq Proteome PhIP-seq Bind-n-seq http://liorpachter.wordpress.com/seq/
  5. 5. 5© 2014 Cloudera, Inc. All rights reserved. For every “-ome” there’s a “-seq” Genome DNA-seq Transcriptome RNA-seq FRT-seq NET-seq Methylome Bisulfite-seq Immunome Immune-seq Proteome PhIP-seq Bind-n-seq http://liorpachter.wordpress.com/seq/
  6. 6. 6© 2014 Cloudera, Inc. All rights reserved. Based on IMGT/LIGM release 201111
  7. 7. 7© 2014 Cloudera, Inc. All rights reserved.
  8. 8. 8© 2014 Cloudera, Inc. All rights reserved.
  9. 9. 9© 2014 Cloudera, Inc. All rights reserved. Developer/computational efficiency becoming paramount Genome Biology 12: 125 (2011)
  10. 10. 10© 2014 Cloudera, Inc. All rights reserved. Software and data management around since 1970s • Version control/reproducibility • Testing/automation/integration • Databases and data formats • API design • Lots (most?) of big data innovation happening in industry
  11. 11. 11© 2014 Cloudera, Inc. All rights reserved. Example query For each variant that is • overlapping a DNase HS site • predicted to be deleterious • absent from dbSNP compute the MAF by subpopulation using samples in Framing Heart Study PARTNER LOGO CH R POS RE F AL T POP MAF POLYPHEN 7 122892 37 A G Plain 0.01 possibly damaging 7 122892 37 A G Star- bellied 0.03 possibly damaging 12 228833 2 T C Plain 0.00 3 probably damaging 12 228833 2 T C Star- bellied 0.09 probably damaging
  12. 12. 12© 2014 Cloudera, Inc. All rights reserved. Available data Data set Format Size Population genotypes VCF 10-100s of billions Dnase HS sites (ENCODE) narrowPeak (BED) <1 million dbSNP CSV 10s of millions Sample phenotypes JSON thousands
  13. 13. 13© 2014 Cloudera, Inc. All rights reserved. Why text data is a bad idea • Text is highly inefficient • Compresses poorly • Values must be parsed • Text is semi-structured at best • Flexible schemas make parsing difficult • Difficult to make assumptions on data structure • Text poorly separates the roles of delimiters and data • Requires escaping of control characters • (ASCII actually includes RS 0x1E and FS 0x1F, but they’re never used) • But still almost always better than Excel
  14. 14. 14© 2014 Cloudera, Inc. All rights reserved. Some reasons VCF in particular is bad • Number of records (variants) grows with new variants, rather than new genotypes • difficult to write data • adding a sample requires rewrite of entire file • Data must be sorted • Semi-structured: need to build a parser for each file • Conflates two functions: • catalogue of variation • repository actual observed genotypes • If gzipped, it’s not splittable • Variants are not encoded uniquely by the VCF spec
  15. 15. 15© 2014 Cloudera, Inc. All rights reserved. Manually executing query in Python class IntervalTree(object): def update(self, feature): pass # ...implement tree update def overlaps(self, feature): return True or False dnase_sites = IntervalTree() with open('path/to/dnase.narrowPeak', 'r') as ip: for line in ip: feature = parse_feature(line) dnase_sites.update(feature) samples = {} with open('path/to/samples.json', 'r') as ip: for line in ip: sample = json.loads(line) if is_framingham(sample): samples[sample['name']] = sample dbsnp = set() with open('path/to/dbsnp.csv', 'r') as ip: for line in ip: snp = tuple(line.split()[:3]) dbsnp.add(snp)
  16. 16. 16© 2014 Cloudera, Inc. All rights reserved. Additional metadata must fit in memory class IntervalTree(object): def update(self, feature): pass # ...implement tree update def overlaps(self, feature): return True or False dnase_sites = IntervalTree() with open('path/to/dnase.narrowPeak', 'r') as ip: for line in ip: feature = parse_feature(line) dnase_sites.update(feature) samples = {} with open('path/to/samples.json', 'r') as ip: for line in ip: sample = json.loads(line) if is_framingham(sample): samples[sample['name']] = sample dbsnp = set() with open('path/to/dbsnp.csv', 'r') as ip: for line in ip: snp = tuple(line.split()[:3]) dbsnp.add(snp)
  17. 17. 17© 2014 Cloudera, Inc. All rights reserved. Can only read from POSIX filesystem class IntervalTree(object): def update(self, feature): pass # ...implement tree update def overlaps(self, feature): return True or False dnase_sites = IntervalTree() with open('path/to/dnase.narrowPeak', 'r') as ip: for line in ip: feature = parse_feature(line) dnase_sites.update(feature) samples = {} with open('path/to/samples.json', 'r') as ip: for line in ip: sample = json.loads(line) if is_framingham(sample): samples[sample['name']] = sample dbsnp = set() with open('path/to/dbsnp.csv', 'r') as ip: for line in ip: snp = tuple(line.split()[:3]) dbsnp.add(snp)
  18. 18. 18© 2014 Cloudera, Inc. All rights reserved. Manually executing query in Python genotype_data = {} reader = vcf.Reader('path/to/genotypes.vcf') for variant in reader: if (dnase_sites.overlaps(variant) and is_deleterious(call) and not in_dbsnp(variant)): for call in variant.samples: if call.sample in samples: pop = samples[call.sample]['population'] genotype_data.setdefault((variant, pop), []).append(call) mafs = {} for (variant, pop) in genotype_data.iter_keys(): mafs[(variant, pop)] = compute_maf(genotype_data[(variant, pop)])
  19. 19. 19© 2014 Cloudera, Inc. All rights reserved. Genotype data may be split across files genotype_data = {} reader = vcf.Reader('path/to/genotypes.vcf') for variant in reader: if (dnase_sites.overlaps(variant) and is_deleterious(call) and not in_dbsnp(variant)): for call in variant.samples: if call.sample in samples: pop = samples[call.sample]['population'] genotype_data.setdefault((variant, pop), []).append(call) mafs = {} for (variant, pop) in genotype_data.iter_keys(): mafs[(variant, pop)] = compute_maf(genotype_data[(variant, pop)])
  20. 20. 20© 2014 Cloudera, Inc. All rights reserved. Manually executing query in Python • If file is gzipped, cannot split file without decompressing (use Snappy) • Reading files required access to POSIX-style file system • Probably want to split VCF file into pieces to parallelize • Requires manual scatter-gather • Samples may be scattered among multiple VCF files (difficult to append to VCF) • Manually implementing broadcast join • Build side must fit into memory
  21. 21. 21© 2014 Cloudera, Inc. All rights reserved. Manually executing query in Python on HPC $ bsub –q shared_12h python split_genotypes.py $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv $ bsub –q shared_12h python merge_maf.py
  22. 22. 22© 2014 Cloudera, Inc. All rights reserved. Manually executing query in Python on HPC $ bsub –q shared_12h python split_genotypes.py $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv $ bsub –q shared_12h python merge_maf.py How to serialize intermediate output? Manually specify requested resources Manually split and mergeBabysit and check for errors/failures
  23. 23. 23© 2014 Cloudera, Inc. All rights reserved. HPC separates compute from storage HPC is about compute. Hadoop is about data. Storage infrastructure • Proprietary, distributed file system • Expensive Compute cluster • High-perf, reliable hardware • Expensive Big network pipe ($$$) User typically works by manually submitting jobs to scheduler (e.g., LSF, Grid Engine, etc.)
  24. 24. 24© 2014 Cloudera, Inc. All rights reserved. HPC is lower-level than Hadoop • HPC only exposes job scheduling • Parallelization typically through MPI • Very low-level communication primitives • Difficult to horizontally scale by simply adding nodes • Large data sets must be manually split • Failures must be dealt with manually
  25. 25. 25© 2014 Cloudera, Inc. All rights reserved. HPC uses file system as DB; text file as LCD • All tools assume flat files with POSIX semantics • Sharing data/collaboration involves copying large files • Broad joint caller with 25k genomes hits file handle limits • Files always streamed over network (HPC architecture)
  26. 26. 26© 2014 Cloudera, Inc. All rights reserved. HPC uses job scheduler as workflow tool • Submitting jobs to scheduler is low level • Workflow engines/execution models provide high level execution graphs with built-in fault tolerance • e.g., MapReduce, Oozie, Spark, Luigi, Crunch, Cascading, Pig, Hive
  27. 27. 27© 2014 Cloudera, Inc. All rights reserved. Prepping data for local analysis in R/Python • Manual script to prepare CSV file for working locally • Same issues as above • Requires working set of data to fit into memory of a single machine • Visualization
  28. 28. 28© 2014 Cloudera, Inc. All rights reserved. Domain-specific tools (e.g., PLINK/Seq) $ pseq path/to/project v-stats --mask phe=framingham locset=dnase ref.ex=dbsnp one of a limited set of specific, useful tasks (yet another) custom query specification
  29. 29. 29© 2014 Cloudera, Inc. All rights reserved. Domain-specific tools (e.g., PLINK/Seq) • Works great if your problem fits into the pre-designed computations • Only works if your problem fits into the pre-designed computations • How to do stats by subpopulation? • Probably possible, but need to learn new notation • Must work to get data in to begin-with • Not obviously parallelizable for performance on large data sets • Built on SQLite underneath
  30. 30. 30© 2014 Cloudera, Inc. All rights reserved. RDBMS and SQL (e.g., MySQL) SELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call) FROM genotypes g INNER JOIN samples s ON g.sample = s.sample INNER JOIN dnase d ON g.chr = d.chr AND g.pos >= d.start AND g.pos < d.end LEFT OUTER JOIN dbsnp p ON g.chr = p.chr AND g.pos = p.pos AND g.ref = p.ref AND g.alt = p.alt WHERE s.study = "framingham" p.pos IS NULL AND g.polyphen IN ( "possibly damaging", "probably damaging" ) GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop
  31. 31. 31© 2014 Cloudera, Inc. All rights reserved. RDBMS and SQL (e.g., MySQL) • Feature-rich and very mature • Highly optimized and allows indexing • Declarative (and abstracted) language for data • Hassle to get data in; data end up formatted one way • No clear scalability story • SQL-only
  32. 32. 32© 2014 Cloudera, Inc. All rights reserved. Problems with old way • Expensive • No fault-tolerance • No horizontal scalability • Poor separation of data modeling and storage formats • File format proliferation • Inefficient text formats
  33. 33. 33© 2014 Cloudera, Inc. All rights reserved.
  34. 34. 34© 2014 Cloudera, Inc. All rights reserved. Indexing the web • Web is Huge • Hundreds of millions of pages in 1999 • How do you index it? • Crawl all the pages • Rank pages based on relevance metrics • Build search index of keywords to pages • Do it in real time!
  35. 35. 35© 2014 Cloudera, Inc. All rights reserved.
  36. 36. 36© 2014 Cloudera, Inc. All rights reserved. Databases in 1999 • Buy a really big machine • Install expensive DBMS on it • Point your workload on it • Hope it doesn’t fail • Ambitious: buy another big machine as backup
  37. 37. 37© 2014 Cloudera, Inc. All rights reserved.
  38. 38. 38© 2014 Cloudera, Inc. All rights reserved. Database limitations • Didn’t scale horizontally • High marginal cost ($$$) • No real fault-tolerance story • Vendor lock-in ($$$) • SQL unsuited for search ranking • Complex analysis (PageRank) • Unstructured data
  39. 39. 39© 2014 Cloudera, Inc. All rights reserved.
  40. 40. 40© 2014 Cloudera, Inc. All rights reserved. Google does something different • Designed their own storage and processing infrastructure • Google File System (GFS) and MapReduce (MR) • Goals: cheap, scalable, reliable • General framework for large-scale batch computation • Powered Google Search for many years • Still used internally to this day (millions of jobs)
  41. 41. 41© 2014 Cloudera, Inc. All rights reserved. Google benevolent enough to publish 2003 2004
  42. 42. 42© 2014 Cloudera, Inc. All rights reserved. Birth of Hadoop at Yahoo! • 2004-2006: Doug Cutting and Mike Cafarella implement GFS/MR • 2006: Spun out as Apache Hadoop • Named after Doug’s son’s yellow stuffed elephant
  43. 43. 43© 2014 Cloudera, Inc. All rights reserved. Open-source proliferation Google Open-source Function GFS HDFS Distributed file system MapReduce MapReduce Batch distributed data processing Bigtable HBase Distributed DB/key-value store Protobuf/Stubb y Thrift or Avro Data serialization/RPC Pregel Giraph Distributed graph processing Dremel/F1 Impala Scalable interactive SQL (MPP) FlumeJava Crunch Abstracted data pipelines on Hadoop
  44. 44. 44© 2014 Cloudera, Inc. All rights reserved. Hadoop provides: • Data centralization on HDFS • No rewriting data for each tool/application • Data-local execution to avoid moving terabytes • High-level execution engines • SQL (Impala, Hive) • Relational algebra (Spark, MapReduce) • Bulk synchronous parallel (GraphX) • Distributed in-memory • Built-in horizontal scalability and fault-tolerance • Hadoop-friendly, evolvable serialization formats/RPC
  45. 45. 45© 2014 Cloudera, Inc. All rights reserved. Hadoop provides serialization/RPC formats (Avro) • Specify schemas/services in user-friendly IDLs • Code-generation to multiple languages (wire-compatible/portable) • Compact, binary formats • Support for schema evolution • Like binary JSON record Feature { union { null, string } featureId = null; union { null, string } featureType = null; // e.g., DNase HS union { null, string } source = null; // e.g., BED, GFF file union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, Strand } strand = null; union { null, double } value = null; array<Dbxref> dbxrefs = []; array<string> parentIds = []; map<string> attributes = {}; }
  46. 46. 46© 2014 Cloudera, Inc. All rights reserved. APIs instead of file formats • Service-oriented architectures (SOA) ensure stable contracts • Allows for implementation changes with new technologies • Software community has lots of experience with SOA, along with mature tools • Can be implemented in language-independent fashion
  47. 47. 47© 2014 Cloudera, Inc. All rights reserved. Current file format hairball
  48. 48. 48© 2014 Cloudera, Inc. All rights reserved. API-oriented architecture
  49. 49. 49© 2014 Cloudera, Inc. All rights reserved. Hadoop provides columnar storage (Parquet) • Designed for general data storage • Columnar format • read fewer bytes • compression more efficient • Splittable • Avro/Thrift-compatible • Predicate pushdown • RLE, dictionary-encoding
  50. 50. 50© 2014 Cloudera, Inc. All rights reserved. Hadoop provides columnar storage (Parquet)
  51. 51. 51© 2014 Cloudera, Inc. All rights reserved. Hadoop provides columnar storage (Parquet) Vertical partitioning (projection pushdown) Horizontal partitioning (predicate pushdown) Read only the data you need! + =
  52. 52. 52© 2014 Cloudera, Inc. All rights reserved. Hadoop provides abstractions for data processing HDFS (scalable, distributed storage) YARN (resource management) MapReduc e Impala (SQL) Solr (search) Spark ADAMquince guacamole … bdg-formats(Avro/Parquet)
  53. 53. 53© 2014 Cloudera, Inc. All rights reserved. Hadoop examples: filesystem [laserson@bottou01-10g ~]$ hadoop fs –ls /user/laserson Found 16 items drwx------ - laserson laserson 0 2014-11-12 16:00 .Trash drwxr-xr-x - laserson laserson 0 2014-11-12 00:29 .sparkStaging drwx------ - laserson laserson 0 2014-06-07 13:27 .staging drwxr-xr-x - laserson laserson 0 2014-10-30 14:15 1kg drwxr-xr-x - laserson laserson 0 2014-05-08 17:29 bigml drwxr-xr-x - laserson laserson 0 2014-10-30 14:14 book drwxrwxr-x - laserson laserson 0 2014-06-16 12:59 editing drwxr-xr-x - laserson laserson 0 2014-06-06 13:49 gdelt -rw-r--r-- 3 laserson laserson 0 2014-10-27 16:24 hg19_text drwxr-xr-x - laserson laserson 0 2014-06-12 19:53 madlibport drwxr-xr-x - laserson laserson 0 2014-03-20 18:09 rock-health-python drwxr-xr-x - laserson laserson 0 2014-05-15 13:25 test-udf drwxr-xr-x - laserson laserson 0 2014-08-21 17:58 test_pymc drwxr-xr-x - laserson laserson 0 2014-10-27 22:25 tmp drwxr-xr-x - laserson laserson 0 2014-10-07 20:30 udf-scratch drwxr-xr-x - laserson laserson 0 2014-03-02 13:50 udfs
  54. 54. 54© 2014 Cloudera, Inc. All rights reserved. Hadoop examples: batch MapReduce job hadoop jar vcf2parquet-0.1.0-jar-with-dependencies.jar com.cloudera.science.vcf2parquet.VCFtoParquetDriver hdfs:///path/to/variants.vcf hdfs:///path/to/output.parquet
  55. 55. 55© 2014 Cloudera, Inc. All rights reserved. Hadoop examples: interactive Spark shell [laserson@bottou01-10g ~]$ spark-shell --master yarn Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.1.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67) Type in expressions to have them evaluated. [...] scala>
  56. 56. 56© 2014 Cloudera, Inc. All rights reserved. Hadoop examples: interactive Spark shell def inDbSnp(g: Genotype): Boolean = true or false def isDeleterious(g: Genotype): Boolean = g.getPolyPhen val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect() val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect() val genotypesRDD = sc.adamLoad("path/to/genotypes") val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase") val filteredRDD = genotypesRDD .filter(!inDbSnp(_)) .filter(isDeleterious(_)) .filter(isFramingham(_)) val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD) val maf = joinedRDD .keyBy(x => (x.getVariant, getPopulation(x))) .groupByKey() .map(computeMAF(_)) .saveAsNewAPIHadoopFile("path/to/output")
  57. 57. 57© 2014 Cloudera, Inc. All rights reserved. Hadoop provides abstractions for data processing HDFS (scalable, distributed storage) YARN (resource management) MapReduc e Impala (SQL) Solr (search) Spark ADAMquince guacamole … bdg-formats(Avro/Parquet)
  58. 58. 58© 2014 Cloudera, Inc. All rights reserved. Genomics ETL .fastq .bam .vcf .bed/.gtf/etc short read alignme nt genotyp e calling analysis
  59. 59. 59© 2014 Cloudera, Inc. All rights reserved. Hadoop variant store architecture Impala shell (SQL) REST API JDBC SQL query Impala engine Hive metastore Result set .parquet.vcf ETL
  60. 60. 60© 2014 Cloudera, Inc. All rights reserved. Data denormalization ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 • Amortize join cost up-front • Replace joins with predicates (allowing predicate pushdown)
  61. 61. 61© 2014 Cloudera, Inc. All rights reserved. Hadoop solution characteristics • Data stored as Parquet columnar format for performance and compression • Impala/Hive metastore provide unified, flexible data model • Impala implements RDBMS-style operations (by experts in distributed systems) • Spark offers flexible relational algebra operators (and in-memory computing) • Built-in fault tolerance for computations and horizontal scalability
  62. 62. 62© 2014 Cloudera, Inc. All rights reserved. Example variant-filtering query • “Give me all SNPs that are: • chromosome 16 • absent from dbSNP • present in COSMIC • observed in breast cancer samples” • On full 1000 Genome data set • ~37 billion genotypes • 14 node cluster • query completion in several seconds SELECT cosmic as snp_id, vcf_chrom as chr, vcf_pos as pos, sample_id as sample, vcf_call_gt as genotype, sample_affection as phenotype FROM hg19_parquet_snappy_join_cached_partitioned WHERE COSMIC IS NOT NULL AND dbSNP IS NULL AND sample_study = ”breast_cancer" AND VCF_CHROM = "16"; PARTNER LOGO
  63. 63. 63© 2014 Cloudera, Inc. All rights reserved. Other queries/use cases • All-vs-all eQTL integrated with ENCODE • >120 billion p-values • “Top 20 eQTLs for 5 genes of interest”: interactive • “Find all cis-eQTLs”: several minutes • Population genetics queries (e.g., backend for PLINK) • Interval arithmetic on large ENCODE data sets • Duke CHGV • ATAV DSL for preparing data for GWAS • Week-long queries now take a few hours by parallelizing on Spark
  64. 64. 64© 2014 Cloudera, Inc. All rights reserved. Computational biologists are reinventing the wheel • e.g., CRAM (columnar storage) • e.g., workflow managers (Galaxy) • e.g., GATK (scatter-gather)
  65. 65. 65© 2014 Cloudera, Inc. All rights reserved. Large-scale data analysis has been solved* • Cheaper in terms of hardware • Easier in terms of productivity • Built-in horizontal scaling • Built-in fault tolerance • Layered abstractions for data modeling • Hadoop!
  66. 66. 66© 2014 Cloudera, Inc. All rights reserved. Science on Hadoop • ADAM project for genomics on Spark • http://bdgenomics.org/ • Guacamole for somatic variation on Spark • https://github.com/hammerlab/guacamole/ • Thunder project for neuroimaging on Spark • http://thefreemanlab.com/thunder/ • Quince for variant store on Impala • currently barebones, but with examples • https://github.com/laserson/quince
  67. 67. 67© 2014 Cloudera, Inc. All rights reserved. Suggestions/resources • Everyone should learn Python • (also, everyone should try some experiments) • Everyone should use version control (e.g., git) • GitHub enables easy collaboration • See Titus Brown’s blog • Use the IPython Notebook (Jupyter) for productivity • Big data is often about engineering; use the best tools • For getting industry jobs: • Show people you know how to code: put your projects on GitHub • You should feel lucky if others will start using your code
  68. 68. 68© 2014 Cloudera, Inc. All rights reserved.
  69. 69. 69© 2014 Cloudera, Inc. All rights reserved. Acknowledgements • Cloudera • Sandy Ryza (Spark development) • Nong Li (Impala) • Skye Wanderman-Milne (Impala) • Impala genomics collaborators • Kiran Mukhyala • Slaton Lipscomb • ADAM project • Matt Massie • Frank Nothaft • Timothy Danford • Mount Sinai School of Medicine • Jeff Hammerbacher (+ lab) • Duke CHGV • Jonathan Keebler
  70. 70. Thank you.

×