Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

2016 bergen-sars

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Prochain SlideShare
2016 davis-biotech
2016 davis-biotech
Chargement dans…3
×

Consultez-les par la suite

1 sur 56 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (20)

Publicité

Similaire à 2016 bergen-sars (20)

Publicité

Plus récents (20)

2016 bergen-sars

  1. 1. A 12-step program for biology to survive and thrive in the era of data-intensive science C.Titus Brown Genome Center & Data Science Initiative Mar 18, 2016 Slides are on slideshare.net/c.titus.brown/
  2. 2. Math undergrad Evolutionary modeling Developmental biology Computer science & microbiology Sea urchin GRNs Chick neural crest Veterinary medicine? Bioinformatics algorithms "Data- Intensive Biology" Marek’s disease Soil metagenomics Ascidian GRNs Lamprey mRNAseq My path:
  3. 3. My guiding question What is going to be happening in the next 5 years with biological data generation? (And can I make progress on some of the coming problems?)
  4. 4. DNA sequencing rates continues to grow. Stephens et al., 2015 - 10.1371/journal.pbio.1002195
  5. 5. (2015 was a good year)
  6. 6. Oxford Nanopore sequencing Slide viaTorsten Seeman
  7. 7. Nanopore technology Slide viaTorsten Seeman
  8. 8. Scaling up --
  9. 9. Scaling up --
  10. 10. Slide viaTorsten Seeman
  11. 11. http://ebola.nextflu.org/
  12. 12. “Fighting EbolaWith a Palm- Sized DNA Sequencer” See: http://www.theatlantic.com/science/archive/2015/09/ebola- sequencer-dna-minion/405466/
  13. 13. “DeepDOM” cruise: examination of dissolved organic matter & microbial metabolism vs physical parameters – potential collab. Via Elizabeth Kujawinski Another challenge beyond volume and velocity – variety.
  14. 14. CRISPR The challenge with genome editing is fast becoming what to edit rather than how to do.
  15. 15. A point for reflection… Increasingly, the best guide to the next 10 years of biology is science fiction ...
  16. 16. Digital normalization Statement of problem: We can’t run de novo assembly on the transcriptome data sets we have!
  17. 17. Shotgun sequencing and coverage “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  18. 18. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (30-300 Gbp for human)
  19. 19. Digital normalization
  20. 20. Digital normalization
  21. 21. Digital normalization
  22. 22. Digital normalization
  23. 23. Digital normalization
  24. 24. Digital normalization
  25. 25. (Digital normalization is a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you…
  26. 26. Some key points -- • Digital normalization is streaming. • Digital normalizing is computationally efficient (lower memory than other approaches; parallelizable/multicore; single-pass) • Currently, primarily used for prefiltering for assembly, but relies on underlying abstraction (De Bruijn graph) that is also used in variant calling.
  27. 27. Assembly now scales with information content, not data size. • 10-100 fold decrease in memory requirements • 10-100 fold speed up in analysis
  28. 28. Diginorm is widely useful: 1. Assembly of the H. contortus parasitic nematode genome, a “high polymorphism/variable coverage” problem. (Schwarz et al., 2013; pmid 23985341) 2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a “big assembly” problem. (in prep) 3. Osedax symbiont metagenome, a “contaminated metagenome” problem (Goffredi et al, 2013; pmid 24225886)
  29. 29. Anecdata: diginorm is used in Illumina long-read sequencing (?)
  30. 30. Computational problems now scale with information content rather than data set size. Most samples can be reconstructed via de novo assembly on commodity computers.
  31. 31. Applying digital normalization in a new project – the horse transcriptome Tamer Mansour w/Bellone, Finno, Penedo, & Murray labs.
  32. 32. Input data Tissue Library length #samples #frag(M) #bp(Gb) BrainStem PE fr.firststrand 101 8 166.73 33.68 Cerebellum PE fr.firststrand 100 24 411.48 82.3 Muscle PE fr.firststrand 126 12 301.94 76.08 Retina PE fr.unstranded 81 2 20.3 3.28 SpinalCord PE fr.firststrand 101 16 403 81.4 Skin PE fr.unstranded 81 2 18.54 3 SE fr.unstranded 81 2 16.57 1.34 SE fr.unstranded 95 3 105.51 10.02 Embryo ICM PE fr.unstranded 100 3 126.32 25.26 SE fr.unstranded 100 3 115.21 11.52 Embryo TE PE fr.unstranded 100 3 129.84 25.96 SE fr.unstranded 100 3 102.26 10.23 Total 81 1917.7 364.07
  33. 33. equCabs current status - NCBI Annotation Feature Acc Annotation GFF Refseq DB Total no of genes 25565 ptn coding genes 19686 Coding RNA NM BestRefSeq 764 1097 Coding RNA XM Gnomon 31578 31346 Non coding RNA NR BestRefSeq 348 726 Non coding RNA XR Gnomon 3311 3310 Total 36001 36479 32342 coding transcripts encoded by 19686 genes (average 1.6 transcript per gene) There are 3034 pseudo genes (with no annotated transcripts) Status count reviewed 4 Validated 267 Provisional 540 Predicted 7 inferred 279 Tamer Mansour
  34. 34. Library prep Read trimming Mapping to ref Merge rep. Trans Ass. Merge byTiss. Predict ORF VariantAna Update dbvar Haplotype ass Pool/diginorm Predict ncRNA Filter & Compare Ass. filter knowns Compare to public ann. Merge All Ass. Mapping to ref Trans Ass. Tamer Mansour
  35. 35. Digital normalization & (e.g.) horse transcriptome The computational demands for cufflinks - Read binning (processing time) - Construction of gene models (no of genes, no of splicing junctions, no of reads per locus, sequencing errors, complexity of the locus like gene overlap and multiple isoforms (processing time & Memory utilization) Diginorm - Significant reduction of binning time - Relative increase of the resources required for gene model construction with merging more samples and tissues - ? false recombinant isoforms Tamer Mansour
  36. 36. Effect of digital normalization ** Should be very valuable for detection of ncRNA Tamer Mansour
  37. 37. The ORF problem Hestand et al 2014: “we identified 301,829 positions with SNPs or small indels within these transcripts relative to EquCab2. Interestingly, 780 variants extend the open reading frame of the transcript and appear to be small errors in the equine reference genome” Tamer Mansour
  38. 38. We merged the assemblies into six tissue-specific transcription profiles for cerebellum, brainstem, spinal cord, retina, muscle and skin.The final merger of all assemblies overlaps with 63% and 73% of NCBI and Ensembl loci, respectively, capturing about 72% and 81% of their coding bases. Comparing our assembly to the most recent transcriptome annotation shows ~85% overlapping loci. In addition, at least 40% of our annotated loci represent novel transcripts. Tamer Mansour
  39. 39. Diginorm can also process data as it comes in – streaming decision making.
  40. 40. What do we do when we get new data?? • How do we efficiently process, update our existing resources? • How do we evaluate whether or not our prior conclusions need to change or be updated? – # of genes, & their annotations; – Differential expression based on new isoforms; • This is a problem everyone has …and it’s not going away…
  41. 41. The data challenge in biology So we can sequence everything – so what? What does it mean? How can we do better biology with the data? How can we understand?
  42. 42. A 12-step program for biology (??) (This was a not terribly successful attempt to be entertaining.)
  43. 43. 1.Think repeatability and scaling x 100 What works for one data set, Doesn’t work as well for three, And doesn’t work at all for 100.
  44. 44. 2.Think streaming / few-pass analysis Mapping Data Sorting Calling Answer 1-pass Data Answer versus
  45. 45. 3. Invest in computational training Summer NGS workshop (2010-2017)
  46. 46. 4. Move beyond PDFs This is only part of the story! Subramanian et al., doi: 10.1128/JVI.01163-13
  47. 47. 5. Focus on a biological question Generating data for the sake of having data leads you into a data analysis maze – “I’m sure there’s something interesting in there… somewhere.”
  48. 48. "...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery being associated with greater research momentum—a genomic bandwagon effect." Ref.: Pandey et al. (2014), PLoS One 11, e88889.Via Erich Schwarz The problem of lopsided gene characterization is pervasive: e.g., the brain "ignorome" 6. Spend more effort on the unknowns!
  49. 49. 7. Invest in data integration. Figure 2. Summary of challenges associated with the data integration in the proposed project. Figure via E. Kujawinski
  50. 50. 8. Split your information into layers Protein coding >> ncRNA >> ??? ** Should be very valuable for detection of ncRNA *** But what the heck do we do with ncRNA information? Tamer Mansour
  51. 51. 9. Move to an update model. Current information New data!!!! Update results? Yes? ?????
  52. 52. Candidates for additional steps… • Invest in data sharing and better “reference” infrastructure. • Build better tools for computationally exploring hypotheses. • Invest in “unsupervised” analysis of data (machine learning) • Learn/apply multivariate stats. • Invest in social media & preprints & “open”
  53. 53. My future plans? • Protocols and (distributed) platform for data discovery & sharing. • Data analysis and integration in marine biogeochemistry & microbial physiology
  54. 54. Fig. 1: The cycle from data to discovery, over models back to experiment, that generates knowledge as the cycle is repeated. Parts of that cycle are standard in particular disciplines, but putting together the full cycle requires transdisciplinary expertise.
  55. 55. Training program at UC Davis: • Regular intensive workshops, half-day or longer. • Aimed at research practitioners (grad students & more senior); open to all (including outside community). • Novice (“zero entry”) on up. • Low cost for students. • Leverage global training initiatives. (Google “dib training” for details; join the announce list!)
  56. 56. Thanks for listening! Please contact me at ctbrown@ucdavis.edu!

×