Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

2015 beacon-metagenome-tutorial

Chargement dans…3

Consultez-les par la suite

1 sur 114 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (20)


Similaire à 2015 beacon-metagenome-tutorial (20)


Plus récents (20)

2015 beacon-metagenome-tutorial

  1. 1. Shotgun metagenomics C. Titus Brown UC Davis / School of Veterinary Medicine ctbrown@ucdavis.edu
  2. 2. Shotgun metagenomics • Collect samples; • Extract DNA; • Feed into sequencer; • Computationally analyze. Wikipedia: Environmental shotgun sequencing.png
  3. 3. Computational protocol for assembly
  4. 4. Genome assembly works pretty well Velvet IDBA Spades Quality Quality Quality Total length (>= 0 bp) 1.6E+08 2.0E+08 2.0E+08 Total length (>= 1000 bp) 1.6E+08 1.9E+08 1.9E+08 Largest contig 561,449 979,948 1,387,918 # misassembled contigs 631 1032 752 Genome fraction (%) 72.949 90.969 90.424 Duplication ratio 1.004 1.007 1.004 Conclusion: SPAdes and IDBA achieve similar results. Dr. Sherine Awad
  5. 5. Evaluating short-read data / coverage plots Assembly starts to work @ ~10x
  6. 6. What’s coming? Lots more data.
  7. 7. Who ya gonna call?? …to do your bioinformatics?
  8. 8. Topics to explore - • Experimental design to enable good data analysis. • Extracting whole genomes and evaluating strain variation. • Technology review and open problems. • Assembly graphs. • Long reads and infinite sequencing – what next? Or just a Q&A. I can talk for 90 minutes straight, too, but I’d rather not.
  9. 9. Assembly graphs.Assembly “graphs” are a way of representing sequencing data that retains all variation; assemblers then crunch this down into a single sequence. Image from Iqbal et al., 2012. PMID 23172865
  10. 10. Long read sequencing See: http://angus.readthedocs.org/en/2015/_static/Torsten _Seemann_LRS.pdf & nanopore etc.
  11. 11. Intro, overview, and case studies
  12. 12. Shotgun metagenomics • Collect samples; • Extract DNA; • Feed into sequencer; • Computationally analyze. Wikipedia: Environmental shotgun sequencing.png
  13. 13. FASTQ etc. @SRR606249.17/1 GAGTATGTTCTCATAGAGGTTGGTANNNNT + B@BDDFFFHHHHHJIJJJJGHIJHJ####1 @SRR606249.17/2 CGAANNNNNNNNNNNNNNNNNCCTGGCTCA + CCCF#################22@GHIJJJ Name Quality score Note: /1 and /2 => interleaved
  14. 14. Shotgun sequencing & assembly Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
  15. 15. Dealing with shotgun metagenome reads. 1) Annotate/analyze individual reads. 1) Assemble reads into contigs, genomes, etc.
  16. 16. Annotating individual reads • Works really well when you have EITHER o (a) evolutionarily close references o (b) rather long sequences (This is obvious, right?)
  17. 17. Annotating individual reads #2 • We have found that this does not work well with Illumina samples from unexplored environments (e.g. soil). • Sensitivity is fine (correct match is usually there) • Specificity is bad (correct match may be drowned out by incorrect matches)
  18. 18. Assembly: good. Howe et al., 2014, PMID 24632729 Assemblies yield much more significant homology matches.
  19. 19. Recommendation: For reads < 200-300 bp, • Annotate individual reads for human- associated samples, or exploration of well- studied systems. • Otherwise, look to assembly.
  20. 20. So, why assemble? • Increase your ability to assign homology/orthology correctly!! Essentially all functional annotation systems depend on sequence similarity to assign homology. This is why you want to assemble your data.
  21. 21. Why else would you want to assemble? • Assemble new “reference”. • Look for large-scale variation from reference – pathogenicity islands, etc. • Discriminate between different members of gene families. • Discover operon assemblages & annotate on co-incidence of genes. • Reduce size of data!!
  22. 22. Why don’t you want to assemble?? • Abundance threshold – low-coverage filter. • Strain variation • Chimerism
  23. 23. Assembly, in general
  24. 24. Short read lengths are hard. Whiteford et al., Nuc. Acid Res, 2005
  25. 25. Short read lengths are hard. Whiteford et al., Nuc. Acid Res, 2005 Conclusion: even with a read length of 200, the E. coli genome cannot be assembled completely. Why?
  26. 26. Short read lengths are hard. Whiteford et al., Nuc. Acid Res, 2005 Conclusion: even with a read length of 200, the E. coli genome cannot be assembled completely. Why? REPEATS. This is why paired-end sequencing is so importan for assembly.
  27. 27. Four main challenges for de novo sequencing. • Repeats. • Low coverage. • Errors These introduce breaks in the construction of contigs. • Variation in coverage – transcriptomes and metagenomes, as well as amplified genomic. This challenges the assembler to distinguish between erroneous connections (e.g. repeats) and real connections.
  28. 28. Repeats • Overlaps don’t place sequences uniquely when there are repeats present. UMD assembly primer (cbcb.umd.edu)
  29. 29. Metagenomics: Mixed community sampling Coverage distribution
  30. 30. Conclusions re mixed community sampling In shotgun metagenomics, you are sampling randomly from the mixed population. Therefore, the lowest abundance member of the population (that you want to observe) drives the required depth of sequencing! 1 in a million => ~50 Tbp sequencing for 10x coverage.
  31. 31. Assembly depends on high coverage HMP mock community assembly; Velvet-based protocol.
  32. 32. Conclusions from previous slide • To recover any contigs at all, you need coverage > 10 (green line). • To recover long sequences, you want coverage > 20 (blue line).
  33. 33. A few assembly stories • Low complexity/Osedax symbionts • High complexity/soil.
  34. 34. Osedax symbionts Table S1 ! Specimens, and collection sites, used in this study Site Dive1 Date Time Zone (months) # of specimens Osedax frankpressi with Rs1 symbiont 2890m T486 Oct 2002 8 2 T610 Aug 2003 18 3 T1069 Jan 2007 59 2 DR010 Mar 2009 85 3 DR098 Nov 2009 93 4 DR204 Oct 2010 104 1 DR234 Jun 2011 112 2 1820m T1048 Oct 2006 7 3 T1071 Jan 2007 10 3 DR012 Mar 2009 36 4 DR2362 Jun 2011 63 2 O. frankpressi from DR236 1 Dive numbers begin with the remotely operated vehicle name; T= Tiburon, DR = Doc Ricketts (both owned and operated by the Monterey Bay Aquarium Research Institute) . 2 Samples used for genomic analysis were collected during dive DR236. Goffredi et al., submitted.
  35. 35. 16s shotgun
  36. 36. Metagenomic assembly followed by binning enabled isolation of fairly complete genomes
  37. 37. Assembly allowed genomic content comparisons to nearest cultured relative
  38. 38. Osedax assembly story Conclusions include: • Osedax symbionts have genes needed for free living stage; • metabolic versatility in carbon, phosphate, and iron uptake strategies; • Genome includes mechanisms for intracellular survival, and numerous potential virulence capabilities.
  39. 39. Osedax assembly story • Low diversity metagenome! • Physical isolation => MDA => sequencing => diginorm => binning => o 94% complete Rs1 o 66-89% complete Rs2 • Note: many interesting critters are hard to isolate => so, basically, metagenomes.
  40. 40. Human-associated communities • “Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization.” Sharon et al. (Banfield lab); Genome Res. 2013 23: 111-120
  41. 41. Setup • Collected 11 fecal samples from premature female, days 15-24. • 260m 100-bp paired-end Illumina HiSeq reads; 400 and 900 base fragments. • Assembled > 96% of reads into contigs > 500bp; 8 complete genomes; reconstructed genes down to 0.05% of population abundance.
  42. 42. Sharon et al., 2013; pmid 22936250
  43. 43. Key strategy: abundance binning Bin reads by k-mer abundance Assemble most abundance bin Remove reads that map to assembly Sharon et al., 2013; pmid 22936250
  44. 44. Sharon et al., 2013; pmid 22936250
  45. 45. Tracking abundance Sharon et al., 2013; pmid 22936250
  46. 46. Conclusions • Recovered strain variation, phage variation, abundance variation, lateral gene transfer. • Claim that “recovered genomes are superior to draft genomes generated in most isolate genome sequencing projects.”
  47. 47. Environmental metagenomics: Deepwater Horizon spill • “Transcriptional response of bathypelagic marine bacterioplankton to the Deepwater Horizon oil spill.” Rivers et al., 2013, Moran lab. Pmid 23902988.
  48. 48. Sequencing strategy
  49. 49. Protocol for messenger RNA extraction, assembly, downstream analysis, including differential expression. Note: used overlapping paired- end reads.
  50. 50. Great Prairie Grand Challenge - soil • Together with Janet Jansson, Jim Tiedje, et al. • “Can we make sense of soil with deep Illumina sequencing?”
  51. 51. Great Prairie Grand Challenge - soil • What ecosystem level functions are present, and how do microbes do them? • How does agricultural soil differ from native soil? • How does soil respond to climate perturbation? • Questions that are not easy to answer without shotgun sequencing: o What kind of strain-level heterogeneity is present in the population? o What does the phage and viral population look like? o What species are where?
  52. 52. A “Grand Challenge” dataset (DOE/JGI) 0 100 200 300 400 500 600 Iowa, Continuous corn Iowa, Native Prairie Kansas, Cultivated corn Kansas, Native Prairie Wisconsin, Continuous corn Wisconsin, Native Prairie Wisconsin, Restored Prairie Wisconsin, Switchgrass BasepairsofSequencing(Gbp) GAII HiSeq Rumen (Hess et. al, 2011), 268 Gbp MetaHIT (Qin et. al, 2011), 578 Gbp NCBI nr database, 37 Gbp Total: 1,846 Gbp soil metagenome Rumen K-mer Filtered, 111 Gbp
  53. 53. Approach 1: Digital normalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you…
  54. 54. Approach 2: Data partitioning (a computational version of cell sorting) Split reads into “bins” belonging to different source species. Can do this based almost entirely on connectivity of sequences. “Divide and conquer” Memory-efficient implementation helps to scale assembly.
  55. 55. Putting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp Assembly results for Iowa corn and prairie (2x ~300 Gbp soil metagenomes) Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Adina Howe
  56. 56. Resulting contigs are low coverage. Figure11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.
  57. 57. Corn Prairie Iowa prairie & corn - very even.
  58. 58. Taxonomy– Iowa prairie Note: this is predicted taxonomy of contigs w/o considering abundance or length. (MG-RAST)
  59. 59. Taxonomy – Iowa corn Note: this is predicted taxonomy of contigs w/o considering abundance or length. (MG-RAST)
  60. 60. Strain variation? Toptwoallelefrequencies Position within contig Of 5000 most abundant contigs, only 1 has a polymorphism rate > 5% Can measure by read mapping.
  61. 61. Concluding thoughts on assembly (I) • There’s no standard approach yet; almost every paper uses a specialized pipeline of some sort. o More like genome assembly o But unlike transcriptomics…
  62. 62. Concluding thoughts on assembly (II) • Anecdotally, everyone worries about strain variation. o Some groups (e.g. Banfield, us) have found that this is not a problem in their system so far. o Others (viral metagenomes! HMP!) have found this to be a big concern.
  63. 63. Concluding thoughts on assembly (III) • Some groups have found metagenome assembly to be very useful. • Others (us! soil!) have not yet proven its utility.
  64. 64. High coverage is critical. Low coverage is the dominant problem blocking assembly of metagenomes
  65. 65. Questions -- • How much sequencing should I do? • How do I evaluate metagenome assemblies? • Which assembler is best?
  66. 66. Assembly strategies & techniques
  67. 67. Shotgun sequencing and coverage “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  68. 68. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (300 Gbp for human)
  69. 69. Coverage distribution matters! (MD amplified)
  70. 70. Assembly depends on high coverage
  71. 71. Shared low-level transcripts may not reach the threshold for assembly.
  72. 72. K-mer based assemblers scale poorly Why do big data sets require big machines?? Memory usage ~ “real” variation + number of errors Number of errors ~ size of data set
  73. 73. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com De Bruijn graphs scale poorly with data size
  74. 74. Practical memory measurements Velvet measurements (Adina Howe)
  75. 75. How much data do we need? (I) (“More” is rather vague…)
  76. 76. Putting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp Assembly results for Iowa corn and prairie (2x ~300 Gbp soil metagenomes) Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Adina Howe
  77. 77. Resulting contigs are low coverage. Figure11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.
  78. 78. How much? (II) • Suppose we need 10x coverage to assemble a microbial genome, and microbial genomes average 5e6 bp of DNA. • Further suppose that we want to be able to assemble a microbial species that is “1 in a 100000”, i.e. 1 in 1e5. • Shotgun sequencing samples randomly, so must sample deeply to be sensitive. 10x coverage x 5e6 bp x 1e5 =~ 50e11, or 5 Tbp of sequence. Currently this would cost approximately $100k, for 10 full Illumina runs, but in a year we will be able to do it for much less.
  79. 79. We can estimate sequencing req’d: http://ivory.idyll.org/blog/how-much-sequencing-is-needed.html
  80. 80. “Whoa, that’s a lot of data…” http://ivory.idyll.org/blog/how-much-sequencing-is-needed.html
  81. 81. Some approximate metagenome sizes • Deep carbon mine data set: 60 Mbp (x 10x => 600 Mbp) • Great Prairie soil: 12 Gbp (4x human genome) • Amazon Rain Forest Microbial Observatory soil: 26 Gbp
  82. 82. Partitioning separates reads by genome. When computationally spiking HMP mock data with one E. coli genome (left) or multiple E. coli strains (right), majority of partitions contain reads from only a single genome (blue) vs multi-genome partitions (green). Partitions containing spiked data indicated with a *Adina Howe **
  83. 83. Is your assembly good? • Truly reference-free assembly is hard to evaluate. • Traditional “genome” measures like N50 and NG50 simply do not apply to metagenomes, because very often you don’t know what the genome “size” is.
  84. 84. Evaluating assembly Predicted genome. X X X X X X X X XX Reads - noisy observations of some genome. Assembler (a Big Black Box) Evaluating correctness of metagenomes is still undiscovered country.
  85. 85. Assembler overlap? See tutorial.
  86. 86. What’s the best assembler? -25 -20 -15 -10 -5 0 5 10 BCM* BCM ALLP NEWB SOAP** MERAC CBCB SOAP* SOAP SGA PHUS RAY MLK ABL CumulativeZ-scorefromrankingmetrics Bird assembly Bradnam et al., Assemblathon 2: http://arxiv.org/pdf/1301.5406v1.pdf
  87. 87. What’s the best assembler? -8 -6 -4 -2 0 2 4 6 8 BCM CSHL CSHL* SYM ALLP SGA SOAP* MERAC ABYSS RAY IOB CTD IOB* CSHL** CTD** CTD* CumulativeZ-scorefromrankingmetrics Fish assembly Bradnam et al., Assemblathon 2: http://arxiv.org/pdf/1301.5406v1.pdf
  88. 88. What’s the best assembler? Bradnam et al., Assemblathon 2: http://arxiv.org/pdf/1301.5406v1.pdf -14 -10 -6 -2 2 6 10 SGA PHUS MERAC SYMB ABYSS CRACS BCM RAY SOAP GAM CURT CumulativeZ-scorefromrankingmetrics Snake assembly
  89. 89. Answer: for eukaryotic genomes, it depends • Different assemblers perform differently, depending on o Repeat content o Heterozygosity • Generally the results are very good (est completeness, etc.) but different between different assemblers (!) • There Is No One Answer.
  90. 90. Estimated completeness: CEGMA Each assembler lost different ~5% CEGs Bradnam et al., Assemblathon 2: http://arxiv.org/pdf/1301.5406v1.pdf
  91. 91. Tradeoffs in N 50 and % incl. 0 10 20 30 40 50 60 70 80 90 100 110 0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 16,000,000 18,000,000 BCM ALLP SOAP NEWB MERAC SGA CBCB PHUS RAY MLK ABL %ofestimatedgenomesizepresentinscaffolds>=25Kbp NG50scaffoldlength(bp) Bird assembly Bradnam et al., Assemblathon 2: http://arxiv.org/pdf/1301.5406v1.pdf
  92. 92. Evaluating assemblies • Every assembly returns different results for eukaryotic genomes. • For metagenomes, it’s even worse. o No systematic exploration of precision, recall, etc. o Very little in the way of cross-comparable data sets o Often sequencing technology being evaluated is out of date o etc. etc.
  93. 93. Our experience • Our metagenome assemblies compare well with others, but we have little in the way of ground truth with which to evaluate. • Scaffold assembly is tricky; we believe in contig assembly for metagenomes, but not scaffolding. • See arXiv paper, “Assembling large, complex metagenomes”, for our suggested pipeline and statistics & references.
  94. 94. Metagenomic assemblies are highly variable Adina Howe et al., arXiv 1212.0159
  95. 95. How to choose a metagenome assembler • Try a few. • Use what seems to perform best (most bp > some minimum) • Try: MEGAHIT and SPAdes and IDBA.
  96. 96. An example assembly
  97. 97. Deep Carbon data set • Name: DCO_TCO_MM5 • Masimong Gold Mine; microbial cells filtered from fracture water from within a 1.9km borehole. (32,000 year old water?!) • M.C.Y. Lau, C. Magnabosco, S. Grim, G. Lacrampe Couloume, K. Wilkie, B. Sherwood Lollar, D.N. Simkus, G.F. Slater, S. Hendrickson, M. Pullin, T.L. Kieft, O. Kuloyo, B. Linage, G. Borgonie, E. van Heerden, J. Ackerman, C. van Jaarsveld, and T.C. Onstott
  98. 98. DCO_TCO_M M5 20m reads / 2.1 Gbp 5.6m reads / 601.3 Mbp Adapter trim & quality filter Diginorm to C=10 Trim high- coverage reads at low-abundance k-mers “Could you take a look at this? MG-RAST is telling us we have a lot of artificially duplicated reads, i.e. the data is bad.” Entire process took ~4 hours of computation, or so. (Minimum genome size est: 60.1 Mbp)
  99. 99. Assembly stats: All contigs Contigs > 1kb k N contigs Sum BP N contigs Sum BP Max contig 21 343263 63217837 6271 10537601 9987 23 302613 63025311 7183 13867885 21348 25 276261 62874727 7375 15303646 34272 27 258073 62500739 7424 16078145 48742 29 242552 62001315 7349 16426147 48746 31 228043 61445912 7307 16864293 48750 33 214559 60744478 7241 17133827 48768 35 203292 60039871 7129 17249351 45446 37 189948 58899828 7088 17527450 59437 39 180754 58146806 7027 17610071 54112 41 172209 57126650 6914 17551789 65207 43 165563 56440648 6925 17654067 73231 DCO_TCO_MM 5 (Minimum genome size est: 60.1 Mbp)
  100. 100. Chose two: • A: k=43 (“long contigs”) • 165563 contigs • 56.4 Mbp • longest contig: 73231 bp • B: k=43 (“high recall”) • 343263 contigs • 63.2 Mbp • longest contig is 9987 bp DCO_TCO_MM 5 How to evaluate??
  101. 101. How many reads map back? Mapped 3.8m paired-end reads (one subsample): • high-recall: 41% of pairs map • longer-contigs: 70% of pairs map + 150k single-end reads: • high-recall: 49% of sequences map • longer-contigs: 79% of sequences map
  102. 102. Annotation/exploration with MG-RAST • You can upload genome assemblies to MG-RAST, and annotate them with coverage; tutorial to follow. • What does MG-RAST do?
  103. 103. Conclusion • This is a pretty good metagenome assembly – > 80% of reads map! • Surprised that the larger dataset (6.32 Mbp, “high recall”) accounts for a smaller percentage of the reads – 49% vs 79% for the 56.4 Mbp “long contigs” data set. • I now suspect that different parameters are recovering different subsets of the sample… • Don’t trust MG-RAST ADR calls.
  104. 104. You can estimate metagenome size…
  105. 105. Estimates of metagenome size Calculation: # reads * (avg read len) / (diginorm coverage) Assumes: few entirely erroneous reads (upper bound); saturation (lower bound). • E. coli: 384k * 86.1 / 5.0 => 6.6 Mbp est. (true: 4.5 Mbp) • MM5 deep carbon: 60 Mbp • Great Prairie soil: 12 Gbp • Amazon Rain Forest Microbial Observatory: 26 Gbp
  106. 106. Extracting whole genomes? So far, we have only assembled contigs, but not whole genomes. Can entire genomes be assembled from metagenomic data? Iverson et al. (2012), from the Armbrust lab, contains a technique for scaffolding metagenome contigs into ~whole genomes. YES. Pmid 22301318, Iverson et al.
  107. 107. Concluding thoughts • What works? • What needs work? • What will work?
  108. 108. What works? Today, • From deep metagenomic data, you can get the gene and operon content (including abundance of both) from communities. • You can get microarray-like expression information from metatranscriptomics.
  109. 109. What needs work? • Assembling ultra-deep samples is going to require more engineering, but is straightforward. (“Infinite assembly.”) • Building scaffolds and extracting whole genomes has been done, but I am not yet sure how feasible it is to do systematically with existing tools (c.f. Armbrust Lab).
  110. 110. What will work, someday? • Sensitive analysis of strain variation. o Both assembly and mapping approaches do a poor job detecting many kinds of biological novelty. o The 1000 Genomes Project has developed some good tools that need to be evaluated on community samples. • Ecological/evolutionary dynamics in vivo. o Most work done on 16s, not on genomes or functional content. o Here, sensitivity is really important!
  111. 111. The interpretation challenge • For soil, we have generated approximately 1200 bacterial genomes worth of assembled genomic DNA from two soil samples. • The vast majority of this genomic DNA contains unknown genes with largely unknown function. • Most annotations of gene function & interaction are from a few phylogenetically limited model organisms o Est 98% of annotations are computationally inferred: transferred from model organisms to genomic sequence, using homology. o Can these annotations be transferred? (Probably not.) This will be the biggest sequence analysis challenge of the next 50 years.
  112. 112. What are future needs? • High-quality, medium+ throughput annotation of genomes? o Extrapolating from model organisms is both immensely important and yet lacking. o Strong phylogenetic sampling bias in existing annotations. • Synthetic biology for investigating non-model organisms? (Cleverness in experimental biology doesn’t scale ) • Integration of microbiology, community ecology/evolution modeling, and data analysis.

Notes de l'éditeur

  • Both BLAST. Shotgun: annotated reads only.
  • Beware of Janet Jansson bearing data.
  • Diginorm is a subsampling approach that may help assemble highly polymorphic sequences. Observed levels of variation are quite low relative to e.g. marine free spawning animals.
  • A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.