9. Phylogenetic trees
H. sapiens ATG CTC TAT GAG
P. troglodoytes ATG CTC TTT GAG
G. gorilla ATG CTT TAT TAC
P. troglodoytes G. gorilla
H. sapiens
P. troglodoytes
11 9
8
13. Field-sequencing for real
Conditions
100% humidity; 6-13ºC
Essential kit
800w generator
3x laptops
Centrifuge
Waterbath
Polystyrene boxes (lots)
Kettle(…!)
Yield
>400Mbp data in three days;
A. thaliana ~2.01x coverage
14. Snowdonia, HelloWorld & ‘tent-seq’
A. thaliana Arabidopsis lyrata
Congeneric species;
Reference genomes available
Field-sequenced (MinION) &
Lab-sequenced (Illumina™)
Orthogonal BLAST:
4 sample*sequencer combinations
Compare TRUE & FALSE rates for
varying ID statistic cutoffs
15. Field- vs. lab-sequenced sample ID
Match individual reads to
each reference with BLAST
Compare match lengths in
TRUE and FALSE cases
‘Length bias’ ID stat:
lengthTRUE - lengthFALSE
Compare TRUE & FALSE
rates as length bias cutoff
varies
MiSeq (lab)
MinION (field)
16. Bitty data (1) partial queries
Subsample MinION output
Repeat ID pipeline, record
mean ID stat sbias
Replicates: N = 30
Simulate from 100 – 104
reads (≈instant → hours)
17. Bitty data (2) partial references
Take reference genome at
high contiguity
Fragment randomly to
target (low) contiguity
Repeat read identification
using fragmented DB
Simulate N50 ≈1,000bp
to N50 ≈ 10Mbp
18. Keeping it simple: Kew Science Festival
Six species: whole genome-
skim samples with MinION
in preparation
Build BLAST DBs from
skimmed data
Select ‘unknown’ (blinded)
sample, extract DNA and
resequence in real-time
Compare to partial DBs in
six-way BLAST competition
Live ID ?
19. de novo genome assembly
Data MiSeq only MiSeq + MinION
Assembler Abyss hybridSPAdes
Illumina reads, 300bp paired-end 8,033,488 8,033,488
Illumina data (yield) 2,418 Mbp 2,418 Mbp
MinION reads, R7.3 + R9 kits,
N50 ~ 4,410bp
- 96,845
MinION data (yield) - 240 Mbp
Approx. coverage 19.49x 19.49x + 2.01x
Assembly key statistics:
# contigs 24,999 10,644
Longest contig 90 Kbp 414 Kbp
N50 contiguity 7,853 bp 48,730 bp
Fraction of reference genome (%) 82 88
Errors, per 100 kbp: #N’s 1.7 5.4
# mismatches 518 588
# indels 120 130
Largest alignment 76,935 bp 264,039 bp
CEGMA gene completeness estimate:
# genes 219 of 248 245 of 248
% genes 88% 99%
25. Key:
Extant node
Inferred node
Synteny edge (physical connection
Phylogeny edge (evolutionary connection)
Identity edge (organismal connection)
Three-colour graphs: phylogeny, synteny & identity
a b c d
x y
z
e
a
a
28. Step back: molecular evolution
“Horizontal gene transfer occurs x more frequently in these lineages,
because of this biology”
“Convergent evolution is rare in most genes, in most organisms, but y times
greater in these gene families …because of this biology”
“New chomosomes are created & destroyed at z, q, rates in this
reproductive strategy …because of this biology”
31. The tools aren’t in great shape but the prizes are there
bionode.js
bioboxes.org
Singularity
Portable sequencing, by anyone means
really Big Data
Informatics connecting this data through
explicit models is inference
Scalable, reproducible, sustainable research:
40. Thanks, funders, contacts and questions
Oxford Nanopore
Technologies Ltd.
Dan Turner, Richard
Ronan, Gerrard CoyneU Bangor:
Alexander S.T. Papadopulos (@metallophyte)
RBG Kew:
Postdocs: Andrew Helmstetter (@ajhelmstetter); Tim Coker
Thanks: Dion Devey, Robyn Cowan, Tim Wilkinson, Stephen Dodsworth, Pepijn Kooij, Felix
Forest, Bill Baker, Jan T. Kim, Jenny Williams, Abigail Barker, Mark Lee, Jim Clarkson, Mike
Chester, Ester Gaya, Lisa Pokorny, Laszlo Csiba, Paul Wilkin, Richard Buggs, Mike Fay, Mark
Chase, Ilia Leitch
QMUL
Laura Kelly, Kalina Davies, Steve Rossiter
Oxford
Aris Katzourakis, Oli Pybus, Jayna Raghwani
Others
Forest Research: Daegan Inward, Katy Reed
Dstl: Claire Lonstale, James Taylor
Birmingham: Nick Loman, Josh Quick
U. Utah: Bryn Dentinger
Imperial: James Rosindell
This research was
conducted in the
Sackler Phylogenomics
Laboratory and was
supported by the
Calleva Foundation
Phylogenomic Research
Programme and the
Sackler Trust
@lonelyjoeparker:
joe.parker@kew.org
Notes de l'éditeur
Definitions
Genetic data, what it is, where it’s found, how we get it
A genome / assembly
Annotation and alignment
A phylogeny or tree
Definitions
Genetic data, what it is, where it’s found, how we get it
A genome / assembly
Annotation and alignment
A phylogeny or tree
Naming stuff
The ladder of life
Binomial / ontological naming
Darwin and The Tree
Networks
Portable sequencing: also long reads and real-time
Portable
Real-time
Long
easy
Data in terrible conditions but anyone can do it
Social media reach The Atlantic, Economist
Direct, explicit, orthogonal test – and can it work?
Picture of experimental design
Outline of the study
In terms of bioinformatics questions
Funding: a first pot and timeline…
We compare match lengths, and minon allows long matches
EXPLAIN AXES: precision improves rapidly
EXPLAIN AXES: a partial REFERENCE would work, too
MORE FUNDING. SO simple a kid could do it? Yes
The challenge I set myself: OK, it’s a simple experiment. Can I buid a trest simple ehough a child can understand it?
SOCIAL MEDIA
Funding: NANOPORE
Data from one time and place can and should be useful elsewhere
lash a bit of proper genomics
Single reads match whole genes – meat & drink
EXPLAIN AXES postdoc-years PAPER ACCEPTED
Genomes come in all shapes and sizes
Organisms too, life cycles
(A)sexual reproduction; clonal replication
Even genetic alphabet not fixed
And mutation isn’t random
Incongruence and reticulation
Horizontal gene transfer
Incomplete lineage sorting
Hybridization
Recombination
Networks attempt to summarise this
Splits graphs, directed graphical models / planar graphs
Definition
Features
Generalised representations
Phylogeny edge information workable-outable
Other edges present in metadata; inferable
Generative model; easy to interpret…
Here’s a common framework for all these studies
How to infer – sounds like a nightmare
Many of the edges in this network are really there already
Shifting paradigms, making linking easier
Explicitly model phylogeny, synteny and identity
Edge support reflects evidence; deviations from neutrality reflect hypotheses/models/phenomena
Any nodes connecting to an identity edge are considered completely connected
Maximum # edges ~n (2n-1)/2
Digraphs ~n!!
Possible ancestors from one locus on n taxa essentially inverse func of when they coalesce (can have m generations of n ancestors until an event where n(m)<n(t)
EXAMPLES
Gene duplication e.g. paralogue in animal
Tetraploid formed then secondary diploidization, e.g. plant
Inversion in a genome
Unlinked loci (e.g. bacterial plasmids) and HGT.
How to infer – sounds like a nightmare
Many of the edges in this network are really there already
Shifting paradigms, making linking easier
Explicitly model phylogeny, synteny and identity
Edge support reflects evidence; deviations from neutrality reflect hypotheses/models/phenomena
EXAMPLES
Gene duplication e.g. paralogue in animal
Tetraploid formed then secondary diploidization, e.g. plant
Inversion in a genome
Unlinked loci (e.g. bacterial plasmids) and HGT.
How to infer – sounds like a nightmare
Many of the edges in this network are really there already
Shifting paradigms, making linking easier
Explicitly model phylogeny, synteny and identity
Edge support reflects evidence; deviations from neutrality reflect hypotheses/models/phenomena
We need enough data to turn obervations, into empirical comparisons, into models and laws
We know a lot about evolutionary mechanisms
And a lot about (a handful of genomes)
What we know tells us “it’s complicated”
Most genes don’t have simple orthologues etc etc etc, hotizonatl etc
But we don’t, really, have an empirical understanding of how they fit together, e.g.:
- ”horizontal gene transfer occurs x more frequently in these lineages, because of this biology”
- adaptive molecularconvergence is rare in most genes, in most organisms, but y times greater in these gene families because of this biology
- new chomosomes are created (by duplication, endogenisation, polyploidy) and destroyed (by diploidization) at z rates in this reproductive strategy because of biology
Global databases
Algorithms, methods and theory
Generally bespoke / slow / in-house
Special sauce
Formally linking datasets and models is inferring the network of life
Shifts the job for bioinformatics from something it’s good at – sophisiticated analysis incemental
To sometheing computers in gerneral are great at: linking elements
In this case informatics doesn’t enable research , it is the process of inference
It’s relatively easy to write a new standalone app to do x, or analyse some big dataset
Reproducibility and scaling-up science mean we must work harder on the links
Informatics as inference.
The lonely astronomers.
HPCs to apps: Exponential data, linear understanding.
Pause – to recap
This is important because it’s where we tie it together and show my contribution:
Portable sequencers, easier to use
More places
More experimenters
More data
More noise
Efficient comparison?
Dynamic computation?
Clever hashing
Portable, mass sequencing is really here
Massive potential for de novo genomics; phylogenomics
But while we’re accumulating information at an exponential rate, we’re integrating it linearly, in essence
… where are we going?
Superset of species ID
Distribution of species, sometimes functional focus
We may not have positive controls
We usually don’t know ‘normal’ distribution
From a fringe idea to routine
Gut microbiome
Many other tissues
UTI
Dental
Cardiac
Respiratory
Not just human health; pathogen surveillance
Ecosystems; habitats; communities; niches
Properly ecologists’ domain
Thousands of species
Abundances shift in time and space
All trackable with DNA
Where do ecosystem services come from?
How healthy?
What’s really out there
Longitudinal data
Fixed locations
Parallels with earth sensing
Autonomous in-situ sensor platforms
Data collection in aggregate means we can asymptotically assemble the components we need for the Tree Of Life
This is loosely defined as the Map Of The World for genomic stuff
Not exactly simple but not a computational / and engineering challenge, not really intellectually taxing (probably)
Pretty much the biggest goal in evolution
The cosmology of life
Why genomes/chromosomes?
Why that size?
Why organisms?
Where is the root
Sequence-space and the network as a state-space
Inflation
Probability function of you