SlideShare une entreprise Scribd logo
1  sur  40
Real-time phylogenomics
or
‘Some interesting problems in
genomic big data’
Joe Parker
Early-career Research Fellow, RBG Kew
@lonelyjoeparker:
joe.parker@kew.org
Incredible times for bioscience
Images – Wikimedia
commons CC BY-SA
(clockwise from top left:
Jeroen Rouwkema,
@aGastya, author’s
own, @RE73)
Background
MENU
Some definitions
Adventures with a genome sequencer
Evolution is complex
Real-time data & mass sequencing
Final thoughts: the cosmology of life
1. Definitions
Genes and genomes
>ENA|AY819028|AY819028.1 Capsicum annuum cultivar Hot 1493 acyltransferase (Pun1) gene, complete cds.
TCATTAGAAGGTCATACCGCTCCACGAAAATGCACCTTGAAAGATATAACACGGACAACGAATCATTATCCCCATCATCACTATTACTCCCACTTCC
CTTGCACTCTTCACTGTCACCACTGACACTCCGCTTGGCAACATTTTCACTAGAATCGACGTAGTCGCTTATCTCCTTTAACTCCGAATCTGATTCG
GACACCGACTGTTCAATTTTCTTTCTTTTTGAGACTTTTTCAACTGCTTCAGTTCTTCTTTTTCTACTGTTACCAGCGGTACCGGCTTTGCGTTTAGG
AATGATGTTTTTTTTACCCATTTTCAACACAATCTACACCTAAAGAACAAATCTCCCATTTTTAGTTCATAGACCACAAGTCTATCAACAGAAATAACT
CAATGCTCAAATGAACCCCCCTCCCCCCAAAAAAAAATTAACAAACACCCCACCATTAAACAGTTCACTACACAAACATACAATAACTGAACCAAAAT
CCAACATGCAATATCAAAACACAACAATTACTAAAATCAAACTAATGCACCTAATCAAACTAATTAGCTATTAATATTTCAATTTTCACTATTTCAGCAA
TCATGTTTTAAAAGAATTTCATACGTCTGAAAATTGATATATATCTAGGGCATTCTCATTTCATAGACCACGGGTCTACGGATAGACCTCGGGTCTAC
GAACAGAAATAGGTCGCTGTTCAAATCAAAATGCCAAAATAACTCTTCAAACAACTATTATCCCACCATTCAACACTTCGTTGCTAAATAAACCACAA
CTAAACCAAAACACCAAATTCGAAGAAAAAATTTCTACATCACTACGAGTTGATTAGCAAAAAAAAACGTTTAAATGGATCTAGAAATGATCGAAACT
TGATTTTAACTAACCTTGCAAAGCAGCAACAACCCCTTAGTAGCTGGAGAAGAAGACGAAATGAAAATGGCATTTTTGGAAGAAGTAGTTTCAAAAG
CAGGAGTTGGGAATTGAAGAGGAGAGAGAGGGTGGGTTTTTTTAAATATTGGAATAATTGGAGGGTGTTAGGTGTATTATATTAAATTTGTAAAGTT
GTAAAAATGATGAATTGGTCCCTTGGCCGATGCGTGGGCCCCACTTTTTCATAAAAAATAAATCAAAAAGAAATTAAGTAGGTATTTGACAAATTAAT
TTTGGAGGGTTCCTTCTTTGCCAATTATTCCCCACTAAGCTACTCTCATTCACTCTTATATTATAGATTATAGTATAAAGTAATACAAACTATGAATTG
TTTTTATATTTTATTTTACAAGTTATGAATAGTGTTTATATAGGTCTCTATTTCCATACAATCACATTTTGTGGGCAGTTTTTTTGGGATTGTCACGAAG
GCGAGGTTTGTTCATTTTGTGGAAAGAGAATTGGATTTCTACATTTTTATCATCTTCTAGGTGTGATGTTGATACTACTATTTGCCCAAATATTTGTTT
TAAACATATTAATATTATGTATCAAAATGTGTACAATATAATTTAACACACGTGCAGTATGCATGTATCGCGAAACTAGTTAATTACATGCATCACATG
TAATAGCAATAGTATTATTGTACGACGTACTAATATATTAGTATCTATTCTAGCTACTAATTTCCTCTTAACCGTCTCCATGCTGAAAACAACGCCACA
GTGCAACGAGCCTTCTATAAAAGTTGAATTATATAAAAATAAGGTACAGTTTAGAAATAAAACTAACAAAAAGGTAACCTATAGTTTGGGGGTTGGGT
AGAGGTTGTTTAGCCAGTAACTCTATTATTTCATTTCCTTTTGTCTATATAAGTGTATCCATATATGCAAGAAATGTCAACCGGCCAGCAGCATATAT
TTATTTGTTAAATTAATTATGGCTTTTGCATTACCATCATCACTTGTTTCAGTTTGTAACAAATCTTTTATCAAACCTTCCTCTCTCACCCCCTCTACAC
TTAGATTTCACAAGCTATCTTTCATCGATCAATCTTTAAGTAATATGTATATCCCTTGTGCATTTTTTTACCCTAAAGTACAACAAAGACTAGAAGACT
CCAAAAATTCTGATGAGCTTTCCCATATAGCCCACTTGCTACAAACATCTCTATCACAAACTCTAGTCTCTTACTATCCTTATGCTGGAAAGTTGAAG
GACAATGCTACTGTTGACTGTAACGATATGGGAGCTGAGTTCTTGAGTGTTCGAATAAAATGTTCCATGTCTGAAATTCTTGATCATCCTCATGCAT
CTCTTGCAGAGAGCATAGTTTTGCCCAAGGATTTGCCTTGGGCGAATAATTGTGAAGGTGGTAATTTGCTTGTAGTTCAAGTAAGTAAGTTTGATTG
TGGGGGAATAGCCATCAGTGTATGCTTTTCGCACAAGATTGGTGATGGTTGCTCTCTGCTTAATTTCCTTAATGATTGGTCTAGCGTTACTCGTGAT
CATACGACAACAACTTTAGTTCCATCTCCTAGATTTGTAGGAGATTCAGTCTTCTCTACACAAAAATATGGTTCTCTCATTACGCCACAAATTTTGTC
CGATCTCAACCAGTGCGTACAGAAAAGACTCATTTTTCCTACAGATAAGTTAGATGCACTTCGAGCTAAGGTAATACTACCATCGTCCATTATTGTTT
GTCTTACGGTATTTTTGAAAAGAATAATATTTAATAGTCTTCTTGAGACATATTTCACTTAACAAGCCTAGGCTATTTAGTCTATTTGTAGAAGCTACT
CTTAAACGCCTCACTTAGTTAATAGCACTCCACTTATTGGTGTCAAAAACTACTCTTGGACATGTCATTTACTTAATAACACTCCACTTAATTATCGAA
Alignment,
assembly,
annotation
Li et al. (2011)
EBI / NCBI / DDBJ
Three millennia of modelling life
Phylogenetic trees
H. sapiens ATG CTC TAT GAG
P. troglodoytes ATG CTC TTT GAG
G. gorilla ATG CTT TAT TAC
P. troglodoytes G. gorilla
H. sapiens
P. troglodoytes
11 9
8
Phylogenetic trees
2. Adventures with
DNA sequencing
Field-sequencing for real
Conditions
100% humidity; 6-13ºC
Essential kit
800w generator
3x laptops
Centrifuge
Waterbath
Polystyrene boxes (lots)
Kettle(…!)
Yield
>400Mbp data in three days;
A. thaliana ~2.01x coverage
Snowdonia, HelloWorld & ‘tent-seq’
A. thaliana Arabidopsis lyrata
Congeneric species;
Reference genomes available
Field-sequenced (MinION) &
Lab-sequenced (Illumina™)
Orthogonal BLAST:
4 sample*sequencer combinations
Compare TRUE & FALSE rates for
varying ID statistic cutoffs
Field- vs. lab-sequenced sample ID
Match individual reads to
each reference with BLAST
Compare match lengths in
TRUE and FALSE cases
‘Length bias’ ID stat:
lengthTRUE - lengthFALSE
Compare TRUE & FALSE
rates as length bias cutoff
varies
MiSeq (lab)
MinION (field)
Bitty data (1) partial queries
Subsample MinION output
Repeat ID pipeline, record
mean ID stat sbias
Replicates: N = 30
Simulate from 100 – 104
reads (≈instant → hours)
Bitty data (2) partial references
Take reference genome at
high contiguity
Fragment randomly to
target (low) contiguity
Repeat read identification
using fragmented DB
Simulate N50 ≈1,000bp
to N50 ≈ 10Mbp
Keeping it simple: Kew Science Festival
Six species: whole genome-
skim samples with MinION
in preparation
Build BLAST DBs from
skimmed data
Select ‘unknown’ (blinded)
sample, extract DNA and
resequence in real-time
Compare to partial DBs in
six-way BLAST competition
Live ID ?
de novo genome assembly
Data MiSeq only MiSeq + MinION
Assembler Abyss hybridSPAdes
Illumina reads, 300bp paired-end 8,033,488 8,033,488
Illumina data (yield) 2,418 Mbp 2,418 Mbp
MinION reads, R7.3 + R9 kits,
N50 ~ 4,410bp
- 96,845
MinION data (yield) - 240 Mbp
Approx. coverage 19.49x 19.49x + 2.01x
Assembly key statistics:
# contigs 24,999 10,644
Longest contig 90 Kbp 414 Kbp
N50 contiguity 7,853 bp 48,730 bp
Fraction of reference genome (%) 82 88
Errors, per 100 kbp: #N’s 1.7 5.4
# mismatches 518 588
# indels 120 130
Largest alignment 76,935 bp 264,039 bp
CEGMA gene completeness estimate:
# genes 219 of 248 245 of 248
% genes 88% 99%
Wait – genes?
Entire chloroplast
genome (~150kbp)
Plastid
coding loci
Individual field-
sequenced
MinION reads
Real-time phylogenomics
Filtered
reads
Gene
models
TAIR10
CDS code
Annotation
SNAP
1:1 reciprocal
BLAST
Multiple sequence
alignments
MUSCLE
Trimal
Gene trees → Consensus tree
*BEAST
RAxML, TreeAnnotator
Cumulative counts:
Unique genes
All genes
(‘Lab’ being
transported!)
3. Life is complex
Evolution is complex Nakhleh, (2009); Suh (2016) Zool. Scripta.
doi:10.1111/zsc.12213
Zapata et al. (2016) PNAS
113:E4052-E4060
©2016 National Academy of Sciences
Networks
Strimmer & Moulton (2000) MBE
Solís-Lemus & Ané (2016)
PLoS Genet.
Nakhleh (2009) in:
Heath & Ramakrishnan, eds., Springer
Key:
Extant node
Inferred node
Synteny edge (physical connection
Phylogeny edge (evolutionary connection)
Identity edge (organismal connection)
Three-colour graphs: phylogeny, synteny & identity
a b c d
x y
z
e
a
a
Three-colour graphs: phylogeny, synteny & identity
a1 b1 a2 b2 a3 b3 b’3
a4 b4
a5 b5
Duplication
Key:
Extant node
Inferred node
Synteny edge (physical connection)
Phylogeny edge (evolutionary connection)
Identity edge (organismal connection)
a1 b1
b2 a2
a3 b3
c1
c3
c1
Inversion
Three-colour graphs: phylogeny, synteny & identity
a1
b1 a2 b2
a3 b3
a4 b4 x4 y4
x3 y3
x1 y1
Tetraploid
hybrid
formed
Diploidization
(secondary loss)
Key:
Extant node
Inferred node
Synteny edge (physical connection)
Phylogeny edge (evolutionary connection)
Identity edge (organismal connection)
a1 b1
a2 b2
x1
x5
x2
x3
x4
x7
x6
Horizontal
gene
transfer
Step back: molecular evolution
“Horizontal gene transfer occurs x more frequently in these lineages,
because of this biology”
“Convergent evolution is rare in most genes, in most organisms, but y times
greater in these gene families …because of this biology”
“New chomosomes are created & destroyed at z, q, rates in this
reproductive strategy …because of this biology”
4. Real big data
How big? How many?
Species:
• Mammals: 103 – 105
• Animals: 106 – 107
• Plants: 105 – 106
• Bacteria: >106 ?
• Fungi: >>105-???
DNA sequencing machines:
~104-5 (each ~109-10 bp/day)
(Organisms):
(…a lot)
Example Typical feature size
(Mbp)
Largest known genomes ~240,000 (1011)
Vascular plant ~0.1 – 10,000
(108 - 1010)
Human genome 3,000 (3x109)
Most fungi ~0.1 – 10 (105 - 107)
Bacteria ~0.1 – 1 (106)
Viruses ~0.01 (105)
Mitochondria /
chloroplasts
0.017 / 0.2 (104 ;105)
‘Barcoding’ locus ~1000bp (102)
Illumina read 75-300bp
Nanopore read 100bp –>1Mbp
The tools aren’t in great shape but the prizes are there
bionode.js
bioboxes.org
Singularity
Portable sequencing, by anyone means
really Big Data
Informatics connecting this data through
explicit models is inference
Scalable, reproducible, sustainable research:
Ubiquitous / citizen-sequencing
From lab-based…
… to ‘app store’ genomics
Metagenomics…
© 2016 Katy Reed / FR
EPI2ME; Juul et al. (2016) /
Metrichor.com
Health: defining ‘normal’
Credit:
Darryl Leja, NHGRI
Food-chain: inputs and outputs
Environmental sampling
(Species’ abundance varying in time and space)
(AKA “trampling on the ecologists’ toes”)
Genomic observatories
• [GBIF pic]
• orms
The Tree (network) of Life
iTOL:
Ivica Letunic,
Mariana Ruiz Villarreal
Final thoughts
Thanks, funders, contacts and questions
Oxford Nanopore
Technologies Ltd.
Dan Turner, Richard
Ronan, Gerrard CoyneU Bangor:
Alexander S.T. Papadopulos (@metallophyte)
RBG Kew:
Postdocs: Andrew Helmstetter (@ajhelmstetter); Tim Coker
Thanks: Dion Devey, Robyn Cowan, Tim Wilkinson, Stephen Dodsworth, Pepijn Kooij, Felix
Forest, Bill Baker, Jan T. Kim, Jenny Williams, Abigail Barker, Mark Lee, Jim Clarkson, Mike
Chester, Ester Gaya, Lisa Pokorny, Laszlo Csiba, Paul Wilkin, Richard Buggs, Mike Fay, Mark
Chase, Ilia Leitch
QMUL
Laura Kelly, Kalina Davies, Steve Rossiter
Oxford
Aris Katzourakis, Oli Pybus, Jayna Raghwani
Others
Forest Research: Daegan Inward, Katy Reed
Dstl: Claire Lonstale, James Taylor
Birmingham: Nick Loman, Josh Quick
U. Utah: Bryn Dentinger
Imperial: James Rosindell
This research was
conducted in the
Sackler Phylogenomics
Laboratory and was
supported by the
Calleva Foundation
Phylogenomic Research
Programme and the
Sackler Trust
@lonelyjoeparker:
joe.parker@kew.org

Contenu connexe

Tendances

B.Tech Biotechnology II Elements of Biotechnology Unit 4 DNA Fingerprinting
B.Tech Biotechnology II Elements of Biotechnology Unit 4 DNA FingerprintingB.Tech Biotechnology II Elements of Biotechnology Unit 4 DNA Fingerprinting
B.Tech Biotechnology II Elements of Biotechnology Unit 4 DNA Fingerprinting
Rai University
 
13 genetic engineering bw
13 genetic engineering bw13 genetic engineering bw
13 genetic engineering bw
honey444
 
Techniques of-biotechnology-mcclean-good
Techniques of-biotechnology-mcclean-goodTechniques of-biotechnology-mcclean-good
Techniques of-biotechnology-mcclean-good
ana_isa_barbosa
 

Tendances (20)

B.Tech Biotechnology II Elements of Biotechnology Unit 4 DNA Fingerprinting
B.Tech Biotechnology II Elements of Biotechnology Unit 4 DNA FingerprintingB.Tech Biotechnology II Elements of Biotechnology Unit 4 DNA Fingerprinting
B.Tech Biotechnology II Elements of Biotechnology Unit 4 DNA Fingerprinting
 
Biotechnological tools used for diagnostic
Biotechnological tools used for diagnosticBiotechnological tools used for diagnostic
Biotechnological tools used for diagnostic
 
Recombinant DNA technology
Recombinant DNA technologyRecombinant DNA technology
Recombinant DNA technology
 
Techniques of-biotechnology-mcclean-good
Techniques of-biotechnology-mcclean-goodTechniques of-biotechnology-mcclean-good
Techniques of-biotechnology-mcclean-good
 
Future of metagenomics
Future of metagenomicsFuture of metagenomics
Future of metagenomics
 
Genetic Engineering
Genetic EngineeringGenetic Engineering
Genetic Engineering
 
Recombinant DNA technology lect
Recombinant DNA technology lectRecombinant DNA technology lect
Recombinant DNA technology lect
 
Recombinant DNA Technology, Forensic DNA Analysis and Human Genome Project
Recombinant DNA Technology, Forensic DNA Analysis and Human Genome ProjectRecombinant DNA Technology, Forensic DNA Analysis and Human Genome Project
Recombinant DNA Technology, Forensic DNA Analysis and Human Genome Project
 
13 genetic engineering bw
13 genetic engineering bw13 genetic engineering bw
13 genetic engineering bw
 
Genetic engineering principle, tools, techniques, types and application
Genetic engineering principle, tools, techniques, types and applicationGenetic engineering principle, tools, techniques, types and application
Genetic engineering principle, tools, techniques, types and application
 
Recombinant dna technology applications
Recombinant dna technology   applicationsRecombinant dna technology   applications
Recombinant dna technology applications
 
principle and applications of recombinant DNA technology
principle and applications of recombinant DNA technologyprinciple and applications of recombinant DNA technology
principle and applications of recombinant DNA technology
 
Genetic engineering
Genetic engineeringGenetic engineering
Genetic engineering
 
Techniques of-biotechnology-mcclean-good
Techniques of-biotechnology-mcclean-goodTechniques of-biotechnology-mcclean-good
Techniques of-biotechnology-mcclean-good
 
The Chills and Thrills of Whole Genome Sequencing
The Chills and Thrills of Whole Genome SequencingThe Chills and Thrills of Whole Genome Sequencing
The Chills and Thrills of Whole Genome Sequencing
 
Gene Therapy & R DNA technology
Gene Therapy & R DNA technologyGene Therapy & R DNA technology
Gene Therapy & R DNA technology
 
rDNA technology by Adelin Nijish
rDNA technology by Adelin NijishrDNA technology by Adelin Nijish
rDNA technology by Adelin Nijish
 
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
 
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...
Sequencing your poo with a usb stick -  Linux.conf.au 2016 miniconf  - mon 1 ...Sequencing your poo with a usb stick -  Linux.conf.au 2016 miniconf  - mon 1 ...
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...
 
CV 2015
CV 2015CV 2015
CV 2015
 

Similaire à Real-time Phylogenomics: Joe Parker

L14 human genome
L14 human genomeL14 human genome
L14 human genome
MUBOSScz
 
Microarrays;application
Microarrays;applicationMicroarrays;application
Microarrays;application
Fyzah Bashir
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
c.titus.brown
 
Genome organisation in eukaryotes...........!!!!!!!!!!!
Genome organisation in eukaryotes...........!!!!!!!!!!!Genome organisation in eukaryotes...........!!!!!!!!!!!
Genome organisation in eukaryotes...........!!!!!!!!!!!
manish chovatiya
 

Similaire à Real-time Phylogenomics: Joe Parker (20)

L14 human genome
L14 human genomeL14 human genome
L14 human genome
 
Inference and informatics in a 'sequenced' world
Inference and informatics in a 'sequenced' worldInference and informatics in a 'sequenced' world
Inference and informatics in a 'sequenced' world
 
Pierre Taberlet - Saturday Closing Plenary
Pierre Taberlet - Saturday Closing PlenaryPierre Taberlet - Saturday Closing Plenary
Pierre Taberlet - Saturday Closing Plenary
 
Mt DNA
Mt DNAMt DNA
Mt DNA
 
Microarrays;application
Microarrays;applicationMicroarrays;application
Microarrays;application
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.
 
Microarry andd NGS.pdf
Microarry andd NGS.pdfMicroarry andd NGS.pdf
Microarry andd NGS.pdf
 
Tetrahymena genome project update 2004 by Jonathan Eisen
Tetrahymena genome project update 2004 by Jonathan EisenTetrahymena genome project update 2004 by Jonathan Eisen
Tetrahymena genome project update 2004 by Jonathan Eisen
 
Domestication, polyploidy and genomics of crops #PAGXXV Heslop-Harrison
Domestication, polyploidy and genomics of crops #PAGXXV Heslop-HarrisonDomestication, polyploidy and genomics of crops #PAGXXV Heslop-Harrison
Domestication, polyploidy and genomics of crops #PAGXXV Heslop-Harrison
 
Talk on Phylogenomics for MBL Molecular Evolution Course 2004
Talk on Phylogenomics for MBL Molecular Evolution Course 2004Talk on Phylogenomics for MBL Molecular Evolution Course 2004
Talk on Phylogenomics for MBL Molecular Evolution Course 2004
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
 
Polymerase Chain Reaction (PCR) and The Application.pptx
Polymerase Chain Reaction (PCR) and The Application.pptxPolymerase Chain Reaction (PCR) and The Application.pptx
Polymerase Chain Reaction (PCR) and The Application.pptx
 
Lecture 1,2
Lecture 1,2Lecture 1,2
Lecture 1,2
 
Genome organisation in eukaryotes...........!!!!!!!!!!!
Genome organisation in eukaryotes...........!!!!!!!!!!!Genome organisation in eukaryotes...........!!!!!!!!!!!
Genome organisation in eukaryotes...........!!!!!!!!!!!
 
26072016 uc davis_small
26072016 uc davis_small26072016 uc davis_small
26072016 uc davis_small
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
155 dna microarray
155 dna microarray155 dna microarray
155 dna microarray
 
155 dna microarray
155 dna microarray155 dna microarray
155 dna microarray
 
Dna microarray mehran
Dna microarray  mehranDna microarray  mehran
Dna microarray mehran
 

Plus de Joe Parker

Plus de Joe Parker (10)

Challenges and potential of real-time phylogenomics: lessons from a metagenom...
Challenges and potential of real-time phylogenomics: lessons from a metagenom...Challenges and potential of real-time phylogenomics: lessons from a metagenom...
Challenges and potential of real-time phylogenomics: lessons from a metagenom...
 
Reframing Phylogenomics
Reframing PhylogenomicsReframing Phylogenomics
Reframing Phylogenomics
 
Using field-based DNA sequencing to accelerate phylogenomics
Using field-based DNA sequencing to accelerate phylogenomicsUsing field-based DNA sequencing to accelerate phylogenomics
Using field-based DNA sequencing to accelerate phylogenomics
 
Single-molecule real-time (SMRT) Nanopore sequencing for Plant Pathology appl...
Single-molecule real-time (SMRT) Nanopore sequencing for Plant Pathology appl...Single-molecule real-time (SMRT) Nanopore sequencing for Plant Pathology appl...
Single-molecule real-time (SMRT) Nanopore sequencing for Plant Pathology appl...
 
Joe parker-benchmarking-bioinformatics
Joe parker-benchmarking-bioinformaticsJoe parker-benchmarking-bioinformatics
Joe parker-benchmarking-bioinformatics
 
Real-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe ParkerReal-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe Parker
 
'Omics in extreme Environments (Lightweight bioinformatics)
'Omics in extreme Environments (Lightweight bioinformatics)'Omics in extreme Environments (Lightweight bioinformatics)
'Omics in extreme Environments (Lightweight bioinformatics)
 
Interpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasetsInterpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasets
 
Phylogenomic methods for comparative evolutionary biology - University Colleg...
Phylogenomic methods for comparative evolutionary biology - University Colleg...Phylogenomic methods for comparative evolutionary biology - University Colleg...
Phylogenomic methods for comparative evolutionary biology - University Colleg...
 
Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...
Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...
Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...
 

Dernier

(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
ANSARKHAN96
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Silpa
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
Silpa
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 

Dernier (20)

Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 

Real-time Phylogenomics: Joe Parker

  • 1. Real-time phylogenomics or ‘Some interesting problems in genomic big data’ Joe Parker Early-career Research Fellow, RBG Kew @lonelyjoeparker: joe.parker@kew.org
  • 2. Incredible times for bioscience Images – Wikimedia commons CC BY-SA (clockwise from top left: Jeroen Rouwkema, @aGastya, author’s own, @RE73)
  • 4. MENU Some definitions Adventures with a genome sequencer Evolution is complex Real-time data & mass sequencing Final thoughts: the cosmology of life
  • 6. Genes and genomes >ENA|AY819028|AY819028.1 Capsicum annuum cultivar Hot 1493 acyltransferase (Pun1) gene, complete cds. TCATTAGAAGGTCATACCGCTCCACGAAAATGCACCTTGAAAGATATAACACGGACAACGAATCATTATCCCCATCATCACTATTACTCCCACTTCC CTTGCACTCTTCACTGTCACCACTGACACTCCGCTTGGCAACATTTTCACTAGAATCGACGTAGTCGCTTATCTCCTTTAACTCCGAATCTGATTCG GACACCGACTGTTCAATTTTCTTTCTTTTTGAGACTTTTTCAACTGCTTCAGTTCTTCTTTTTCTACTGTTACCAGCGGTACCGGCTTTGCGTTTAGG AATGATGTTTTTTTTACCCATTTTCAACACAATCTACACCTAAAGAACAAATCTCCCATTTTTAGTTCATAGACCACAAGTCTATCAACAGAAATAACT CAATGCTCAAATGAACCCCCCTCCCCCCAAAAAAAAATTAACAAACACCCCACCATTAAACAGTTCACTACACAAACATACAATAACTGAACCAAAAT CCAACATGCAATATCAAAACACAACAATTACTAAAATCAAACTAATGCACCTAATCAAACTAATTAGCTATTAATATTTCAATTTTCACTATTTCAGCAA TCATGTTTTAAAAGAATTTCATACGTCTGAAAATTGATATATATCTAGGGCATTCTCATTTCATAGACCACGGGTCTACGGATAGACCTCGGGTCTAC GAACAGAAATAGGTCGCTGTTCAAATCAAAATGCCAAAATAACTCTTCAAACAACTATTATCCCACCATTCAACACTTCGTTGCTAAATAAACCACAA CTAAACCAAAACACCAAATTCGAAGAAAAAATTTCTACATCACTACGAGTTGATTAGCAAAAAAAAACGTTTAAATGGATCTAGAAATGATCGAAACT TGATTTTAACTAACCTTGCAAAGCAGCAACAACCCCTTAGTAGCTGGAGAAGAAGACGAAATGAAAATGGCATTTTTGGAAGAAGTAGTTTCAAAAG CAGGAGTTGGGAATTGAAGAGGAGAGAGAGGGTGGGTTTTTTTAAATATTGGAATAATTGGAGGGTGTTAGGTGTATTATATTAAATTTGTAAAGTT GTAAAAATGATGAATTGGTCCCTTGGCCGATGCGTGGGCCCCACTTTTTCATAAAAAATAAATCAAAAAGAAATTAAGTAGGTATTTGACAAATTAAT TTTGGAGGGTTCCTTCTTTGCCAATTATTCCCCACTAAGCTACTCTCATTCACTCTTATATTATAGATTATAGTATAAAGTAATACAAACTATGAATTG TTTTTATATTTTATTTTACAAGTTATGAATAGTGTTTATATAGGTCTCTATTTCCATACAATCACATTTTGTGGGCAGTTTTTTTGGGATTGTCACGAAG GCGAGGTTTGTTCATTTTGTGGAAAGAGAATTGGATTTCTACATTTTTATCATCTTCTAGGTGTGATGTTGATACTACTATTTGCCCAAATATTTGTTT TAAACATATTAATATTATGTATCAAAATGTGTACAATATAATTTAACACACGTGCAGTATGCATGTATCGCGAAACTAGTTAATTACATGCATCACATG TAATAGCAATAGTATTATTGTACGACGTACTAATATATTAGTATCTATTCTAGCTACTAATTTCCTCTTAACCGTCTCCATGCTGAAAACAACGCCACA GTGCAACGAGCCTTCTATAAAAGTTGAATTATATAAAAATAAGGTACAGTTTAGAAATAAAACTAACAAAAAGGTAACCTATAGTTTGGGGGTTGGGT AGAGGTTGTTTAGCCAGTAACTCTATTATTTCATTTCCTTTTGTCTATATAAGTGTATCCATATATGCAAGAAATGTCAACCGGCCAGCAGCATATAT TTATTTGTTAAATTAATTATGGCTTTTGCATTACCATCATCACTTGTTTCAGTTTGTAACAAATCTTTTATCAAACCTTCCTCTCTCACCCCCTCTACAC TTAGATTTCACAAGCTATCTTTCATCGATCAATCTTTAAGTAATATGTATATCCCTTGTGCATTTTTTTACCCTAAAGTACAACAAAGACTAGAAGACT CCAAAAATTCTGATGAGCTTTCCCATATAGCCCACTTGCTACAAACATCTCTATCACAAACTCTAGTCTCTTACTATCCTTATGCTGGAAAGTTGAAG GACAATGCTACTGTTGACTGTAACGATATGGGAGCTGAGTTCTTGAGTGTTCGAATAAAATGTTCCATGTCTGAAATTCTTGATCATCCTCATGCAT CTCTTGCAGAGAGCATAGTTTTGCCCAAGGATTTGCCTTGGGCGAATAATTGTGAAGGTGGTAATTTGCTTGTAGTTCAAGTAAGTAAGTTTGATTG TGGGGGAATAGCCATCAGTGTATGCTTTTCGCACAAGATTGGTGATGGTTGCTCTCTGCTTAATTTCCTTAATGATTGGTCTAGCGTTACTCGTGAT CATACGACAACAACTTTAGTTCCATCTCCTAGATTTGTAGGAGATTCAGTCTTCTCTACACAAAAATATGGTTCTCTCATTACGCCACAAATTTTGTC CGATCTCAACCAGTGCGTACAGAAAAGACTCATTTTTCCTACAGATAAGTTAGATGCACTTCGAGCTAAGGTAATACTACCATCGTCCATTATTGTTT GTCTTACGGTATTTTTGAAAAGAATAATATTTAATAGTCTTCTTGAGACATATTTCACTTAACAAGCCTAGGCTATTTAGTCTATTTGTAGAAGCTACT CTTAAACGCCTCACTTAGTTAATAGCACTCCACTTATTGGTGTCAAAAACTACTCTTGGACATGTCATTTACTTAATAACACTCCACTTAATTATCGAA
  • 7. Alignment, assembly, annotation Li et al. (2011) EBI / NCBI / DDBJ
  • 8. Three millennia of modelling life
  • 9. Phylogenetic trees H. sapiens ATG CTC TAT GAG P. troglodoytes ATG CTC TTT GAG G. gorilla ATG CTT TAT TAC P. troglodoytes G. gorilla H. sapiens P. troglodoytes 11 9 8
  • 12.
  • 13. Field-sequencing for real Conditions 100% humidity; 6-13ºC Essential kit 800w generator 3x laptops Centrifuge Waterbath Polystyrene boxes (lots) Kettle(…!) Yield >400Mbp data in three days; A. thaliana ~2.01x coverage
  • 14. Snowdonia, HelloWorld & ‘tent-seq’ A. thaliana Arabidopsis lyrata Congeneric species; Reference genomes available Field-sequenced (MinION) & Lab-sequenced (Illumina™) Orthogonal BLAST: 4 sample*sequencer combinations Compare TRUE & FALSE rates for varying ID statistic cutoffs
  • 15. Field- vs. lab-sequenced sample ID Match individual reads to each reference with BLAST Compare match lengths in TRUE and FALSE cases ‘Length bias’ ID stat: lengthTRUE - lengthFALSE Compare TRUE & FALSE rates as length bias cutoff varies MiSeq (lab) MinION (field)
  • 16. Bitty data (1) partial queries Subsample MinION output Repeat ID pipeline, record mean ID stat sbias Replicates: N = 30 Simulate from 100 – 104 reads (≈instant → hours)
  • 17. Bitty data (2) partial references Take reference genome at high contiguity Fragment randomly to target (low) contiguity Repeat read identification using fragmented DB Simulate N50 ≈1,000bp to N50 ≈ 10Mbp
  • 18. Keeping it simple: Kew Science Festival Six species: whole genome- skim samples with MinION in preparation Build BLAST DBs from skimmed data Select ‘unknown’ (blinded) sample, extract DNA and resequence in real-time Compare to partial DBs in six-way BLAST competition Live ID ?
  • 19. de novo genome assembly Data MiSeq only MiSeq + MinION Assembler Abyss hybridSPAdes Illumina reads, 300bp paired-end 8,033,488 8,033,488 Illumina data (yield) 2,418 Mbp 2,418 Mbp MinION reads, R7.3 + R9 kits, N50 ~ 4,410bp - 96,845 MinION data (yield) - 240 Mbp Approx. coverage 19.49x 19.49x + 2.01x Assembly key statistics: # contigs 24,999 10,644 Longest contig 90 Kbp 414 Kbp N50 contiguity 7,853 bp 48,730 bp Fraction of reference genome (%) 82 88 Errors, per 100 kbp: #N’s 1.7 5.4 # mismatches 518 588 # indels 120 130 Largest alignment 76,935 bp 264,039 bp CEGMA gene completeness estimate: # genes 219 of 248 245 of 248 % genes 88% 99%
  • 20. Wait – genes? Entire chloroplast genome (~150kbp) Plastid coding loci Individual field- sequenced MinION reads
  • 21. Real-time phylogenomics Filtered reads Gene models TAIR10 CDS code Annotation SNAP 1:1 reciprocal BLAST Multiple sequence alignments MUSCLE Trimal Gene trees → Consensus tree *BEAST RAxML, TreeAnnotator Cumulative counts: Unique genes All genes (‘Lab’ being transported!)
  • 22. 3. Life is complex
  • 23. Evolution is complex Nakhleh, (2009); Suh (2016) Zool. Scripta. doi:10.1111/zsc.12213 Zapata et al. (2016) PNAS 113:E4052-E4060 ©2016 National Academy of Sciences
  • 24. Networks Strimmer & Moulton (2000) MBE Solís-Lemus & Ané (2016) PLoS Genet. Nakhleh (2009) in: Heath & Ramakrishnan, eds., Springer
  • 25. Key: Extant node Inferred node Synteny edge (physical connection Phylogeny edge (evolutionary connection) Identity edge (organismal connection) Three-colour graphs: phylogeny, synteny & identity a b c d x y z e a a
  • 26. Three-colour graphs: phylogeny, synteny & identity a1 b1 a2 b2 a3 b3 b’3 a4 b4 a5 b5 Duplication Key: Extant node Inferred node Synteny edge (physical connection) Phylogeny edge (evolutionary connection) Identity edge (organismal connection) a1 b1 b2 a2 a3 b3 c1 c3 c1 Inversion
  • 27. Three-colour graphs: phylogeny, synteny & identity a1 b1 a2 b2 a3 b3 a4 b4 x4 y4 x3 y3 x1 y1 Tetraploid hybrid formed Diploidization (secondary loss) Key: Extant node Inferred node Synteny edge (physical connection) Phylogeny edge (evolutionary connection) Identity edge (organismal connection) a1 b1 a2 b2 x1 x5 x2 x3 x4 x7 x6 Horizontal gene transfer
  • 28. Step back: molecular evolution “Horizontal gene transfer occurs x more frequently in these lineages, because of this biology” “Convergent evolution is rare in most genes, in most organisms, but y times greater in these gene families …because of this biology” “New chomosomes are created & destroyed at z, q, rates in this reproductive strategy …because of this biology”
  • 29. 4. Real big data
  • 30. How big? How many? Species: • Mammals: 103 – 105 • Animals: 106 – 107 • Plants: 105 – 106 • Bacteria: >106 ? • Fungi: >>105-??? DNA sequencing machines: ~104-5 (each ~109-10 bp/day) (Organisms): (…a lot) Example Typical feature size (Mbp) Largest known genomes ~240,000 (1011) Vascular plant ~0.1 – 10,000 (108 - 1010) Human genome 3,000 (3x109) Most fungi ~0.1 – 10 (105 - 107) Bacteria ~0.1 – 1 (106) Viruses ~0.01 (105) Mitochondria / chloroplasts 0.017 / 0.2 (104 ;105) ‘Barcoding’ locus ~1000bp (102) Illumina read 75-300bp Nanopore read 100bp –>1Mbp
  • 31. The tools aren’t in great shape but the prizes are there bionode.js bioboxes.org Singularity Portable sequencing, by anyone means really Big Data Informatics connecting this data through explicit models is inference Scalable, reproducible, sustainable research:
  • 32. Ubiquitous / citizen-sequencing From lab-based… … to ‘app store’ genomics
  • 33. Metagenomics… © 2016 Katy Reed / FR EPI2ME; Juul et al. (2016) / Metrichor.com
  • 36. Environmental sampling (Species’ abundance varying in time and space) (AKA “trampling on the ecologists’ toes”)
  • 38. The Tree (network) of Life iTOL: Ivica Letunic, Mariana Ruiz Villarreal
  • 40. Thanks, funders, contacts and questions Oxford Nanopore Technologies Ltd. Dan Turner, Richard Ronan, Gerrard CoyneU Bangor: Alexander S.T. Papadopulos (@metallophyte) RBG Kew: Postdocs: Andrew Helmstetter (@ajhelmstetter); Tim Coker Thanks: Dion Devey, Robyn Cowan, Tim Wilkinson, Stephen Dodsworth, Pepijn Kooij, Felix Forest, Bill Baker, Jan T. Kim, Jenny Williams, Abigail Barker, Mark Lee, Jim Clarkson, Mike Chester, Ester Gaya, Lisa Pokorny, Laszlo Csiba, Paul Wilkin, Richard Buggs, Mike Fay, Mark Chase, Ilia Leitch QMUL Laura Kelly, Kalina Davies, Steve Rossiter Oxford Aris Katzourakis, Oli Pybus, Jayna Raghwani Others Forest Research: Daegan Inward, Katy Reed Dstl: Claire Lonstale, James Taylor Birmingham: Nick Loman, Josh Quick U. Utah: Bryn Dentinger Imperial: James Rosindell This research was conducted in the Sackler Phylogenomics Laboratory and was supported by the Calleva Foundation Phylogenomic Research Programme and the Sackler Trust @lonelyjoeparker: joe.parker@kew.org

Notes de l'éditeur

  1. Definitions Genetic data, what it is, where it’s found, how we get it A genome / assembly Annotation and alignment A phylogeny or tree
  2. Definitions Genetic data, what it is, where it’s found, how we get it A genome / assembly Annotation and alignment A phylogeny or tree
  3. Naming stuff The ladder of life Binomial / ontological naming Darwin and The Tree Networks
  4. Portable sequencing: also long reads and real-time
  5. Portable Real-time Long easy
  6. Data in terrible conditions but anyone can do it Social media reach The Atlantic, Economist
  7. Direct, explicit, orthogonal test – and can it work? Picture of experimental design Outline of the study In terms of bioinformatics questions Funding: a first pot and timeline…
  8. We compare match lengths, and minon allows long matches
  9. EXPLAIN AXES: precision improves rapidly
  10. EXPLAIN AXES: a partial REFERENCE would work, too
  11. MORE FUNDING. SO simple a kid could do it? Yes The challenge I set myself: OK, it’s a simple experiment. Can I buid a trest simple ehough a child can understand it? SOCIAL MEDIA Funding: NANOPORE
  12. Data from one time and place can and should be useful elsewhere lash a bit of proper genomics
  13. Single reads match whole genes – meat & drink
  14. EXPLAIN AXES postdoc-years PAPER ACCEPTED
  15. Genomes come in all shapes and sizes Organisms too, life cycles (A)sexual reproduction; clonal replication Even genetic alphabet not fixed And mutation isn’t random Incongruence and reticulation Horizontal gene transfer Incomplete lineage sorting Hybridization Recombination
  16. Networks attempt to summarise this Splits graphs, directed graphical models / planar graphs
  17. Definition Features Generalised representations Phylogeny edge information workable-outable Other edges present in metadata; inferable Generative model; easy to interpret… Here’s a common framework for all these studies How to infer – sounds like a nightmare Many of the edges in this network are really there already Shifting paradigms, making linking easier Explicitly model phylogeny, synteny and identity Edge support reflects evidence; deviations from neutrality reflect hypotheses/models/phenomena Any nodes connecting to an identity edge are considered completely connected Maximum # edges ~n (2n-1)/2 Digraphs ~n!! Possible ancestors from one locus on n taxa essentially inverse func of when they coalesce (can have m generations of n ancestors until an event where n(m)<n(t)
  18. EXAMPLES Gene duplication e.g. paralogue in animal Tetraploid formed then secondary diploidization, e.g. plant Inversion in a genome Unlinked loci (e.g. bacterial plasmids) and HGT. How to infer – sounds like a nightmare Many of the edges in this network are really there already Shifting paradigms, making linking easier Explicitly model phylogeny, synteny and identity Edge support reflects evidence; deviations from neutrality reflect hypotheses/models/phenomena
  19. EXAMPLES Gene duplication e.g. paralogue in animal Tetraploid formed then secondary diploidization, e.g. plant Inversion in a genome Unlinked loci (e.g. bacterial plasmids) and HGT. How to infer – sounds like a nightmare Many of the edges in this network are really there already Shifting paradigms, making linking easier Explicitly model phylogeny, synteny and identity Edge support reflects evidence; deviations from neutrality reflect hypotheses/models/phenomena
  20. We need enough data to turn obervations, into empirical comparisons, into models and laws We know a lot about evolutionary mechanisms And a lot about (a handful of genomes) What we know tells us “it’s complicated” Most genes don’t have simple orthologues etc etc etc, hotizonatl etc But we don’t, really, have an empirical understanding of how they fit together, e.g.: - ”horizontal gene transfer occurs x more frequently in these lineages, because of this biology” - adaptive molecularconvergence is rare in most genes, in most organisms, but y times greater in these gene families because of this biology - new chomosomes are created (by duplication, endogenisation, polyploidy) and destroyed (by diploidization) at z rates in this reproductive strategy because of biology
  21. Global databases Algorithms, methods and theory Generally bespoke / slow / in-house Special sauce Formally linking datasets and models is inferring the network of life Shifts the job for bioinformatics from something it’s good at – sophisiticated analysis incemental To sometheing computers in gerneral are great at: linking elements In this case informatics doesn’t enable research , it is the process of inference It’s relatively easy to write a new standalone app to do x, or analyse some big dataset Reproducibility and scaling-up science mean we must work harder on the links Informatics as inference. The lonely astronomers.
  22. HPCs to apps: Exponential data, linear understanding. Pause – to recap This is important because it’s where we tie it together and show my contribution: Portable sequencers, easier to use More places More experimenters More data More noise Efficient comparison? Dynamic computation? Clever hashing Portable, mass sequencing is really here Massive potential for de novo genomics; phylogenomics But while we’re accumulating information at an exponential rate, we’re integrating it linearly, in essence … where are we going?
  23. Superset of species ID Distribution of species, sometimes functional focus We may not have positive controls We usually don’t know ‘normal’ distribution From a fringe idea to routine
  24. Gut microbiome Many other tissues UTI Dental Cardiac Respiratory Not just human health; pathogen surveillance
  25. Dodgy burgers and provenance Pests, crop inputs Feeds Supporting ecosystem health/services
  26. Ecosystems; habitats; communities; niches Properly ecologists’ domain Thousands of species Abundances shift in time and space All trackable with DNA Where do ecosystem services come from? How healthy?
  27. What’s really out there Longitudinal data Fixed locations Parallels with earth sensing Autonomous in-situ sensor platforms
  28. Data collection in aggregate means we can asymptotically assemble the components we need for the Tree Of Life This is loosely defined as the Map Of The World for genomic stuff Not exactly simple but not a computational / and engineering challenge, not really intellectually taxing (probably) Pretty much the biggest goal in evolution
  29. The cosmology of life Why genomes/chromosomes? Why that size? Why organisms? Where is the root Sequence-space and the network as a state-space Inflation Probability function of you
  30. Funders Thanks Reach out