By Florian Maumus and Hadi Quesneville
We present our opinions, recent developments and perspectives regarding whole-genome repeatome annotation.
This talk was presented by Florian Maumus at the Barbados Workshop on the Computational Identification and Analysis of Transposable Elements, Holetown, Barbados, April 18-24 2014
1. Barbados Workshop on the Computational Identification
and Analysis of Transposable Elements
April 18th - 25th, 2014
Florian Maumus with Hadi Quesneville (URGI-INRA, Versailles, France)
4. De novo repeatome detection
Deep repeatome annotation
Repeat annotation in large genomes
5. De novo repeatome detection
Deep repeatome annotation
Repeat annotation in large genomes
6.
7. Repeat complement = Repeatome
The Repeatome includes:
Transposable elements
Endogenous viruses
Tandem repeats
Ribozymes
Genes
…
7
= What you get with repeat-finders!
9. Dark matter, the genomic humus
« Repeats » Old repeats Dark matter
Detected Detectable? Background Noise
Burst Decay Melt
10. Complexity of the repeatome
Turnover ++
Recent activity +++
Turnover -
Recent activity -
young
old
11. Different history, different challenges
Maize
2.3 Gb genome
About 85% repeats
Human
3.2 Gb genome
About 50% repeats
12. LECA:
Core eukaryotic genes +
Copia, Gypsy, LINEs,
DNA transposons…
TEs have been jumping around genes over evolutionary times
13.
14. Contents include:
Professional Tool Roll
Archaeology Margin Trowel
Battiferro Leaf & Square
Battiferro forged ornamental tools lance
Battiferro Trowel and Square
Aluminium scale rulers
Small Tools Set
Hand Shovel
Small Brush
Mason Line*
Line Pegs
Line Level
Plumb Bob
Retractable
Hi-Viz Grip Knife
Battiferro Trowel*
*Optional.
Archeology toolbox
26. TEdenovo RepeatModeler RepeatScout
35
30
25
20
15
10
5
0
Genome coverage increase (%)
REPET, RepeatScout, and RepeatModeler employ
complementary computational methods that together
enable to better represent repeatome complexity.
27. Conclusions I
TEdenovo outcompetes RepeatModeler and RepeatScout
Greater coverage with
Less consensus
Larger consensus
Larger copies
Complementarity of TEdenovo, RepeatModeler and RepeatScout
Comprehensive annotation of complex repeatomes
28. De novo repeatome detection
Deep repeatome annotation
Repeat annotation in large genomes
29. Arabidopsis
120 Mb
Experimental model
CDS Repeatome Dark matter
0% 100%
Three strategies with REPET:
Annotate genome with genomic copies
Use relaxed parameters for HSP detection
Use P-clouds to detect short repeat fragments
35. AA
AC
AG
AT
CA
CC
CG
CT
GA
GC
GG
TA
TC
GT
TG
TT
0,15
0,05
-0,05
CDS
TEdenovo
delta_2vs1
delta_3vs2
delta_4vs3
Dinucleotide composition
36. Relevance
Genome annotation using the delta_2vs1 copies
masks as much as 23 Mb (19.5%) of the genome
Covers 66% of the reference annotation
and 56% of the TEdenovo annotation
The supplementary annotations from
TEdenovo_2 are highly representative of the A.
thaliana repeatome.
42. Deep annotation of the A. thaliana repeatome
RepeatScout
RepeatModeler
TEdenovo
Repbase
(+Buisine et al.)
Remove
redundancy
Bundle library
TEannot
Consensus size
43. Deep annotation of the A. thaliana repeatome
selected
not
selected
TEannot
P-clouds
Complete
bundle
annotation
48. • Bundle + P-clouds
=> Repeated and repeat-derived sequences contribute
at least 30% to the A. thaliana genome
Enhanced repeat detection in gene-rich regions
49. Arabidopsis repeats browser
Genes
Buisine et al.
RepeatScout
RepeatModeler
REPET
Deep annotations
24-nt sRNA
50. Conclusions II
Innovative approaches for deep repeatome annotation
About one third of the A. thaliana genome of repetitive origin (vs 24%)
Increased sensitivity and detection of old repeat remnants
Improved genome evolution and epigenetic analyses
Continuum between repeatome and genomic dark matter
Time
51. De novo repeatome detection
Deep repeatome annotation
Repeat annotation in large genomes
52. All genomes should benefit the greater quality of
TEdenovo
Adapted from Nina V. Fedoroff (2012) and Steven M. Carr
53. Limitations with REPET
All-by-all genome comparison => LOTS (Gb) of high scoring pairs (HSPs)
HSP files > 1 Gb are not handled by Piler
Grouper can last for weeks
Impossible to run TEdenovo on whole large and/or highly
repeated genomes until recently
54. Solutions
Use a sample of whole genome as input for TEdenovo (e.g. 300Mb)
(As recommended for RepeatModeler)
59. De novo repeat annotation in large genomes
Future developments
Parallelize Grouper
Parallelize the “Long join” procedure
Establish phyla-specific approaches
Develop strategies to annotate genomes with different
composition
old, complex repeatomes as compared to large plant
genomes
60. De novo repeat annotation in large genomes
Future challenges & perspectives
Propose TEdenovo and TEannot pipelines on GALAXY
Deliver REPET compilation for use on a cloud
61. Véronique
Jamilloux
Tina Alaeitabar
Timothée
Chaumier
Olivier Inizan
Mark Moissette
Hadi
Quesneville
THANK YOU !