2. An introduction to Web Apollo.
A webinar for the Ceratitis capitata research community.
Monica Munoz-Torres, PhD | @monimunozto
Berkeley Bioinformatics Open-Source Projects (BBOP)
Genomics Division, Lawrence Berkeley National Laboratory
15 July, 2014
UNIVERSITY OF
CALIFORNIA
3. Outline
1. What is Web Apollo?:
• Definition & working concept.
2. Our Experience With Community
Based Curation.
3. The Manual Annotation Process.
4. Becoming acquainted with Web
Apollo.
An introduction to
Web Apollo.
A webinar for the
Ceratitis capitata
research community.
Outline 3
4. During this webinar you will:
• Learn to identify homologs of known genes of interest
in your newly sequenced genome.
• Become familiar with the environment and
functionality of the Web Apollo genome annotation
editing tool.
• Receive a brief introduction to the resources available
for the Ceratitis capitata genome.
Footer 4
5. What is Web Apollo?
• Web Apollo is a web-based, collaborative genomic
annotation editing platform.
We
need
annota)on
edi)ng
tools
to
modify
and
refine
the
precise
loca)on
and
structure
of
the
genome
elements
that
predic)ve
algorithms
cannot
yet
resolve
automa)cally.
51. What is Web Apollo?
Find more about Web Apollo at
http://GenomeArchitect.org
and
Genome Biol 14:R93. (2013).
6. Brief history of Apollo*:
a. Desktop:
one person at a time editing a
specific region, annotations
saved in local files; slowed down
collaboration.
b. Java Web Start:
users saved annotations directly
to a centralized database;
potential issues with stale
annotation data remained.
1. What is Web Apollo? 6
Biologists could finally visualize computational analyses and
experimental evidence from genomic features and build
manually-curated consensus gene structures. Apollo became a
very popular, open source tool (insects, fish, mammals, birds, etc.).
*
7. Web Apollo
• Browser-based tool integrated with JBrowse.
• Two new tracks: “Annotation” and “DNA Sequence”
• Allows for intuitive annotation creation and editing,
with gestures and pull-down menus to create and
modify transcripts and exons
structures, insert comments
(CV, freeform text), etc.
• Customizable look & feel.
• Edits in one client are
instantly pushed to all other
clients: Collaborative!
1. What is Web Apollo? 7
8. Working
Concept
In the context of gene manual annotation,
curation tries to find the best examples
and/or eliminate most errors.
To conduct manual annotation efforts:
Gather and evaluate all available evidence
using quality-control metrics to
corroborate or modify automated
annotation predictions.
Perform sequence similarity searches
(phylogenetic framework) and use
literature and public databases to:
• Predict functional assignments from
experimental data.
• Distinguish orthologs from paralogs,
and classify gene membership in
families and networks.
2. In our experience. 8
Automated gene models
Evidence:
cDNAs, HMM domain searches,
alignments with assemblies or
genes from other species.
Manual annotation & curation
9. Dispersed, community-based gene
manual annotation efforts.
We continuously train and support
hundreds of geographically dispersed
scientists from many research
communities, to perform biologically
supported manual annotations using
Web Apollo.
– Gate keepers and monitoring.
– Written tutorials.
– Training workshops and geneborees.
– Personalized user support.
2. In our experience. 9
10. What we have learned.
Harvesting expertise from dispersed researchers who
assigned functions to predicted and curated peptides
we have developed more interactive and
responsive tools, as well as better visualization,
editing, and analysis capabilities.
102. In our experience.
http://people.csail.mit.edu/fredo/PUBLI/Drawing/
11. Collaborative Efforts Improved
Automated Annotations*
In many cases, automated annotations have been
improved (e.g: Apis mellifera. Elsik et al. BMC Genomics 2014, 15:86).
Also, learned of the challenges of newer sequencing
technologies, e.g.:
– Frameshifts and indel errors
– Split genes across scaffolds
– Highly repetitive sequences
To face these challenges, we train annotators in
recovering coding sequences in agreement with all
available biological evidence.
112. In our experience.
12. It is helpful to work together.
Scientific community efforts bring together domain-
specific and natural history expertise that would
otherwise remain disconnected.
Breaking down large amounts of data into
manageable portions and mobilizing groups
of researchers to extract the most accurate
representation of the biology from all
available data distills invaluable
knowledge from genome analysis.
122. In our experience.
14. A little training goes a long way!
With the right tools, wet lab scientists make exceptional
curators who can easily learn to maximize the
generation of accurate, biologically supported gene
models.
142. In our experience.
15. Manual
Annotation
How do we get there?
15
Assembly
Manual
annotation
Experimental
validation
Automated
Annotation
In a genome sequencing project…
3. How do we get there?
16. Gene Prediction
Identification of protein-coding genes, tRNAs, rRNAs,
regulatory motifs, repetitive elements (masked), etc.
- Ab initio (DNA composition): Augustus, GENSCAN,
geneid, fgenesh
- Homology-based: E.g: SGP2, fgenesh++
16
Nucleic Acids 2003 vol. 31 no. 13 3738-3741
3. How do we get there?
17. Gene Annotation
Integration of data from prediction tools to generate a
consensus set of predictions or gene models.
• Models may be organized using:
- automatic integration of predicted sets; e.g: GLEAN
- packaging necessary tools into pipeline; e.g: MAKER
• All available biological evidence (e.g. transcriptomes) further
informs the annotation process.
173. How do we get there?
In some cases algorithms and metrics used to generate
consensus sets may actually reduce the accuracy of the
gene’s representation; in such cases it is usually better to
use an ab initio model to create a new annotation.
18. Manual Genome Annotation
• Identifies elements that best represent the underlying
biology.
• Eliminates elements that reflect the systemic errors of
automated genome analyses.
• Determines functional roles through comparative
analysis of well-studied, phylogenetically similar
genome elements using literature, databases, and
the researcher’s experience.
183. How do we get there?
19. Curation Process is Necessary
1. A computationally predicted consensus gene set is
generated using multiple lines of evidence.
2. Manual annotation takes place.
3. Ideally consensus computational predictions will be
integrated with manual annotations to produce an
updated Official Gene Set (OGS).
Otherwise, “incorrect and incomplete genome annotations
will poison every experiment that uses them”.
- M. Yandell.
193. How do we get there?
20. The Collaborative Curation Process at i5K
1) A computationally predicted consensus gene set has
been generated using multiple lines of evidence; e.g.
JAMg Consensus Gene Set v1.
2) i5K Projects will integrate consensus computational
predictions with manual annotations to produce an updated
Official Gene Set (OGS):
» If it’s not on either track, it won’t make the OGS!
» If it’s there and it shouldn’t, it will still make the OGS!
203. How do we get there?
21. Consensus set: reference and start point
• In some cases algorithms and metrics used to generate
consensus sets may actually reduce the accuracy of the gene’s
representation; e.g. use Augustus model instead to create a new
annotation.
• Isoforms: drag original and alternatively spliced form to ‘User-
created Annotations’ area.
• If an annotation needs to be removed from the consensus set,
drag it to the ‘User-created Annotations’ area and label as
‘Delete’ on Information Editor.
• Overlapping interests? Collaborate to reach agreement.
• Follow guidelines for i5K Pilot Species Projects as shown at
http://goo.gl/LRu1VY and download the MedFly Annotation
guide from http://goo.gl/YY0tNw
213. How do we get there?
24. Navigation tools:
pan and zoom Search box: go
to a scaffold or
a gene model.
Grey bar of coordinates
indicates location. You can
also select here in order to
zoom to a sub-region.
‘View’: change
color by CDS,
toggle strands,
set highlight.
‘File’:
Upload your own
evidence: GFF3,
BAM, BigWig, VCF*.
Add combination
and sequence
search tracks.
‘Tools’:
Use BLAT to query the
genome with a protein
or DNA sequence.
Available Tracks
Evidence Tracks Area
‘User-created Annotations’ Track
Login
Web Apollo
24
Graphical User Interface (GUI) for editing annotations
4. Becoming Acquainted with Web Apollo.
25. Flags non-
canonical splice
sites.
Selection of features and
sub-features
Edge-matching
Evidence Tracks Area
‘User-created Annotations’ Track
The editing logic in the server:
§ selects longest ORF as CDS
§ flags non-canonical splice sites
25
Web Apollo
4. Becoming Acquainted with Web Apollo.
25
26. DNA Track
‘User-created Annotations’ Track
Web Apollo
26
4. Becoming Acquainted with Web Apollo.
§ There are two new kinds of tracks for:
§ annotation editing
§ sequence alteration editing
28. Web Apollo
28
4. Becoming Acquainted with Web Apollo.
28
• DBXRefs
• PubMed IDs
• GO terms
• Comments
The Information Editor
29. Additional Functionality
In addition to protein-coding gene annotation that you know and love.
• Non-coding genes: ncRNAs, miRNAs, repeat regions, and TEs
• Sequence alterations (less coverage = more fragmentation)
• Visualization of stage and cell-type specific transcription data as
coverage plots, heat maps, and alignments
29
4. Becoming Acquainted with Web Apollo.
29
30. Webservices & additional tools
• Alignments - Jalview
• BLAST - blastp
• Signal Peptide – search using signalP.
• Just_Annotate_My_proteins:
Pick a Gene Ontology, Enzyme, KEGG, etc term and it gives you a list
of genes that have a significant Hidden Markov Model alignment to a
SwissProt protein (i.e. only real proteins that have been validated) and that has
real experimental evidence (i.e. from the literature) for that term.
The search is conservative and does not allow IEA evidence codes to
avoid possibly propagating annotation errors. However, the search is run twice:
first every annotated gene is searched against SwissProt. Then a profile
alignment is created with the good matches and searched again.
Footer 30
31. 1. Select a chromosomal region of interest, e.g. scaffold.
2. Select appropriate evidence tracks.
3. Determine whether a feature in an existing evidence track will
provide a reasonable gene model to start working.
- If yes: select and drag the feature to the ‘User-created
Annotations’ area, creating an initial gene model. If necessary
use editing functions to adjust the gene model.
- If not: let’s talk.
4. Check your edited gene model for integrity and accuracy by
comparing it with available homologs.
4. Becoming Acquainted with Web Apollo
General Process of Curation
31 |
Always remember: when annotating gene models using Web
Apollo, you are looking at a ‘frozen’ version of the genome
assembly and you will not be able to modify the assembly itself.
31
32. There are a number of ways to find the gene region you wish to annotate. It depends what
you’re starting with:
a) The protein sequence from another species
b) Sequence from a similar gene
c) You provided Alexie with golden genes and he provided back alignments
d) You provided Alexie with high quality proteins and/or gene family alignments (multi or
single species) and he created domain annotations.
So how do I start curating!?
Option 1 – You have a sequence but don’t know where it is in
this genome
1. You will need to BLAT it
2. If protein-based BLAT doesn’t find it, you can BLAST it
3. You can use the i5k BLAST server here :
http://i5k.nal.usda.gov/blastn
4. Or you can use any other tool, for example Geneious
Option 2 – the genome has already been annotated with your
sequences and you have an ID
1. In other words, someone has told you where to look: if you give Alexie
profile alignments of your favorite gene family we could do that for
you.
2. Type the ID in the Search box of Web Apollo
• Web Apollo autocompletes using a case-insensitive search
anchored on the left-hand side of the word
e.g. so HaGR will show all hagr objects (up to 10)
3. Choose one of the gene and click Go
You can do that with Domains, Alignments or Gene names provided to you.
Option 3 – Get genes based on a GO / EC etc term
This is a fun, new tool Alexie has made, called
Just_Annotate_My_proteins.
33. Example
Live Demonstration using the Apis mellifera genome.
Example 33
A public Honey Bee Web Apollo Demo is available at
http://genomearchitect.org/WebApolloDemo
35. Thanks!
• Berkeley Bioinformatics Open-source Projects
(BBOP), Berkeley Lab: Web Apollo and Gene
Ontology teams. Suzanna E. Lewis (PI).
• Christine G. Elsik (PI). § University of Missouri.
• Ian Holmes (PI). * University of California Berkeley.
• Arthropod genomics community, i5K Steering
Committee, Alexie Papanicolaou at CSIRO, Monica
Poelchau at USDA/NAL, fringy Richards at HGSC-
BCM, Oliver Niehuis at 1KITE http://www.1kite.org/,
BGI, and the Honey Bee Genome Sequencing
Consortium.
• Web Apollo is supported by NIH grants
5R01GM080203 from NIGMS, and 5R01HG004483
from NHGRI, and by the Director, Office of Science,
Office of Basic Energy Sciences, of the U.S.
Department of Energy under Contract No. DE-
AC02-05CH11231.
• Insect images used with permission:
http://AlexanderWild.com and O. Niehuis.
• For your attention, thank you!
Thank you. 35
Web Apollo
Gregg Helt
Ed Lee
Colin Diesh §
Deepak Unni §
Rob Buels *
Gene Ontology
Chris Mungall
Seth Carbon
Heiko Dietze
BBOP
Web Apollo: http://GenomeArchitect.org
GO: http://GeneOntology.org
i5K: http://arthropodgenomes.org/wiki/i5K