This is the talk I gave at the 7th Arthropod Genomics Symposium, hosted by the Eck Institute for Global Health at University of Notre Dame in South Bend, Indiana, USA.
More efficient sequencing technologies mean a dramatic increase in our access to whole genome sequences, and annotation efforts must adapt to keep pace in converting these sequence data into knowledge. The growing number of genome sequencing projects also means there will be a larger reliance on contributions from domain specialists. This is indicative of a curation environment shifting from a traditional centralized model to a geographically dispersed community annotation model, which requires new tools to support collaborative annotation. WebApollo is a successor to the Apollo annotation editor; it provides a web-based environment that allows multiple distributed users to review, edit, and share manual annotations. The WebApollo client is designed as an extension to JBrowse, a genome browser that provides a fast, highly interactive interface for visualization of genomic data. WebApollo allows users to create and modify transcript and exon structures through intuitive gestures, and flags potential problems within these manual annotations.
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
Web Apollo: A Web-based Genomics Annotation Editing Platform. 13ArthGen
1. {
Web Apollo
A Web-based Genomics Annotation Editing Platform
Eduardo Lee, Gregg Helt, Justin Reese, Monica Munoz-Torres*, Christopher
Childers, Rob Buels, Lincoln Stein, Ian Holmes, Christine Elsik, Suzanna Lewis
Arthropod Genomics Symposium 2013 | South Bend, IN | * @monimunozto
Lawrence Berkeley National Laboratory, Joint Genome Institute, for the US Department of Energy at UCB
2. The first real-time, collaborative genomics
annotation editor on the Web
Easy-to-use environment for multiple,
distributed users to review, update, and share
genome feature markups
Web Apollo is:
3. Working Concept
‘Gene Models’
Automated predictions,
assisted by some evidence
‘Evidence’
cDNA, HMM searches for
protein domains, alignments
of assemblies, curated genes
or other species
‘Manual annotation’:
Correct coordinates for
genes of interest.
4. The need for annotation tools
Assembly
Manual
annotation
Experimental
validation
Automated
Annotation
Requires optimized genome
visualization and editing tools
The need for genome visualization and editing tools prompted the
development of the genome browsers we commonly use.
Annotation editing tools then became necessary.
5. Gather and evaluate all available evidence using quality-
control metrics, to corroborate/modify automated predictions
Use literature and public databases to infer gene function
from experimental data
Run sequence-similarity searches within a phylogenetic
framework (e.g. alignment trees)
To predict protein functional assignments
Distinguish orthologs from paralogs, classify genes as members
of a family
Otherwise, incorrect and incomplete genome annotations will
poison every experiment that uses them
Manual curation is necessary!
6. Access to computational analysis
& experimental evidence
Manual annotation & curation
Compatibility with GMOD
Saved annotations directly to
database (not via email)*
Widely used (initially designed
for centralized, resource-rich
projects).
Apollo: Desktop and Java Web Start*
7. BUT…
Must load all data for a region (range) at once
No [automated]* support for sharing
Possible update conflicts due to stale annotation data
One annotator at a time
Edits from other users not visible without reloading
Require Apollo Download, Chado Install, Java Installation*
Apollo: Desktop and Java Web Start*
8. The need for updated tools
The democratization of genome-scale sequencing
calls for a new kind of annotation editing tool.
• more assembly errors
• lack of gold standard gene
structure training data
9. No installation required (for users).
User interface is a browser-based
Javascript client communicating with an
annotation editing server.
Apollo: on the Web
Is a plug-in for JBrowse, a successor to
the GBrowse genome browser. (GMOD)
Plug-in offers a ‚User-created
Annotations‛ track.
Real Time annotation updates;
annotations saved to centralized
database.
Uses dynamic (lazy) data loading:
only the region of interest
Customizable: rules, appearance.
Supports user authentication:
read, edit, review, complete, publish (export).
Automatically promote tracks (script).
10. Navigation tools:
pan and zoom
Search box: go
to a scaffold or a
gene model.
Grey bar of coordinates
indicates location. You can
also select here in order to
zoom to a sub-region.
‘Options’:
change color by
CDS, toggle
strands
‘File’:
Upload your
own evidence:
GFF3, BAM,
BigWig, VCF*
‘Tools’:
Use BLAT to query the
genome with a protein
or DNA sequence.
Available Tracks
Evidence Tracks Area
‘User-created Annotations’ Track
‘Share’: a stable link
shares your view and
shows exactly what
you are seeing (keeps a
record of your
annotation process)
Login
Web Apollo
Graphical User Interface (GUI) for editing annotations
11. Flags non-canonical
splice sites.
Selection of features and
sub-features
Edge-matching
Evidence Tracks Area
‘User-created Annotations’ Track
Web Apollo
The editing logic is on the server:
selects longest ORF as CDS
flags non-canonical splice sites
15. - BAM
- BigWig
- GFF3
- VCF*
Trellis
Data Broker
(Java)
Static JSON
Generation Pipeline
(Perl)
Server-side Data Service Annotation Editing Engine (Java)
Berkeley DB
realtime store
User
Management
Data Sources
Analysis Pipelines
- BAM
- BED
- BigWig
- GFF3
- MAKER
output
Data Repositories
Chado
MySQL
DAS servers
e.g. Ensembl
Annotation Exports
Local DB.
e.g. Chado
- GFF3
- FASTA
Annotators
Web
Apollo
JBrowse
Apollo Edit Operations
& User Management
User Interface (JavaScript)
JSON
Web Apollo
Architecture
17. Ability to annotate regulatory regions & features
Collapsing and expanding tracks
Sticky ‘User Annotations’ track
Genome slicing: annotating across contigs
Folding of intronic space
Web Apollo at GMOD in the cloud
[Near] Future Enhancements
18. Release
http://genomearchitect.org/webapollo/releases
Demo Site
http://genomearchitect.org/WebApolloDemo
User Guide
http://genomearchitect.org/webapollo/docs/webapollo_user_guide.pdf
At GMOD
http://gmod.org/wiki/WebApollo
Releases & Demo
19. To all our users & contributors! Especially:
Code: Mitch Skinner, Nomi Harris, Thomas Down, Carson Holt.
Feedback: Sue Brown, Sanjay Chellapilla, Daniel Ence, Juergen
Gadau, Nicolae Herndon, Elisabeth Huguet, Carolyn Lawrence,
Sasha Mikheyev, Barry Moore, Jan Oettler, Xiang Qin, Lukas
Schrader, Kim Worley, Mark Yandell, Jing-Jiang Zhou.
Formatting: Anna Bennett.
To our funding agencies:
NIH: NIGMS and NHGRI.
DOE: Office of the Director, Office of Science, Office of Basic
Energy Sciences.
Thanks
- Thank the organizers, specially the scientific committee- Introduce the team, collaboration between three labs.
- The need for genome visualization and editing tools prompted the development of the genome browsers we commonly use. - It was further necessary to create editing tools.
Before I explain to you about the tools we have been working on, I will first explain why I think Manual curation is necessary.
A series of changes have introduced both positive and negative elements into the process of genome sequencing and annotation- Cheaper sequencingMore researchers getting involvedMore genomes being sequencedHigh throughput RNAseq and improved automated annotationMore assembly errors & lack of gold standards are more reasons for manual annotation.All of these factors are part of a process we call ‘the democratization of genome-scale sequencing’, which calls for a new kind of tool.
- Apollo API, using Javascript. - No installation required.- JBrowse plugin with “User-created Annotations” track.- Uses dynamic (lazy) data loading- Real Time annotation updates to centralized database.- Customizable: rules, appearance, user authentication.- Auto-promote (script)
File: GFF3, BAM, BigWig, VCF (soon).Tools: The plug-in architecture of the annotation editing engine allows for sequence alignment searches using BLAT
The editing logic is on the server:It selects the longest ORF as CDSAnd flags non-canonical splice sitesThe server is a Java servlet- it uses the GMOD biological object layer (gbol) data model: object model & API, based on the Chado schema
A BerkeleyDB stores annotations, annotation edits, and their History
Take Home Message:Annotators can upload their own data. Data Sources: Analysis Pipelines can process BAM, BED, BigWig, GFF3, and MAKER output without “much” massaging.Data Repositories: INNOVATION: Trellis A data broker with plug-in architecture for both output formats and back-end data stores.On the back-end we implemented 3 plug-ins for:- UCSC MySQL genome database- Chado- DAS servers (e.g.: Ensembl)