A Centralized Model Organism Database (CMOD) for the Long Tail of Genomes
Presented at Biocuration 2014 in Toronto http://biocuration2014.events.oicr.on.ca/
See related slides at http://www.slideshare.net/andrewsu/20140116-gmod-short
Centralized Model Organism Database (Biocuration 2014 poster)
1. A Centralized Model Organism Database (CMOD)
for the Long Tail of Genomes
ABSTRACT
Andrew I. Su, Benjamin M. Good, Chinmay Naik and Adriel Carolino
The Scripps Research Institute, La Jolla, California, USA
Background
How Gene Wiki?
We acknowledge support from the National
Institute of General Medical Sciences
(GM089820 and GM083924).
CONTACT
Benjamin Good: bgood@scripps.edu, @bgood
Andrew Su: asu@scripps.edu, @andrewsu
How Gene Wiki? The CMOD visionGENE WIKI EXAMPLEABSTRACT
FUNDING
Progress and status
CONCLUSION
One: structure from text miningThe Dark Matter of genome annotation
We need more hands on deck! We have
multiple positions open for postdocs and
programmers interested in crowdsourcing
and bioinformatics projects (like CMOD)!
1
10
100
1000
10000
100000
1000000
1997
1999
2001
2003
2005
2007
2009
2011
2013
2015
2017
2019
2021
2023
2025
Bacteria
Eukaryotes
Archaea
Model organism databases (MODs) are fantastic resources for
organizing genomic information for commonly-studied
organisms. To facilitate the creation and maintenance of MODs,
the Generic Model Organism Database (GMOD) Project
provides “a set of interoperable open-source software
components for visualizing, annotating, and managing biological
data.”
Provide a database of the
world’s knowledge that
anyone can edit.
- Denny Vrandečić
Despite the obvious success and value of GMOD, the number of
sequenced genomes is growing exponentially. Does this model
scale with the rate of genome sequencing?
Figure courtesy Scott Cain
Wikidata (http://wikidata.org) is an innovative and important
new tool for community-based knowledge management.
Wikidata is supported by the Wikimedia Foundation, which also
operates Wikipedia. In short, Wikidata is to structured data what
Wikipedia is to free text.
Model organism databases are fantastic resources for
genomics researchers. But relatively few model organisms
have stable funding for their database, and the number of
sequenced genomes is increasing exponentially. It seems
impractical to create and fund a model organism database
for each of them. Here, we describe our efforts to build a
Centralized Model Organism Database (CMOD), a single
online resource to support all genomes and organisms. To
scale to the Long Tail of Genomes, CMOD employs an open
editing model in which the entire research community is
empowered to edit and maintain genomic data. We
describe our efforts to systematically populate CMOD with
two core data types across all organisms – genome
annotations and Gene Ontology annotations.
We propose to build a Centralized Model Organism Database
(CMOD), which would house gene and genome annotations for
all genomes. This database would be based on Wikidata,
enabling it to be community-curated, continuously-updated, and
computer-readable.
CMOD
Gene and genome annotations
CMOD data can be accessed using a number of mechanisms.
The Wikidata web interface offers convenient access using a web
browser. The Wikidata application programming interface (API)
and associated programming libraries allow programmers and
bioinformaticians computational access to the data. Wikidata
export to RDF offers compatibility with the Semantic Web and
Linked Data. We also envision that many popular GMOD tools,
including Gbrowse, Jbrowse, and and WebApollo, can be
modified to use CMOD as the back-end data warehouse.
Wikidata
Wikidata currently catalogs over 14 million entities, and
describes those entities in the form of 27 million statements.
This knowledgebase is the product of over 50 million edits. Of
those edits, ~90% are contributed by bots that predominantly
import data from structured resources, and 10% are contributed
by human editors.
This seminal paper identified 517 operons and 103 small
regulatory RNAs in Listeria monocytogenes, an important human
pathogen. Unfortunately, these annotations cannot be
downloaded from the Broad’s “Listeria monocytogenes
Database”, nor NCBI Genome, nor UCSC’s Microbial Genome
Browser, nor EnsemblBacteria, nor any GMOD instance. The
only place they are available is from the Supplementary
information on the Nature website in PDF format.
We have loaded gene and genome annotation data for ~1000
human genes, the human proteins they encode, and their mouse
orthologs according to the data model shown above. The code
repository for managing these data is available at
https://bitbucket.org/sulab/wikidatagenebot.
The Skeptic’s Corner
Will CMOD scale with the exponential growth in sequenced
genomes? Yes, because there is no gatekeeper to adding new
content. Anyone is empowered to directly contribute. Even
though the technical infrastructure is centralized, the data
management is highly distributed.
Who will contribute to CMOD? We envision a wide spectrum of
contributors, from large biocuration/annotation centers adding
large data sets, to individual bioinformaticians who deposit
structured versions of previously unstructured data, to individual
scientists contributing individual annotations.
Will CMOD content be trustworthy? Like Wikipedia, we expect
that Wikidata overall will asymptotically approach perfect
accuracy and completeness. Moreover, because provenance is a
core part of the data model, the presence/absence/type of the
reference can be used to systematically filter the knowledgebase
according to each user’s needs.
Managing genomic information and knowledge is a critical
challenge for biomedical research. Community infrastructure
that allows individuals to collaboratively and collectively organize
knowledge has the potential to be an enabling technology in
biological research. Here, we propose CMOD as one such
application that is particularly focused on the Long Tail of
sequenced genomes.
Cumulative number of sequenced genomes