1. Developing a Curator Assistant for Functional Analysis of Genome Databases
Requesting $1,451,005 from NSF BIO Advances in Biological Informatics, August 2009
PI: Bruce Schatz, Institute for Genomic Biology, University of Illinois at Urbana-Champaign
coPI: ChengXiang Zhai, Computer Science, University of Illinois, LanguageTechnology
coPI: Susan Brown, Biology, Kansas State University, ArthropodBaseConsortium (BeetleBase)
coPI: Donald Gilbert, Bioinformatics, Indiana University, Community Annotation (wFleaBase)
Intellectual Merit
The advent of next-generation sequencing is rapidly decreasing the cost of genomes.
Projections are that costs will decrease from $1M to $100K to $10K in the next 5 to 10 years.
As a result, the number of arthropod genomes will increase from 10 to 1000 to 10,000. This
shifts the major limitation from sequencing to annotating. The current level of annotation is
recognizing genes from sequences, rather than understanding the function of genes.
Traditionally, functional analysis has been performed by human curators who read biological
literature to provide evidence for a genome database of gene function such as FlyBase. To
functionally analyze a genome, biologists develop an ortholog pipeline, where the genes of their
organism have the orthologs computed and these used to find the most similar gene in a model
organism database. This process is inexpensive, but inaccurate, compared to manual curators.
We propose to develop a Curator Assistant that will enable the communities that are
generating genomes to analyze the function of their genes by themselves. While the model
organism databases (MODs) have groups of curators, subsequent genome databases have
struggled to find funding for even a single human curator. Such bases will have to be curated by
the communities themselves, by community biologists using software infrastructure to help them
extract functions from community literature. Within the Arthropod Base Consortium (ABC), for
example, only FlyBase is a MOD with professional curators.
During the NSF-funded BeeSpace project, we developed prototype software for automatically
extracting entities and relations from biological literature. The entities include genes, anatomy,
and behavior, while the relations include interaction (gene-gene), expression (gene-anatomy),
and function (gene-behavior). These entities and relations can be used to populate relational
tables to build a genome database. Our prototype works on Drosophila literature and leverages
FlyBase, the MOD for the ABC. Our techniques appear general enough for all arthropods.
We propose to develop a fully fledged Curator Assistant that fully utilizes machine learning
technologies for natural language processing. These include community dictionaries, heuristic
procedures, and training sets. Given the community collection with relevant literature, the
assistant software suggests candidate relations that the community biologists can select from.
Providing additional knowledge is much easier than reading biological literature and
mechanisms are provided to specify the level of quality desired and revise the information itself.
Broader Impact
Our project has been organized via the annual Symposium of the Arthropod Base Consortium.
Our investigators including the BeeSpace PI for informatics and the Symposium organizer for
biology, representing arthropod genomes in particular and animal genomes in general. Our
project will develop language technology for entity-relation semantics into usable infrastructure
and distribute it through GMOD, which already provides the sequence support used by ABC.
We will develop the standards for literature support for customized extraction and curation,
including practical deployment to a distributed community of NSF-funded genome biologists.
2. Developing a Curator Assistant for Functional Analysis of Genome Databases
PI: Bruce Schatz, Institute for Genomic Biology, University of Illinois at Urbana-Champaign
coPI: ChengXiang Zhai, Computer Science, University of Illinois, LanguageTechnology
coPI: Susan Brown, Biology, Kansas State University, ArthropodBaseConsortium (BeetleBase)
coPI: Donald Gilbert, Bioinformatics, Indiana University, CommunityAnnotation (wFleaBase)
1. GENOME SEQUENCING AND BIOCURATION
The advent of next-generation sequencing is rapidly decreasing the cost of genomes.
Projections are that costs will decrease from $1M to $100K to $10K in the next 5 to 10 years.
As a result, the number of arthropod genomes will increase from 10 to 1000 to 10,000. This
shifts the major limitation from sequencing to annotating. The current level of annotation is
recognizing genes from sequences, rather than understanding the function of genes.
Traditionally, functional analysis has been performed by human curators who read biological
literature to provide evidence for a genome database of gene function such as FlyBase. To
functionally analyze a genome, biologists develop an ortholog pipeline, where the genes of their
organism have the orthologs computed and these used to find the most similar gene in a model
organism database. This process is inexpensive, but inaccurate, compared to manual curators.
We propose to develop a Curator Assistant that will enable the communities that are
generating genomes to analyze the function of their genes by themselves. While the model
organism databases (MODs) have groups of curators, subsequent genome databases have
struggled to find funding for even a single human curator. Such bases will have to be curated by
the communities themselves, by community biologists using software infrastructure to help them
extract functions from community literature. Within the Arthropod Base Consortium (ABC), for
example, only FlyBase is the only MOD with professional curators.
During the NSF-funded BeeSpace project, we developed prototype software for functional
analysis [33], by automatically extracting entities and relations from biological literature. The
entities include genes, anatomy, and behavior, while the relations include interaction (gene-
gene), expression (gene-anatomy), and function (gene-behavior). These entities and relations
can be used to populate relational tables to build a genome database. Our prototype currently
works on Drosophila literature and leverages FlyBase, the MOD for the ABC. Our techniques
are general enough for all arthropods. This is an important taxa of organisms for NSF biologists.
We propose to develop a fully fledged Curator Assistant that fully utilizes machine learning
technologies for natural language processing. These include community dictionaries, heuristic
procedures, and training sets. Given the community collection with relevant literature, the
assistant software suggests candidate relations that the community biologists can select from.
Providing additional knowledge is much easier than reading biological literature and
mechanisms are provided to specify the level of quality desired and revise the information itself.
We will debug the new system on the existing bases such as BeetleBase and wFleaBase, then
deploy more widely to the full bases of the Arthropod Base Consortium as it grows. The
software will be general enough to be widely applicable for genome databases. We will use
GMOD (Generic Model Organism Database) consortium as the distribution mechanism for our
literature curation software, to complement their existing software for sequence curation.
-2-
3. 2. GENOME DATABASES AND BIOCURATION
The Curator Assistant will initially focus on arthropod genomes, as organisms of central
interest to NSF. At least half of the described species of living animals are arthropods (jointed
legs, mostly insects), species of great scientific interest for molecular genetics and evolutionary
synthesis. The Arthropod Base Consortium (ABC) has been meeting quarterly for the past 4
years, to discuss their needs for genome data bases and data analysis. The inner circle has about
40 scientists, who hold workshops at the major community sites. The outer circle has about 400
scientists, who attend the Annual Symposium [www.k-state.edu/agc/symposium.shtml]. This
community consortium currently includes some 10 resource genomes, including insects
important biologically (bee, beetle, butterfly, aphid), crustaceans important ecologically (water
flea), and vectors important for human diseases (mosquito, tick, louse).
There is a reference genome for this community, the fruit fly Drosophila melanogaster, which
has been a genetic model for 100 years. As the model insect, Drosophila is important enough to
justify a 40-person staff at FlyBase, who manually curate this model organism database (MOD).
Through a close collaboration, the FlyBase literature curation process is serving as the model for
our semantic indexing of biological literature, see Figure 1 below.
The first wave of genomes were of the model genetic organisms, these MODs already had
Bases with human curators. For the arthropods, the only MOD is FlyBase for the insect
Drosophila meglanoster. The second wave of genomes did not have decades of genetics, but
were attempting to jumpstart with genome sequencing. For arthropods, these include the insects
honey bee and flour beetle, both important scientifically and agriculturally. The corresponding
bases, e.g. BeeBase and BeetleBase, were able to gain modest funding, but not for professional
curators, only for postdocs and programmers. Such resources thus went into annotating genes of
particular interest (small numbers) or support of automatic processing (large numbers).
With the third wave, the sequencing is still done at genome centers, but no attempts are made
at manual curation. These Bases, e.g. wFleaBase for Daphnia, spend their limited resources on
community annotation and computation. Beyond the third wave, the sequencing is being done at
campus centers rather than national centers and any curation is done automatically with quality
enhancement by the community itself. Within ABC, ButterflyBase and AphidBase are down this
path and will be working with our group as their genomes mature. The 10,000 arthropod
genomes expected in the next decade will all be in the post-curator era.
From a technology standpoint, this implies that the Curator Assistant must support variable
levels of quality because different bases from different waves will do different amounts of post-
assistant quality improvement. With many curators, the system should generate many candidates
that can be manually checked by human experts. With few curators, the system should generate
few candidates for manual checking, thus higher precision and lower recall. With no curators,
the system should generate highest precision “correct” entries, which are annotated by the
community itself using collaboration technology. In the preliminary work performed in the
BeeSpace project described below, we developed prototype services tuned towards recall and
towards precision, indicating feasibility of developing a fully tunable system for curation quality.
What’s in a Base: An Examination of FlyBase
For many reasons several of the fields in FlyBase use structured controlled vocabularies (aka
ontologies). This makes it much easier (and more robust) to make links within the database, as
-3-
4. well as making it much easier to search the database for information. Moreover, several of these
controlled vocabularies are shared with other databases, and this provides a degree of integration
between them. The controlled vocabularies are only implemented in certain fields in FlyBase.
The initial literature selection is done at FlyBase at Cambridge University while the bulk of the
literature curation is done at FlyBase at Harvard University to populate the gene models in the
database from highlighted facts in the literature articles [8].
Controlled vocabularies currently used by FlyBase are [www.flybase.org]:
• The Gene Ontology (GO). This provides structured controlled vocabularies for the
annotation of gene products (although FlyBase annotates genes with GO terms, as a
surrogate for their products). The GO has three domains: the molecular function of gene
products, the biological process in which they are involved and their cellular component.
• Anatomy. A structured controlled vocabulary of the anatomy of Drosophila
melanogaster, used for the description of phenotypes and where a gene is expressed.
• Development. A structured controlled vocabulary of the development of Drosophila
melanogaster, used for the description of phenotypes and when a gene is expressed.
• The Sequence Ontology (SO). A structured controlled vocabulary for sequence
annotation, for the exchange of annotation data and for the description of sequence
objects in databases. FlyBase describes the genome in a consistent and rigorous manner.
All of these structured controlled vocabularies are in the same format, that used by the Open
Biomedical Ontology group. This format is called the OBO format [www.obo.org] .
These controlled vocabularies focus on the most important types of data for genome
databases, namely “gene”, “anatomy”, and types of “function” such as “development” [37]. The
factoids in the official database are relations on these datatypes, such as Interaction (gene-gene),
Expression (gene-anatomy), Function (gene-development). When a FlyBase curator records a
factoid, they also record the type of evidence that enables them to judge its correctness. The list
for genes is as below. Note this implies that even manual curation includes different factoids at
different qualities, whether a relation is true depends on the level of evidence chosen.
The Gene Ontology Guide to GO Evidence Codes contains comprehensive descriptions of
the evidence codes used in GO annotation. FlyBase uses the following evidence codes when
assigning GO data: inferred from mutant phenotype (IMP), inferred from genetic interaction
(IGI), inferred from direct assay (IDA), inferred from physical interaction (IPI), inferred from
expression pattern (IEP), inferred from sequence or structural similarity (ISS), inferred from
electronic annotation (IEA), inferred from reviewed computational analysis (RCA),,traceable
author statement (TAS), non-traceable author statement (NAS), inferred by curator (IC), no
biological data available (ND). Note some of these are observational and some computational.
3. CURATOR ASSISTANT SYSTEM
Biocuration [17] is the process of extracting facts from the biological literature to populate a
database about gene function. The curators at the Model Organism Databases (MODs) read
input papers from scientific literature relevant to their organism and extract facts judged to be
correct, which are then used to populate the structured fields of their genome database. There are
currently 10 reference genomes, each with their own group of curators. These groups are falling
-4-
5. behind, with the current scale of literature, and new resource genomes are being denied custom
curator support, due to financial limitations.
In the 5-year BeeSpace project just ending with NSF BIO FIBR funding, we have been
working closely with FlyBase curators to better understand what can be automated within the
biocuration process. We are fortunate in collaborating with John MacMullen from the Graduate
School of Library and Information Science, who specializes in studying the process of
biocuration by analyzing the detailed activities of MOD curators. He is analyzing the curator
annotations in FlyBase, among others, by examining which sentences are highlighted in the texts
and which database entries are inferred from these. Through the BeeSpace project, we also work
with the many curators at the FlyBase project under PI William Gelbart at Harvard University
and the few curators at the BeeBase project under PI Christine Elsik at Georgetown University.
The Group Manager at FlyBase-Cambridge (England), Steven Marygold, provided the Figure
below giving the steps in the FlyBase curation process. He spoke at the ABC working meeting in
December 2007 hosted at our project home site in the Institute for Genomic Biology at the
University of Illinois, slides at www.beespace.uiuc.edu/files/Marygold-ABC.ppt .
Figure 1. FlyBase Literature Curation Process Diagram [27].
The automatic process set up in the Curator Assistant is modeled after this manual process.
The user could be a full biocurator or could be a community member research biologist, thus
differently tuning the system to their needs. They search the literature to choose articles. The
manual curator can only choose tens of articles to skim, but the assisted curator can choose
thousands of articles to be automatically skimmed. The BeeSpace system that the Curator
Assistant leverages contains powerful services for choosing collections well targeted to the
particular purpose, including searching and clustering. The major strength of the automatic
system is breadth, it can cover a much wider selection of the available literature than can
humans. In demonstrating the prototype to many curators at the Arthropod Genomics
Symposium, even the most professional curators spoke longingly of having an automatic system
to filter candidates, in order to attempt to with the full range of biological literature.
The Curator Assistant will focus on the middle of the diagram, the central core of the curation
process. This process highlights the curatable material and then performs curation, this is
-5-
6. basically finding sentences with functional information and extracting the facts that are described
by the functional sentences. For example, two genes interact with each other (Interaction), a
gene is expressed in a specific part of the anatomy (Expression), a gene regulates a particular
behavior (Function). Key information is usually contained within the abstract, which is why our
current services are effective, even though they cover only Medline and Biological Abstracts.
The manual curators have the advantage of reading the fulltext, so we will be also gathering
fulltext systematically for our community, through the collaboration technology described below.
For the bottom of the diagram, the Curator Assistant will also support error checking of
different kinds by the community curators themselves and by the community biologists
themselves, as described in the later section on Community Annotation and Curation. Finally,
through an arrangement with the GMOD consortium (Generic Model Organism Database
software), who support the GBrowse genome sequence displayer and the CHADO database
schema format, we will be distributing our literature infrastructure software to the broader
genome community to supplement the existing sequence infrastructure software. The concluding
section below on Organization and Schedule contains further details on GMOD relations.
The underlying system uses natural language processing to extract relevant entities and
relations automatically from relevant literature. An entity is a noun phrase representing a
community datatype, e.g. gene name or body part. A relation is a verb phrase representing the
action performed by an entity, e.g. gene A regulates behavior B in organism C. Many projects
extract entities and relations, using template rules for a particular domain. The BeeSpace project
pioneered trained adaptive entity recognition, where sample sentences are used to train the
recognizer for particular entities with high accuracy and software adapts the training to related
domains automatically [18,19] and we will be leveraging off this NSF BIO project, which ends
in August 2009 before the proposed project would begin. We also leverage off our previous
NSF research in digital libraries on interactive support for literature curation [4,22].
The first prototype within the BeeSpace system has already become a production service, with
streamlined v4 interface available at www.beespace.uiuc.edu . The Gene Summarizer was the
subject of an accepted plenary talk at the 2nd International Biocurator Meeting in San Jose in
October 2007 [34]. The Gene Summarizer has two stages: the first highlights the curatable
materials while the second curates these materials in a usable interactive form [25,26]. The
highlighting is tuned for recall, so that sentences containing gene names are automatically
extracted from the literature abstracts, where the entity “gene” is broadly recognized, including
genes, proteins, and gene-like descriptions. The curation is simpler than what is proposed for
the Curator Assistant but is very effective for practicing biologists who use the interactive
system, where each gene sentence is placed automatically into a functional category.
The first version of this service used a machine learning approach that was trained on the
curator generated sentences from FlyBase, explaining why the curator had entered a particular
factoid into FlyBase relational database. PI Schatz of BeeSpace then visited PI Gelbart of
FlyBase at Harvard and observed the curator process at length. A reciprocal visit by a FlyBase
curator, Sian Giametes, to BeeSpace refined the automatic process and the functional categories.
We then also did specific training with new sentences judged by bee biologists at University of
Illinois and beetle biologists at Kansas State University. A subsequent version was developed
using this training with much higher accuracy than previous dictionary-based versions.
Figures 2 and 3 give examples of using the Gene Summarizer with this insect training on a
Drosophila fly gene and on a Tribolium beetle gene. There are more fly papers than beetle
papers so the number of highlighted sentences are naturally greater. The functional categories
-6-
7. are: Gene Products (GP), Expression Location (EL), Sequence Information (SI), Wild-type
Function & Phenotypic Information (WFPI), Mutant Phenotype (MP), Genetic Interaction (GI).
Figure 2. Gene Summarization for Automatic Curation on FlyBase collection.
Figure 3. Gene Summarization for Automatic Curation on BeetleBase collection.
-7-
8. 4. CURATOR ASSISTANT PROTOTYPE
After integrating the Gene Summarizer in BeeSpace v3, we developed a prototype BeeSpace
v5 that specifically extracted entity and relation from literature. This has deeper curation,
recognizing within a highlighted sentence what entities and relations are mentioned. The
extractors were tuned for precision to produce “correct” factoids, rather than the previous
extractors that were tuned for recall to produce comprehensive coverage of all entities present.
From this, it became clear that the level of precision and recall was a tunable feature of machine
learning and thus it would be feasible to support varying qualities for different purposes.
The precision v5 system was an important prototype for the Curator Assistant, as it showed
that accurate automatic extraction was technically possible. The first version leveraged the
relations within FlyBase and was run on the Drosophila collection of standard articles that we
obtained through collaboration from FlyBase at Indiana University where the software
development is done. The high precision used disambiguation algorithms that enabled
identification of which gene was mentioned. For v3 recall, “wingless” was a particular text
phrase but for v5 precision, the same word was a particular gene number. Thus, accurate
linkouts became possible. So a gene entity recognized can jump directly to the FlyBase gene
entry for that name and an anatomy entity can jump directly to the FlyBase anatomical hierarchy.
Figure 4 contains a sample output from the v5 prototype on the Drosophila fly collection.
Multiple word phrases are recognized correctly for gene in green, for anatomy in orange, for
behavior in blue, and for chemical in yellow. (Tags are correct if this figure displayed in color.)
Anatomy is dictionary-based, just like gene, using the FlyBase anatomy terms as the base. The
function terms in the categories of behavior and chemical were extracted using heuristics of
certain key words. There was another set of function terms for development, the other category
used in FlyBase, but not many terms identified with our simple heuristics. Figure 5 shows that
the recognized gene is linked to its corresponding correct gene database entry in FlyBase.
In the proposed project, for entities, we will focus on gene, anatomy, and function
(combining behavior, anatomy, development). For relations, we will focus on different
combinations of these such as Interaction (Gene-Gene), Expression (Gene-Anatomy), Function
(Gene-Behavior etc). We will leverage existing resources for dictionary generation, such as gene
names from NCBI Entrez Gene [www.ncbi.nlm.nih.gov/sites/entrez?db=gene] and anatomy
names from FlyBase [http://flybase.org/static_pages/anatomy/glossary.html]. The relational
indexes in Biological Abstracts include gene and anatomy, providing a rich source of entities
tagged by human curators from phrases in biological literature. FlyMine [www.flymine.org] is a
rich source of query relations, including multistep inferences extracted from FlyBase. We will
also leverage available resources to obtain training data or pseudo training data. In particular,
BioCreative studies [16,29] have resulted in a valuable training set, which we have already used
in gene recognition. Fixed template systems such as Textpresso [30] have hand-generated rules
useful for constructing features in our learning-based framework.
For the proposed project, we plan to do extensive training to improve the precision of the
dictionaries and of the heuristics, to automatically identify sentence slots for particular entities.
This process greatly improved our previous efforts for entity summarization, as discussed above.
To achieve better results, the community curators can supplement the dictionaries with local
gene names or anatomy names. The next section is a technical discussion of the training
procedures and how such tuning can be feasibly implemented.
-8-
9. Figure 4. Preliminary Work from BeeSpace Prototype v5. Interactive System for
Entity Relations using FlyBase relational database for leverage, with live linkouts.
Figure 5. FlyBase Gene entry (manual) linked to from Curator Assistant (automatic).
-9-
10. We have tried running the Drosophila trained v5 extractors on Tribolium literature, since few
beetle genes have direct names but commonly use the fly gene names. The anatomy is also not
identical but similar in many ways. This process sometimes produces good results as shown in
Figure 6. This version is the initial attempt at a general system for arthropods using prototype
classification, the closer the organism is to the prototype fly the more accurate the recognition.
Figure 6. Entity Relation v5 on Beetle Tribolium literature. This still uses the FlyBase
training so not as accurate as would be trained system, but still produces some useful outputs.
We are currently extracting from a large insect collection from the Biological Abstracts
database. PI Schatz is giving an invited lecture in December 2009 at the annual meeting of the
ESA Entomological Society of America on "Computer support for community knowledge:
information technologies for insect biologists to automatically annotate their molecular
information" and will demonstrate the evolved version of this prototype. coPI Gilbert is giving
an invited talk in the same session on Integrative Physiological and Molecular Insect Systems.
He works on the arthropod water flea, a good test of machine learning for entity anatomy.
PROJECT SCHEDULE FOR CURATOR ASSISTANT
Year 1. Develop v1 leverage FlyBase (base BeeSpace v5). Deploy to BeetleBase.
Year 2. Develop v2 with Trained Recognizers. Deploy to BeetleBase and wFleaBase.
Year 3. Develop v3 with Community Curation. Deploy to entire ABC including Hymenoptera
and Leptidoptera genome databases without curators and VectorBase with.
- 10 -
11. 5. ENTITY RELATION EXTRACTION
This project proposes that it is feasible to apply advanced machine learning and natural
language processing techniques to extract various biological entities and relations with tunable
extraction results in a sustainable way through leveraging the increasing amount of training data
from annotations naturally accumulated over time. This sustainability is illustrated in Figure 7.
The main technical component is
the trainable and tunable extractor.
This extractor can automatically
process large amounts of literature
and identify relevant entities and
relations that can become candidate
factoids for curation. The extracted
results would then be validated by
human curators or any one with
appropriate expertise for validation.
The validated results can be
incorporated into structured databases
for researcher query or analysis tools
to further process. The growing
amount of validated entities and Figure 7. Extraction Process for Assistant, where Curator
relations naturally serves as tunes the Dictionaries and the Training.
additional training data for the
extractor, leading to “organic” improvement of extraction performance over time.
The extractor is trainable due to the use of a machine learning approach to extraction as
opposed to the traditional rule-based approaches. This means that the extractor can learn over
time from the human-validated extraction results to improve its extraction accuracy; the more
training data we have, the better the accuracy of extraction will be. Thus as we accumulate more
and more entities and relations, the Curator Assistant would become more and more intelligent
and powerful, being able to replace more and more of the human labor. Thus, the extractor
would become more and more scalable to handle large amounts of literature automatically.
The extractor is tunable due to a combination of high-precision techniques such as
dictionary lookup and rule-based recognition with high-recall enhancement from statistical
learning. Informally, our idea is that we can first use dictionary lookup and/or rule-based
methods to obtain a small amount of highly accurate extraction results and then feed these results
as (pseudo) training data to a learning-based extractor to train the extractor to extract more
results, thus increase recall. A learning-based extractor also generally has parameters to control
the tradeoff of precision and recall, making it possible to tune the system to output either fewer
results with higher precision or more results with higher recall but potentially lower precision.
This trainable and tunable extractor will be implemented based on a general learning
framework for information extraction, in which all resources, including dictionaries, human-
generated rules, and existing annotations, can be integrated in a principled way. The basic idea of
using machine learning [1] for extraction is to cast the extraction problem as a classification
problem. For example, for entity extraction, the task would be to classify a candidate phrase as
either being a particular type of entity (e.g., gene) or not, while for relation extraction, the
classification task can be to classify a sentence as either containing a particular relation (e.g.,
- 11 -
12. gene interaction) or not. The prediction is based on a function that combines various features that
describe an instance (i.e., a phrase or a sentence) in a weighted manner. For example, for gene
prediction, features can include every possible clue that can potentially help making the
prediction. Or features can be local syntactic features such as whether the phrase has capitalized
letters, whether there are parentheses or Greek letters, whether there is a hyphen, or contextual
features such as whether the word “gene” or “expressed” occurs in a small window around the
phrase. These features can be combined to generate a score as basis for the prediction. The exact
way to combine the features and to make the decision would vary from method to method [1].
For example, a commonly used effective classifier is based on logistic regression [1,18]. It
works as follows. Let X be a candidate phrase and f1(X), f2(X), …, fk(X) be k feature values
computed on X; e.g., f1(X)=1 (or 0) can indicate that the first letter of X is (or not) capitalized.
Let Y ∈{0,1} be a binary variable indicating whether X is a gene. The logistic regression
classifier assumes that Y and the features are related through the parameterized function:
k
exp(∑ β i f i ( X )) k
p (Y = 1 | X , β1 ,..., β k ) = i =1
k
∝ exp(∑ β i f i ( X ))
1 + exp(∑ β i f i ( X )) i =1
i =1
where β’s are parameters that control the weights on all the features learned from training data.
Given any instance X, we can use the formula above to compute p(Y=1|X), and thus can
predict X to be a gene if p(Y=1|X)> p(Y=0|X) (i.e., p(Y=1|X)>0.5), and a non-gene otherwise.
The training data will be of the form of a pair (Xj, Yj) where Xj is a phrase and Yj ∈{0,1} is the
correct prediction for Xj , thus a pair like (Xj, Yj=1) would mean that phase Xj should be predicted
as a gene, while a pair like (Xj, Yj=0) would mean that phase Xj should be predicted as not a
gene. In general, we will have many such training pairs, which tell us the expected predictions
for various instances. With a set of such training data {(Xj, Yj)}, j=1,…,n, in the training phase,
we would optimize the parameters (i.e., β’s) to minimize the prediction errors on the training
data. Intuitively, this is to figure out the best settings for these β’s so that ideally for all training
pairs where Yj=1, p(Yj=1| Xj) would be larger than 0.5, while for those where Yj=0, p(Yj=1| Xj)
would be smaller than 0.5.
Although we used gene prediction as an example to illustrate the idea of this kind of learning
approach, it is clear that the same method can be used for recognizing other entities as well as
relations if X is a candidate sentence and Y indicates whether a certain relation is expressed in X.
There are many other classifiers [1] such as SVM and k-nearest neighbors that we can also use;
they all work in a similar way – using training data to optimize a combination of features for
making a prediction.
A significant advantage of such a learning-based approach over the traditional rule-based
approach (as used in, e.g., the Textpresso system [30]) is that it can keep improving its
performance through leveraging the naturally growing curated database as training data, thus
gradually reducing the need for human effort over time. Indeed, such supervised learning
methods have already been applied successfully for information extraction from biology
literature (see, e.g., [3,9,12,28,35,36,43] ) and many other tasks such as text categorization and
hand-written character recognition.
Such a learning-based method relies on the availability of two critical resources: (1) training
data; (2) computable effective features. The more training data we have and the more useful
features we have, the accuracy of extraction would be higher. Unfortunately, these two resources
- 12 -
13. are not always readily available to us. Below we discuss how we can apply advanced machine
learning and NLP techniques to solve these two challenges.
Insufficient training data: All the human-generated annotations are naturally available high
quality training data, but for a new genome, we may not have many or any annotations available,
creating a problem of “cold start”. We solve this problem using three strategies:
1. “Borrow” training data from related model organisms that have already been well annotated
through the use of domain adaptation techniques [18,19,20]. For example, our previous work
shows that cross-domain validation (emphasizing more on features that work well for multiple
domains) can lead to an improvement in the accuracy of extracting genes from a BioCreative test
set [16] by up to 40% [18].
2. Bootstrap with a small number of manually created rules to generate pseudo training
examples (e.g., by assuming that all the matched cases with a rule are correct predictions). This
is a general powerful idea to improve recall, thus can be expected to be very useful when we
want to tune toward high recall based on high precision results. For example, a small set of
human-generated rules can be used for extraction with high accuracy; the generated high
precision results can then be used to train a classifier, which would be able to augment the
extraction results to improve recall. In our previous study, this technique has also been shown to
be very effective when combined with domain adaptation [20].
Figure 8 shows some sample results from using the
pseudo training data automatically generated from
entries in a FlyBase table for genetic interaction
relation recognition. Different curves correspond to
using different combinations of features. The best
performing curve uses all the words in a sentence
as features. Note that this top curve also shows that
it is possible to tune the extractor to produce either
high-precision low-recall results or low-precision
high-recall results by applying a different cutoff
threshold to a ranked list of predictions.
Figure 8. Relation Extractor with Tunable
3. In the worst case, we will resort to human Precision-Recall depending on thresholds.
annotators to generate a small number of high-
quality training examples with minimum effort using active learning techniques, which allow us
to choose the most useful examples for a human annotator to work on so as to minimize human
effort. The basic idea is to ask a human expert to judge a case on which our classifier is most
uncertain about; we can expect the classifier to learn most from the correct prediction for such
uncertain cases. There are many active learning techniques that we can apply [7,10,40].
Insufficient effective features: Some entities and relations are easier to extract than others; for
example, organisms are easier to extract than genes because the former is usually restricted to a
closed set of vocabulary while the latter is not. For most entities, we expect that the standard
features defined based on surface forms of words and contextual words around a phrase would
- 13 -
14. be sufficiently effective for prediction. However, for difficult cases, we may need to extend the
existing feature construction methods to define and extract additional effective features for a
specific entity or relation. We will solve this problem using two strategies:
1. Systematically generate more sophisticated linguistic features based on syntactic and semantic
structures (e.g., dependency relations between words determined by a parser). To improve the
effectiveness of features, it is useful to consider more discriminative features than words. To this
end, we will parse text to obtain syntactic and semantic structures of sentences and
systematically generate a large space of linguistically meaningful features that can potentially
capture more semantic relations and are more discriminative. In our previous study [21], we have
proposed a graph representation that enables systematic enumeration of linguistic features, and
our study has found that using a combination of features of different granularity can improve
performance for relation extraction. In this project, we will apply this methodology to enable the
classifier to work on a large space of features.
2. Involve human experts in the loop of learning so that when the system makes a mistake, the
expert can pinpoint to the exact feature responsible for the error; this way, the system can
effectively improve the
quality of features through
human feature supervision.
For example, in some
previous experiments, we
have discovered that
dictionary-based approaches Figure 9. Sample gene name disambiguation results
to gene name recognition are
unable to distinguish a gene abbreviation such as “for” from the common preposition word “for”.
Thus if we just add a feature to the classifier to indicate whether the phrase occurs in a
dictionary, we may potentially misrecognize a preposition like “for” as a gene name. To solve
this problem, we designed a special classifier targeting at disambiguating such cases based on the
distribution patterns of words in the nearby text. The results in Figure 9 show that this technique
can successfully distinguish all the occurrences of “foraging” and “for” (the numbers are the
scores given by the classifier; a positive number indicates a gene, while a negative number a
non-gene). The output from such a disambiguation classifier can be regarded as a high-level
feature that can be fed into a general gene recognizer to tune the classifier toward high precision.
Note that we take a very broad view of features, which makes our framework quite general.
Thus, in addition to leveraging all kinds of training data, we can also incorporate a variety of
other useful resources such as dictionaries and human-generated rules through defining
appropriate features (e.g., a feature can correspond to whether an instance matches a particular
rule or an entry in a dictionary), effectively leveraging results from existing work. Extracting
entities and relations from biomedical literature has been studied extensively in the literature
(see, e.g., [3,5,6,9,11-15,23-24,28,30-31,35,36,38-39,41-43]), including our own previous work
(e.g., [18-21]). Our framework would enable us to leverage and combine the findings and
resources from all these previous studies to perform large-scale information extraction. For
example, we can obtain a wide range of useful features from previous work and various
strategies for optimizing extraction accuracy.
- 14 -
15. 6. COMMUNITY ANNOTATION and CURATION
The Community itself will eventually have to take over the curator role, with interactive
analysis to enable scientists to use the infrastructure to infer biological functions and infer
semantic relationships. Today's new genome projects are efforts contributed by many experts
and students, supported and enabled by distributed data sets, wiki project notebooks, genome
maps, annotation and search tools. These projects are not supported in a monolithic way, but via
contributions by biologists at nearly as many institutions as the hundreds of individual labs.
For example, more than 400 biologists contributed gene annotations to the Daphnia genome
[17]. As this is the same scale of attendees to the Arthropod Genomics Symposium, but for a
single arthropod, the number of potential contributors to the ArthropodBaseConsortium
annotations clearly numbers in the tens of thousands. Each of these is a potential curator, with
effective infrastructure for Curator Assistant. See the Collaboration Wikis for Daphnia
Genomics Consortium [https://dgc.cgb.indiana.edu/display/DGC/] and for Aphid Genomics
Consortium [https://dgc.cgb.indiana.edu/display/aphid/] for arthropod genomes examples.
This is a new model of sustainable scientific activity, with cost-effective collaboration via
widely adopted cyberinfrastructure. Experts and students in focus areas are actively involved,
and contribute according to their means and interest. They join from disparate areas of basic and
applied sciences, educational, governmental, and industry centers (e.g. Daphnia and Aphid
genomes involve EPA and USDA agencies, agricultural and environmental businesses).
We will develop infrastructure to address collaboration support for community annotation.
By providing tunable quality for biological factoids, we provide an automatic system to filter the
literature for curatable knowledge. In current gene annotation systems, such as Apollo
distributed by GMOD, the curator is presented with a blank form in which to write a gene
description. In the Curator Assistant, they are presented with candidate suggestions, thus greatly
expanding the number of persons who can serve as effective curators. We will also provide
mechanisms for the community to enter their own documents as published into the base
collections for the system, yielding a rich source of full-text articles, and to directly provide their
own factoids from their articles, without the inaccuracy of automatic entity-relation extraction.
Currently, the most popular collaboration tools are wikis. While a wiki excels at simplicity
and flexibility, it lacks validation tools, rich indexing and social instrumentation. We propose to
develop structured social instrumentation for collaborative research environments, including
collaborative curation. In particular, our systems will allow users to offer confidence ratings for
human annotations and for various automated metadata extracts presented to the users. The
users themselves will gain expert status when their annotations receive high confidence ratings.
These ratings and rankings will allow researchers to share expertise and enhance the precision of
automated annotation systems in a mutually-beneficial way with secure transactions.
A relevance rating system will be integrated in the basic functioning of the system itself.
Every view of information (entities, relations, abstracts, document lists) will also include
checkboxes to up-rate or reject/dismiss any listed elements. For example, community members
can judge the quality of the factoids viewed during their usage of the system. Items which are
selected and viewed receive increased relevance ratings. Data items which are
dismissed/rejected are down-rated in relevance and/or validity. The rating system is not
optional: It is transparently embedded within the user experience, which is key to its success.
This model of relevance feedback and validity ratings embedded within the core system has
proven effective in popular commercial social network systems such as YouTube and LastFM.
- 15 -
16. 7. PROJECT ORGANIZATION AND SCHEDULE
Our project has been organized via the annual Symposium of the Arthropod Base Consortium
(ABC). This is sponsored by the Arthropod Genomics Center at Kansas State University with
coPI Brown as Director. There have been 3 symposia held thus far in Kansas City, drawing
300-400 attendees, generally representatives of their research laboratories or genome projects.
http://www.k-state.edu/agc/symp2009/ The steering committee for the ABC meets after the
workshop to plan community support, this proposal grew out of these planning meetings.
There have also been specific meetings of the inner circle, 30-40 attendees, once or twice a
year at the main infrastructure sites such as FlyBase. The BeeSpace project hosted the one in
December 2007 at the University of Illinois, the slides for this workshop are at
http://www.beespace.uiuc.edu/groups_abc.php . The investigators for this proposal each spoke
at this meeting, along with the Head of Literature Curation for FlyBase Cambridge. The
proposed project will host a budgeted annual specialty workshop to plan Curator Assistant.
The genome databases being used as test models in this project have already bypassed the
use of professional curators. They are coming in later than the post-MOD wave, such as honey
bee, where a case for a few curators was eventually successful after many grant attempts. So
BeetleBase for Tribolium the flour beetle and wFleaBase for Daphnia the water flea employ a
few biologists and programmers to help with sequencing support and computational pipelines.
The coPIs who lead the bioinformatics for these, respectively Susan Brown and Donald Gilbert,
are influential proponents of the new paradigm for community curation via annotation software.
This proposal is concerned with developing an effective Curator Assistant and testing it to
evolve to full utility. The infrastructure investigators will develop the software infrastructure,
Schatz leading the informatics system development and Zhai leading the computer science
research. These were the same roles they played in the BIO FIBR BeeSpace project, which
developed interactive services for functional analysis using computer science research. The
bioinformatics investigators will serve as the initial users, each is the lead for the informatics of a
major community of arthropod biologists with several hundred community members. Tribolium
is an insect close to Drosophila, while Daphnia is a non-insect arthropod far from Drosophila.
The close BeeSpace collaboration with FlyBase will be continued, with both the curator site at
Harvard with PI Bill Gelbart and the software site at Indiana with PI Thom Kaufmann.
Deployment to the full ABC and beyond will begin towards the end of the project. The groups
already identified coordinate multiple related databases. They will be the wave of deployment
after the investigator organisms are effectively using the Curator Assistant. Their coordinators
have expressed great interest while serving on the ABC steering committee. NIH-supported
VectorBase has many curators for mosquitos and ticks, USDA-supported HymenopteraBase has
few curators for bees and wasps, LepidopteraBase has no curators for butterflies and moths.
There is also an international collaboration for AphidBase hosted at INRA in France.
The GMOD (Generic Model Organism Database) consortium is a bioinformatics group who
provide common infrastructure for over 100 genome projects, including all the ABC genomes
[www.gmod.org/wiki/GMOD_Users]. We have presented our preliminary software at GMOD
meetings [32], using RESTful protocols for linking Genome Browser to Gene Summarizer, and
made arrangements with the coordinator Scott Cain to link our software into GMOD for mass
distribution, during extensive conversations at the GMOD meetings and the ABC meetings. So
the Curator Assistant will become the literature infrastructure for ABC, just as GBrowse is the
sequence infrastructure, and through GMOD made available to the genome biology community.
- 16 -
17. References Cited
[1] Bishop C (2007) Pattern Recognition and Machine Learning, Springer, 2007.
[2] Buell J, Stone D, Naeger N, Fahrbach S, Bruce C, Schatz B (2009) Experiencing BeeSpace:
Educational Explorations in Behavioral Genomics for High School and Beyond, AAAS Annual
Symposium, Chicago, Feb 2009. curricular materials at www.beespace.uiuc.edu/ebeespace
[3] Chang J, Schutze H, Altman R (2004) GAPSCORE: finding gene and protein names one
word at a time, Bioinformatics, 20(2):216-25.
[4] Chung Y, Pottenger W, Schatz B (1998) Automatic Subject Indexing using an Associative
Neural Network, 3rd Int ACM Conf on Digital Libraries, Pittsburgh, PA, Jun, pp 59-68.
Nominated for Best Paper award.
[5] Cohen A (2005) Unsupervised gene/protein entity normalization using automatically
extracted dictionaries, Proc BioLINK2005 Workshop Linking Biological Literature,
Ontologies and Databases: Mining Biological Semantics. Detroit, MI: Association for
Computational Linguistics; 2005:17-24.
[6] Daraselia N, Yuryev A, Egorov S, Novichkova S, Nikitin A, Mazo I (2004) Extracting
human protein interactions from MEDLINE using a full-sentence parser, Bioinformatics,
20(5):604-11, 2004.
[7] Dasgupta S, Tauman Kalai A, Monteleoni C (2005) Analysis of perceptron-based active
learning, Proceedings of COLT 2005, 249-263, 2005.
[8] Drysdale R, Crosby M, FlyBase Consortium (2005) FlyBase: genes and gene models,
Nucleic Acids Research, 33:D390-D395, Database Issue, doi:10.1093/nar/gki046.
[9] Finkel J, Dingare S, Manning C, Nissim M, Alex B, Grover C (2005) Exploring the
boundaries: gene and protein identification in biomedical text, BMC Bioinformatics, 6 Suppl
1(NIL):S5, 2005.
[10] Freund Y, Seung H, Shamir E, Tishby N (1997) Selective sampling using the query by
committee algorithm, Machine Learning, 28(2-3):133-168.
[11] Fukuda K, Tamura A, Tsunoda T, Takagi T (1998) Toward information extraction:
identifying protein names from biological papers, Pac Symp Biocomput, NIL(NIL):707-18.
[12] Hakenberg J, Bickel S, Plake C, Brefeld U, Zahn H, Faulstich L, Leser U, Scheffer T (2005)
Systematic feature evaluation for gene name recognition, BMC Bioinformatics, 6 Suppl
1(NIL):S9, 2005.
[13] Hanisch D, Fundel K, Mevissen H, Zimmer R, Fluck J (2005) ProMiner: rule-based protein
and gene entity recognition, BMC Bioinformatics 2005, 6(Suppl
1):S14doi:10.1186/1471-2105-6-S1-S14.
[14] Hatzivassiloglou V, Duboue P, Rzhetsky A (2001) Disambiguating proteins, genes, and rna
in text: a machine learning approach, Bioinformatics, 17 Suppl 1.:S97-S106.
[15] Hirschman L, Park J, Tsujii J, Wong L, Wu C (2002) Accomplishments and challenges in
literature data mining for biology, Bioinformatics, 18(12):1553-1561.
[16] Hirschman L, Yeh A, Blaschke C, Valencia A (2005) Overview of BioCreAtIvE: critical
assessment of information extraction for biology, BMC Bioinformatics 2005, 6(Suppl
1):S1doi:10.1186/1471-2105-6-S1-S1.
[17] Howe D, Costanzo M, Fey P, et. al. (2008) Big data: The future of biocuration, Nature 455:
47-50; doi:10.1038/455047a.
[18] Jiang J, Zhai C (2006) Exploiting Domain Structure for Named Entity
Recognition, Proceedings of HLT/NAACL 2006.
- 17 -
18. [19] Jiang J, Zhai C (2007) Instance weighting for domain adaptation in NLP, Proceedings of
the 45th Annual Meeting of the Association for Computational Linguistics (ACL'07), 264-271.
[20] Jiang J, Zhai C (2007) A Two-Stage Approach to Domain Adaptation for Statistical
Classifiers , Proc 16th ACM International Conference on Information and Knowledge
Management ( CIKM'07), pp 401-410.
[21] Jiang J, Zhai C (2007) A Systematic Exploration of The Feature Space for Relation
Extraction, Proc Human Language Technologies: Annual Conference of the North American
Chapter of the Association for Computational Linguistics (NAACL-HLT 2007), pp 113-120.
[22] Johnson E, Schatz B, Cochrane P (1996) Interactive Term Suggestion for Users of Digital
Libraries: Using Subject Thesauri and Co-occurrence Lists for Information Retrieval, Proc
Digital Libraries '96: 1st ACM Intl Conf on Digital Libraries, March, Bethesda, MD.
[23] Kazama J, Makino T, Ohta Y, Tsujii J (2002) Tuning SVM for biomedical named entity
recognition, Proc workshop on NLP in the biomedical domain, 2002.
[24] Kulick S and others (2004) Integrated Annotation for Biomedical Information Extraction,
Proc HTL-NAACL 2004 Workshop on Linking Biological Literature, Ontologies and
Databases, pp 61-68.
[25] Ling X, Jiang J, He X, Mei Q, Zhai C, Schatz B (2006) Automatically generating gene
summaries from biomedical literature, Proc Pacific Symposium on Biocomputing, pp 40-51.
[26] Ling X, Jiang J, He X, Mei Q, Zhai C, Schatz B (2007) Generating gene summaries from
biomedical literature: A study of semi-structured summarization, Information Processing and
Management, 43: 1777-1791.
[27] Marygold S (2007) Genetic Literature Curation at FlyBase-Cambridge, presentation at
ArthropodBaseConsortium working group meeting at University of Illinois, Dec 2007.
www.beespace.uiuc.edu/files/Marygold-ABC.ppt
[28] Mika S, Rost B (2004) Protein names precisely peeled off free text, Bioinformatics, 20
Suppl. 1:241-247, 2004.
[29] Morgan A, Hirschman L (2007) Overview of BioCreative II Gene Normalization, Proc of
the Second BioCreative Challenge Evaluation Workshop. Madrid, Spain: CNIO; 2007:17-27.
[30] Muller H, Kenny E, Sternberg P (2004) Textpresso: an ontology-based information retrieval
and extraction system for biological literature, PLoS Biology 2004 Nov; 2(11) e309.
doi:10.1371/journal.pbio.0020309 pmid:15383839. www.textpresso.org
[31] Narayanaswamy M, Ravikumar K, Vijay-Shanker K (2003) A biological named entity
recognizer, Proc Pacific Symposium on Biocomputing, pp 427-38.
[32] Sanders B, Arcoleo D, Schatz B (2008) BeeSpace Navigator Integration with GMOD
GBrowse, 9th annual Bioinformatics Open Source Conference (BOSC 2008), Toronto, ON,
Canada. www.beespace.uiuc.edu/files/BOSC2008_v3.ppt
[33] Schatz B (2002) Building Analysis Environments: Beyond the Genome and the Web,
invited essay for Trends and Controversies section about Mining Information for Functional
Genomics, IEEE Intelligent Systems 17: 70-73 (May/June 2002).
[34] Schatz B (2007) Gene Summarizer: Software for Automatically Generating Structured
Summaries from Biomedical Literature, accepted plenary Presentation to 2nd International
Biocurator Meeting, San Jose. www.canis.uiuc.edu/~schatz/Biocurator.GeneSummarizer.ppt
[35] Settles B (2005) ABNER: an open source tool for automatically tagging genes, proteins and
other entity names in text, Bioinformatics, 21(14):3191-3192, 2005.
[36] Skounakis M, Craven M, Ray S (2003) Hierarchical hidden markov models for information
extraction, Proc of the 18th International Joint Conference on Artificial Intelligence, 2003.
- 18 -
19. [37] Sokolowski M (2001) Drosophila: genetics meets behaviour. Nature Reviews Genetics,
11(2):2001.
[38] Srinivasan P, Libbus B (2004) Mining Medline for implicit links between dietary substances
and diseases, Bioinformatics, 20 Suppl. 1:290-296, 2004.
[39] Tanabe L, Wilbur W (2002) Tagging gene and protein names in biomedical text,
Proceedings of the workshop on NLP in the biomedical domain, 2002.
[40] Tong S, Koller D (2001) Support vector machine active learning with applications to text
classification, Journal of Machine Learning Research, 2:45-66, 2001.
[41] Tsuruoka Y, Tsujii J (2003) Boosting precision and recall of dictionary-based protein name
recognition, Proc ACL 2003 workshop on Natural language processing in biomedicine, pp
41-48, Morristown, NJ.
[42] Tuason O, Chen L, Liu H, Blake J, Friedman C (2004) Biological nomenclatures: A source
of lexical knowledge and ambiguity, Proc Pacific Symposium on Biocomputing 9, pp 238-249.
[43] Zhou G, Shen D, Zhang J, Su J, Tan S (2005) Recognition of protein/gene names from text
using an ensemble of classifiers, BMC Bioinformatics, 6 Suppl 1(NIL):S7, 2005.
- 19 -