SlideShare une entreprise Scribd logo
1  sur  116
Chapter 1. Introduction



The Exponential Growth of Biomedical Research Data

       The current capabilities of our biomedical research enterprise, exemplified by the

completion of Human Genome Project, enable researchers to quickly and routinely

survey the contents of entire molecular and cellular systems. This capability is generating

a revolution in biomedical research in various profound ways. One significant change is

the availability of staggering amounts of genomic and functional genomic data gathered

at a whole genome or whole cell scale. As the result of such tremendous technology

breakthroughs, the challenge for biomedical research is being shifted from experimental

data generation to the organization, curation and interpretation of these data (Lander ES

et al, 2001; Meldrum D et al, 2000).

       Biomedical research literature can be considered to be a knowledgebase that

comprises the most complete status of our research enterprise. Reflecting the geometric

growth of available experimental data, the publication rate in biomedicine is also

increasing exponentially. There are currently more than 17 million biomedical articles

already represented in the National Library of Medicine’s biomedical literature database

MEDLINE, including more than 3 million articles published within last 5 years alone and

2,000 per day in 2006 (Hunter L et al, 2006; MEDLINE). Keeping abreast of this large

and ever-expanding body of information is increasingly daunting for researchers in order

to track and utilize what’s relevant to their interests, especially for new investigators. For

example, the pediatric tumor neuroblastoma is a common pediatric tumor but considered

to be quite rare overall, with approximately 600 new cases diagnosed in the US each
                                              1
year. However, there are almost 25,000 research articles describing neuroblastoma,

making it virtually impossible for a new investigator to systematically assess historical

research on this topic.

       Furthermore, researchers have the increasing need to get in touch with the

research fields outside their core competence. The commonly used PubMed system,

which provides a convenient query interface for MEDLINE, provides keyword search

and some concept mapping for researchers to narrow down the information they are

looking for (PubMed). However, its capabilities lack the precision (positive predictive

value), recall (sensitivity), granularity, and relevance ranking capabilities that many

typical but complex research queries have. One of the most popular demands that

general-purpose systems such as PubMed fail to satisfy is the ability to extract and

compile specific knowledge or facts out of literature records. For example, there is no

provision in PubMed-like systems to determine which genes have been studied thus far in

relation to a certain type of malignancy, other than to read through the set of articles

identified by PubMed using keywords defining the concepts “gene” and “cancer” (or the

type of cancer of interest), and then identifying the particular genes one article at a time.

With the exponentially increasing literature size, the process will not only be more time

consuming, but also be less reliable on getting the right articles. Consequently, the gap

between what is recognized and what is currently known is widening (Wren JD et al,

2004). Biomedical text mining techniques can help researchers meet this challenge by

developing automated systems to extract the relevant information out of the text and

organize it into a structured knowledgebase.



                                               2
Data Integration Opportunities in Cancer Research

       The general challenge of biomedical literature knowledge extraction is

confounded in cancer research, including an acute need to more systematically identify

linkages between genomic data and malignant phenotypes. Characterization of the

molecular aberrations responsible for the onset and progression of malignancy is a major

goal for cancer researchers, and genomic components of the aberrations, ranging from

base pair variance to chromosome deletion, are crucial determinants in this regard.

Despite the existence of some locus-, mutation- and disease-specific resources, there is

currently no central cancer knowledge database in the public domain integrating genomic

findings with phenotypic observations of tumors (Cairns J et al, 2000; Freimer N et al,

2003). While high-throughput screening efforts increasing allow researchers to identify

genome-wide mutational profiles for specific tumors, this information is largely diffusely

distributed and is mostly catalogued in a semi-structured manner throughout the

biomedical literature. Such decentralization is holding back the efforts towards making

rapid and comprehensive inferences of the genomic basis of malignancy onset and

progression in a manner that incorporates cumulative knowledge. Ideally, researchers and

clinicians would likely benefit from a comprehensive cancer knowledgebase that

consolidates experimental work (genome-level investigation), clinical observations

(descriptions of phenotype) and patient outcome (efficacy of treatment). Because the

biomedical literature represents a large proportion of this information, which is both

critically reviewed and eventually objective in its presentation of cancer research

information, means for more adequately extracting, normalizing and relating such diverse

collections of information in literature are crucial to solving this data integration problem

                                             3
in cancer research.

Named Entity Recognition

       The successful development of text mining technology has been increasingly

applied in biomedical research to assist with meeting the above-mentioned challenges.

There have been significant efforts from both computational linguists and

bioinformaticists within the past 5 years to develop automated biomedical text mining

(BTM) systems (Jensen LJ et al, 2006). BTM tasks include named entity recognition

(NER), information extraction (IE), document retrieval (DR), and literature-based

discovery (LBD). NER, which serves as the basis for most other BTM undertakings, is

the process of identifying mentions of biomedical entities (objects, such as genes and

diseases) in the text. Named entity recognition can be at first deceptively straightforward,

but it is has emerged as a challenging and considerable task in BTM research. NER

begins with the classification and definition of biomedical entities, which easily

consumes tremendous amount of effort because of the complex and lack-of-standard

nature in biomedical entities.

       The process of identifying references to biomedical objects in text is usually split

into two steps: the identification of mentions of specific entity instances in text, such as

“the p53 gene” or “acute lymphoblastic leukemia”; and the assignment of these mentions

to a standard referent (normalization), such as classifying “the p53 gene” as a mention of

the official gene symbol “TP53”, or “ALL” as “acute lymphoblastic leukemia”. Many

biomedical entities either lack controlled vocabularies that can act as sufficient

nomenclature standards, or the instances in text are not expressed with the standards due

to historical reasons. Therefore, normalization is absolutely necessary for equating entity

                                             4
values as appropriate, or placing values into a hierarchical or ontological framework (e.g.,

“ALL” as a form of “leukemia”. Much BTM research to date has focused upon molecular

entities that tend to be more discretely definable, such as genes and protein-protein

interactions, than phenotypic entities, which are harder to classify semantically

(BioCreAtIvE; McDonald R et al, 2005; Settles BA 2005; Zhou G et al, 2005).

       NER methods include both rule-based and machine-learning approaches. Rule-

based approaches use sets of “rules”, alone or in combination, that pre-state signature

grammatical and especially character and word-based patterns within a string of text

being considered, and then return Boolean values as an output. For example, a rule to

identify a gene name could be “This word is a gene if it contains the consecutive letters

‘KIAA”, all of which are capitalized”. There can be some allowance for lexical

variations, such as capitalization, stemming, or punctuation, and some or all rules might

compare the text being considered to a term list, such as a pre-compiled list of known

tumor types. However, the performance of the approach can’t count on the completion of

the dictionary-type list in terms of both depth (the completion of the entity unique

identifiers) and breadth (the completion of the synonyms for each unique identifier)

because for most biomedical entities, the term lists are always changing and never

complete. For complexly formulated text, rule-based approaches typically require

considerable thought and exquisite biological knowledge. Advantages of this approach

are relatively high precision without the requirement for generating extensive training

material. However, disadvantages include high false negative rates, a performance

plateau that is increasingly difficult to overcome, and, for complex and heterogeneous

text, a tendency to generate low recall. Most first-generation systems and many domain-

                                             5
focused current systems utilize rule-based approaches; when coupled with a term list, this

approach accomplishes both steps of the overall NER task at one time. However, rule-

based systems have enjoyed only modest success for biomedical applications, likely

because their performances have plateaued below rates acceptable for wide use by

researchers, or their application domains have been overtly narrow (Hanisch D et al,

2005; Fundel K et al, 2005; Chang JT et al, 2004; Finkel J et al, 2005).

       Given the limitations of rule-based systems, a number of machine-learning

algorithms have been applied to improve the first step of the NER task. Generally, these

algorithms consider and then define sets of features within and surrounding entity

mentions that co-associate with the mentions. These can include orthographic features of

the text (e.g., suffixes, particular sequential combinations of characters or words,

capitalization patterns, etc.) and domain-specific features (e.g., term lists). For example,

the suffix “-ase” usually indicates a protein name, and the noun phrase immediately

preceding the word “gene” is often a gene name. Machine-learning approaches have

several advantages: at their purest, they require no domain knowledge; they can consider

thousands or millions of features simultaneously; they can provide confidence scores for

predictions; and they can consider the entire feature space simultaneously. However, the

success of machine-learning approaches is dependent upon two critical and costly factors.

First, ML systems require the establishment, quality, and representativeness of a set of

manually generated training material from which to “learn” features, a process that

requires considerable effort and does not generalize effectively. Second, the most

effective systems incorporate biological knowledge—either in the form of domain-

specific rules or definition of features that are domain-specific (such as specialized

                                             6
lexicons)—that are likewise costly to implement (McDonald R et al, 2004; Coller N et al,

2000; Tanabe L et al, 2002).

        It is most critical to let human set the examples of gold standards before machines

can learn from it. To better reduce the annotation ambiguity and disagreement, it is

crucial to define the target biomedical entities explicitly. Currently, most developed NER

systems take some version of pre-established conceptual definitions, by which annotators

could apply with very different standards. We have tried otherwise and put tremendous

effort in an iterative annotation process to develop literature-based definitions drawing

both the conceptual and textual boundaries.

         Step 2 work (normalization) is syntactically easier since the identification of

textual boundaries is not necessary. However, it poses significant semantic challenges,

because the non-unique synonyms have to be disambiguated to find out the real intent.

And also, a comprehensive thesaurus like dictionary is necessary in order to match the

raw entity mentions to their unique identifiers. Classification techniques, rule-based

systems, and pattern-matching algorithms have been utilized to solve this issue, and some

approaches also take the contextual information to disambiguate the synonyms (Chen L

et al, 2005).

Information Extraction

        Ideally, BTM systems extract and synthesize “facts” out of the literature that

combine entity mentions with relationships between and among the mentions established

in the literature. This work requires NER results, that is, the relationships between the

entities can only be extracted once the individual entities have been identified. Although

biomedically oriented research in this area is not as advanced as NER, BTM researchers

                                              7
have recently been increasing their efforts on these challenges.

       A most straightforward but powerful approach is co-occurrence. This approach

identifies the relationships between the involved biomedical entities based on their co-

occurrence in the articles, or by considering how close mentions are to each other within

a document. The assumption taken by the co-occurrence method is that if two (or more)

entity instances are co-mentioned in one single text record (or defined subset, such as a

sentence or a paragraph), these instances have some type of underlying biological

relationship. As it is possible that entity instances can coincidentally co-occur, systems

commonly use some parameters to rank the relationships, such as the frequency and

location of their co-occurrence. If two entity instances are repeatedly co-mentioned

together in close proximity, it is most likely that they are related. This approach tends to

perform with better recall but at the expense of precision because it has no intelligent

means for distinguishing specific from general relationships. For example, if the

information to be extracted is the causal relationship between gene A and disease

diagnostic labels, this approach will recognize relationships of any kind between gene A

and relevant diseases, including but not limited to direct or causal relationships. In order

to improve precision, some co-occurrence-based IE systems include additional

approaches, such as combining with a customized text-categorization system to

preferentially identify relevant articles or sentences. Co-occurrence-based IE systems are

usually used as exploratory tools making inferential calls since they can identify both

direct and indirect relationships between entity instances (Jessen TK et al, 2001; Alako

BT et al, 2005).

       Another approach is to take advantage of natural language processing (NLP)

                                             8
methodology that combines syntactic and semantic analysis of text. In this approach,

individual tokens in test are often first identified and then assigned part-of-speech labels,

in a process that has been converted to automation with high accuracy. Then a nested tree

like structure (either top-down or bottom-up) is developed in order to determine the

relationships between noun phrases or beyond, such as subjective and objective. After a

NER process is applied for assigning semantic labels to specific words and phrases, either

rule-based or machine-learning based processes can be used to extract relationships

between entity mentions. Although the syntactic parsing and the semantic labeling have

been carried out as separate steps by most NLP-based IE systems, results indicate that

better performance can be obtained by integrating the two steps, due in part to the often

complex relationships of biomedical entity mentions. This NLP-based approach can

achieve better precision, but lower recall, largely because of increased challenges in

identifying relationships across sentences. These approaches are also labor-intensive,

since either expert defined sophisticated extraction rules or manually annotated training

corpus are required (Rzhetsky A et al, 2004; Daraselia N et al, 2004; Yakushiji A et al,

2001).

         Although there is some research touching base with n-ary relationships between a

set of biomedical entities, most IE systems currently classify binary relationships between

same-type entities. These systems most commonly focus on entities and relationships that

are easier to define, such as protein-protein/gene-protein interactions, protein

phosphorylation, other specific relations between genomic entities such as cellular

localizations of proteins, or interactions between proteins and chemicals. Few NER

systems have yet to be designed for relating phenotypic attributes, such as gene-disease

                                             9
relationships (Temkin et al, 2003; McDonald R et al, 2005).

       High-performance systems that can extract many types of relationships and also

distinguish among relationships beyond the sentence level are not yet achievable. This is

due largely to three contributing factors. First, biomedical text is complex and highly

variable in its structure and presentation. Second, many complicating factors need to be

considered, including co-reference (e.g, the use of pronouns), ambiguity in intent, and

variability in formulation. Finally, systems need to incorporate various approaches

simultaneously (e.g., tokenizers, POS taggers, NER systerms, parsers, disambiguators),

each of which contributes some measure of error that combines to significantly degrade

finalized output (Ding J et al, 2002).

Document Retrieval

       DR systems typically identify and rank documents pertaining to a certain topic

from a large collection of text. Topics of interest might be derived from user-supplied

search terms or from pre-selecting specified types of documents. Most DR systems

feature keyword search capabilities; advanced keyword searching allows users to input a

combination of search terms and/or to perform advanced functions, such as including

logical operations or inducing limits to terms. Systems then commonly retrieve

documents containing or excluding certain terms that match the search criteria. This

method often retrieves irrelevant articles, and relevance-ranking functions are often

absent or primitive. More sophisticated DR systems go beyond this by applying distance

metrics, such as a vector-space model. With this model, every document is represented as

a vector, which is determined by measuring text-based features and/or document

metadata, such as a list of frequency-based weighted terms identified in each document.

                                           10
The query vector, which is determined by the relative importance of each query term, is

then compared to document vectors to relevance rank the documents. The comparison

between document vectors can also calculate document similarity. PubMed is a well-

known DR system that is highly adapted for use as a query interface for MEDLINE.

PubMed uses both keyword searching and a vector model (Glenisson P et al, 2003).

       Advanced DR systems integrate NER or other NLP methods in order to more

accurately assess document content and identify documents that mention certain

biomedical entity mentions. FABLE, MedMiner and Textpresso are examples of systems

that make retrieval decisions by extracting and considering knowledge from gene/protein

mentions in the documents (FABLE; Tanabe L et al, 1999; Muller HM et al, 2004).

Literature-Based Discovery

       An ultimate goal of BTM is to assist with literature-based discovery. LBD can be

defined as a process that discovers testable novel hypotheses by inferring implicit

knowledge in biomedical literature. An early and often-cited example of LBD was from

researcher recognizance of facts from two unrelated bodies of biomedical text, describing

Raynaud’s disease, in which patients suffer from vasoconstriction, high blood viscosity

and platelet aggregability, and describing fish oil, indicating that besides its capability of

causing vasodilation, its active ingredient can also lower blood viscosity and platelet

aggregation. This connection was formed completely through extensive reading of the

literature, and later the relationship was proved experimentally. The model used in this

seminal example was very simple: if A leads to B, and B leads to C, then it is plausible

that A could lead to C. Based on this closed discovery process (to connect two previously

known relations), this researcher subsequently discovered a novel association between

                                             11
migraine and magnesium deficiency (also proved experimentally) as well as additional

successes (Swanson DR 1986; Swanson DR 1988; Swanson DR 1990).

       More challenging LBDs might arise from an open discovery process, which

attempts to derive relationships between two entities of interest through implicit

relationships in literature. For example, the process of identifying candidate genes for a

certain disease is an open discovery process. One example of this process would be to

first identify gene mentions co-occurring in the literature (gene set A) with mentions of a

disease of interest, next identifiying co-occurring gene mentions (gene set B) with known

disease genes, and then consider the overlap between the two sets of gene mentions as

candidate genes for the disease. There are two assumptions taken for this approach: Gene

set B is functionally related with known disease genes; Gene set A has some sort of

relations with the disease. One potential problem for this approach is that there are many

types of direct and indirect relationships identified in such a process, including the high

likelihood that a substantial number of false positives are generated. NLP-based IE can

certainly help narrow down the relationship types, but further research is needed to

improve the performance of such models. Also fundamentally, literature inevitably

contains conflicting and inaccurate statements, which is impossible for an automated

algorithm to adjudicate (Weeber M et al, 2005).

       It is much likely that more reliable inference of novel hypotheses and research

directions from literature achieves success by integration of BTM results with other data

types, including from curated data sets and experimental data. Experts’ curation and

experimental evidence provides verification, filtering, and relevance ranking capabilities

from information derived from real biological relationships between entities. For

                                            12
example, researchers have made novel discoveries by transferring text-mined

relationships of a protein to its orthologous proteins based on sequence-similarity

searches. The integration effort of BTM results with functional genomic data such as

microarray data has helped researchers rank significant genes as well as develop novel

hypotheses based on both experimental data and previously known knowledge in a large

scale, automated fashion (Yandell MD et al, 2002; Raychaudhuri S et al, 2002; Glenisson

P et al, 2004).

Significance

        Along with the rapid expanding of experimental data, the exponential increase of

the biomedical research text makes it more and more difficult for researchers to track and

utilize the relevant information to their interests, especially for the domains outside their

core competence. Automated text mining systems can process the unstructured

information in the literature into structured, queryable knowledgebase. This dissertation

research has developed well-performed automated entity extractors based on the refined

manual annotation with iteratively defined literature-based entity definitions in genomic

variation of malignancy. Co-occurrence-based information extraction process was

applied to integrate with microarray expression data in the pursuit of determining

neuroblastoma research candidate genes. Both functional pathway analysis and RT-PCR

experiment validated the text mining’s contribution. This thesis demonstrated that in

addition to systematic curation of the textual information, biomedical text mining also

has inferential capability especially when combined with experimental data.




                                             13
Introduction to the Thesis

       Using the genomics of malignancy as a test bed, this thesis has touched upon

every aspect of BTM outlined above. Work regarding the BTM process developed and

employed will be discussed in detail in Chapter 2 and Chapter 3. This thesis has also

established important work regarding information extraction in this domain, which has

been applied to research regarding the pediatric tumor neuroblastoma (Chapter 3 and

Chapter 4). Integration of BTM-extracted information with expression array analytical

results to discover candidate genes for neuroblastoma research will be discussed in detail

in Chapter 4.




                                           14
Chapter 2. Defining Biomedical Entities for Named Entity Recognition

                                         Yang Jin
                                      Mark A. Mandel
                                      Peter S. White

Abstract

       The performance of machine-learning based named entity recognition is highly

dependent upon the quality of the training data, which is commonly generated by manual

annotation of biomedical text representative of the target domain. The development of

robust definitions of biomedical entities of interest is crucial for highly accurate

recognition but is often neglected by text-mining applications. While the conceptual and

syntactic complexities of biomedical entities often generate ambiguities in assigning text

mentions to particular entity classes, entity definitions that exhibit as distinct semantic

and textual boundaries as possible are desired. We have created a highly generalizable

process for developing entity definitions specifying both conceptual limits and detailed

textual ranges for target biomedical entities. This process utilizes representative text and

manual annotators to initially define and iteratively refine definitions. The process was

tested within the knowledge domain of genomic variation of malignancy. This work

describes in detail the different types of challenges faced and the corresponding solutions

devised during the definition process. The resulting entity definitions were used to

annotate a training corpus for the development of automated entity extraction algorithms

and for use by the research community. We conclude that manual annotation consistency

is useful for the success of later biomedical text mining tasks, and that explicit, boundary-

defined entity definitions can assist with achieving this goal.



                                             15
1. Introduction

        Automated information extraction techniques can assist in the acquisition,

management and curation of data. A necessary first step is the ability to automatically

recognize biomedical entities in text, as also known as named entity recognition (NER).

Development of named entity extractors for biomedical literature has progressed rapidly

in recent years. For example, a number of machine-learning algorithms currently exist for

identifying gene name instances in text (Collier N et al, 2000; Tanabe L et al, 2002;

GENIA; Hanisch D et al, 2005). However, a major shortcoming of many approaches is

that they often minimize efforts to define biomedical entities in an explicit fashion.

Rather, the tendency is often to ignore this step by adapting or refining existing semantic

standards as the target entities’ conceptual definitions, leaving interpretive details to

manual annotators. Additionally, existing standards often provide little or none of the

semantic depth required to establish concept boundaries with enough rigidity to provide

highly accurate extraction. This tends to create outstanding consistency problems in later

steps when training automated extractors and utilizing the extracted entity mentions for

particular applications, because non-literature based conceptual definitions often generate

significant annotation ambiguity problems due to the semantic as well as syntactic

complexities of biomedical entities in the literature. As a result, automated systems

derived from such systems tend to perform more poorly. For biologists in particular, high

precision is a necessary prerequisite for widespread acceptance of automated tools, in

order to establish a level of reliability acceptable to users.

        Strongly believing the importance of establishing well-defined, literature-based

entity definitions with clear boundaries specially designed for biomedical NER practice,

                                               16
the Biomedical Information Extraction Group at University of Pennsylvania (Penn

BioIE) has developed an iterative annotation process designed to establish a set of

“precise” entity definitions. These definitions are meant to clarify the conceptual

boundaries both semantically and syntactically, while also striking a balance between the

requirements of researchers, annotators, and computational scientists. This paper will

first describe the annotation process developed by the Penn BioIE group, and then

introduce the necessities and challenges of defining biomedical entities with specific

examples in the literature.

2. Overview of manual annotation process and entity classification




                                          QuickTimeª and a
                                  TIFF (Uncompressed) decompressor
                                    are needed to see this picture.




             Figure 2-1. The processes of developing entity definitions and extractors



       Figure 2-1 demonstrates the iterative process developed for establishing and

refining entity definitions, first through manual annotations and then in developing

extractors based on the manually annotated training data. The process begins with the

creation of an initial definition that establishes the general concept and scope of an entity

                                                17
class, which is supplied by one or a group of domain experts. Commonly existing

standards and resources are explored and, if deemed suitable, adopted as nuclei for the

process. Subsequently, the domain expert(s) plays the role of adjudicating definition

discrepancies. Manual annotators are then trained with the initial versions of the entity

definitions, from which they manually annotate the selected training corpora. Invariably,

as the annotators encounter the wide diversity of semantic representations of specific

concepts, a need for iterative refinement of the entity definitions emerges. Often, text

encounters require major revisions or even restructuring of definitions to accommodate

such heterogeneity. Accordingly, definitions are continually refined during the analysis of

annotated texts and annotation disambiguation. The Penn BioIE group founded useful

frequent communication forums where the emerging definitions and identified exceptions

were fully discussed among annotators and researchers. Communication modalities

included weekly face-to-face meetings, email lists, and live chat. After annotation has

been executed, entity extractors were developed by implementation of machine-learning

algorithms utilizing probability models (we used Conditional Random Fields); the

manually annotated texts were utilized as both training and testing data for these

algorithms. Comparison of the annotations produced by the automatic extractors and

human annotators allows for evaluation of the extractor performance.

       The target knowledge domain we chose was “Genomic Variation of Malignancy”,

conceptualized as a relationship among three entity classes: Gene, Variation and

Malignancy. As shown in Figure 2-2, the Gene and Variation entities comprise genomic

components of cancer while the Malignancy entity covers phenotypic aspects of

malignancy, including malignancy diagnostic labels and a number of malignancy

                                            18
phenotypic attributes.




                                         QuickTimeª and a
                                 TIFF (Uncompressed) decompressor
                                   are needed to see this picture.




    Figure 2-2. Entity classification scheme for the domain of genomic variation of malignancy




       A total of 1442 MEDLINE abstracts were selected for exploration and annotation

in this study, one subset of which contained many different malignancy types to establish

breadth, and a second subset of which mentioned only one major malignancy

(neuroblastoma) to establish depth. As diagrammed in Figure 2-1, the manual annotation

process was first applied to the corpus with an electronic annotation tool, WordFreak

(http://sourceforge.net/projects/wordfreak). After the entity definitions were refined and

stabilized, the manually annotated data were then used to develop entity and attribute

extractors (McDonald RT et al, 2004, Jin Y et al, 2006). These automated extractors
                                        19
performed with state-of-the-art accuracy, in part due to the careful design and

management of our annotation process. In the following paragraphs, we will discuss the

challenges we have encountered during the manual annotation process, and why we

believe that consistent entity definitions are critical for the success of later steps in

biomedical text mining.

3. The challenges of defining biomedical entities

       Although we began this task believing we had clear ideas of what information

each entity should cover, it quickly proved challenging to develop detailed working

definitions. Our a priori notions of entity definition adequacy were that definitions

establish distinct and defensible boundaries both conceptually and textually, therefore

providing guidance to the annotators both semantically and syntactically. Solid entity

definitions are an essential foundation for the subsequent steps of developing machine-

learning algorithms and utilizing the extracted information for specific applications. First,

the performance of entity extractors is highly dependent not only on the selection of the

underlying algorithms, but also on the quality of the training data, which are entirely

based on the entity definitions. If the annotators cannot identify specific entity mentions

consistently on the basis of the definitions, it is hard to imagine that automated extractors

can replicate this task reliably. More importantly, without clear definitions, researchers

will certainly run into problems when trying to utilize the extracted mentions, since it will

be difficult to know the precise boundaries of the gathered information.

       As mentioned earlier, we initially defined three major entities in the knowledge

domain of genomic variation of malignancy, based on existing ontological categories and

concepts. However, we quickly found that ontology-based definitions often don’t

                                             20
precisely reflect what has been conceptualized throughout the biomedical texts

contributed by researchers worldwide. For example, a gene defined by NCI thesaurus is:

“A functional unit of heredity which occupies a specific position (locus) on a particular

chromosome, is capable of reproducing itself exactly at each cell division, and directs the

formation of a protein or other product.” If annotators use this definition for identifying

gene mentions in the text, they could quickly be confused by many situations such as

whether promoters should be included; how should gene family names be treated; how

about pronoun referents to genes, etc. Thus, we found the need to invoke text-based

working entity definitions, which are most effectively determined as annotators

proceeded with the entity recognition task in the training corpus. Every new mention of

an entity and every new context for a mention provided a test for the pre-developed entity

definition. If a definition could not explicitly lead the annotators to a “correct”, or at least

consistent decision in each case, the problematic mention required further examination,

interpretation, and possibly, refinement of the definition. Through such an iterative

process, we were able to develop fine-tuned entity definitions that provided distinct

boundaries both for semantic scope and contextual range.

        The challenges that we encountered in refining our definitions can be grouped

into four categories: conceptual, syntactic, syntactic/semantic ambiguity, and inter-

annotator agreement. In the following paragraphs we will illustrate these types and give

examples of our devised solutions and their limits.

3.1 Conceptual definition challenges

        As discussed earlier, an entity definition has to clarify both conceptual and textual

boundaries. Initial versions of our definitions were completely conceptual, based on our

                                              21
understanding of biomedical categories. Surprisingly, more than half of the annotators’

difficulties with definitions fell into this category during the annotation process, and most

of them were reasonable as you can observe in the following paragraphs showing the four

most common challenges in this category. This reflects the semantic complexity and

diversity of biomedical entities, which often cannot be easily defined without some

ambiguity.

3.1.1 Sub-classification of entities

       Based on the classification scheme stated above, our target knowledge domain

was initially divided into three major conceptual classes: gene, genomic variation, and

malignancy. However, this broad conceptual classification was far from sufficient for the

generation of highly accurate extractors. For example, according to the conceptual

definition, the malignancy concept covers all phenotypic information of cancer, including

a tumor’s diagnostic type, the tumor’s anatomical location and cellular composition, and

its differentiation status. Each of these types of information are presented in a variable

and often bewildering array of syntactic and contextual patterns, which increases entropy

and thus erodes the ability of machine-learning approaches to classify mentions. If

instead we further classified the mentions into sub-categories such as those described

above and annotated them as such, entropy is reduced and extractor performance can be

expected to improve. However, a major disadvantage of this approach is that, sub-

categorization introduces considerable additional annotation effort. Thus, the annotation

process requires first the establishment of a level of entity granularity that balances the

cost of manual annotation with the application value of the extracted data.

       There are countless ways to further divide entities into their underlying

                                             22
components. For our purpose, we decided to let the level of granularity be generated by

the annotation process. By beginning with broad classes and subdividing them as needed,

we considered that we would eventually approach an optimal balance between effort and

effectiveness. We considered it to be critical to determine how the text strings represented

subcategories in the real world of biomedical literature. Therefore we divided our

annotation efforts into two stages: data gathering and data classification, as demonstrated

in Figure 2-3 with a genomic variation entity example.




                                        QuickTimeª and a
                                 TIFF (Uncompressed) decompressor
                                   are needed to see this picture.




              Figure 2-3. The text-based two-stage entity sub-classification process



       In the example illustrated by Figure 2-3, annotation of our initial concept of

“Genomic Variation” proceeded through a preliminary stage of annotation before it was

divided into sub-categories, which we named “Data Gathering”. In this stage, all textual

mentions falling within or partially within our initial concept definition were annotated

regardless of syntax. When sufficient information was gathered, sub-categories were

                                               23
defined based on their semantic and syntactic representations. In addition, by proceeding

with this exercise, the annotators became familiar with the concepts, definitions, and

emerging challenges of the tasks. By employing this method, the sub-classification

scheme began to approximate how the concepts were actually presented in the text.

3.1.2 Levels of specificity

       Textual entity mentions referring to the same semantic types can range from very

general to quite specific, and not all levels of detail may be appropriate for a particular

project. A gene mention may refer to a specific gene instance in a single cell of a sample,

or to the wild type or a specific variation of the gene; or it may refer to gene families,

super families and generalized classes, which represent classes of genes. For instance,

“MAPK10” or “mitogen-activated protein kinase 10” is a family member of “MAPK”,

which itself belongs to a higher level family “protein kinase”. We made the decision to

include all levels of information for the gene entity except for the most general level such

as “gene”. That is, in the above example, all three levels of gene mentions are legitimate

and should be annotated as such.

       The decision was based on a couple of considerations. First of all, gene class

information is valuable information to extract in later steps; although we don’t know

which specific gene it refers to, it does help us narrow down to a class of genes. Second,

if we only include the mentions describing genes at the instance level (the level that can

lead to a specific genomic element), we have to draw a line between gene classes and

instances. Because textual mentions for gene classes and instances are sometimes

interchangeable (researchers tend to use gene class names referring to gene instance

names and vice versa), it will be quite difficult for the automated extractors to distinguish

                                             24
between the two. And finally, we exclude gene mentions at the most general level, which

contains no information content or application value to extract. In another words, all

information-containing levels of mentions are included.

3.1.3 Conceptual overlaps between entities

       An ideal entity classification scheme should result in independent information

categories without any conceptual overlaps. Unfortunately, the subjective and adaptive

nature of biological objects makes this ideal especially difficult to achieve, especially

when defining two different but related entities. Even a basic concept such as “organism”

is difficult to define when considering entities such as viruses and viroids, self-replicating

machines with attributes necessary but not necessarily sufficient to qualify as life forms.

Because our gene and genomic variation concepts both fall within the genomic domain

and are closely associated, we were very careful to make a clear distinction. Eventually,

our gene entity evolved to encompass solely the names of genes and their downstream

products (i.e., RNAs and proteins), while the genomic variation entity covered specific

descriptions of genomic element variations.

       Although our definitions of gene and genomic variation managed to eventually

establish a reasonable boundary between them, for other entities, we found it sometimes

impossible to avoid the conceptual overlapping problem. We encountered such problems

when trying to make a clear division between the entity classes symptom and disease. The

symptom entity was designed to capture subjective or objective evidence of disease, such

as headache, diarrhea or hyperglycemia, while the disease entity captured specific

pathological processes with a characteristic set of symptoms, such as Long QT Syndrome

or lung cancer. As with most cases, the distinction is often clear to domain experts unless

                                              25
considerable scrutiny is requested, as it appears to be simple common sense that these

concepts represent two distinct and non-overlapping sets of information. However, when

presented with the broad contextual variation in use and, often, semantic intent, it actually

becomes quite difficult to draw a clear boundary between the two. We quickly found that

many terms can be considered as both symptoms and diseases, depending both upon

intent and the level of domain knowledge available. For example, “arrhythmia” itself is a

disease entity mention, representing a pathological process, but it is usually used as a

diagnostic label of a disease (symptom), such as long QT Syndrome. We certainly don’t

want to have two entity types heavily overlapping with each other, since that will make

the classification unnecessary. That is not the case for the symptom and disease entity

types, and their overlapping mentions are less than approximately 10% overall. Most

conceptually overlapping mentions cannot be put into either category without reading the

text. We leave it to the annotators to determine authors’ intent based on the context and

increasingly, they became quite good at minimizing the disagreement.

3.1.4 Domain-specific clarification

       As biological entities tend to be conceptually subjective, we often found it to be

quite challenging and labor-intensive to establish consistent conceptual boundaries. The

process of defining the gene entity is a good example to illustrate this challenge. Initially,

we considered the task of defining a “gene” to be a straightforward task, as this concept is

considered by biologists to be a rather discrete object. The HUGO Gene Nomenclature

Committee (HUGO), the nomenclature body tasked with establishing official names for

human genes, defines a gene as “a DNA segment that contributes to phenotype/function.

In the absence of demonstrated function a gene may be characterized by sequence,

                                             26
transcription or homology". On top of that, our gene entity is initially defined as the

nominal reference to a gene or its downstream product in biomedical text. However, as

annotations moved forward, annotators raised more and more questions, forcing us to

make difficult determinations on the boundaries as illustrated below.

       An example of biological complexity is the many ways that a gene can contribute

to phenotype. Typically, genes functionally impact biological processes through their

downstream products, proteins. However, there are DNA segments on the genome which

are able to affect phenotype by regulating how genes are expressed in particular

biological contexts. Promoter and enhancer regions, which are distinct segments of DNA

(often far) removed from the DNA segment that directly contributes to an RNA and/or

protein product, are such example. These elements control whether and when the gene

itself is expressed. Although biologists disagree whether promoters should be considered

as genes or components of particular genes, annotators are required to make a decision on

the gene entity boundary limits. In this case, we considered our application domain to be

the most important determinant, as the main focus of our gene entity was to capture those

“traditional genes” that could be directly and consistently associated with a protein. Thus,

we limited our scope of genes to include only what we considered to be biologically

functional DNA segments which are translated into protein products.

       There are many more cases that required further clarification of the gene entity

conceptual definition, such as how to deal with segments and multiplexes of

genes/RNAs/proteins. We realized that consistency was more valuable than trying to

establish universal truth, the former of which we considered to be the key to developing

well-performing automated extractors and increasing the application value of extracted

                                            27
mentions.

3.2 Syntactic definition challenges

     Even with precise conceptual definitions, we found that guidelines needed be made

regarding the textual boundaries of the entity mentions. Although many of these were

syntactical nuances, they were not necessarily trivial for the annotator disagreement. In

order to make consistent automated extractors, we determined that detailed annotation

guidelines were required to make manual annotations consistent between different

annotators. We designed our guidelines to be practical and based on actual contexts,

specifying to the annotators exactly what to do under any uncertain circumstances that we

had encountered.

3.2.1 Associating a text string to an entity mention

       There are many different ways to associate a text string with an entity mention in

biomedical literature. In order to harvest consistent training data to develop highly

performed automated extractors, we needed to define a series of rules specifying how to

select text strings in the literature as legitimate entity mentions. We allowed entity

references to include more than one word, including punctuation, but not to cross

sentence boundaries.

       Although the majority of the entity mentions were nouns, not all of them were.

For some entity mentions such as variation type, other part-of-speech forms were not

uncommon. For example, for genomic variation types that would likely be normalized as

the forms “insertion”, “deletion”, or “translocation”, those variation type mentions were

usually expressed as verbs: “inserted”, “deleted”, or “translocated”. Moreover,

malignancy attribute mentions were nearly always adjectives, such as “well-

                                           28
differentiated”, “hereditary”, and “malignant”.

       All modifiers in a noun phrase mention were considered to be included as part of

a mention, because not only can the modifiers provide very useful information to be

extracted, but also that some modifiers are indispensable parts of the standard terms. We

observed that this decision made it easier for both manual annotators and machine-

learning extractors to operate since it was difficult to define boundaries on what

modifiers to include in noun phrases. However, modifiers were not included for other

part-of-speech phrases, in order not to complicate the issue. For example, in a noun

phrase malignancy type mention “malignant squamous cell carcinoma”, both “malignant”

and “squamous cell” are the modifiers of “carcinoma”, and both provide very useful

information. “Squamous cell carcinoma” is also a commonly employed name of a type of

cancer. Our experience determined that it was difficult for annotators and impossible for

automatic extractors to draw consistent boundaries between modifiers on what should be

included as part of the legitimate mentions.

       Lastly, we found it necessary to make entity-specific rules for some biological

entities. For example, the gene entity mentions commonly appeared in the text as “The

mycn gene…”, necessitating a decision as to whether the article “The” and the noun

“gene” should be included as part of the entity mention. We reasoned that the decision

should depend on how the extracted information was to be further processed and utilized.

Accordingly, we decided to include neither word, since all the extracted gene mentions

were to be subsequently mapped and normalized to official gene symbols.

3.2.2 Co-reference issue

       Often a single entity is referred to in different ways in the same text, a situation

                                               29
known as co-reference. Besides its standardized form, an entity instance can also be

referred to by aliases, acronyms, descriptions or pronoun references. For example, the

mycn gene has at least 10 aliases in the literature, including “n-myc”, “oded”, and “v-myc

avian myelocytomatosis viral related oncogene, neuroblastoma derived”. Moreover,

researchers commonly engineer their own acronyms as self-convenient but non-standard

and often unique aliases. Co-reference is generally recognized as a challenging task for

entity recognition and information extraction. To deal with this issue in manual

annotation, we have classified this problem into the following four categories and made

corresponding decisions for each of them.

A. Extended form vs. acronym

Regular expression: ___ ___ ___ (___)

Examples:

   •   …mitogen-activated protein kinase (MAPK)…-- gene entity mention

   •   …squamous cell carcinoma (SCC)… -- malignancy type entity mention

Our decision: Tag both the extended form and abbreviated form of the entity mention.

For the above examples, “MAPK” is co-referential with “mitogen-activated protein

kinase”, and “SCC” is co-referential with “squamous cell carcinoma”. Both extended

forms and acronyms would be tagged as corresponding entity instances in our system.

Our rationale: Both forms are interchangeable descriptions of entity mentions, and they

should be treated equally.

B. Alias description

Regular expression: …Y…X… or …Y (X)…

Examples:
                                            30
•   TrkA (NTRK1)…

   •   The N-myc gene, or MYCN…

Our decision: NTRK1 and MYCN are official name designations of the TrkA and N-myc

genes, and here they are being co-referenced accordingly. We decided to tag all different

expression forms of the entity instances, including standard/official nomenclatures,

aliases or descriptions. Like acronyms and their extended forms, these various names are

also tagged individually: in the first example, we tagged “TrkA” and “NTRK1”

separately and without the parentheses, not the combined string “TrkA (NTRK1)”.

Our rationale: Researchers often use unofficial nomenclatures for entity mentions, so we

can’t just annotate standard descriptions. However, they should be normalized later.

C. General vs. specific

Regular expression: X, a (the) Y…

Examples:

   •   C-Kit, a tyrosine kinase which plays an important role, …

   •   K-Ras is an oncogene. The Ras gene…

Our decision: In the examples above, the gene family name “Ras” and the superfamily

name “tyrosine kinase” are used to co-refer to the gene family instances “K-Ras” and “C-

Kit”. In such situations, our annotation guideline treated the general terms and more

specific terms completely independently, regardless of the co-referential relationship

between them. That is, depending on the conceptual definition, if the term was a

legitimate mention, it was tagged as an entity mention no matter what levels of specificity

it had. For those examples, since the gene entity definition included both gene instances

and family names, all four terms were tagged as gene entity mentions. We did not,
                                            31
however, tag “oncogene”, nor did we extend the tag on “Ras” to include the following

word “gene”. These words, at the highest level of generality, convey no taggable

information.

Our rationale: Based on our decision on tagging all information-containing levels of

mentions and specifically for the examples listed, all gene instances, gene families and

superfamilies are determined legitimate mentions.

D. Pronoun reference

Regular expression: …X…PRONOUN (It, This, etc.)…

Examples:

   •   K-Ras is an oncogene. It is mutated in…

   •   Five point mutations were found in the MYC gene, and they were next to each

       other.

Our decision: In the two examples, “It” is co-referential to “K-Ras”, and “they” is co-

referential to “point mutations”. We generally did not annotate pronouns, although they

may refer to legitimate entity mentions.

Our rationale: Pronoun co-reference is a challenging problem in text mining research,

which involves cross-sentence, whole-record level of relation extraction. Without deeper

parsing of the text, there is no value by extracting the pronoun itself.




3.2.3 Structural overlap between entity mentions

       Entities can overlap not only conceptually, but also literally, with their textual

mentions in the literature. Annotation guidelines were developed for the following

                                              32
situations:

A. Entity within entity – tag within tag

         This refers to the situation that one entity mention is completely included in the

textual range of another. As the two intertwined entity mentions could belong to either

the same or different entities, we divided this category of problem into two sub-

categories. If the two mentions were in the same entity, only the subsuming entity

mention was tagged. For example, in “mitogen-activated protein kinase kinase kinase”,

there exist 7 distinct gene entity mentions: mitogen-activated protein; mitogen-activated

protein kinase; mitogen-activated protein kinase kinase; mitogen-activated protein kinase

kinase kinase; and three mentions of “kinase”. While this type of a situation was a source

of confusion among new annotators, we considered it both unnecessary and costly to tag

all possible mention permutations. As the mention with the largest range was always the

one being discussed, only the outermost mention was considered to be tagged as a gene

mention. In fact, this situation led to the adoption of a more generalized guiding

principle, where the annotation should reflect the author intent whenever possible

(although exceptions were encountered, such as poorly written abstracts where the intent

from the context occasionally and obviously differed from the actual word or phrase

used).

         If two completely overlapping mentions instead belonged to different entity types,

we annotated both. These mentions were usually related, and they both often provided

valuable information. Some entities, such as malignancy attributes, often appeared as part

of another entity mention. For instance, “colon cancer” is a malignancy type mention, and

“colon” is a malignancy site mention. “Hirschsprung disease 1” is another example, that

                                             33
“Hirschsprung disease” is a disease mention while the whole phrase is a gene mention.

B. Entity co-identity – double tagging

       This category represents the situation that two entity mentions share the exact

same text. We annotated the same text twice with the two corresponding labels under

such circumstances. For example, in the phrase “deletion of the K-ras gene”, “K-ras” was

tagged as both a gene entity mention and a variation-location mention.

C. Discontinuous mentions – chaining

       Sometimes mentions of several entities of the same type shared a common

substring. When written together in the text, the common part only occured once for the

first or last mention, and other mentions were only represented with the different parts.

For example, in the text “H-, K-, and N-ras…”, there are really three gene mentions: “H-

ras”, “K-ras” and “N-ras”, but a limitation of our annotation software prevented tagging

of discontinuous mentions as one parent mention (in the example above, only “N-ras”

could be tagged. For the other two discontinuous mentions, we developed a chaining,

procedure through which annotators were able to link the component parts (“H-” and

“K-” with “ras”) by inserting comments into the annotation in a standard format.

       Chaining was strictly limited within one sentence in order not to complicate issues

for subsequent syntactic parsing of sentences. Employing the same logic, entity mentions

were not allowed to come across different sentences.

3.3 Syntactical vs. Semantic – ambiguity challenges

       We considered ambiguity in mentions to be the most common and difficult

challenge in our annotation experience, as it truly reflects the limitation of human-

invented texts in fully communicating author intent. In biomedical text, we found it not

                                           34
uncommon that an identical text string could represent completely different concepts, and

the frequency of ambiguity appeared to be much higher than for non-biological text. In

the following paragraphs, we will use mainly gene entity examples to illustrate the

illusive nature of this problem.

       We found ambiguity to occur both within and outside gene entities. Genes have a

tradition of being independently named, with poor adherence to or awareness of

standards. People tended to make up new acronyms for gene names, as the result of

which, there are more gene names than the combinations of letters and numbers for short-

character symbols/aliases. Thus, there are lots of similarities between aliases just by

chance. Since each gene has multiple non-unique aliases with one unique gene symbol,

there exists very serious internal ambiguity problem among the aliases. Based on our

calculation, just for human genes alone, there are as many as 3% genes share the same

aliases and the numbers are number higher if including other species. Also, many species

have traditions of naming the genes the same, especially mouse and human (Chen L et al,

2005). For example, p90 is the common alias shared by the distinct gene symbols CANX

and TFRC. As a protein naming convention, p90 actually refers to the protein with

molecular weight 90. Therefore, it is not surprising that there are two proteins with the

same name.

       When such gene mentions appear in literature, (often quite distant) context is the

only way to clarify which gene is in discussion, although sometimes it offers no

assistance. Another type of within gene entity ambiguity that we recognized was the

frequent apparent inability to distinguish a gene from its downstream products, based

purely on the text string of the mention. Although initially, our gene entity was designed

                                           35
to capture only the nomenclatures of functional genomic elements, we soon discovered

that researchers were frequently using the same referents to represent a gene and also its

RNA and protein products in the literature. Without looking at the context, a gene

mention “mycn” had almost an equal probability to refer to a gene or its downstream

product, and both the gene and its mRNA were referred to as being “expressed” to create

a mRNA or a protein product, respectively. In addition, authors also tended to obscure

the conceptual boundaries between a gene and its downstream products. For example,

while a given protein X performs biological functions, we found it common that the

corresponding gene X was being described as performing this action. It became apparent

that while researchers were personally clear regarding distinctions, their descriptions did

not adequately convey these distinctions. In fact, in several cases, we found it impossible

to determine whether certain gene mentions referred to a gene or its RNA or protein

products even when considering the entire article. This overwhelming ambiguity problem

finally prompted us to reach the decision to include genes’ downstream products when

annotating gene entity mentions. Finally, we created one entity class gene but also

included labels for partially subdividing them, while making considerations for not being

able to perfectly divide mentions into the 3 classes. If it was not clear in the text whether

a mention referred to a gene or a protein, the mention was annotated as “gene.generic”, as

apposed to “gene.gene/RNA” or “gene.protein”.

       Besides the challenges mentioned above, it was common to encounter gene entity

mentions that were easily be confused with objects belonging to other entity types, This

is because genes have been named with a wide variety of methods, from the use of lay

languages to the invention of specialized and often clever acronyms. For example, “Cat”

                                             36
is an official gene symbol for the gene catalase, while it could also be used to refer to a

kind of animal. “NB” is the acronym of a well-known pediatric cancer neuroblastoma,

but it is also an official name of a gene locus putatively located on chromosome 1p36.

This cross-entity ambiguity problem was also commonly seen for other entity classes,

such as variation type. As an example, “Insertion” and “deletion” are well-defined

variation type mentions, but they are also frequently used to denote biological or clinical

actions. Regardless of the types of the ambiguity problems, the task for our manual

annotators was to make their best calls to identify the intended reference of the text

strings and annotate them as such. Sometimes annotators needed to take entire abstract

or, rarely, the entire article, into consideration in order to determine what particular

mentions truly represented. Depending on the nature of the biomedical entities and how

representative the training data was, the subsequent automatic extractors were able to

disambiguate problematic text strings to certain degree by taking local contextual features

into account.

3.4 Annotator perceptions

       Even if perfect entity definitions and annotation guidelines could somehow be

created, there would still be variations among human annotators in understanding and

applying them during the annotation process, and we certainly encountered lively

discussion regarding some topics. Usually, manual annotation is done by different

annotators in order to get more files done within a shorter period of time, but the

downside is that it introduces more inconsistencies between annotators. Even with only

one annotator, there will be variability in application of guidelines.

       We took two approaches to deal with this problem. First, annotators were told to

                                             37
discuss anything unclear, and we promoted frequent discussion to determine a consistent

path. And also, a dual, sequential-pass manual annotation process was developed and

applied to better adjudicate different annotators’ work and produce training data as

consistent as possible. During this process, every document was annotated de novo by

one annotator and then subsequently checked by a second annotator, who is more

experienced and consistent, charged with identifying and revising any annotations

considered to be incorrect by first pass annotators. Edited items were then subject to

review by the group, and senior annotators used this editing process as an opportunity for

educating less experienced annotators if repeated error patterns were identified.

3.5 Publication-based errors

       Typographical and grammatical errors, though infrequent, are inevitable, and

some of them were observed in entity mentions during our process. Due to the

considerations of copyright issues, we were not authorized to change the text in such

cases but instead skipped tagging the mentions with added comments.

4. Application

        As a result of the generation and application of these carefully refined entity

definitions and annotation guidelines, 1442 MEDLINE abstracts were manually

annotated. Of these, 1157 files have been made publicly available (release 0.9, BioIE web

site). Since the release, the data has been widely used by the biomedical text mining

community for a variety of purposes, including entity recognition, normalization etc., and

the usage is likely to increase (Cohen KB et al, 2005).

       Because of the consistency of the training data across the corpus, the developed

entity and attribute extractors perform with high precision and recall rates. Table 2-1

                                            38
indicates the performance of three entity extractors built with this data (McDonald RT et

al, 2004; Jin Y et al, 2006).

        Entity                  Precision               Recall                  F-measure
        Gene                      0.864                 0.787                     0.824

     Variation Type              0.8556                 0.7990                   0.8263
       Location                  0.8695                 0.7722                   0.8180
      State-Initial              0.8430                 0.8286                   0.8357
       State-Sub                 0.8035                 0.7809                   0.7920
        Overall                  0.8541                 0.7870                   0.8192

  Malignancy type                0.8456                 0.8218                   0.8335

                   Table 2-1: Entity extractor performance on evaluation data

5. Conclusion

       Manual annotation is an indispensable step to create training data for developing

machine-learning automated extractors. In order to generate extractors that perform with

accuracies high enough to be acceptable to the biomedical research community,

consistently annotated training data is a prerequisite. Although we did not formally prove

it, our experience has been that investment of developing literature-based entity

definitions and annotation guidelines yields far better extracted information with distinct

conceptual boundaries, which in turn increases the opportunity for practical application.

We have concluded that rather than trying to construct unifying definitions that maximize

acceptance and minimize contention amongst domain experts, that a consistent and

generally arguable definition was preferable when making decisions to specify entity

boundaries and magnitudes. More important for us was to consider how the extracted

information will be used, and once determined, how to maintain consistency throughout

the training corpus.

                                              39
Reference

Chen L, Liu H, Friedman C: Gene name ambiguity of eukaryotic nomenclatures.

Bioinformatics, 21: 248-256. (2005).



Cohen KB, Fox L, Ogren PV, Hunter L: Corpus design for biomedical natural language

processing. Proceedings of the ACL-ISMB workshop on linking biological literature,

ontologies and databases, pp. 38-45. Association for Computational Linguistics. (2005).



Collier N, Nobata C, Tsujii J: Extracting the names of genes and gene products with a

hidden Markov model. In Proceedings of the 18th International Conference on

Computational Lingustics, Saarbrucken, Germany. (2000).



GENIA: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ (2004).



Hanisch D, Fundel K, Mevissen HT, Ximmer R, Fluck J: ProMiner: rule-based protein

and gene entity recognition. BMC Bioinformatics. 6: S14. (2005).



Jin Y, McDonald RT, Lerman K, Mandel MA, Carroll S, Liberman MY, Pereira FC,

Winters RS, White PS: Automated recognition of malignancy mentions in biomedical

literature. BMC Bioinformatics, 7: 492. (2006).



McDonald RT, Winters RS, Mandel M, Jin Y, White PS, Pereira F: An entity tagger for

recognizing acquired genomic variations in cancer literature. Bioinformatics 22(20):
                                           40
3249-3251. (2004).



Penn BioIE: http://bioie.ldc.upenn.edu/index.jsp



Tanabe L, Wilbur W: Tagging gene and protein names in biomedical text,

Bioinformatics, 18:1124-1132. (2002).




                                           41
Chapter 3. Automated Recognition of Malignancy Mentions in
                        Biomedical Literature

                                            Yang Jin
                                      Ryan T. McDonald
                                         Kevin Lerman
                                        Mark A. Mandel
                                         Steven Carroll
                                       Mark Y. Liberman
                                     Fernando C. N. Pereira
                                        R. Scott Winters
                                         Peter S. White


                       Pulished: BMC Bioinformatics, 7:492, 2006

Abstract

   Background: The rapid proliferation of biomedical text makes it increasingly

difficult for researchers to identify, synthesize, and utilize developed knowledge in their

fields of interest. Automated information extraction procedures can assist in the

acquisition and management of this knowledge. Previous efforts in biomedical text

mining have focused primarily upon named entity recognition of well-defined molecular

objects such as genes, but less work has been performed to identify disease-related

objects and concepts. Furthermore, promise has been tempered by an inability to

efficiently scale approaches in ways that minimize manual efforts and still perform with

high accuracy. Here, we have applied a machine-learning approach previously successful

for identifying molecular entities to a disease concept to determine if the underlying

probabilistic model effectively generalizes to unrelated concepts with minimal manual

intervention for model retraining.


                                              42
Results: We developed a named entity recognizer (MTag), an entity tagger for

recognizing clinical descriptions of malignancy presented in text. The application uses

the machine-learning technique Conditional Random Fields with additional domain-

specific features. MTag was tested with 1,010 training and 432 evaluation documents

pertaining to cancer genomics. Overall, our experiments resulted in 0.85 precision, 0.83

recall, and 0.84 F-measure on the evaluation set. Compared with a baseline system using

string matching of text with a neoplasm term list, MTag performed with a much higher

recall rate (92.1% vs. 42.1% recall) and demonstrated the ability to learn new patterns.

Application of MTag to all MEDLINE abstracts yielded the identification of 580,002

unique and 9,153,340 overall mentions of malignancy. Significantly, addition of an

extensive lexicon of malignancy mentions as a feature set for extraction had minimal

impact in performance.

   Conclusions: Together, these results suggest that the identification of disparate

biomedical entity classes in free text may be achievable with high accuracy and only

moderate additional effort for each new application domain.

Background

   The biomedical literature collectively represents the acknowledged historical

perception of biological and medical concepts, including findings pertaining to disease-

related research. However, the rapid proliferation of this information makes it

increasingly difficult for researchers and clinicians to peruse, query, and synthesize it for

biomedical knowledge gain. Automated information extraction methods, which have

recently been increasingly concentrated upon biomedical text, can assist in the acquisition

and management of this data. Although text mining applications have been successful in

                                             43
other domains and show promise for biomedical information extraction, issues of

scalability impose significant impediments to broad use in biomedicine. Particular

challenges for text mining include the requirement for highly specified extractors in order

to generate accuracies sufficient for users; considerable effort by highly trained computer

scientists with substantial input by biomedical domain experts to develop extractors; and

a significant body of manually annotated text—with comparable effort in generating

annotated corpora—for training machine-learning extractors. In addition, the high

number and wide diversity of biomedical entity types, along with the high complexity of

biomedical literature, makes auto-annotation of multiple biomedical entity classes a

difficult and labor-intensive task.

   Most biomedical text mining efforts to date have focused upon molecular object

(entity) classes, especially the identification of gene and protein names. Automated

extractors for these tasks have improved considerably in the last few years [1-13]. We

recently extended this focus to include genomic variations [14]. Although there have

been efforts to apply automated entity recognition to the identification of phenotypic and

disease objects [15-17], these systems are broadly focused and often do not perform as

well as those utilizing more recently-evolved machine-learning techniques for such tasks

as gene/protein name recognition. Recently, Skounakis and colleagues have applied a

machine-learning algorithm to extract gene-disorder relations [18], while van Driel and

co-workers have made attempts to extract phenotypic attributes from Online Mendelian

Inheritance in Man [19]. However, more extensive work on medical entity class

recognition is necessary because it is an important prerequisite for utilizing text

information to link molecular and phenotypic observations, thus improving the

                                            44
association between laboratory research and clinical applications described in the

literature.

    In the current work, we explore scalability issues relating to entity extractor

generality and development time, and also determine the feasibility of efficiently

capturing disease descriptions. We first describe an algorithm for automatically

recognizing a specific disease entity class: malignant disease labels. This algorithm,

MTag, is based upon the probability model Conditional Random Fields (CRFs) that has

been shown to perform with state-of-the-art accuracy for entity extraction tasks [5, 14].

CRF extractors consider a large number of syntactic and semantic features of text

surrounding each putative mention [20, 21]. MTag was trained and evaluated on

MEDLINE abstracts and compared with a baseline vocabulary matching method. An

MTag output format that provides HTML-visualized markup of malignant mentions was

developed. Finally, we applied MTag to the entire collection of MEDLINE abstracts to

generate an annotated corpus and an extensive vocabulary of malignancy mentions.

Results

MTag performance

    Manually annotated text from a corpus of 1,442 MEDLINE abstracts was used to

train and evaluate MTag. Abstracts were derived from a random sampling of two

domains: articles pertaining to the pediatric tumor neuroblastoma and articles describing

genomic alterations in a wide variety of malignancies. Two separate training experiments

were performed, either with or without the inclusion of malignancy-specific features,

which were the addition of a lexicon of malignancy mentions and a list of indicative

suffixes. In each case, MTag was tested with the same randomly selected 1,010 training

                                           45
documents and then evaluated with a separate set of 432 documents pertaining to cancer

genomics. The extractor took approximately 6 hours to train on a 733 MHz PowerPC G4

with 1 GB SDRAM. Once trained, MTag can annotate a new abstract in a matter of

seconds.

   For evaluation purposes, manual annotations were treated as gold-standard files

(assuming 100% annotation accuracy). We first evaluated the MTag model with all

biological feature sets included. Our experiments resulted in 0.846 precision, 0.831 recall,

and 0.838 F-measure on the evaluation set. Additionally, the two subset corpora

(neuroblastoma-specific and genome-specific) were tested separately. As expected, the

extractor performed with higher accuracy with the more narrowly defined corpus

(neuroblastoma) than with the corpus more representative for various malignancies

(genome-specific). The neuroblastoma corpus performed with 0.88 precision, 0.87 recall,

and 0.88 F-measure, while the genome-specific corpus performed with 0.77 precision,

0.69 recall, and 0.73 F-measure. These results likely reflect the increased challenge of

identifying mentions of malignancy in a document set demonstrating a more diverse

collection of mentions.

   To determine the impact of the biological feature sets we included to provide domain

specificity, we excluded these feature sets to create a generic MTag. This extractor was

then trained and evaluated using the identical set of files used to train the biological

MTag version. Somewhat surprisingly, the extractor performed with similar accuracy

with the generic model, resulting in 0.851 precision, 0.818 recall, and 0.834 F-measure

on the evaluation set. These results suggested that at least for this class of entities, the

extractor performs the task of identifying malignancy mentions efficiently without the

                                            46
use of a specialized lexicon.

Extraction versus string matching

   We next determined performance of MTag relative to a baseline system that could be

easily employed. For the baseline system, the NCI neoplasm ontology, a term list of

5,555 malignancies, was used as a lexicon to identify malignancy mentions [22]. Lexicon

terms were individually queried against text by case-insensitive exact string matching. A

subset of 39 abstracts randomly selected from the testing set, which together contained

202 malignancy mentions, were used to compare the automated extractor and baseline

results. MTag identified 190 of the 202 mentions correctly (94.1%), while the NCI list

identified only 85 mentions (42.1%), all of which were also identified by the extractor.

We also determined the performance of string matching that instead used the set of

malignancy mentions identified in the manually curated training set annotations (1,010

documents) as a matching lexicon. This system identified 79 of 202 mentions (39.1%).

Combining the manually-derived lexicon with the NCI lexicon yielded 124 of 202

matches (61.4%).

   A closer analysis of the 68 malignancy mentions missed by the string matching with

combined lists but positively identified by MTag determined two general subclasses of

additional malignant mentions. The majority of MTag-unique mentions were lexical or

modified variations of malignancies present either in the training data or in the NCI

lexicon, such as minor variations in spelling and form (e.g., “leukaemia” versus

“leukemia”), and acronyms (e.g., “AML” in place of “acute myeloid leukemia”). More

importantly, a substantial minority of mentions identified only by MTag were instances

of the extractor determining new mentions of malignancies that were, in many cases,

                                           47
neither obvious nor represented in readily available lexicons. For example, “temporal

lobe benign capillary haemangioblastoma” and “parietal lobe ganglioglioma” are neither

in the NCI list or training set per se, or approximated as such by a lexical variant. This

suggests that MTag contributes a significant learning component.

Application to MEDLINE

       MTag was then used to extract mentions of malignancy from all MEDLINE

abstracts through 2005. Extraction took 1,642 CPU-hours (68.4 CPU-days; 2.44 days on

our 28-CPU cluster) to process 15,433,668 documents. A total of 9,153,340 redundant

mentions and 580,002 unique mentions (ignoring case) were identified. Interestingly, the

ratio of unique new mentions identified relative to the number of abstracts analyzed was

relatively uniform, ranging from a rate of 0.183 new mentions per abstract for the first

0.1% of documents to a rate of 0.038 new mentions per abstract for the last 1% of

documents. This indicated that a substantial rate of new mentions was being maintained

throughout the extraction process.

   The 25 mentions found in the greatest number of abstracts by MTag are listed in

Table 1. Six of these malignant phrases: pulmonary, fibroblasts, neoplastic, neoplasm

metastasis, extramural, and abdominal did not match our definition of malignancy. Of

these, only “extramural” is not frequently associated with malignancy descriptions and is

likely the result of containing character n-grams that are generally indicative of

malignancy mentions. The remaining five phrases are likely the result of the extractor

failing to properly define mention boundaries in certain cases (e.g., tagging “neoplasm”

rather than “brain neoplasm”), or alternatively, shared use of an otherwise indicative

character string (e.g., “opl” in “brain neoplasm” and “neoplastic”) between a true positive

                                            48
and a false positive.

   For comparison, we also determined the corresponding number of articles identified

both by keyword searching of PubMed and by exact string matching of MEDLINE for

each of the 19 most common true malignancy types (Table 1). Overall, MTag’s

comparative recall was 1.076 versus PubMed keyword searching and 0.814 versus string

matching. As PubMed keyword searching uses concept mapping to relate keywords to

related concepts, thus providing query expansion, the document retrieval totals derived

from this approach do not strictly compare to MTag’s approach. Furthermore, the exact

string totals would be inflated relative to the MTag totals, as for example the phrase

“myeloid leukemia” would be counted both for this category and for a category

“leukemia” with exact string matching, but would only be counted for the former phrase

by MTag. To adjust for these discrepancies, for MTag document totals listed in Table 1,

we included documents that were tagged with malignancy mentions that were both strict

syntactic parents and biological children of the phrase used. For example, we included

articles identified by MTag with the phrase “small-cell lung cancer” within the total for

the phrase “lung cancer”.

   Comparison of these totals between MTag articles and PubMed keyword searching

revealed that MTag provided high recall for most malignancies. Interestingly, there are

three malignancy mention instances (“carcinoma”, “sarcoma”, “melanoma”) that have

more MTag-identified articles than for PubMed keyword searches. This suggests that a

more formalized normalization of MTag-derived mentions might assist both with

efficiency and recall if employed in concert with the manual annotation procedure

currently employed by MEDLINE. Furthermore, MTag’s document recall compared

                                           49
quite favorably to exact string matching. Only two of the 25 malignancy mentions

yielded less than 60% as many articles via MTag than via PubMed exact string matching

(“bone neoplasms” and “lung cancer”). In these two cases, the concept-mapping PubMed

search identifies the articles with a broader range beyond the search terms. For example,

a PubMed search for the term “lung cancer” identifies articles describing “lung

neoplasms”, while for “bone neoplams”, articles focusing on related concepts such as

“osteoma” and “sphenoid meningioma” are identified by PubMed. Generally, MTag

recall would be expected to improve further after a subsequent normalization process that

maps equivalent phrases to a standard referent.

   To assess document-level precision, we randomly selected 100 abstracts identified by

MTag each for the malignancies “breast cancer” and “adenocarcinoma”. Manual

evaluation of these abstracts showed that all of the articles were directly describing the

respective malignancies. Finally, we evaluated both the 250 most frequently mentioned

malignancies as well as a random set of 250 extracted malignancy mentions from the all-

MEDLINE-extracted set. For the frequently occurring mentions, 72.06% were considered

to be true malignancies; this set corresponds to 0.043% of all malignancy mentions. For

the random set, 78.93% were true malignancies. This suggests that such extracted

mention sets might serve as a first-pass exhaustive lexicon of malignancy mentions.

Comparison of the entire set of unique mentions with the NCI neoplasm list showed that

1,902 of the 5,555 NCI terms (34.2%) were represented in the extracted literature.




Software

                                           50
MTag is platform independent, written in java, and requires java 1.4.2 or higher to

run. The software is freely available under the GNU General Public License at

http://bioie.ldc.upenn.edu/index.jsp?page=soft_tools_MalignancyTaggers.html.           MTag

has been engineered to directly accept files downloaded from PubMed and formatted in

MEDLINE format as input. MTag provides output options of text or HTML file versions

of the extractor results. The text file repeats the input file with recognized malignancy

mentions appended at the end of the file. The HTML file provides markup of the original

abstract with color-highlighted malignancy mentions, as shown in Figure 1.

Discussion

   We have adapted an entity extraction approach that has been shown to be successful

for recognition of molecular biological entities and have shown that it also performs with

high accuracy for disease labels. It is evident that an F-measure of 0.83 is not sufficient as

a stand-alone approach for curation tasks, such as the de novo population of databases.

However, such an approach provides highly enriched material for manual curators to

utilize further. As was determined by our comparisons with lexical string matching and

PubMed-based approaches, our extraction method demonstrated substantial improvement

and efficiency over commonly employed methods for document retrieval. Furthermore,

MTag appeared to be accurately predicting malignancy mentions by learning and

exploiting syntactic patterns encountered in the training corpus.

   Analysis of mis-annotations would likely suggest additional features and/or heuristics

that could boost performance considerably. For example, anatomical and histological

descriptions were frequent among MTag false positive mentions. Incorporation of

lexicons for these entity types as negative features within the MTag model would likely

                                             51
increase precision. Our training set also does not include a substantial number of

documents that do not contain mentions of malignancy; recent unpublished work from

our group suggests that inclusion of such documents significantly impacts extractor

performance in a positive manner.

   Unlike the first iteration of our CRF model [14], the MTag application required only

modest computational effort (several weeks vs. several months) of retraining and

customization time (see Methods). To our surprise, the addition of biological features,

including an extensive lexicon for malignancy mentions, provided very little boost to the

recall rate. This provides evidence that our general CRF model is flexible, broadly

applicable, and if these results hold true for additional entity types, might lessen the need

for creating highly specified extractors. In addition, the need for extensive domain-

specific lexicons, which do not readily exist for many disease attributes, might be

obviated. If so, one approach to comprehensive text mining of biomedical literature might

be to employ a series of modular extractors, each of which is quickly generated and then

trained for a particular entity or relation class. Conversely, it is important to note that the

entity class of malignancy possesses a relatively discrete conceptualization relative to

certain other phenotypic and disease concepts. Further adaptation of our extractor model

for more variably described entity types, such as morphological and developmental

descriptions of neoplasms, is underway. However, the finding that biological feature

addition provided minimal gain in accuracy suggests that further improvements may be

more difficult to obtain than by merely identifying and adding additional domain-specific

features. Significantly, challenges in rapid generation of annotations for extractor

training, as well as procedures for efficient and accurate entity normalization, still

                                              52
remain.

   When combined with expert evaluation of output, extractors can assist with

vocabulary building for targeted entity classes. To demonstrate feasibility, we extracted

mentions of malignancy for all pre-2006 MEDLINE abstracts. Our results indicate that

MTag can generate such a vocabulary readily and with moderate computational resources

and expertise. With manual intervention, this list could be linked to the underlying

literature records and also integrated with other ontological and database resources, such

as the Gene Ontology, UMLS, caBIG, or tumor-specific databases [23-25]. Since

normalization of disease-descriptive term lists requires considerable specialized

expertise, the role of an extractor in this setting more appropriately serves as an

information harvester. However, this role is important, as such supervised lists are often

not readily available, due in part to the variability in which phenotypic and disease

descriptions can be described, and in part to the lack of nomenclature standards in many

cases.

   Finally, to our knowledge, MTag is one of the first directed efforts to automatically

extract entity mentions in a disease-oriented domain with high accuracy. Therefore,

applications such as MTag could contribute to the extraction and integration of

unstructured, medically-oriented information, such as physician notes and physician-

dictated letters to patients and practitioners. Future work will include determining how

well similar extractors perform for identifying mentions of malignant attributes with

greater (e.g. tumor histology) and lesser (e.g. tumor clinical stage) semantic and syntactic

heterogeneity.



                                            53
Conclusions

   MTag can automatically identify and extract mentions of malignancy with high

accuracy from biomedical text. Generation of MTag required only moderate

computational expertise, development time, and domain knowledge. MTag substantially

outperformed information retrieval methods using specialized lexicons. MTag also

demonstrated the ability to assist with the generation of a literature-based vocabulary for

all neoplasm mentions, which is of benefit for data integration procedures requiring

normalization of malignancy mentions. Parallel iteration of the core algorithm used for

MTag could provide a means for more systematic annotation of unstructured text,

involving the identification of many entity types; and application to phenotypic and

medical classes of information.

Methods

Task definition

   Our task was to develop an automated method that would accurately identify and

extract strings of text corresponding to a clinician’s or researcher’s reference to cancer

(malignancy). Our definition of the extent of the label “malignancy” was generally the

full noun phrase encompassing a mention of a cancer subtype, such that “neuroblastoma”,

“localized neuroblastoma”, and “primary extracranial neuroblastoma” were considered to

be distinct mentions of malignancy. Directly adjacent prepositional phrases, such as

“cancer <of the lung>”, were not allowed, as these constructions often denoted ambiguity

as to exact type. Within these confines, the task included identification of all variable

descriptions of particular malignancies, such as the forms “squamous cell carcinoma”

(histological observation) or “lung cancer” (anatomical location), both of which are

                                            54
underspecified forms of “lung squamous cell carcinoma”. Our formal definition of the

semantic type “malignancy” can be found at the Penn BioIE website [26].

Corpora

   In order to train and test the extractor with both depth and breadth of entity mention,

we combined two corpora for testing. The first corpus concentrated upon a specific

malignancy (neuroblastoma) and consisted of 1,000 randomly selected abstracts

identified by querying PubMed with the query terms “neuroblastoma” and “gene”. The

second corpus consisted of 600 abstracts previously selected as likely containing gene

mutation instances for genes commonly mutated in a wide variety of malignancies. These

sets were combined to create a single corpus of 1,442 abstracts, after eliminating 158

abstracts that appeared to be non-topical, had no abstract body, or were not written in

English. This set was manually annotated for tokenization, part-of-speech assignments,

and malignancy named entity recognition, the latter in strict adherence to our pre-

established entity class definition [27, 28]. Sequential dual pass annotations were

performed on all documents by experienced annotators with biomedical knowledge, and

discrepancies were resolved through forum discussions. A total of 7,303 malignancy

mentions were identified in the document set. These annotations are available in corpus

release v0.9 from our BioIE website [29].

Algorithm

   Based on the manually annotated data, an automatic malignancy mention extractor

(MTag) was developed using the probability model Conditional Random Fields (CRFs)

[20]. We have previously demonstrated that this model yields state-of-the-art accuracy

for recognition of molecular named entity classes [5, 14]. CRFs model the conditional

                                            55
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction
Chapter 1. Introduction

Contenu connexe

Tendances

Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression DatabaseКолкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
bigdatabm
 
Ontology based phenotype database and mining tool
Ontology based phenotype database and mining toolOntology based phenotype database and mining tool
Ontology based phenotype database and mining tool
Jennifer Smith
 

Tendances (20)

03 Guerra, Rudy
03 Guerra, Rudy03 Guerra, Rudy
03 Guerra, Rudy
 
Going FAIR: premises, promises and challenges of interoperability standards
Going FAIR: premises, promises and challenges of interoperability standardsGoing FAIR: premises, promises and challenges of interoperability standards
Going FAIR: premises, promises and challenges of interoperability standards
 
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression DatabaseКолкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
 
NTU-2019
NTU-2019NTU-2019
NTU-2019
 
Biomedical Search
Biomedical SearchBiomedical Search
Biomedical Search
 
Twenty Years of Whole Slide Imaging - the Coming Phase Change
Twenty Years of Whole Slide Imaging - the Coming Phase ChangeTwenty Years of Whole Slide Imaging - the Coming Phase Change
Twenty Years of Whole Slide Imaging - the Coming Phase Change
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
UNMSymposium2014
UNMSymposium2014UNMSymposium2014
UNMSymposium2014
 
Wim de Grave: Big Data in life sciences
Wim de Grave:  Big Data in life sciencesWim de Grave:  Big Data in life sciences
Wim de Grave: Big Data in life sciences
 
Obeid generic_2017-11
Obeid generic_2017-11Obeid generic_2017-11
Obeid generic_2017-11
 
INBIOMEDvision Workshop at MIE 2011. Victoria López
INBIOMEDvision Workshop at MIE 2011. Victoria LópezINBIOMEDvision Workshop at MIE 2011. Victoria López
INBIOMEDvision Workshop at MIE 2011. Victoria López
 
Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...
Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...
Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...
 
Basics of Data Analysis in Bioinformatics
Basics of Data Analysis in BioinformaticsBasics of Data Analysis in Bioinformatics
Basics of Data Analysis in Bioinformatics
 
Twenty Years of Whole Slide Imaging - the Coming Phase Change
Twenty Years of Whole Slide Imaging - the Coming Phase ChangeTwenty Years of Whole Slide Imaging - the Coming Phase Change
Twenty Years of Whole Slide Imaging - the Coming Phase Change
 
Ontology based phenotype database and mining tool
Ontology based phenotype database and mining toolOntology based phenotype database and mining tool
Ontology based phenotype database and mining tool
 
An introduction to The Cancer Imaging Archive (Hands on)
An introduction to The Cancer Imaging Archive (Hands on)An introduction to The Cancer Imaging Archive (Hands on)
An introduction to The Cancer Imaging Archive (Hands on)
 
Digital Pathology, FDA Approval and Precision Medicine
Digital Pathology, FDA Approval and Precision MedicineDigital Pathology, FDA Approval and Precision Medicine
Digital Pathology, FDA Approval and Precision Medicine
 
Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees...
Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees...Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees...
Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees...
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics in the Clinical Pipeline: Contribution in Genomic Medicine
Bioinformatics in the Clinical Pipeline: Contribution in Genomic MedicineBioinformatics in the Clinical Pipeline: Contribution in Genomic Medicine
Bioinformatics in the Clinical Pipeline: Contribution in Genomic Medicine
 

En vedette

msword
mswordmsword
msword
butest
 
Introduction.doc
Introduction.docIntroduction.doc
Introduction.doc
butest
 
Bayesian Inference: An Introduction to Principles and ...
Bayesian Inference: An Introduction to Principles and ...Bayesian Inference: An Introduction to Principles and ...
Bayesian Inference: An Introduction to Principles and ...
butest
 
JISC Project Plan Template
JISC Project Plan TemplateJISC Project Plan Template
JISC Project Plan Template
butest
 
This document is a draft of a planned solicitation and is subject ...
This document is a draft of a planned solicitation and is subject ...This document is a draft of a planned solicitation and is subject ...
This document is a draft of a planned solicitation and is subject ...
butest
 
Analogy Based Defect Prediction Model Elham Paikari Department of ...
Analogy Based Defect Prediction Model Elham Paikari Department of ...Analogy Based Defect Prediction Model Elham Paikari Department of ...
Analogy Based Defect Prediction Model Elham Paikari Department of ...
butest
 
Text Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards ExploitationText Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards Exploitation
butest
 
Best oral presentation was awarded to Dr Kostas Lathouras - BIARGSNewsletter2015
Best oral presentation was awarded to Dr Kostas Lathouras - BIARGSNewsletter2015Best oral presentation was awarded to Dr Kostas Lathouras - BIARGSNewsletter2015
Best oral presentation was awarded to Dr Kostas Lathouras - BIARGSNewsletter2015
KOSTAS LATHOURAS
 

En vedette (9)

msword
mswordmsword
msword
 
Introduction.doc
Introduction.docIntroduction.doc
Introduction.doc
 
Bayesian Inference: An Introduction to Principles and ...
Bayesian Inference: An Introduction to Principles and ...Bayesian Inference: An Introduction to Principles and ...
Bayesian Inference: An Introduction to Principles and ...
 
JISC Project Plan Template
JISC Project Plan TemplateJISC Project Plan Template
JISC Project Plan Template
 
This document is a draft of a planned solicitation and is subject ...
This document is a draft of a planned solicitation and is subject ...This document is a draft of a planned solicitation and is subject ...
This document is a draft of a planned solicitation and is subject ...
 
Analogy Based Defect Prediction Model Elham Paikari Department of ...
Analogy Based Defect Prediction Model Elham Paikari Department of ...Analogy Based Defect Prediction Model Elham Paikari Department of ...
Analogy Based Defect Prediction Model Elham Paikari Department of ...
 
Text Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards ExploitationText Mining: Beyond Extraction Towards Exploitation
Text Mining: Beyond Extraction Towards Exploitation
 
Best oral presentation was awarded to Dr Kostas Lathouras - BIARGSNewsletter2015
Best oral presentation was awarded to Dr Kostas Lathouras - BIARGSNewsletter2015Best oral presentation was awarded to Dr Kostas Lathouras - BIARGSNewsletter2015
Best oral presentation was awarded to Dr Kostas Lathouras - BIARGSNewsletter2015
 
doc
docdoc
doc
 

Similaire à Chapter 1. Introduction

Introducción a la bioinformatica
Introducción a la bioinformaticaIntroducción a la bioinformatica
Introducción a la bioinformatica
Martín Arrieta
 
Ontologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological DataOntologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological Data
Yannick Pouliot
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
CSCJournals
 
Data Mining and Big Data Analytics in Pharma
Data Mining and Big Data Analytics in Pharma Data Mining and Big Data Analytics in Pharma
Data Mining and Big Data Analytics in Pharma
Ankur Khanna
 
Cancer and Work Design v1.01
Cancer and Work Design v1.01Cancer and Work Design v1.01
Cancer and Work Design v1.01
James Repenning
 
Deep learning methods in metagenomics: a review
Deep learning methods in metagenomics: a reviewDeep learning methods in metagenomics: a review
Deep learning methods in metagenomics: a review
ssuser6fc73c
 
Principles organization and_operation_of_a_dna_bank
Principles organization and_operation_of_a_dna_bankPrinciples organization and_operation_of_a_dna_bank
Principles organization and_operation_of_a_dna_bank
Espirituanna
 

Similaire à Chapter 1. Introduction (20)

Bio ontology drtc-seminar_anwesha
Bio ontology drtc-seminar_anweshaBio ontology drtc-seminar_anwesha
Bio ontology drtc-seminar_anwesha
 
Introducción a la bioinformatica
Introducción a la bioinformaticaIntroducción a la bioinformatica
Introducción a la bioinformatica
 
The Clinical Genome Conference 2014
The Clinical Genome Conference 2014The Clinical Genome Conference 2014
The Clinical Genome Conference 2014
 
www.ijerd.com
www.ijerd.comwww.ijerd.com
www.ijerd.com
 
Ontologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological DataOntologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological Data
 
Bioinformatics .pptx
Bioinformatics .pptxBioinformatics .pptx
Bioinformatics .pptx
 
Technology R&D Theme 2: From Descriptive to Predictive Networks
Technology R&D Theme 2: From Descriptive to Predictive NetworksTechnology R&D Theme 2: From Descriptive to Predictive Networks
Technology R&D Theme 2: From Descriptive to Predictive Networks
 
The journal of law medicine ethics mapping
The journal of law medicine ethics mappingThe journal of law medicine ethics mapping
The journal of law medicine ethics mapping
 
Role of bioinformatics of drug designing
Role of bioinformatics of drug designingRole of bioinformatics of drug designing
Role of bioinformatics of drug designing
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
 
Data Mining and Big Data Analytics in Pharma
Data Mining and Big Data Analytics in Pharma Data Mining and Big Data Analytics in Pharma
Data Mining and Big Data Analytics in Pharma
 
Bio informatics
Bio informaticsBio informatics
Bio informatics
 
Bio informatics
Bio informaticsBio informatics
Bio informatics
 
Cancer and Work Design v1.01
Cancer and Work Design v1.01Cancer and Work Design v1.01
Cancer and Work Design v1.01
 
JALANov2000
JALANov2000JALANov2000
JALANov2000
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
Deep learning methods in metagenomics: a review
Deep learning methods in metagenomics: a reviewDeep learning methods in metagenomics: a review
Deep learning methods in metagenomics: a review
 
Izant openscience
Izant openscienceIzant openscience
Izant openscience
 
Principles organization and_operation_of_a_dna_bank
Principles organization and_operation_of_a_dna_bankPrinciples organization and_operation_of_a_dna_bank
Principles organization and_operation_of_a_dna_bank
 

Plus de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 

Plus de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Chapter 1. Introduction

  • 1. Chapter 1. Introduction The Exponential Growth of Biomedical Research Data The current capabilities of our biomedical research enterprise, exemplified by the completion of Human Genome Project, enable researchers to quickly and routinely survey the contents of entire molecular and cellular systems. This capability is generating a revolution in biomedical research in various profound ways. One significant change is the availability of staggering amounts of genomic and functional genomic data gathered at a whole genome or whole cell scale. As the result of such tremendous technology breakthroughs, the challenge for biomedical research is being shifted from experimental data generation to the organization, curation and interpretation of these data (Lander ES et al, 2001; Meldrum D et al, 2000). Biomedical research literature can be considered to be a knowledgebase that comprises the most complete status of our research enterprise. Reflecting the geometric growth of available experimental data, the publication rate in biomedicine is also increasing exponentially. There are currently more than 17 million biomedical articles already represented in the National Library of Medicine’s biomedical literature database MEDLINE, including more than 3 million articles published within last 5 years alone and 2,000 per day in 2006 (Hunter L et al, 2006; MEDLINE). Keeping abreast of this large and ever-expanding body of information is increasingly daunting for researchers in order to track and utilize what’s relevant to their interests, especially for new investigators. For example, the pediatric tumor neuroblastoma is a common pediatric tumor but considered to be quite rare overall, with approximately 600 new cases diagnosed in the US each 1
  • 2. year. However, there are almost 25,000 research articles describing neuroblastoma, making it virtually impossible for a new investigator to systematically assess historical research on this topic. Furthermore, researchers have the increasing need to get in touch with the research fields outside their core competence. The commonly used PubMed system, which provides a convenient query interface for MEDLINE, provides keyword search and some concept mapping for researchers to narrow down the information they are looking for (PubMed). However, its capabilities lack the precision (positive predictive value), recall (sensitivity), granularity, and relevance ranking capabilities that many typical but complex research queries have. One of the most popular demands that general-purpose systems such as PubMed fail to satisfy is the ability to extract and compile specific knowledge or facts out of literature records. For example, there is no provision in PubMed-like systems to determine which genes have been studied thus far in relation to a certain type of malignancy, other than to read through the set of articles identified by PubMed using keywords defining the concepts “gene” and “cancer” (or the type of cancer of interest), and then identifying the particular genes one article at a time. With the exponentially increasing literature size, the process will not only be more time consuming, but also be less reliable on getting the right articles. Consequently, the gap between what is recognized and what is currently known is widening (Wren JD et al, 2004). Biomedical text mining techniques can help researchers meet this challenge by developing automated systems to extract the relevant information out of the text and organize it into a structured knowledgebase. 2
  • 3. Data Integration Opportunities in Cancer Research The general challenge of biomedical literature knowledge extraction is confounded in cancer research, including an acute need to more systematically identify linkages between genomic data and malignant phenotypes. Characterization of the molecular aberrations responsible for the onset and progression of malignancy is a major goal for cancer researchers, and genomic components of the aberrations, ranging from base pair variance to chromosome deletion, are crucial determinants in this regard. Despite the existence of some locus-, mutation- and disease-specific resources, there is currently no central cancer knowledge database in the public domain integrating genomic findings with phenotypic observations of tumors (Cairns J et al, 2000; Freimer N et al, 2003). While high-throughput screening efforts increasing allow researchers to identify genome-wide mutational profiles for specific tumors, this information is largely diffusely distributed and is mostly catalogued in a semi-structured manner throughout the biomedical literature. Such decentralization is holding back the efforts towards making rapid and comprehensive inferences of the genomic basis of malignancy onset and progression in a manner that incorporates cumulative knowledge. Ideally, researchers and clinicians would likely benefit from a comprehensive cancer knowledgebase that consolidates experimental work (genome-level investigation), clinical observations (descriptions of phenotype) and patient outcome (efficacy of treatment). Because the biomedical literature represents a large proportion of this information, which is both critically reviewed and eventually objective in its presentation of cancer research information, means for more adequately extracting, normalizing and relating such diverse collections of information in literature are crucial to solving this data integration problem 3
  • 4. in cancer research. Named Entity Recognition The successful development of text mining technology has been increasingly applied in biomedical research to assist with meeting the above-mentioned challenges. There have been significant efforts from both computational linguists and bioinformaticists within the past 5 years to develop automated biomedical text mining (BTM) systems (Jensen LJ et al, 2006). BTM tasks include named entity recognition (NER), information extraction (IE), document retrieval (DR), and literature-based discovery (LBD). NER, which serves as the basis for most other BTM undertakings, is the process of identifying mentions of biomedical entities (objects, such as genes and diseases) in the text. Named entity recognition can be at first deceptively straightforward, but it is has emerged as a challenging and considerable task in BTM research. NER begins with the classification and definition of biomedical entities, which easily consumes tremendous amount of effort because of the complex and lack-of-standard nature in biomedical entities. The process of identifying references to biomedical objects in text is usually split into two steps: the identification of mentions of specific entity instances in text, such as “the p53 gene” or “acute lymphoblastic leukemia”; and the assignment of these mentions to a standard referent (normalization), such as classifying “the p53 gene” as a mention of the official gene symbol “TP53”, or “ALL” as “acute lymphoblastic leukemia”. Many biomedical entities either lack controlled vocabularies that can act as sufficient nomenclature standards, or the instances in text are not expressed with the standards due to historical reasons. Therefore, normalization is absolutely necessary for equating entity 4
  • 5. values as appropriate, or placing values into a hierarchical or ontological framework (e.g., “ALL” as a form of “leukemia”. Much BTM research to date has focused upon molecular entities that tend to be more discretely definable, such as genes and protein-protein interactions, than phenotypic entities, which are harder to classify semantically (BioCreAtIvE; McDonald R et al, 2005; Settles BA 2005; Zhou G et al, 2005). NER methods include both rule-based and machine-learning approaches. Rule- based approaches use sets of “rules”, alone or in combination, that pre-state signature grammatical and especially character and word-based patterns within a string of text being considered, and then return Boolean values as an output. For example, a rule to identify a gene name could be “This word is a gene if it contains the consecutive letters ‘KIAA”, all of which are capitalized”. There can be some allowance for lexical variations, such as capitalization, stemming, or punctuation, and some or all rules might compare the text being considered to a term list, such as a pre-compiled list of known tumor types. However, the performance of the approach can’t count on the completion of the dictionary-type list in terms of both depth (the completion of the entity unique identifiers) and breadth (the completion of the synonyms for each unique identifier) because for most biomedical entities, the term lists are always changing and never complete. For complexly formulated text, rule-based approaches typically require considerable thought and exquisite biological knowledge. Advantages of this approach are relatively high precision without the requirement for generating extensive training material. However, disadvantages include high false negative rates, a performance plateau that is increasingly difficult to overcome, and, for complex and heterogeneous text, a tendency to generate low recall. Most first-generation systems and many domain- 5
  • 6. focused current systems utilize rule-based approaches; when coupled with a term list, this approach accomplishes both steps of the overall NER task at one time. However, rule- based systems have enjoyed only modest success for biomedical applications, likely because their performances have plateaued below rates acceptable for wide use by researchers, or their application domains have been overtly narrow (Hanisch D et al, 2005; Fundel K et al, 2005; Chang JT et al, 2004; Finkel J et al, 2005). Given the limitations of rule-based systems, a number of machine-learning algorithms have been applied to improve the first step of the NER task. Generally, these algorithms consider and then define sets of features within and surrounding entity mentions that co-associate with the mentions. These can include orthographic features of the text (e.g., suffixes, particular sequential combinations of characters or words, capitalization patterns, etc.) and domain-specific features (e.g., term lists). For example, the suffix “-ase” usually indicates a protein name, and the noun phrase immediately preceding the word “gene” is often a gene name. Machine-learning approaches have several advantages: at their purest, they require no domain knowledge; they can consider thousands or millions of features simultaneously; they can provide confidence scores for predictions; and they can consider the entire feature space simultaneously. However, the success of machine-learning approaches is dependent upon two critical and costly factors. First, ML systems require the establishment, quality, and representativeness of a set of manually generated training material from which to “learn” features, a process that requires considerable effort and does not generalize effectively. Second, the most effective systems incorporate biological knowledge—either in the form of domain- specific rules or definition of features that are domain-specific (such as specialized 6
  • 7. lexicons)—that are likewise costly to implement (McDonald R et al, 2004; Coller N et al, 2000; Tanabe L et al, 2002). It is most critical to let human set the examples of gold standards before machines can learn from it. To better reduce the annotation ambiguity and disagreement, it is crucial to define the target biomedical entities explicitly. Currently, most developed NER systems take some version of pre-established conceptual definitions, by which annotators could apply with very different standards. We have tried otherwise and put tremendous effort in an iterative annotation process to develop literature-based definitions drawing both the conceptual and textual boundaries. Step 2 work (normalization) is syntactically easier since the identification of textual boundaries is not necessary. However, it poses significant semantic challenges, because the non-unique synonyms have to be disambiguated to find out the real intent. And also, a comprehensive thesaurus like dictionary is necessary in order to match the raw entity mentions to their unique identifiers. Classification techniques, rule-based systems, and pattern-matching algorithms have been utilized to solve this issue, and some approaches also take the contextual information to disambiguate the synonyms (Chen L et al, 2005). Information Extraction Ideally, BTM systems extract and synthesize “facts” out of the literature that combine entity mentions with relationships between and among the mentions established in the literature. This work requires NER results, that is, the relationships between the entities can only be extracted once the individual entities have been identified. Although biomedically oriented research in this area is not as advanced as NER, BTM researchers 7
  • 8. have recently been increasing their efforts on these challenges. A most straightforward but powerful approach is co-occurrence. This approach identifies the relationships between the involved biomedical entities based on their co- occurrence in the articles, or by considering how close mentions are to each other within a document. The assumption taken by the co-occurrence method is that if two (or more) entity instances are co-mentioned in one single text record (or defined subset, such as a sentence or a paragraph), these instances have some type of underlying biological relationship. As it is possible that entity instances can coincidentally co-occur, systems commonly use some parameters to rank the relationships, such as the frequency and location of their co-occurrence. If two entity instances are repeatedly co-mentioned together in close proximity, it is most likely that they are related. This approach tends to perform with better recall but at the expense of precision because it has no intelligent means for distinguishing specific from general relationships. For example, if the information to be extracted is the causal relationship between gene A and disease diagnostic labels, this approach will recognize relationships of any kind between gene A and relevant diseases, including but not limited to direct or causal relationships. In order to improve precision, some co-occurrence-based IE systems include additional approaches, such as combining with a customized text-categorization system to preferentially identify relevant articles or sentences. Co-occurrence-based IE systems are usually used as exploratory tools making inferential calls since they can identify both direct and indirect relationships between entity instances (Jessen TK et al, 2001; Alako BT et al, 2005). Another approach is to take advantage of natural language processing (NLP) 8
  • 9. methodology that combines syntactic and semantic analysis of text. In this approach, individual tokens in test are often first identified and then assigned part-of-speech labels, in a process that has been converted to automation with high accuracy. Then a nested tree like structure (either top-down or bottom-up) is developed in order to determine the relationships between noun phrases or beyond, such as subjective and objective. After a NER process is applied for assigning semantic labels to specific words and phrases, either rule-based or machine-learning based processes can be used to extract relationships between entity mentions. Although the syntactic parsing and the semantic labeling have been carried out as separate steps by most NLP-based IE systems, results indicate that better performance can be obtained by integrating the two steps, due in part to the often complex relationships of biomedical entity mentions. This NLP-based approach can achieve better precision, but lower recall, largely because of increased challenges in identifying relationships across sentences. These approaches are also labor-intensive, since either expert defined sophisticated extraction rules or manually annotated training corpus are required (Rzhetsky A et al, 2004; Daraselia N et al, 2004; Yakushiji A et al, 2001). Although there is some research touching base with n-ary relationships between a set of biomedical entities, most IE systems currently classify binary relationships between same-type entities. These systems most commonly focus on entities and relationships that are easier to define, such as protein-protein/gene-protein interactions, protein phosphorylation, other specific relations between genomic entities such as cellular localizations of proteins, or interactions between proteins and chemicals. Few NER systems have yet to be designed for relating phenotypic attributes, such as gene-disease 9
  • 10. relationships (Temkin et al, 2003; McDonald R et al, 2005). High-performance systems that can extract many types of relationships and also distinguish among relationships beyond the sentence level are not yet achievable. This is due largely to three contributing factors. First, biomedical text is complex and highly variable in its structure and presentation. Second, many complicating factors need to be considered, including co-reference (e.g, the use of pronouns), ambiguity in intent, and variability in formulation. Finally, systems need to incorporate various approaches simultaneously (e.g., tokenizers, POS taggers, NER systerms, parsers, disambiguators), each of which contributes some measure of error that combines to significantly degrade finalized output (Ding J et al, 2002). Document Retrieval DR systems typically identify and rank documents pertaining to a certain topic from a large collection of text. Topics of interest might be derived from user-supplied search terms or from pre-selecting specified types of documents. Most DR systems feature keyword search capabilities; advanced keyword searching allows users to input a combination of search terms and/or to perform advanced functions, such as including logical operations or inducing limits to terms. Systems then commonly retrieve documents containing or excluding certain terms that match the search criteria. This method often retrieves irrelevant articles, and relevance-ranking functions are often absent or primitive. More sophisticated DR systems go beyond this by applying distance metrics, such as a vector-space model. With this model, every document is represented as a vector, which is determined by measuring text-based features and/or document metadata, such as a list of frequency-based weighted terms identified in each document. 10
  • 11. The query vector, which is determined by the relative importance of each query term, is then compared to document vectors to relevance rank the documents. The comparison between document vectors can also calculate document similarity. PubMed is a well- known DR system that is highly adapted for use as a query interface for MEDLINE. PubMed uses both keyword searching and a vector model (Glenisson P et al, 2003). Advanced DR systems integrate NER or other NLP methods in order to more accurately assess document content and identify documents that mention certain biomedical entity mentions. FABLE, MedMiner and Textpresso are examples of systems that make retrieval decisions by extracting and considering knowledge from gene/protein mentions in the documents (FABLE; Tanabe L et al, 1999; Muller HM et al, 2004). Literature-Based Discovery An ultimate goal of BTM is to assist with literature-based discovery. LBD can be defined as a process that discovers testable novel hypotheses by inferring implicit knowledge in biomedical literature. An early and often-cited example of LBD was from researcher recognizance of facts from two unrelated bodies of biomedical text, describing Raynaud’s disease, in which patients suffer from vasoconstriction, high blood viscosity and platelet aggregability, and describing fish oil, indicating that besides its capability of causing vasodilation, its active ingredient can also lower blood viscosity and platelet aggregation. This connection was formed completely through extensive reading of the literature, and later the relationship was proved experimentally. The model used in this seminal example was very simple: if A leads to B, and B leads to C, then it is plausible that A could lead to C. Based on this closed discovery process (to connect two previously known relations), this researcher subsequently discovered a novel association between 11
  • 12. migraine and magnesium deficiency (also proved experimentally) as well as additional successes (Swanson DR 1986; Swanson DR 1988; Swanson DR 1990). More challenging LBDs might arise from an open discovery process, which attempts to derive relationships between two entities of interest through implicit relationships in literature. For example, the process of identifying candidate genes for a certain disease is an open discovery process. One example of this process would be to first identify gene mentions co-occurring in the literature (gene set A) with mentions of a disease of interest, next identifiying co-occurring gene mentions (gene set B) with known disease genes, and then consider the overlap between the two sets of gene mentions as candidate genes for the disease. There are two assumptions taken for this approach: Gene set B is functionally related with known disease genes; Gene set A has some sort of relations with the disease. One potential problem for this approach is that there are many types of direct and indirect relationships identified in such a process, including the high likelihood that a substantial number of false positives are generated. NLP-based IE can certainly help narrow down the relationship types, but further research is needed to improve the performance of such models. Also fundamentally, literature inevitably contains conflicting and inaccurate statements, which is impossible for an automated algorithm to adjudicate (Weeber M et al, 2005). It is much likely that more reliable inference of novel hypotheses and research directions from literature achieves success by integration of BTM results with other data types, including from curated data sets and experimental data. Experts’ curation and experimental evidence provides verification, filtering, and relevance ranking capabilities from information derived from real biological relationships between entities. For 12
  • 13. example, researchers have made novel discoveries by transferring text-mined relationships of a protein to its orthologous proteins based on sequence-similarity searches. The integration effort of BTM results with functional genomic data such as microarray data has helped researchers rank significant genes as well as develop novel hypotheses based on both experimental data and previously known knowledge in a large scale, automated fashion (Yandell MD et al, 2002; Raychaudhuri S et al, 2002; Glenisson P et al, 2004). Significance Along with the rapid expanding of experimental data, the exponential increase of the biomedical research text makes it more and more difficult for researchers to track and utilize the relevant information to their interests, especially for the domains outside their core competence. Automated text mining systems can process the unstructured information in the literature into structured, queryable knowledgebase. This dissertation research has developed well-performed automated entity extractors based on the refined manual annotation with iteratively defined literature-based entity definitions in genomic variation of malignancy. Co-occurrence-based information extraction process was applied to integrate with microarray expression data in the pursuit of determining neuroblastoma research candidate genes. Both functional pathway analysis and RT-PCR experiment validated the text mining’s contribution. This thesis demonstrated that in addition to systematic curation of the textual information, biomedical text mining also has inferential capability especially when combined with experimental data. 13
  • 14. Introduction to the Thesis Using the genomics of malignancy as a test bed, this thesis has touched upon every aspect of BTM outlined above. Work regarding the BTM process developed and employed will be discussed in detail in Chapter 2 and Chapter 3. This thesis has also established important work regarding information extraction in this domain, which has been applied to research regarding the pediatric tumor neuroblastoma (Chapter 3 and Chapter 4). Integration of BTM-extracted information with expression array analytical results to discover candidate genes for neuroblastoma research will be discussed in detail in Chapter 4. 14
  • 15. Chapter 2. Defining Biomedical Entities for Named Entity Recognition Yang Jin Mark A. Mandel Peter S. White Abstract The performance of machine-learning based named entity recognition is highly dependent upon the quality of the training data, which is commonly generated by manual annotation of biomedical text representative of the target domain. The development of robust definitions of biomedical entities of interest is crucial for highly accurate recognition but is often neglected by text-mining applications. While the conceptual and syntactic complexities of biomedical entities often generate ambiguities in assigning text mentions to particular entity classes, entity definitions that exhibit as distinct semantic and textual boundaries as possible are desired. We have created a highly generalizable process for developing entity definitions specifying both conceptual limits and detailed textual ranges for target biomedical entities. This process utilizes representative text and manual annotators to initially define and iteratively refine definitions. The process was tested within the knowledge domain of genomic variation of malignancy. This work describes in detail the different types of challenges faced and the corresponding solutions devised during the definition process. The resulting entity definitions were used to annotate a training corpus for the development of automated entity extraction algorithms and for use by the research community. We conclude that manual annotation consistency is useful for the success of later biomedical text mining tasks, and that explicit, boundary- defined entity definitions can assist with achieving this goal. 15
  • 16. 1. Introduction Automated information extraction techniques can assist in the acquisition, management and curation of data. A necessary first step is the ability to automatically recognize biomedical entities in text, as also known as named entity recognition (NER). Development of named entity extractors for biomedical literature has progressed rapidly in recent years. For example, a number of machine-learning algorithms currently exist for identifying gene name instances in text (Collier N et al, 2000; Tanabe L et al, 2002; GENIA; Hanisch D et al, 2005). However, a major shortcoming of many approaches is that they often minimize efforts to define biomedical entities in an explicit fashion. Rather, the tendency is often to ignore this step by adapting or refining existing semantic standards as the target entities’ conceptual definitions, leaving interpretive details to manual annotators. Additionally, existing standards often provide little or none of the semantic depth required to establish concept boundaries with enough rigidity to provide highly accurate extraction. This tends to create outstanding consistency problems in later steps when training automated extractors and utilizing the extracted entity mentions for particular applications, because non-literature based conceptual definitions often generate significant annotation ambiguity problems due to the semantic as well as syntactic complexities of biomedical entities in the literature. As a result, automated systems derived from such systems tend to perform more poorly. For biologists in particular, high precision is a necessary prerequisite for widespread acceptance of automated tools, in order to establish a level of reliability acceptable to users. Strongly believing the importance of establishing well-defined, literature-based entity definitions with clear boundaries specially designed for biomedical NER practice, 16
  • 17. the Biomedical Information Extraction Group at University of Pennsylvania (Penn BioIE) has developed an iterative annotation process designed to establish a set of “precise” entity definitions. These definitions are meant to clarify the conceptual boundaries both semantically and syntactically, while also striking a balance between the requirements of researchers, annotators, and computational scientists. This paper will first describe the annotation process developed by the Penn BioIE group, and then introduce the necessities and challenges of defining biomedical entities with specific examples in the literature. 2. Overview of manual annotation process and entity classification QuickTimeª and a TIFF (Uncompressed) decompressor are needed to see this picture. Figure 2-1. The processes of developing entity definitions and extractors Figure 2-1 demonstrates the iterative process developed for establishing and refining entity definitions, first through manual annotations and then in developing extractors based on the manually annotated training data. The process begins with the creation of an initial definition that establishes the general concept and scope of an entity 17
  • 18. class, which is supplied by one or a group of domain experts. Commonly existing standards and resources are explored and, if deemed suitable, adopted as nuclei for the process. Subsequently, the domain expert(s) plays the role of adjudicating definition discrepancies. Manual annotators are then trained with the initial versions of the entity definitions, from which they manually annotate the selected training corpora. Invariably, as the annotators encounter the wide diversity of semantic representations of specific concepts, a need for iterative refinement of the entity definitions emerges. Often, text encounters require major revisions or even restructuring of definitions to accommodate such heterogeneity. Accordingly, definitions are continually refined during the analysis of annotated texts and annotation disambiguation. The Penn BioIE group founded useful frequent communication forums where the emerging definitions and identified exceptions were fully discussed among annotators and researchers. Communication modalities included weekly face-to-face meetings, email lists, and live chat. After annotation has been executed, entity extractors were developed by implementation of machine-learning algorithms utilizing probability models (we used Conditional Random Fields); the manually annotated texts were utilized as both training and testing data for these algorithms. Comparison of the annotations produced by the automatic extractors and human annotators allows for evaluation of the extractor performance. The target knowledge domain we chose was “Genomic Variation of Malignancy”, conceptualized as a relationship among three entity classes: Gene, Variation and Malignancy. As shown in Figure 2-2, the Gene and Variation entities comprise genomic components of cancer while the Malignancy entity covers phenotypic aspects of malignancy, including malignancy diagnostic labels and a number of malignancy 18
  • 19. phenotypic attributes. QuickTimeª and a TIFF (Uncompressed) decompressor are needed to see this picture. Figure 2-2. Entity classification scheme for the domain of genomic variation of malignancy A total of 1442 MEDLINE abstracts were selected for exploration and annotation in this study, one subset of which contained many different malignancy types to establish breadth, and a second subset of which mentioned only one major malignancy (neuroblastoma) to establish depth. As diagrammed in Figure 2-1, the manual annotation process was first applied to the corpus with an electronic annotation tool, WordFreak (http://sourceforge.net/projects/wordfreak). After the entity definitions were refined and stabilized, the manually annotated data were then used to develop entity and attribute extractors (McDonald RT et al, 2004, Jin Y et al, 2006). These automated extractors 19
  • 20. performed with state-of-the-art accuracy, in part due to the careful design and management of our annotation process. In the following paragraphs, we will discuss the challenges we have encountered during the manual annotation process, and why we believe that consistent entity definitions are critical for the success of later steps in biomedical text mining. 3. The challenges of defining biomedical entities Although we began this task believing we had clear ideas of what information each entity should cover, it quickly proved challenging to develop detailed working definitions. Our a priori notions of entity definition adequacy were that definitions establish distinct and defensible boundaries both conceptually and textually, therefore providing guidance to the annotators both semantically and syntactically. Solid entity definitions are an essential foundation for the subsequent steps of developing machine- learning algorithms and utilizing the extracted information for specific applications. First, the performance of entity extractors is highly dependent not only on the selection of the underlying algorithms, but also on the quality of the training data, which are entirely based on the entity definitions. If the annotators cannot identify specific entity mentions consistently on the basis of the definitions, it is hard to imagine that automated extractors can replicate this task reliably. More importantly, without clear definitions, researchers will certainly run into problems when trying to utilize the extracted mentions, since it will be difficult to know the precise boundaries of the gathered information. As mentioned earlier, we initially defined three major entities in the knowledge domain of genomic variation of malignancy, based on existing ontological categories and concepts. However, we quickly found that ontology-based definitions often don’t 20
  • 21. precisely reflect what has been conceptualized throughout the biomedical texts contributed by researchers worldwide. For example, a gene defined by NCI thesaurus is: “A functional unit of heredity which occupies a specific position (locus) on a particular chromosome, is capable of reproducing itself exactly at each cell division, and directs the formation of a protein or other product.” If annotators use this definition for identifying gene mentions in the text, they could quickly be confused by many situations such as whether promoters should be included; how should gene family names be treated; how about pronoun referents to genes, etc. Thus, we found the need to invoke text-based working entity definitions, which are most effectively determined as annotators proceeded with the entity recognition task in the training corpus. Every new mention of an entity and every new context for a mention provided a test for the pre-developed entity definition. If a definition could not explicitly lead the annotators to a “correct”, or at least consistent decision in each case, the problematic mention required further examination, interpretation, and possibly, refinement of the definition. Through such an iterative process, we were able to develop fine-tuned entity definitions that provided distinct boundaries both for semantic scope and contextual range. The challenges that we encountered in refining our definitions can be grouped into four categories: conceptual, syntactic, syntactic/semantic ambiguity, and inter- annotator agreement. In the following paragraphs we will illustrate these types and give examples of our devised solutions and their limits. 3.1 Conceptual definition challenges As discussed earlier, an entity definition has to clarify both conceptual and textual boundaries. Initial versions of our definitions were completely conceptual, based on our 21
  • 22. understanding of biomedical categories. Surprisingly, more than half of the annotators’ difficulties with definitions fell into this category during the annotation process, and most of them were reasonable as you can observe in the following paragraphs showing the four most common challenges in this category. This reflects the semantic complexity and diversity of biomedical entities, which often cannot be easily defined without some ambiguity. 3.1.1 Sub-classification of entities Based on the classification scheme stated above, our target knowledge domain was initially divided into three major conceptual classes: gene, genomic variation, and malignancy. However, this broad conceptual classification was far from sufficient for the generation of highly accurate extractors. For example, according to the conceptual definition, the malignancy concept covers all phenotypic information of cancer, including a tumor’s diagnostic type, the tumor’s anatomical location and cellular composition, and its differentiation status. Each of these types of information are presented in a variable and often bewildering array of syntactic and contextual patterns, which increases entropy and thus erodes the ability of machine-learning approaches to classify mentions. If instead we further classified the mentions into sub-categories such as those described above and annotated them as such, entropy is reduced and extractor performance can be expected to improve. However, a major disadvantage of this approach is that, sub- categorization introduces considerable additional annotation effort. Thus, the annotation process requires first the establishment of a level of entity granularity that balances the cost of manual annotation with the application value of the extracted data. There are countless ways to further divide entities into their underlying 22
  • 23. components. For our purpose, we decided to let the level of granularity be generated by the annotation process. By beginning with broad classes and subdividing them as needed, we considered that we would eventually approach an optimal balance between effort and effectiveness. We considered it to be critical to determine how the text strings represented subcategories in the real world of biomedical literature. Therefore we divided our annotation efforts into two stages: data gathering and data classification, as demonstrated in Figure 2-3 with a genomic variation entity example. QuickTimeª and a TIFF (Uncompressed) decompressor are needed to see this picture. Figure 2-3. The text-based two-stage entity sub-classification process In the example illustrated by Figure 2-3, annotation of our initial concept of “Genomic Variation” proceeded through a preliminary stage of annotation before it was divided into sub-categories, which we named “Data Gathering”. In this stage, all textual mentions falling within or partially within our initial concept definition were annotated regardless of syntax. When sufficient information was gathered, sub-categories were 23
  • 24. defined based on their semantic and syntactic representations. In addition, by proceeding with this exercise, the annotators became familiar with the concepts, definitions, and emerging challenges of the tasks. By employing this method, the sub-classification scheme began to approximate how the concepts were actually presented in the text. 3.1.2 Levels of specificity Textual entity mentions referring to the same semantic types can range from very general to quite specific, and not all levels of detail may be appropriate for a particular project. A gene mention may refer to a specific gene instance in a single cell of a sample, or to the wild type or a specific variation of the gene; or it may refer to gene families, super families and generalized classes, which represent classes of genes. For instance, “MAPK10” or “mitogen-activated protein kinase 10” is a family member of “MAPK”, which itself belongs to a higher level family “protein kinase”. We made the decision to include all levels of information for the gene entity except for the most general level such as “gene”. That is, in the above example, all three levels of gene mentions are legitimate and should be annotated as such. The decision was based on a couple of considerations. First of all, gene class information is valuable information to extract in later steps; although we don’t know which specific gene it refers to, it does help us narrow down to a class of genes. Second, if we only include the mentions describing genes at the instance level (the level that can lead to a specific genomic element), we have to draw a line between gene classes and instances. Because textual mentions for gene classes and instances are sometimes interchangeable (researchers tend to use gene class names referring to gene instance names and vice versa), it will be quite difficult for the automated extractors to distinguish 24
  • 25. between the two. And finally, we exclude gene mentions at the most general level, which contains no information content or application value to extract. In another words, all information-containing levels of mentions are included. 3.1.3 Conceptual overlaps between entities An ideal entity classification scheme should result in independent information categories without any conceptual overlaps. Unfortunately, the subjective and adaptive nature of biological objects makes this ideal especially difficult to achieve, especially when defining two different but related entities. Even a basic concept such as “organism” is difficult to define when considering entities such as viruses and viroids, self-replicating machines with attributes necessary but not necessarily sufficient to qualify as life forms. Because our gene and genomic variation concepts both fall within the genomic domain and are closely associated, we were very careful to make a clear distinction. Eventually, our gene entity evolved to encompass solely the names of genes and their downstream products (i.e., RNAs and proteins), while the genomic variation entity covered specific descriptions of genomic element variations. Although our definitions of gene and genomic variation managed to eventually establish a reasonable boundary between them, for other entities, we found it sometimes impossible to avoid the conceptual overlapping problem. We encountered such problems when trying to make a clear division between the entity classes symptom and disease. The symptom entity was designed to capture subjective or objective evidence of disease, such as headache, diarrhea or hyperglycemia, while the disease entity captured specific pathological processes with a characteristic set of symptoms, such as Long QT Syndrome or lung cancer. As with most cases, the distinction is often clear to domain experts unless 25
  • 26. considerable scrutiny is requested, as it appears to be simple common sense that these concepts represent two distinct and non-overlapping sets of information. However, when presented with the broad contextual variation in use and, often, semantic intent, it actually becomes quite difficult to draw a clear boundary between the two. We quickly found that many terms can be considered as both symptoms and diseases, depending both upon intent and the level of domain knowledge available. For example, “arrhythmia” itself is a disease entity mention, representing a pathological process, but it is usually used as a diagnostic label of a disease (symptom), such as long QT Syndrome. We certainly don’t want to have two entity types heavily overlapping with each other, since that will make the classification unnecessary. That is not the case for the symptom and disease entity types, and their overlapping mentions are less than approximately 10% overall. Most conceptually overlapping mentions cannot be put into either category without reading the text. We leave it to the annotators to determine authors’ intent based on the context and increasingly, they became quite good at minimizing the disagreement. 3.1.4 Domain-specific clarification As biological entities tend to be conceptually subjective, we often found it to be quite challenging and labor-intensive to establish consistent conceptual boundaries. The process of defining the gene entity is a good example to illustrate this challenge. Initially, we considered the task of defining a “gene” to be a straightforward task, as this concept is considered by biologists to be a rather discrete object. The HUGO Gene Nomenclature Committee (HUGO), the nomenclature body tasked with establishing official names for human genes, defines a gene as “a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterized by sequence, 26
  • 27. transcription or homology". On top of that, our gene entity is initially defined as the nominal reference to a gene or its downstream product in biomedical text. However, as annotations moved forward, annotators raised more and more questions, forcing us to make difficult determinations on the boundaries as illustrated below. An example of biological complexity is the many ways that a gene can contribute to phenotype. Typically, genes functionally impact biological processes through their downstream products, proteins. However, there are DNA segments on the genome which are able to affect phenotype by regulating how genes are expressed in particular biological contexts. Promoter and enhancer regions, which are distinct segments of DNA (often far) removed from the DNA segment that directly contributes to an RNA and/or protein product, are such example. These elements control whether and when the gene itself is expressed. Although biologists disagree whether promoters should be considered as genes or components of particular genes, annotators are required to make a decision on the gene entity boundary limits. In this case, we considered our application domain to be the most important determinant, as the main focus of our gene entity was to capture those “traditional genes” that could be directly and consistently associated with a protein. Thus, we limited our scope of genes to include only what we considered to be biologically functional DNA segments which are translated into protein products. There are many more cases that required further clarification of the gene entity conceptual definition, such as how to deal with segments and multiplexes of genes/RNAs/proteins. We realized that consistency was more valuable than trying to establish universal truth, the former of which we considered to be the key to developing well-performing automated extractors and increasing the application value of extracted 27
  • 28. mentions. 3.2 Syntactic definition challenges Even with precise conceptual definitions, we found that guidelines needed be made regarding the textual boundaries of the entity mentions. Although many of these were syntactical nuances, they were not necessarily trivial for the annotator disagreement. In order to make consistent automated extractors, we determined that detailed annotation guidelines were required to make manual annotations consistent between different annotators. We designed our guidelines to be practical and based on actual contexts, specifying to the annotators exactly what to do under any uncertain circumstances that we had encountered. 3.2.1 Associating a text string to an entity mention There are many different ways to associate a text string with an entity mention in biomedical literature. In order to harvest consistent training data to develop highly performed automated extractors, we needed to define a series of rules specifying how to select text strings in the literature as legitimate entity mentions. We allowed entity references to include more than one word, including punctuation, but not to cross sentence boundaries. Although the majority of the entity mentions were nouns, not all of them were. For some entity mentions such as variation type, other part-of-speech forms were not uncommon. For example, for genomic variation types that would likely be normalized as the forms “insertion”, “deletion”, or “translocation”, those variation type mentions were usually expressed as verbs: “inserted”, “deleted”, or “translocated”. Moreover, malignancy attribute mentions were nearly always adjectives, such as “well- 28
  • 29. differentiated”, “hereditary”, and “malignant”. All modifiers in a noun phrase mention were considered to be included as part of a mention, because not only can the modifiers provide very useful information to be extracted, but also that some modifiers are indispensable parts of the standard terms. We observed that this decision made it easier for both manual annotators and machine- learning extractors to operate since it was difficult to define boundaries on what modifiers to include in noun phrases. However, modifiers were not included for other part-of-speech phrases, in order not to complicate the issue. For example, in a noun phrase malignancy type mention “malignant squamous cell carcinoma”, both “malignant” and “squamous cell” are the modifiers of “carcinoma”, and both provide very useful information. “Squamous cell carcinoma” is also a commonly employed name of a type of cancer. Our experience determined that it was difficult for annotators and impossible for automatic extractors to draw consistent boundaries between modifiers on what should be included as part of the legitimate mentions. Lastly, we found it necessary to make entity-specific rules for some biological entities. For example, the gene entity mentions commonly appeared in the text as “The mycn gene…”, necessitating a decision as to whether the article “The” and the noun “gene” should be included as part of the entity mention. We reasoned that the decision should depend on how the extracted information was to be further processed and utilized. Accordingly, we decided to include neither word, since all the extracted gene mentions were to be subsequently mapped and normalized to official gene symbols. 3.2.2 Co-reference issue Often a single entity is referred to in different ways in the same text, a situation 29
  • 30. known as co-reference. Besides its standardized form, an entity instance can also be referred to by aliases, acronyms, descriptions or pronoun references. For example, the mycn gene has at least 10 aliases in the literature, including “n-myc”, “oded”, and “v-myc avian myelocytomatosis viral related oncogene, neuroblastoma derived”. Moreover, researchers commonly engineer their own acronyms as self-convenient but non-standard and often unique aliases. Co-reference is generally recognized as a challenging task for entity recognition and information extraction. To deal with this issue in manual annotation, we have classified this problem into the following four categories and made corresponding decisions for each of them. A. Extended form vs. acronym Regular expression: ___ ___ ___ (___) Examples: • …mitogen-activated protein kinase (MAPK)…-- gene entity mention • …squamous cell carcinoma (SCC)… -- malignancy type entity mention Our decision: Tag both the extended form and abbreviated form of the entity mention. For the above examples, “MAPK” is co-referential with “mitogen-activated protein kinase”, and “SCC” is co-referential with “squamous cell carcinoma”. Both extended forms and acronyms would be tagged as corresponding entity instances in our system. Our rationale: Both forms are interchangeable descriptions of entity mentions, and they should be treated equally. B. Alias description Regular expression: …Y…X… or …Y (X)… Examples: 30
  • 31. TrkA (NTRK1)… • The N-myc gene, or MYCN… Our decision: NTRK1 and MYCN are official name designations of the TrkA and N-myc genes, and here they are being co-referenced accordingly. We decided to tag all different expression forms of the entity instances, including standard/official nomenclatures, aliases or descriptions. Like acronyms and their extended forms, these various names are also tagged individually: in the first example, we tagged “TrkA” and “NTRK1” separately and without the parentheses, not the combined string “TrkA (NTRK1)”. Our rationale: Researchers often use unofficial nomenclatures for entity mentions, so we can’t just annotate standard descriptions. However, they should be normalized later. C. General vs. specific Regular expression: X, a (the) Y… Examples: • C-Kit, a tyrosine kinase which plays an important role, … • K-Ras is an oncogene. The Ras gene… Our decision: In the examples above, the gene family name “Ras” and the superfamily name “tyrosine kinase” are used to co-refer to the gene family instances “K-Ras” and “C- Kit”. In such situations, our annotation guideline treated the general terms and more specific terms completely independently, regardless of the co-referential relationship between them. That is, depending on the conceptual definition, if the term was a legitimate mention, it was tagged as an entity mention no matter what levels of specificity it had. For those examples, since the gene entity definition included both gene instances and family names, all four terms were tagged as gene entity mentions. We did not, 31
  • 32. however, tag “oncogene”, nor did we extend the tag on “Ras” to include the following word “gene”. These words, at the highest level of generality, convey no taggable information. Our rationale: Based on our decision on tagging all information-containing levels of mentions and specifically for the examples listed, all gene instances, gene families and superfamilies are determined legitimate mentions. D. Pronoun reference Regular expression: …X…PRONOUN (It, This, etc.)… Examples: • K-Ras is an oncogene. It is mutated in… • Five point mutations were found in the MYC gene, and they were next to each other. Our decision: In the two examples, “It” is co-referential to “K-Ras”, and “they” is co- referential to “point mutations”. We generally did not annotate pronouns, although they may refer to legitimate entity mentions. Our rationale: Pronoun co-reference is a challenging problem in text mining research, which involves cross-sentence, whole-record level of relation extraction. Without deeper parsing of the text, there is no value by extracting the pronoun itself. 3.2.3 Structural overlap between entity mentions Entities can overlap not only conceptually, but also literally, with their textual mentions in the literature. Annotation guidelines were developed for the following 32
  • 33. situations: A. Entity within entity – tag within tag This refers to the situation that one entity mention is completely included in the textual range of another. As the two intertwined entity mentions could belong to either the same or different entities, we divided this category of problem into two sub- categories. If the two mentions were in the same entity, only the subsuming entity mention was tagged. For example, in “mitogen-activated protein kinase kinase kinase”, there exist 7 distinct gene entity mentions: mitogen-activated protein; mitogen-activated protein kinase; mitogen-activated protein kinase kinase; mitogen-activated protein kinase kinase kinase; and three mentions of “kinase”. While this type of a situation was a source of confusion among new annotators, we considered it both unnecessary and costly to tag all possible mention permutations. As the mention with the largest range was always the one being discussed, only the outermost mention was considered to be tagged as a gene mention. In fact, this situation led to the adoption of a more generalized guiding principle, where the annotation should reflect the author intent whenever possible (although exceptions were encountered, such as poorly written abstracts where the intent from the context occasionally and obviously differed from the actual word or phrase used). If two completely overlapping mentions instead belonged to different entity types, we annotated both. These mentions were usually related, and they both often provided valuable information. Some entities, such as malignancy attributes, often appeared as part of another entity mention. For instance, “colon cancer” is a malignancy type mention, and “colon” is a malignancy site mention. “Hirschsprung disease 1” is another example, that 33
  • 34. “Hirschsprung disease” is a disease mention while the whole phrase is a gene mention. B. Entity co-identity – double tagging This category represents the situation that two entity mentions share the exact same text. We annotated the same text twice with the two corresponding labels under such circumstances. For example, in the phrase “deletion of the K-ras gene”, “K-ras” was tagged as both a gene entity mention and a variation-location mention. C. Discontinuous mentions – chaining Sometimes mentions of several entities of the same type shared a common substring. When written together in the text, the common part only occured once for the first or last mention, and other mentions were only represented with the different parts. For example, in the text “H-, K-, and N-ras…”, there are really three gene mentions: “H- ras”, “K-ras” and “N-ras”, but a limitation of our annotation software prevented tagging of discontinuous mentions as one parent mention (in the example above, only “N-ras” could be tagged. For the other two discontinuous mentions, we developed a chaining, procedure through which annotators were able to link the component parts (“H-” and “K-” with “ras”) by inserting comments into the annotation in a standard format. Chaining was strictly limited within one sentence in order not to complicate issues for subsequent syntactic parsing of sentences. Employing the same logic, entity mentions were not allowed to come across different sentences. 3.3 Syntactical vs. Semantic – ambiguity challenges We considered ambiguity in mentions to be the most common and difficult challenge in our annotation experience, as it truly reflects the limitation of human- invented texts in fully communicating author intent. In biomedical text, we found it not 34
  • 35. uncommon that an identical text string could represent completely different concepts, and the frequency of ambiguity appeared to be much higher than for non-biological text. In the following paragraphs, we will use mainly gene entity examples to illustrate the illusive nature of this problem. We found ambiguity to occur both within and outside gene entities. Genes have a tradition of being independently named, with poor adherence to or awareness of standards. People tended to make up new acronyms for gene names, as the result of which, there are more gene names than the combinations of letters and numbers for short- character symbols/aliases. Thus, there are lots of similarities between aliases just by chance. Since each gene has multiple non-unique aliases with one unique gene symbol, there exists very serious internal ambiguity problem among the aliases. Based on our calculation, just for human genes alone, there are as many as 3% genes share the same aliases and the numbers are number higher if including other species. Also, many species have traditions of naming the genes the same, especially mouse and human (Chen L et al, 2005). For example, p90 is the common alias shared by the distinct gene symbols CANX and TFRC. As a protein naming convention, p90 actually refers to the protein with molecular weight 90. Therefore, it is not surprising that there are two proteins with the same name. When such gene mentions appear in literature, (often quite distant) context is the only way to clarify which gene is in discussion, although sometimes it offers no assistance. Another type of within gene entity ambiguity that we recognized was the frequent apparent inability to distinguish a gene from its downstream products, based purely on the text string of the mention. Although initially, our gene entity was designed 35
  • 36. to capture only the nomenclatures of functional genomic elements, we soon discovered that researchers were frequently using the same referents to represent a gene and also its RNA and protein products in the literature. Without looking at the context, a gene mention “mycn” had almost an equal probability to refer to a gene or its downstream product, and both the gene and its mRNA were referred to as being “expressed” to create a mRNA or a protein product, respectively. In addition, authors also tended to obscure the conceptual boundaries between a gene and its downstream products. For example, while a given protein X performs biological functions, we found it common that the corresponding gene X was being described as performing this action. It became apparent that while researchers were personally clear regarding distinctions, their descriptions did not adequately convey these distinctions. In fact, in several cases, we found it impossible to determine whether certain gene mentions referred to a gene or its RNA or protein products even when considering the entire article. This overwhelming ambiguity problem finally prompted us to reach the decision to include genes’ downstream products when annotating gene entity mentions. Finally, we created one entity class gene but also included labels for partially subdividing them, while making considerations for not being able to perfectly divide mentions into the 3 classes. If it was not clear in the text whether a mention referred to a gene or a protein, the mention was annotated as “gene.generic”, as apposed to “gene.gene/RNA” or “gene.protein”. Besides the challenges mentioned above, it was common to encounter gene entity mentions that were easily be confused with objects belonging to other entity types, This is because genes have been named with a wide variety of methods, from the use of lay languages to the invention of specialized and often clever acronyms. For example, “Cat” 36
  • 37. is an official gene symbol for the gene catalase, while it could also be used to refer to a kind of animal. “NB” is the acronym of a well-known pediatric cancer neuroblastoma, but it is also an official name of a gene locus putatively located on chromosome 1p36. This cross-entity ambiguity problem was also commonly seen for other entity classes, such as variation type. As an example, “Insertion” and “deletion” are well-defined variation type mentions, but they are also frequently used to denote biological or clinical actions. Regardless of the types of the ambiguity problems, the task for our manual annotators was to make their best calls to identify the intended reference of the text strings and annotate them as such. Sometimes annotators needed to take entire abstract or, rarely, the entire article, into consideration in order to determine what particular mentions truly represented. Depending on the nature of the biomedical entities and how representative the training data was, the subsequent automatic extractors were able to disambiguate problematic text strings to certain degree by taking local contextual features into account. 3.4 Annotator perceptions Even if perfect entity definitions and annotation guidelines could somehow be created, there would still be variations among human annotators in understanding and applying them during the annotation process, and we certainly encountered lively discussion regarding some topics. Usually, manual annotation is done by different annotators in order to get more files done within a shorter period of time, but the downside is that it introduces more inconsistencies between annotators. Even with only one annotator, there will be variability in application of guidelines. We took two approaches to deal with this problem. First, annotators were told to 37
  • 38. discuss anything unclear, and we promoted frequent discussion to determine a consistent path. And also, a dual, sequential-pass manual annotation process was developed and applied to better adjudicate different annotators’ work and produce training data as consistent as possible. During this process, every document was annotated de novo by one annotator and then subsequently checked by a second annotator, who is more experienced and consistent, charged with identifying and revising any annotations considered to be incorrect by first pass annotators. Edited items were then subject to review by the group, and senior annotators used this editing process as an opportunity for educating less experienced annotators if repeated error patterns were identified. 3.5 Publication-based errors Typographical and grammatical errors, though infrequent, are inevitable, and some of them were observed in entity mentions during our process. Due to the considerations of copyright issues, we were not authorized to change the text in such cases but instead skipped tagging the mentions with added comments. 4. Application As a result of the generation and application of these carefully refined entity definitions and annotation guidelines, 1442 MEDLINE abstracts were manually annotated. Of these, 1157 files have been made publicly available (release 0.9, BioIE web site). Since the release, the data has been widely used by the biomedical text mining community for a variety of purposes, including entity recognition, normalization etc., and the usage is likely to increase (Cohen KB et al, 2005). Because of the consistency of the training data across the corpus, the developed entity and attribute extractors perform with high precision and recall rates. Table 2-1 38
  • 39. indicates the performance of three entity extractors built with this data (McDonald RT et al, 2004; Jin Y et al, 2006). Entity Precision Recall F-measure Gene 0.864 0.787 0.824 Variation Type 0.8556 0.7990 0.8263 Location 0.8695 0.7722 0.8180 State-Initial 0.8430 0.8286 0.8357 State-Sub 0.8035 0.7809 0.7920 Overall 0.8541 0.7870 0.8192 Malignancy type 0.8456 0.8218 0.8335 Table 2-1: Entity extractor performance on evaluation data 5. Conclusion Manual annotation is an indispensable step to create training data for developing machine-learning automated extractors. In order to generate extractors that perform with accuracies high enough to be acceptable to the biomedical research community, consistently annotated training data is a prerequisite. Although we did not formally prove it, our experience has been that investment of developing literature-based entity definitions and annotation guidelines yields far better extracted information with distinct conceptual boundaries, which in turn increases the opportunity for practical application. We have concluded that rather than trying to construct unifying definitions that maximize acceptance and minimize contention amongst domain experts, that a consistent and generally arguable definition was preferable when making decisions to specify entity boundaries and magnitudes. More important for us was to consider how the extracted information will be used, and once determined, how to maintain consistency throughout the training corpus. 39
  • 40. Reference Chen L, Liu H, Friedman C: Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics, 21: 248-256. (2005). Cohen KB, Fox L, Ogren PV, Hunter L: Corpus design for biomedical natural language processing. Proceedings of the ACL-ISMB workshop on linking biological literature, ontologies and databases, pp. 38-45. Association for Computational Linguistics. (2005). Collier N, Nobata C, Tsujii J: Extracting the names of genes and gene products with a hidden Markov model. In Proceedings of the 18th International Conference on Computational Lingustics, Saarbrucken, Germany. (2000). GENIA: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ (2004). Hanisch D, Fundel K, Mevissen HT, Ximmer R, Fluck J: ProMiner: rule-based protein and gene entity recognition. BMC Bioinformatics. 6: S14. (2005). Jin Y, McDonald RT, Lerman K, Mandel MA, Carroll S, Liberman MY, Pereira FC, Winters RS, White PS: Automated recognition of malignancy mentions in biomedical literature. BMC Bioinformatics, 7: 492. (2006). McDonald RT, Winters RS, Mandel M, Jin Y, White PS, Pereira F: An entity tagger for recognizing acquired genomic variations in cancer literature. Bioinformatics 22(20): 40
  • 41. 3249-3251. (2004). Penn BioIE: http://bioie.ldc.upenn.edu/index.jsp Tanabe L, Wilbur W: Tagging gene and protein names in biomedical text, Bioinformatics, 18:1124-1132. (2002). 41
  • 42. Chapter 3. Automated Recognition of Malignancy Mentions in Biomedical Literature Yang Jin Ryan T. McDonald Kevin Lerman Mark A. Mandel Steven Carroll Mark Y. Liberman Fernando C. N. Pereira R. Scott Winters Peter S. White Pulished: BMC Bioinformatics, 7:492, 2006 Abstract Background: The rapid proliferation of biomedical text makes it increasingly difficult for researchers to identify, synthesize, and utilize developed knowledge in their fields of interest. Automated information extraction procedures can assist in the acquisition and management of this knowledge. Previous efforts in biomedical text mining have focused primarily upon named entity recognition of well-defined molecular objects such as genes, but less work has been performed to identify disease-related objects and concepts. Furthermore, promise has been tempered by an inability to efficiently scale approaches in ways that minimize manual efforts and still perform with high accuracy. Here, we have applied a machine-learning approach previously successful for identifying molecular entities to a disease concept to determine if the underlying probabilistic model effectively generalizes to unrelated concepts with minimal manual intervention for model retraining. 42
  • 43. Results: We developed a named entity recognizer (MTag), an entity tagger for recognizing clinical descriptions of malignancy presented in text. The application uses the machine-learning technique Conditional Random Fields with additional domain- specific features. MTag was tested with 1,010 training and 432 evaluation documents pertaining to cancer genomics. Overall, our experiments resulted in 0.85 precision, 0.83 recall, and 0.84 F-measure on the evaluation set. Compared with a baseline system using string matching of text with a neoplasm term list, MTag performed with a much higher recall rate (92.1% vs. 42.1% recall) and demonstrated the ability to learn new patterns. Application of MTag to all MEDLINE abstracts yielded the identification of 580,002 unique and 9,153,340 overall mentions of malignancy. Significantly, addition of an extensive lexicon of malignancy mentions as a feature set for extraction had minimal impact in performance. Conclusions: Together, these results suggest that the identification of disparate biomedical entity classes in free text may be achievable with high accuracy and only moderate additional effort for each new application domain. Background The biomedical literature collectively represents the acknowledged historical perception of biological and medical concepts, including findings pertaining to disease- related research. However, the rapid proliferation of this information makes it increasingly difficult for researchers and clinicians to peruse, query, and synthesize it for biomedical knowledge gain. Automated information extraction methods, which have recently been increasingly concentrated upon biomedical text, can assist in the acquisition and management of this data. Although text mining applications have been successful in 43
  • 44. other domains and show promise for biomedical information extraction, issues of scalability impose significant impediments to broad use in biomedicine. Particular challenges for text mining include the requirement for highly specified extractors in order to generate accuracies sufficient for users; considerable effort by highly trained computer scientists with substantial input by biomedical domain experts to develop extractors; and a significant body of manually annotated text—with comparable effort in generating annotated corpora—for training machine-learning extractors. In addition, the high number and wide diversity of biomedical entity types, along with the high complexity of biomedical literature, makes auto-annotation of multiple biomedical entity classes a difficult and labor-intensive task. Most biomedical text mining efforts to date have focused upon molecular object (entity) classes, especially the identification of gene and protein names. Automated extractors for these tasks have improved considerably in the last few years [1-13]. We recently extended this focus to include genomic variations [14]. Although there have been efforts to apply automated entity recognition to the identification of phenotypic and disease objects [15-17], these systems are broadly focused and often do not perform as well as those utilizing more recently-evolved machine-learning techniques for such tasks as gene/protein name recognition. Recently, Skounakis and colleagues have applied a machine-learning algorithm to extract gene-disorder relations [18], while van Driel and co-workers have made attempts to extract phenotypic attributes from Online Mendelian Inheritance in Man [19]. However, more extensive work on medical entity class recognition is necessary because it is an important prerequisite for utilizing text information to link molecular and phenotypic observations, thus improving the 44
  • 45. association between laboratory research and clinical applications described in the literature. In the current work, we explore scalability issues relating to entity extractor generality and development time, and also determine the feasibility of efficiently capturing disease descriptions. We first describe an algorithm for automatically recognizing a specific disease entity class: malignant disease labels. This algorithm, MTag, is based upon the probability model Conditional Random Fields (CRFs) that has been shown to perform with state-of-the-art accuracy for entity extraction tasks [5, 14]. CRF extractors consider a large number of syntactic and semantic features of text surrounding each putative mention [20, 21]. MTag was trained and evaluated on MEDLINE abstracts and compared with a baseline vocabulary matching method. An MTag output format that provides HTML-visualized markup of malignant mentions was developed. Finally, we applied MTag to the entire collection of MEDLINE abstracts to generate an annotated corpus and an extensive vocabulary of malignancy mentions. Results MTag performance Manually annotated text from a corpus of 1,442 MEDLINE abstracts was used to train and evaluate MTag. Abstracts were derived from a random sampling of two domains: articles pertaining to the pediatric tumor neuroblastoma and articles describing genomic alterations in a wide variety of malignancies. Two separate training experiments were performed, either with or without the inclusion of malignancy-specific features, which were the addition of a lexicon of malignancy mentions and a list of indicative suffixes. In each case, MTag was tested with the same randomly selected 1,010 training 45
  • 46. documents and then evaluated with a separate set of 432 documents pertaining to cancer genomics. The extractor took approximately 6 hours to train on a 733 MHz PowerPC G4 with 1 GB SDRAM. Once trained, MTag can annotate a new abstract in a matter of seconds. For evaluation purposes, manual annotations were treated as gold-standard files (assuming 100% annotation accuracy). We first evaluated the MTag model with all biological feature sets included. Our experiments resulted in 0.846 precision, 0.831 recall, and 0.838 F-measure on the evaluation set. Additionally, the two subset corpora (neuroblastoma-specific and genome-specific) were tested separately. As expected, the extractor performed with higher accuracy with the more narrowly defined corpus (neuroblastoma) than with the corpus more representative for various malignancies (genome-specific). The neuroblastoma corpus performed with 0.88 precision, 0.87 recall, and 0.88 F-measure, while the genome-specific corpus performed with 0.77 precision, 0.69 recall, and 0.73 F-measure. These results likely reflect the increased challenge of identifying mentions of malignancy in a document set demonstrating a more diverse collection of mentions. To determine the impact of the biological feature sets we included to provide domain specificity, we excluded these feature sets to create a generic MTag. This extractor was then trained and evaluated using the identical set of files used to train the biological MTag version. Somewhat surprisingly, the extractor performed with similar accuracy with the generic model, resulting in 0.851 precision, 0.818 recall, and 0.834 F-measure on the evaluation set. These results suggested that at least for this class of entities, the extractor performs the task of identifying malignancy mentions efficiently without the 46
  • 47. use of a specialized lexicon. Extraction versus string matching We next determined performance of MTag relative to a baseline system that could be easily employed. For the baseline system, the NCI neoplasm ontology, a term list of 5,555 malignancies, was used as a lexicon to identify malignancy mentions [22]. Lexicon terms were individually queried against text by case-insensitive exact string matching. A subset of 39 abstracts randomly selected from the testing set, which together contained 202 malignancy mentions, were used to compare the automated extractor and baseline results. MTag identified 190 of the 202 mentions correctly (94.1%), while the NCI list identified only 85 mentions (42.1%), all of which were also identified by the extractor. We also determined the performance of string matching that instead used the set of malignancy mentions identified in the manually curated training set annotations (1,010 documents) as a matching lexicon. This system identified 79 of 202 mentions (39.1%). Combining the manually-derived lexicon with the NCI lexicon yielded 124 of 202 matches (61.4%). A closer analysis of the 68 malignancy mentions missed by the string matching with combined lists but positively identified by MTag determined two general subclasses of additional malignant mentions. The majority of MTag-unique mentions were lexical or modified variations of malignancies present either in the training data or in the NCI lexicon, such as minor variations in spelling and form (e.g., “leukaemia” versus “leukemia”), and acronyms (e.g., “AML” in place of “acute myeloid leukemia”). More importantly, a substantial minority of mentions identified only by MTag were instances of the extractor determining new mentions of malignancies that were, in many cases, 47
  • 48. neither obvious nor represented in readily available lexicons. For example, “temporal lobe benign capillary haemangioblastoma” and “parietal lobe ganglioglioma” are neither in the NCI list or training set per se, or approximated as such by a lexical variant. This suggests that MTag contributes a significant learning component. Application to MEDLINE MTag was then used to extract mentions of malignancy from all MEDLINE abstracts through 2005. Extraction took 1,642 CPU-hours (68.4 CPU-days; 2.44 days on our 28-CPU cluster) to process 15,433,668 documents. A total of 9,153,340 redundant mentions and 580,002 unique mentions (ignoring case) were identified. Interestingly, the ratio of unique new mentions identified relative to the number of abstracts analyzed was relatively uniform, ranging from a rate of 0.183 new mentions per abstract for the first 0.1% of documents to a rate of 0.038 new mentions per abstract for the last 1% of documents. This indicated that a substantial rate of new mentions was being maintained throughout the extraction process. The 25 mentions found in the greatest number of abstracts by MTag are listed in Table 1. Six of these malignant phrases: pulmonary, fibroblasts, neoplastic, neoplasm metastasis, extramural, and abdominal did not match our definition of malignancy. Of these, only “extramural” is not frequently associated with malignancy descriptions and is likely the result of containing character n-grams that are generally indicative of malignancy mentions. The remaining five phrases are likely the result of the extractor failing to properly define mention boundaries in certain cases (e.g., tagging “neoplasm” rather than “brain neoplasm”), or alternatively, shared use of an otherwise indicative character string (e.g., “opl” in “brain neoplasm” and “neoplastic”) between a true positive 48
  • 49. and a false positive. For comparison, we also determined the corresponding number of articles identified both by keyword searching of PubMed and by exact string matching of MEDLINE for each of the 19 most common true malignancy types (Table 1). Overall, MTag’s comparative recall was 1.076 versus PubMed keyword searching and 0.814 versus string matching. As PubMed keyword searching uses concept mapping to relate keywords to related concepts, thus providing query expansion, the document retrieval totals derived from this approach do not strictly compare to MTag’s approach. Furthermore, the exact string totals would be inflated relative to the MTag totals, as for example the phrase “myeloid leukemia” would be counted both for this category and for a category “leukemia” with exact string matching, but would only be counted for the former phrase by MTag. To adjust for these discrepancies, for MTag document totals listed in Table 1, we included documents that were tagged with malignancy mentions that were both strict syntactic parents and biological children of the phrase used. For example, we included articles identified by MTag with the phrase “small-cell lung cancer” within the total for the phrase “lung cancer”. Comparison of these totals between MTag articles and PubMed keyword searching revealed that MTag provided high recall for most malignancies. Interestingly, there are three malignancy mention instances (“carcinoma”, “sarcoma”, “melanoma”) that have more MTag-identified articles than for PubMed keyword searches. This suggests that a more formalized normalization of MTag-derived mentions might assist both with efficiency and recall if employed in concert with the manual annotation procedure currently employed by MEDLINE. Furthermore, MTag’s document recall compared 49
  • 50. quite favorably to exact string matching. Only two of the 25 malignancy mentions yielded less than 60% as many articles via MTag than via PubMed exact string matching (“bone neoplasms” and “lung cancer”). In these two cases, the concept-mapping PubMed search identifies the articles with a broader range beyond the search terms. For example, a PubMed search for the term “lung cancer” identifies articles describing “lung neoplasms”, while for “bone neoplams”, articles focusing on related concepts such as “osteoma” and “sphenoid meningioma” are identified by PubMed. Generally, MTag recall would be expected to improve further after a subsequent normalization process that maps equivalent phrases to a standard referent. To assess document-level precision, we randomly selected 100 abstracts identified by MTag each for the malignancies “breast cancer” and “adenocarcinoma”. Manual evaluation of these abstracts showed that all of the articles were directly describing the respective malignancies. Finally, we evaluated both the 250 most frequently mentioned malignancies as well as a random set of 250 extracted malignancy mentions from the all- MEDLINE-extracted set. For the frequently occurring mentions, 72.06% were considered to be true malignancies; this set corresponds to 0.043% of all malignancy mentions. For the random set, 78.93% were true malignancies. This suggests that such extracted mention sets might serve as a first-pass exhaustive lexicon of malignancy mentions. Comparison of the entire set of unique mentions with the NCI neoplasm list showed that 1,902 of the 5,555 NCI terms (34.2%) were represented in the extracted literature. Software 50
  • 51. MTag is platform independent, written in java, and requires java 1.4.2 or higher to run. The software is freely available under the GNU General Public License at http://bioie.ldc.upenn.edu/index.jsp?page=soft_tools_MalignancyTaggers.html. MTag has been engineered to directly accept files downloaded from PubMed and formatted in MEDLINE format as input. MTag provides output options of text or HTML file versions of the extractor results. The text file repeats the input file with recognized malignancy mentions appended at the end of the file. The HTML file provides markup of the original abstract with color-highlighted malignancy mentions, as shown in Figure 1. Discussion We have adapted an entity extraction approach that has been shown to be successful for recognition of molecular biological entities and have shown that it also performs with high accuracy for disease labels. It is evident that an F-measure of 0.83 is not sufficient as a stand-alone approach for curation tasks, such as the de novo population of databases. However, such an approach provides highly enriched material for manual curators to utilize further. As was determined by our comparisons with lexical string matching and PubMed-based approaches, our extraction method demonstrated substantial improvement and efficiency over commonly employed methods for document retrieval. Furthermore, MTag appeared to be accurately predicting malignancy mentions by learning and exploiting syntactic patterns encountered in the training corpus. Analysis of mis-annotations would likely suggest additional features and/or heuristics that could boost performance considerably. For example, anatomical and histological descriptions were frequent among MTag false positive mentions. Incorporation of lexicons for these entity types as negative features within the MTag model would likely 51
  • 52. increase precision. Our training set also does not include a substantial number of documents that do not contain mentions of malignancy; recent unpublished work from our group suggests that inclusion of such documents significantly impacts extractor performance in a positive manner. Unlike the first iteration of our CRF model [14], the MTag application required only modest computational effort (several weeks vs. several months) of retraining and customization time (see Methods). To our surprise, the addition of biological features, including an extensive lexicon for malignancy mentions, provided very little boost to the recall rate. This provides evidence that our general CRF model is flexible, broadly applicable, and if these results hold true for additional entity types, might lessen the need for creating highly specified extractors. In addition, the need for extensive domain- specific lexicons, which do not readily exist for many disease attributes, might be obviated. If so, one approach to comprehensive text mining of biomedical literature might be to employ a series of modular extractors, each of which is quickly generated and then trained for a particular entity or relation class. Conversely, it is important to note that the entity class of malignancy possesses a relatively discrete conceptualization relative to certain other phenotypic and disease concepts. Further adaptation of our extractor model for more variably described entity types, such as morphological and developmental descriptions of neoplasms, is underway. However, the finding that biological feature addition provided minimal gain in accuracy suggests that further improvements may be more difficult to obtain than by merely identifying and adding additional domain-specific features. Significantly, challenges in rapid generation of annotations for extractor training, as well as procedures for efficient and accurate entity normalization, still 52
  • 53. remain. When combined with expert evaluation of output, extractors can assist with vocabulary building for targeted entity classes. To demonstrate feasibility, we extracted mentions of malignancy for all pre-2006 MEDLINE abstracts. Our results indicate that MTag can generate such a vocabulary readily and with moderate computational resources and expertise. With manual intervention, this list could be linked to the underlying literature records and also integrated with other ontological and database resources, such as the Gene Ontology, UMLS, caBIG, or tumor-specific databases [23-25]. Since normalization of disease-descriptive term lists requires considerable specialized expertise, the role of an extractor in this setting more appropriately serves as an information harvester. However, this role is important, as such supervised lists are often not readily available, due in part to the variability in which phenotypic and disease descriptions can be described, and in part to the lack of nomenclature standards in many cases. Finally, to our knowledge, MTag is one of the first directed efforts to automatically extract entity mentions in a disease-oriented domain with high accuracy. Therefore, applications such as MTag could contribute to the extraction and integration of unstructured, medically-oriented information, such as physician notes and physician- dictated letters to patients and practitioners. Future work will include determining how well similar extractors perform for identifying mentions of malignant attributes with greater (e.g. tumor histology) and lesser (e.g. tumor clinical stage) semantic and syntactic heterogeneity. 53
  • 54. Conclusions MTag can automatically identify and extract mentions of malignancy with high accuracy from biomedical text. Generation of MTag required only moderate computational expertise, development time, and domain knowledge. MTag substantially outperformed information retrieval methods using specialized lexicons. MTag also demonstrated the ability to assist with the generation of a literature-based vocabulary for all neoplasm mentions, which is of benefit for data integration procedures requiring normalization of malignancy mentions. Parallel iteration of the core algorithm used for MTag could provide a means for more systematic annotation of unstructured text, involving the identification of many entity types; and application to phenotypic and medical classes of information. Methods Task definition Our task was to develop an automated method that would accurately identify and extract strings of text corresponding to a clinician’s or researcher’s reference to cancer (malignancy). Our definition of the extent of the label “malignancy” was generally the full noun phrase encompassing a mention of a cancer subtype, such that “neuroblastoma”, “localized neuroblastoma”, and “primary extracranial neuroblastoma” were considered to be distinct mentions of malignancy. Directly adjacent prepositional phrases, such as “cancer <of the lung>”, were not allowed, as these constructions often denoted ambiguity as to exact type. Within these confines, the task included identification of all variable descriptions of particular malignancies, such as the forms “squamous cell carcinoma” (histological observation) or “lung cancer” (anatomical location), both of which are 54
  • 55. underspecified forms of “lung squamous cell carcinoma”. Our formal definition of the semantic type “malignancy” can be found at the Penn BioIE website [26]. Corpora In order to train and test the extractor with both depth and breadth of entity mention, we combined two corpora for testing. The first corpus concentrated upon a specific malignancy (neuroblastoma) and consisted of 1,000 randomly selected abstracts identified by querying PubMed with the query terms “neuroblastoma” and “gene”. The second corpus consisted of 600 abstracts previously selected as likely containing gene mutation instances for genes commonly mutated in a wide variety of malignancies. These sets were combined to create a single corpus of 1,442 abstracts, after eliminating 158 abstracts that appeared to be non-topical, had no abstract body, or were not written in English. This set was manually annotated for tokenization, part-of-speech assignments, and malignancy named entity recognition, the latter in strict adherence to our pre- established entity class definition [27, 28]. Sequential dual pass annotations were performed on all documents by experienced annotators with biomedical knowledge, and discrepancies were resolved through forum discussions. A total of 7,303 malignancy mentions were identified in the document set. These annotations are available in corpus release v0.9 from our BioIE website [29]. Algorithm Based on the manually annotated data, an automatic malignancy mention extractor (MTag) was developed using the probability model Conditional Random Fields (CRFs) [20]. We have previously demonstrated that this model yields state-of-the-art accuracy for recognition of molecular named entity classes [5, 14]. CRFs model the conditional 55