download

Department of
Computer Science
University of Bari
Knowledge Acquisition &
Machine Learning Lab
CILC 2006
Convegno Italiano di Logica Computazionale
26-27 giugno 2006, Dipartimento di Informatica, Bari
Learning for Biomedical Information
Extraction with ILP
Margherita Berardi Vincenzo Giuliano Donato Malerba

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Outline of the talk
 IE for Biomedicine
 Looking around
 IE problem formulation
 which representation model on data? which
features?
 which framework for reasoning?
 Mutual Recursion in IE
 Text processing & domain knowledge
 Application to studies on mitochondrial
genome
 Conclusions & Future work

What is “Information Extraction”
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-
source concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
NAME TITLE ORGANIZATION

What is “Information Extraction”
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-
source concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
NAME TITLE ORGANIZATION
Bill Gates CEO Microsoft
Bill Veghte VP Microsoft
Richard Stallman founder Free Soft..
IE

IE from Biomedical Texts: Motivation
 Complexity of biological systems:
 Too many specialized biological tasks
 Several entities interacting in a single phenomenon
 Many conditions to simultaneously verify
 Complexity of biomedical languages:
 Several nomenclatures, dictionaries, lexica
 tending to quickly become obsolete
Too much to read!
Genome decoding  increasing amount of published
literature

IE History
 Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER
[’92-’96]
 Most early work dominated by hand-built models
 E.g. SRI’s FASTUS, hand-built FSMs.
 But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and
then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]
 Wrapper Induction: initially hand-build, then ML [Soderland ’96],
[Kushmeric ’97], …
 Most learning attempts based on statistical approaches
 Learning of production rules constrained by probability measures (e.g.,
HMMs, Probabilistic Context-free Grammars)
 Some recent logic-based approaches
 Rapier (Califf ’98)
 SRV (Freitag ’98)
 INTHELEX (Ferilli et al. ’01)
 FOIL-based (Aitken ’02)
 Aleph-based (Goadrich et al. ’04)

Learning Language in biomedicine
 BioCreAtIvE - Critical Assessment for Information Extraction in Biology
(http://biocreative.sourceforge.net/)
 BioNLP, Natural language processing of biology text
(http://www.bionlp.org)
 ACL/COLING Workshops on Natural Language Processing in Biomedicine
 SIGIR Workshops on Text Analysis for Bioinformatics
 Special Interest Group in Text Mining since ISMB’03 (Intelligent Systems
for Molecular Biology): BioLINK (Biology Literature, Information and
Knowledge)
 PSB (Pacific Symposium on Biocomputing) tracks
 Genomic tracks in TREC (Text Retrieval Conference)
 PASCAL challenges on information extraction http://nlp.shef.ac.uk/pascal/
 Workshops: IJCAI, ECAI, ECML/PKDD, ICML (Learning Language in Logic
since ’99, challenge task on Extracting Relations from Biomedical Texts)

Is there “Logic” in language learning?
 IE systems limitations, in general:
 Portability (domain-dependent, task-dependent)
 Scalability (work well on “relevant” data)
 Statistics-based approaches
 wide coverage,
 scalability,
 no semantics,
 no domain knowledge
 Logic-based approaches:
 natural encoding of natural language statements and queries in first-
order logic,
 human-comprehensible models,
 domain knowledge
 refinement of models
[R. J. Mooney, Learning for Semantic Interpretation: Scaling Up Without Dumbing Down, ICML Workshop on
Language Learning in Logic, 1999]

IE problem formulation for HmtDB
 HmtDB resource of variability data associated to clinical
phenotypes concerning human mithocondrial genome
(http://www.hmdb.uniba.it/)

Textual Entity Extraction
Ex: “Cytoplasts from two unrelated patients with MELAS (mitochondrial
myopathy, encephalopathy, lactic acidosis, and strokelike episodes) harboring
an A-*G transition at nucleotide position 3243 in the tRNALeU(UUR) gene of
the mitochondrial genome were fused with human cells lacking endogenous
mitochondrial DNA (mtDNA)”
pathology associated to the mutation under study,
substitution that causes the mutation,
type of the mutation,
position in the DNA where the mutation occurs,
gene correlated to the mutation.
By modelling the sentence structure:
substitution(X)  follows (Y,X), type (Y)
Extractors cannot be learned independently!!!

Textual Entity Extraction
 Each entity is characterized
by some slots defining a
template
 The task is to learn rules to
fill slots (template filling)
 Relations in data may
allow:
 intra-template
dependencies to be
learned
 context-sensitive
application of “extractors”
Mutation
Sampled population
DNA sample tissue
DNA screening method
…
Title
Abstract
Introduction
Methods

The learning task
 Classification
 Each class (slot) is a concept (target predicate), each
model (template filler) induced for the class is a logical
theory explaining the concept (set of predicate
definitions)
 Predefined models of classification should be provided
 Importance of domain knowledge and first-order
representations
 Usefulness of mutual recursion (concept dependencies)
 ILP = Inductive Learning  Logic Programming
 From IL: inductive reasoning from observations and
background knowledge
 From LP: first-order logic as representation formalism

ATRE
(Apprendimento di Teorie Ricorsive da Esempi)
http://www.di.uniba.it/~malerba/software/atre/
Given
 a set of concepts C1, C2, ... , Cr
 a set of objects O described in a language LO
 a background knowledge BK described in a language LBK
 a language of hypotheses LH that defines the space of
hypotheses SH
 a user’s preference criterion PC
Find
a (possibly recursive) logical theory T for the concepts C1,
C2, ... , Cr , such that T is complete and consistent with
respect to the set of observations and satisfies the
preference criterion PC.

ATRE
Main Characteristics
 Learning problem: induce recursive theories from examples
 ILP setting: learning from interpretations
 Observation language: ground multiple-head clauses
 Hypothesis language: non-ground definite clauses
 Constraints: linkedness + range-restrictedness
 Generalization model: generalized implication
 Search strategy for a recursive theory: separate-and-
parallel-conquer
 Continuous and discrete attributes and relations
 Background knowledge: intensionally defined

Data preparation
ATRE’s observation language: multiple-head clauses
 Enumeration of positive and negative examples
(expert users manual annotations + unlabelled
tokens)
 Descriptions of examples: which features?
 Statistical (frequencies)
 Lexical (alphanumeric, capitalized, …)
 Syntactical (nouns, verbs, adjectives, …)
 Domain-specific (dictionaries)

Text processing
 The GATE (A General Architecture for Text Engeneering)
framework (http://gate.ac.uk/)
 ANNIE is the IE core:
 Tokeniser
 Sentence Splitter
 POS tagger
 Morphological Analyser
 Gazetteers
 Semantic tagger (JAPE transducer)
 Orthomatcher (orthographic coreference)
 Some domain specific gazetteers have been added
(diseases, enzymes, genes, methods of analysis)

Text processing
 Some reg. expr. to capture some domain specific patterns
(alphanumeric strings, appositions, etc.)
 Shallow acronym resolution
Screening operations:
 Some POSs (nouns, verbs, adjectives, numbers, symbols)
 Punctuation
 stopwords (glimpse.cs.arizona.edu. )
Stemming (Porter)

Text description
 word_to_string(token)
Numerical:
 lenght(token), word_frequency(token),
distance_word_category(token1,token2)
Structural:
 s_part_of(token1,token2), first(token), last(token),
first_is_char(token), first_is_numeric(token),
middle_is_char(token), middle_is_numeric(token),
last_is_char(token), last_is_numeric(token),
single_char(token), follows(token1,token2)
Lexical:
 type_of(token), type_POS(token)
Domain dependent:
 word_category(token)

Application
 We considered 71 documents selected by
biologists
 Expert users manually annotated occurrences of
entities of interest, namely
Mutation: position, type, substitution, type_position, locus
Subjects: nationality, method, pathology, category, number
 The extraction process (both learning and
recognition) is locally performed to text portions
of interest, automatically classified

Textual portions of papers were categorized in five
classes: Abstract, Introduction, Materials & Methods,
Discussion and Results
The abstract of each paper was processed
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
90,00
100,00
Abstract Introduction Methods Results Discussion
Correctlyclassified(%)
Avg. No. of categories correctly classified

An A-to-G mutation at nucleotide position (np) 3243 in the mitochondrial
tRNALeu(UUR) gene is closely associated with various clinical
phenotypes of diabetes mellitus.
[annotation(3)=substitution, annotation(4)=no_tag, annotation(5)=no_tag,
annotation(6)=no_tag, annotation(7)=position, annotation(8)=no_tag,
annotation(9)=locus, annotation(10)=no_tag, annotation(11)=no_tag,
annotation(12)=no_tag, annotation(13)=no_tag, annotation(14)=no_tag,
annotation(15)=no_tag, annotation(16)=pathology],

[part_of(1,2)=true, contain(2,3)=true, …, contain(2,16)=true,
word_to_string(3)=‘A-to-G', word_to_string(4)='mutation',
word_to_string(5)='nucleotid',
word_to_string(6)='position',word_to_string(7)='3243',
word_to_string(8)='mitochondri', word_to_string(9)='trnaleu(uur)',
word_to_string(10)='gene', word_to_string(11)='clos',
word_to_string(12)='associat', word_to_string(13)='variou',
word_to_string(14)='clinic', word_to_string(15)='phenotyp',
word_to_string(16)='diabetes_mellitus', type_of(3)=upperinitial, …,
type_of(7)=numeric, type_POS(3)=jj, type_POS(4)=nn, …, type_POS(15)=nns,
word_frequency(3)=3, word_frequency(4)=6, …, word_frequency(16)=1,
word_category(9)=locus, word_category(16)=disease,
distance_word_category(9,16)=1, follows(3,4)=true, follows(4,5)=true,…,
follows(14,15)=true, follows(15,16)=true]).
Example description

Background knowledge
 follows(X,Z)  follows(X,Y)=true, follows(Y,Z)=true
 char_number_char(X)=true  first_is_char(X)=true,
middle_is_numeric(X)=true, last_is_char(X)=true
 number_char_char(X)=true  first_is_numeric(X)=true,
middle_is_char(X)=true, last_is_char(X)=true
 char_char_number(X)=true  first_is_char(X)=true,
middle_is_char(X)=true, last_is_numeric(X)=true
Domain knowledge:
 word_to_string(X)=transition  word_to_string(X)=transversion
 word_to_string(X)=substitution  word_to_string(X)=replacement

Experiments
 Mutation template
 6-fold cross validation
 The user manually annotates 355 tokens (8.65 per
abstract)
 About 11% positives

Experiments

Learned theories
annotation(X1)=position 
follows(X2,X1)=true, type_of(X1)=numeric, follows(X1,X3)=true,
word_category(X3)=gene, word_to_string(X2)=position.
annotation(X1)=type 
follows(X1,X2)=true, word_frequency(X2) in [8..140],
follows(X3,X1)=true, annotation(X3)=substitution
annotation(X1)=position 
follows(X2,X1)=true, annotation(X2)=substitution, follows(X3,X1)=true,
follows(X1,X4)=true, word_frequency(X4) in [6..6],
annotation(X3)=type, follows(X1,X5)=true, annotation(X5)=locus,
word_frequency(X1) in [1..2]

Wrap-up
 IE in Biomedicine
 The ILP approach to IE within a multi-relational framework
allows to implicitly define
 Domain knowledge
 Learning from users’ interaction
 Relational representations
 Learning relational patterns to allow context-sensitive application of
models
 Recursive Theory Learning in IE: ATRE
 Efforts on text processing level:
 Ambiguities
 Data sparseness
 Noise
 Encouraging results on a real-world data set

Where from here?
 Test on available corpus for Bio IE
 Genia
 BioCreative
 NLPBA
 Genic interaction challenges
 Investigation of semisupervised approaches: online
extension of dictionaries
 How to encapsulate taxonomical knowledge?
 Can information extracted by ATRE be really used as
background knowledge for genomic database mining?

download

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to download

Similar to download (20)

More from butest

More from butest (20)

download

Editor's Notes