1. Department of
Computer Science
University of Bari
Knowledge Acquisition &
Machine Learning Lab
CILC 2006
Convegno Italiano di Logica Computazionale
26-27 giugno 2006, Dipartimento di Informatica, Bari
Learning for Biomedical Information
Extraction with ILP
Margherita Berardi Vincenzo Giuliano Donato Malerba
2. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Outline of the talk
IE for Biomedicine
Looking around
IE problem formulation
which representation model on data? which
features?
which framework for reasoning?
Mutual Recursion in IE
Text processing & domain knowledge
Application to studies on mitochondrial
genome
Conclusions & Future work
3. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
What is “Information Extraction”
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-
source concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
NAME TITLE ORGANIZATION
4. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
What is “Information Extraction”
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-
source concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
NAME TITLE ORGANIZATION
Bill Gates CEO Microsoft
Bill Veghte VP Microsoft
Richard Stallman founder Free Soft..
IE
5. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
IE from Biomedical Texts: Motivation
Complexity of biological systems:
Too many specialized biological tasks
Several entities interacting in a single phenomenon
Many conditions to simultaneously verify
Complexity of biomedical languages:
Several nomenclatures, dictionaries, lexica
tending to quickly become obsolete
Too much to read!
Genome decoding increasing amount of published
literature
6. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
IE History
Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER
[’92-’96]
Most early work dominated by hand-built models
E.g. SRI’s FASTUS, hand-built FSMs.
But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and
then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]
Wrapper Induction: initially hand-build, then ML [Soderland ’96],
[Kushmeric ’97], …
Most learning attempts based on statistical approaches
Learning of production rules constrained by probability measures (e.g.,
HMMs, Probabilistic Context-free Grammars)
Some recent logic-based approaches
Rapier (Califf ’98)
SRV (Freitag ’98)
INTHELEX (Ferilli et al. ’01)
FOIL-based (Aitken ’02)
Aleph-based (Goadrich et al. ’04)
7. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Learning Language in biomedicine
BioCreAtIvE - Critical Assessment for Information Extraction in Biology
(http://biocreative.sourceforge.net/)
BioNLP, Natural language processing of biology text
(http://www.bionlp.org)
ACL/COLING Workshops on Natural Language Processing in Biomedicine
SIGIR Workshops on Text Analysis for Bioinformatics
Special Interest Group in Text Mining since ISMB’03 (Intelligent Systems
for Molecular Biology): BioLINK (Biology Literature, Information and
Knowledge)
PSB (Pacific Symposium on Biocomputing) tracks
Genomic tracks in TREC (Text Retrieval Conference)
PASCAL challenges on information extraction http://nlp.shef.ac.uk/pascal/
Workshops: IJCAI, ECAI, ECML/PKDD, ICML (Learning Language in Logic
since ’99, challenge task on Extracting Relations from Biomedical Texts)
8. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Is there “Logic” in language learning?
IE systems limitations, in general:
Portability (domain-dependent, task-dependent)
Scalability (work well on “relevant” data)
Statistics-based approaches
wide coverage,
scalability,
no semantics,
no domain knowledge
Logic-based approaches:
natural encoding of natural language statements and queries in first-
order logic,
human-comprehensible models,
domain knowledge
refinement of models
[R. J. Mooney, Learning for Semantic Interpretation: Scaling Up Without Dumbing Down, ICML Workshop on
Language Learning in Logic, 1999]
9. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
IE problem formulation for HmtDB
HmtDB resource of variability data associated to clinical
phenotypes concerning human mithocondrial genome
(http://www.hmdb.uniba.it/)
10. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Textual Entity Extraction
Ex: “Cytoplasts from two unrelated patients with MELAS (mitochondrial
myopathy, encephalopathy, lactic acidosis, and strokelike episodes) harboring
an A-*G transition at nucleotide position 3243 in the tRNALeU(UUR) gene of
the mitochondrial genome were fused with human cells lacking endogenous
mitochondrial DNA (mtDNA)”
pathology associated to the mutation under study,
substitution that causes the mutation,
type of the mutation,
position in the DNA where the mutation occurs,
gene correlated to the mutation.
By modelling the sentence structure:
substitution(X) follows (Y,X), type (Y)
Extractors cannot be learned independently!!!
11. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Textual Entity Extraction
Each entity is characterized
by some slots defining a
template
The task is to learn rules to
fill slots (template filling)
Relations in data may
allow:
intra-template
dependencies to be
learned
context-sensitive
application of “extractors”
Mutation
Sampled population
DNA sample tissue
DNA screening method
…
Title
Abstract
Introduction
Methods
12. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
The learning task
Classification
Each class (slot) is a concept (target predicate), each
model (template filler) induced for the class is a logical
theory explaining the concept (set of predicate
definitions)
Predefined models of classification should be provided
Importance of domain knowledge and first-order
representations
Usefulness of mutual recursion (concept dependencies)
ILP = Inductive Learning Logic Programming
From IL: inductive reasoning from observations and
background knowledge
From LP: first-order logic as representation formalism
13. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
ATRE
(Apprendimento di Teorie Ricorsive da Esempi)
http://www.di.uniba.it/~malerba/software/atre/
Given
a set of concepts C1, C2, ... , Cr
a set of objects O described in a language LO
a background knowledge BK described in a language LBK
a language of hypotheses LH that defines the space of
hypotheses SH
a user’s preference criterion PC
Find
a (possibly recursive) logical theory T for the concepts C1,
C2, ... , Cr , such that T is complete and consistent with
respect to the set of observations and satisfies the
preference criterion PC.
14. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
ATRE
Main Characteristics
Learning problem: induce recursive theories from examples
ILP setting: learning from interpretations
Observation language: ground multiple-head clauses
Hypothesis language: non-ground definite clauses
Constraints: linkedness + range-restrictedness
Generalization model: generalized implication
Search strategy for a recursive theory: separate-and-
parallel-conquer
Continuous and discrete attributes and relations
Background knowledge: intensionally defined
15. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Data preparation
ATRE’s observation language: multiple-head clauses
Enumeration of positive and negative examples
(expert users manual annotations + unlabelled
tokens)
Descriptions of examples: which features?
Statistical (frequencies)
Lexical (alphanumeric, capitalized, …)
Syntactical (nouns, verbs, adjectives, …)
Domain-specific (dictionaries)
17. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Text processing
The GATE (A General Architecture for Text Engeneering)
framework (http://gate.ac.uk/)
ANNIE is the IE core:
Tokeniser
Sentence Splitter
POS tagger
Morphological Analyser
Gazetteers
Semantic tagger (JAPE transducer)
Orthomatcher (orthographic coreference)
Some domain specific gazetteers have been added
(diseases, enzymes, genes, methods of analysis)
18. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Text processing
Some reg. expr. to capture some domain specific patterns
(alphanumeric strings, appositions, etc.)
Shallow acronym resolution
Screening operations:
Some POSs (nouns, verbs, adjectives, numbers, symbols)
Punctuation
stopwords (glimpse.cs.arizona.edu. )
Stemming (Porter)
19. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Text description
word_to_string(token)
Numerical:
lenght(token), word_frequency(token),
distance_word_category(token1,token2)
Structural:
s_part_of(token1,token2), first(token), last(token),
first_is_char(token), first_is_numeric(token),
middle_is_char(token), middle_is_numeric(token),
last_is_char(token), last_is_numeric(token),
single_char(token), follows(token1,token2)
Lexical:
type_of(token), type_POS(token)
Domain dependent:
word_category(token)
20. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Application
We considered 71 documents selected by
biologists
Expert users manually annotated occurrences of
entities of interest, namely
Mutation: position, type, substitution, type_position, locus
Subjects: nationality, method, pathology, category, number
The extraction process (both learning and
recognition) is locally performed to text portions
of interest, automatically classified
21. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Textual portions of papers were categorized in five
classes: Abstract, Introduction, Materials & Methods,
Discussion and Results
The abstract of each paper was processed
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
90,00
100,00
Abstract Introduction Methods Results Discussion
Correctlyclassified(%)
Avg. No. of categories correctly classified
22. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
An A-to-G mutation at nucleotide position (np) 3243 in the mitochondrial
tRNALeu(UUR) gene is closely associated with various clinical
phenotypes of diabetes mellitus.
[annotation(3)=substitution, annotation(4)=no_tag, annotation(5)=no_tag,
annotation(6)=no_tag, annotation(7)=position, annotation(8)=no_tag,
annotation(9)=locus, annotation(10)=no_tag, annotation(11)=no_tag,
annotation(12)=no_tag, annotation(13)=no_tag, annotation(14)=no_tag,
annotation(15)=no_tag, annotation(16)=pathology],
[part_of(1,2)=true, contain(2,3)=true, …, contain(2,16)=true,
word_to_string(3)=‘A-to-G', word_to_string(4)='mutation',
word_to_string(5)='nucleotid',
word_to_string(6)='position',word_to_string(7)='3243',
word_to_string(8)='mitochondri', word_to_string(9)='trnaleu(uur)',
word_to_string(10)='gene', word_to_string(11)='clos',
word_to_string(12)='associat', word_to_string(13)='variou',
word_to_string(14)='clinic', word_to_string(15)='phenotyp',
word_to_string(16)='diabetes_mellitus', type_of(3)=upperinitial, …,
type_of(7)=numeric, type_POS(3)=jj, type_POS(4)=nn, …, type_POS(15)=nns,
word_frequency(3)=3, word_frequency(4)=6, …, word_frequency(16)=1,
word_category(9)=locus, word_category(16)=disease,
distance_word_category(9,16)=1, follows(3,4)=true, follows(4,5)=true,…,
follows(14,15)=true, follows(15,16)=true]).
Example description
24. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Experiments
Mutation template
6-fold cross validation
The user manually annotates 355 tokens (8.65 per
abstract)
About 11% positives
25. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Experiments
26. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Learned theories
annotation(X1)=position
follows(X2,X1)=true, type_of(X1)=numeric, follows(X1,X3)=true,
word_category(X3)=gene, word_to_string(X2)=position.
annotation(X1)=type
follows(X1,X2)=true, word_frequency(X2) in [8..140],
follows(X3,X1)=true, annotation(X3)=substitution
annotation(X1)=position
follows(X2,X1)=true, annotation(X2)=substitution, follows(X3,X1)=true,
follows(X1,X4)=true, word_frequency(X4) in [6..6],
annotation(X3)=type, follows(X1,X5)=true, annotation(X5)=locus,
word_frequency(X1) in [1..2]
27. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Wrap-up
IE in Biomedicine
The ILP approach to IE within a multi-relational framework
allows to implicitly define
Domain knowledge
Learning from users’ interaction
Relational representations
Learning relational patterns to allow context-sensitive application of
models
Recursive Theory Learning in IE: ATRE
Efforts on text processing level:
Ambiguities
Data sparseness
Noise
Encouraging results on a real-world data set
28. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Where from here?
Test on available corpus for Bio IE
Genia
BioCreative
NLPBA
Genic interaction challenges
Investigation of semisupervised approaches: online
extension of dictionaries
How to encapsulate taxonomical knowledge?
Can information extracted by ATRE be really used as
background knowledge for genomic database mining?
Editor's Notes
Firstly I’ll introduce peculiarities of SDM. They ‘re particularly interesting because the practice of geo-referencing them have caused a growing demand for powerful exploratory data analysis techniques overcomes classical statistical and data mining techniques and, among other things,support the analysis of socio economic phenomena by a spatial point of view. In this talk I’ll focus my attention on a specific task that is the discovery of spatial association rules For this purpose I’ll present ARES a system to extract association rules from census data and illustrate an application ARES to mine spatial association rules on North West England 1998 census data in order to study the mportality risk in Greater manchester county
What is IE. As a task it is… Starting with some text… and a empty data base with a defined ontology of fields and records, Use the information in the text to fill the database.
ML… although this is an area where ML has not yet trounced the hand-built systems. In some of the latest evaluations, hand-built shared 1 st place with a ML. Now many companies making a business from IE (from the Web): WasBang, Inxight, Intelliseek, ClearForest.
Data sparseness, robustness
CV i.e. it is divided into 5 folds (Four are used for training and one for testing in turn).
Initial ILP reasearch deals with concept learning in form of predicate definition learning
ATRE is a multiple-concept learning system, which solves the following problem:
Since the generation of a clause depends on the chosen seed, several seeds have to be chosen such that at least one seed per incomplete predicate definition is kept . Therefore, the search space is actually a forest of as many search-trees as the number of chosen seeds. The parallel exploration of the forest related to odd and even numbers. Spec. hierarchies are traversed top-dow. Search proceeds towards deeper and deeper levels of the specialization hierarchies until at least a user-defined number of consistent clauses is found. A supervisor task decides whether the search should carry on or not on the basis of the results returned by the concurrent tasks. When the search is stopped, the supervisor selects the “best” consistent clause according to the user’s preference criterion. This strategy has the advantage that simpler consistent clauses are found first, independently of the predicates to be learned. First learning step Consistent clauses in red
Second learning step
CV i.e. it is divided into 5 folds (Four are used for training and one for testing in turn).
If we guarantee the following two conditions: ……………………… then after a finite number of steps a theory T , which is complete and consistent, is built. If we denote by LHM( T i ) the least Herbrand model of a theory T i , the stepwise construction of theories entails that LHM( T i ) LHM( T i+1 ), for each i {0, 1, , n-1}, since the addition of a clause to a theory can only augment the LHM
In order to guarantee the first of the two conditions it is possible to proceed as follows. First, a positive example e + of a predicate p to be learned is selected, such that e + is not in LHM( T i ). The example e + is called seed . Then the space of definite clauses more general than e + is explored, looking for a clause C, if any, such that neg(LHM( T i { C })) = . In this way we guarantee that the second condition above holds as well. When found, C is added to T i giving T i+1 . If some positive examples are not included in LHM( T i+1 ) then a new seed is selected and the process is repeated. The second condition is more difficult to guarantee because of the non-monotonicity property. The approach followed in ATRE to remove inconsistency due to the addition of a clause to the theory consists of simple syntactic changes in the theory, which eventually creates new layers . The layering of a theory introduces a first variation of the classical separate-and-conquer strategy sketched above, since the addition of a locally consistent clause generated in the conquer stage is preceded by a global consistency check.
Learning multi-relational patterns from multi-relational data and background knowledge It allows to navigate the relational structure of data