SlideShare a Scribd company logo
1 of 28
Department of
Computer Science
University of Bari
Knowledge Acquisition &
Machine Learning Lab
CILC 2006
Convegno Italiano di Logica Computazionale
26-27 giugno 2006, Dipartimento di Informatica, Bari
Learning for Biomedical Information
Extraction with ILP
Margherita Berardi Vincenzo Giuliano Donato Malerba
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Outline of the talk
 IE for Biomedicine
 Looking around
 IE problem formulation
 which representation model on data? which
features?
 which framework for reasoning?
 Mutual Recursion in IE
 Text processing & domain knowledge
 Application to studies on mitochondrial
genome
 Conclusions & Future work
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
What is “Information Extraction”
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-
source concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
NAME TITLE ORGANIZATION
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
What is “Information Extraction”
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-
source concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
NAME TITLE ORGANIZATION
Bill Gates CEO Microsoft
Bill Veghte VP Microsoft
Richard Stallman founder Free Soft..
IE
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
IE from Biomedical Texts: Motivation
 Complexity of biological systems:
 Too many specialized biological tasks
 Several entities interacting in a single phenomenon
 Many conditions to simultaneously verify
 Complexity of biomedical languages:
 Several nomenclatures, dictionaries, lexica
 tending to quickly become obsolete
Too much to read!
Genome decoding  increasing amount of published
literature
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
IE History
 Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER
[’92-’96]
 Most early work dominated by hand-built models
 E.g. SRI’s FASTUS, hand-built FSMs.
 But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and
then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]
 Wrapper Induction: initially hand-build, then ML [Soderland ’96],
[Kushmeric ’97], …
 Most learning attempts based on statistical approaches
 Learning of production rules constrained by probability measures (e.g.,
HMMs, Probabilistic Context-free Grammars)
 Some recent logic-based approaches
 Rapier (Califf ’98)
 SRV (Freitag ’98)
 INTHELEX (Ferilli et al. ’01)
 FOIL-based (Aitken ’02)
 Aleph-based (Goadrich et al. ’04)
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Learning Language in biomedicine
 BioCreAtIvE - Critical Assessment for Information Extraction in Biology
(http://biocreative.sourceforge.net/)
 BioNLP, Natural language processing of biology text
(http://www.bionlp.org)
 ACL/COLING Workshops on Natural Language Processing in Biomedicine
 SIGIR Workshops on Text Analysis for Bioinformatics
 Special Interest Group in Text Mining since ISMB’03 (Intelligent Systems
for Molecular Biology): BioLINK (Biology Literature, Information and
Knowledge)
 PSB (Pacific Symposium on Biocomputing) tracks
 Genomic tracks in TREC (Text Retrieval Conference)
 PASCAL challenges on information extraction http://nlp.shef.ac.uk/pascal/
 Workshops: IJCAI, ECAI, ECML/PKDD, ICML (Learning Language in Logic
since ’99, challenge task on Extracting Relations from Biomedical Texts)
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Is there “Logic” in language learning?
 IE systems limitations, in general:
 Portability (domain-dependent, task-dependent)
 Scalability (work well on “relevant” data)
 Statistics-based approaches
 wide coverage,
 scalability,
 no semantics,
 no domain knowledge
 Logic-based approaches:
 natural encoding of natural language statements and queries in first-
order logic,
 human-comprehensible models,
 domain knowledge
 refinement of models
[R. J. Mooney, Learning for Semantic Interpretation: Scaling Up Without Dumbing Down, ICML Workshop on
Language Learning in Logic, 1999]
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
IE problem formulation for HmtDB
 HmtDB resource of variability data associated to clinical
phenotypes concerning human mithocondrial genome
(http://www.hmdb.uniba.it/)
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Textual Entity Extraction
Ex: “Cytoplasts from two unrelated patients with MELAS (mitochondrial
myopathy, encephalopathy, lactic acidosis, and strokelike episodes) harboring
an A-*G transition at nucleotide position 3243 in the tRNALeU(UUR) gene of
the mitochondrial genome were fused with human cells lacking endogenous
mitochondrial DNA (mtDNA)”
pathology associated to the mutation under study,
substitution that causes the mutation,
type of the mutation,
position in the DNA where the mutation occurs,
gene correlated to the mutation.
By modelling the sentence structure:
substitution(X)  follows (Y,X), type (Y)
Extractors cannot be learned independently!!!
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Textual Entity Extraction
 Each entity is characterized
by some slots defining a
template
 The task is to learn rules to
fill slots (template filling)
 Relations in data may
allow:
 intra-template
dependencies to be
learned
 context-sensitive
application of “extractors”
Mutation
Sampled population
DNA sample tissue
DNA screening method
…
Title
Abstract
Introduction
Methods
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
The learning task
 Classification
 Each class (slot) is a concept (target predicate), each
model (template filler) induced for the class is a logical
theory explaining the concept (set of predicate
definitions)
 Predefined models of classification should be provided
 Importance of domain knowledge and first-order
representations
 Usefulness of mutual recursion (concept dependencies)
 ILP = Inductive Learning  Logic Programming
 From IL: inductive reasoning from observations and
background knowledge
 From LP: first-order logic as representation formalism
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
ATRE
(Apprendimento di Teorie Ricorsive da Esempi)
http://www.di.uniba.it/~malerba/software/atre/
Given
 a set of concepts C1, C2, ... , Cr
 a set of objects O described in a language LO
 a background knowledge BK described in a language LBK
 a language of hypotheses LH that defines the space of
hypotheses SH
 a user’s preference criterion PC
Find
a (possibly recursive) logical theory T for the concepts C1,
C2, ... , Cr , such that T is complete and consistent with
respect to the set of observations and satisfies the
preference criterion PC.
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
ATRE
Main Characteristics
 Learning problem: induce recursive theories from examples
 ILP setting: learning from interpretations
 Observation language: ground multiple-head clauses
 Hypothesis language: non-ground definite clauses
 Constraints: linkedness + range-restrictedness
 Generalization model: generalized implication
 Search strategy for a recursive theory: separate-and-
parallel-conquer
 Continuous and discrete attributes and relations
 Background knowledge: intensionally defined
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Data preparation
ATRE’s observation language: multiple-head clauses
 Enumeration of positive and negative examples
(expert users manual annotations + unlabelled
tokens)
 Descriptions of examples: which features?
 Statistical (frequencies)
 Lexical (alphanumeric, capitalized, …)
 Syntactical (nouns, verbs, adjectives, …)
 Domain-specific (dictionaries)
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Text processing
 The GATE (A General Architecture for Text Engeneering)
framework (http://gate.ac.uk/)
 ANNIE is the IE core:
 Tokeniser
 Sentence Splitter
 POS tagger
 Morphological Analyser
 Gazetteers
 Semantic tagger (JAPE transducer)
 Orthomatcher (orthographic coreference)
 Some domain specific gazetteers have been added
(diseases, enzymes, genes, methods of analysis)
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Text processing
 Some reg. expr. to capture some domain specific patterns
(alphanumeric strings, appositions, etc.)
 Shallow acronym resolution
Screening operations:
 Some POSs (nouns, verbs, adjectives, numbers, symbols)
 Punctuation
 stopwords (glimpse.cs.arizona.edu. )
Stemming (Porter)
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Text description
 word_to_string(token)
Numerical:
 lenght(token), word_frequency(token),
distance_word_category(token1,token2)
Structural:
 s_part_of(token1,token2), first(token), last(token),
first_is_char(token), first_is_numeric(token),
middle_is_char(token), middle_is_numeric(token),
last_is_char(token), last_is_numeric(token),
single_char(token), follows(token1,token2)
Lexical:
 type_of(token), type_POS(token)
Domain dependent:
 word_category(token)
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Application
 We considered 71 documents selected by
biologists
 Expert users manually annotated occurrences of
entities of interest, namely
Mutation: position, type, substitution, type_position, locus
Subjects: nationality, method, pathology, category, number
 The extraction process (both learning and
recognition) is locally performed to text portions
of interest, automatically classified
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Textual portions of papers were categorized in five
classes: Abstract, Introduction, Materials & Methods,
Discussion and Results
The abstract of each paper was processed
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
90,00
100,00
Abstract Introduction Methods Results Discussion
Correctlyclassified(%)
Avg. No. of categories correctly classified
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
An A-to-G mutation at nucleotide position (np) 3243 in the mitochondrial
tRNALeu(UUR) gene is closely associated with various clinical
phenotypes of diabetes mellitus.
[annotation(3)=substitution, annotation(4)=no_tag, annotation(5)=no_tag,
annotation(6)=no_tag, annotation(7)=position, annotation(8)=no_tag,
annotation(9)=locus, annotation(10)=no_tag, annotation(11)=no_tag,
annotation(12)=no_tag, annotation(13)=no_tag, annotation(14)=no_tag,
annotation(15)=no_tag, annotation(16)=pathology],

[part_of(1,2)=true, contain(2,3)=true, …, contain(2,16)=true,
word_to_string(3)=‘A-to-G', word_to_string(4)='mutation',
word_to_string(5)='nucleotid',
word_to_string(6)='position',word_to_string(7)='3243',
word_to_string(8)='mitochondri', word_to_string(9)='trnaleu(uur)',
word_to_string(10)='gene', word_to_string(11)='clos',
word_to_string(12)='associat', word_to_string(13)='variou',
word_to_string(14)='clinic', word_to_string(15)='phenotyp',
word_to_string(16)='diabetes_mellitus', type_of(3)=upperinitial, …,
type_of(7)=numeric, type_POS(3)=jj, type_POS(4)=nn, …, type_POS(15)=nns,
word_frequency(3)=3, word_frequency(4)=6, …, word_frequency(16)=1,
word_category(9)=locus, word_category(16)=disease,
distance_word_category(9,16)=1, follows(3,4)=true, follows(4,5)=true,…,
follows(14,15)=true, follows(15,16)=true]).
Example description
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Background knowledge
 follows(X,Z)  follows(X,Y)=true, follows(Y,Z)=true
 char_number_char(X)=true  first_is_char(X)=true,
middle_is_numeric(X)=true, last_is_char(X)=true
 number_char_char(X)=true  first_is_numeric(X)=true,
middle_is_char(X)=true, last_is_char(X)=true
 char_char_number(X)=true  first_is_char(X)=true,
middle_is_char(X)=true, last_is_numeric(X)=true
Domain knowledge:
 word_to_string(X)=transition  word_to_string(X)=transversion
 word_to_string(X)=substitution  word_to_string(X)=replacement
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Experiments
 Mutation template
 6-fold cross validation
 The user manually annotates 355 tokens (8.65 per
abstract)
 About 11% positives
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Experiments
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Learned theories
annotation(X1)=position 
follows(X2,X1)=true, type_of(X1)=numeric, follows(X1,X3)=true,
word_category(X3)=gene, word_to_string(X2)=position.
annotation(X1)=type 
follows(X1,X2)=true, word_frequency(X2) in [8..140],
follows(X3,X1)=true, annotation(X3)=substitution
annotation(X1)=position 
follows(X2,X1)=true, annotation(X2)=substitution, follows(X3,X1)=true,
follows(X1,X4)=true, word_frequency(X4) in [6..6],
annotation(X3)=type, follows(X1,X5)=true, annotation(X5)=locus,
word_frequency(X1) in [1..2]
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Wrap-up
 IE in Biomedicine
 The ILP approach to IE within a multi-relational framework
allows to implicitly define
 Domain knowledge
 Learning from users’ interaction
 Relational representations
 Learning relational patterns to allow context-sensitive application of
models
 Recursive Theory Learning in IE: ATRE
 Efforts on text processing level:
 Ambiguities
 Data sparseness
 Noise
 Encouraging results on a real-world data set
CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
Where from here?
 Test on available corpus for Bio IE
 Genia
 BioCreative
 NLPBA
 Genic interaction challenges
 Investigation of semisupervised approaches: online
extension of dictionaries
 How to encapsulate taxonomical knowledge?
 Can information extracted by ATRE be really used as
background knowledge for genomic database mining?

More Related Content

What's hot

CV-English.doc
CV-English.docCV-English.doc
CV-English.docbutest
 
Presentation
PresentationPresentation
Presentationsidra ali
 
Detecting the High Level Similarities in Software Implementation Process Usin...
Detecting the High Level Similarities in Software Implementation Process Usin...Detecting the High Level Similarities in Software Implementation Process Usin...
Detecting the High Level Similarities in Software Implementation Process Usin...IOSR Journals
 
ONTOLOGY VISUALIZATION PROTÉGÉ TOOLS – A REVIEW
ONTOLOGY VISUALIZATION PROTÉGÉ TOOLS – A REVIEW ONTOLOGY VISUALIZATION PROTÉGÉ TOOLS – A REVIEW
ONTOLOGY VISUALIZATION PROTÉGÉ TOOLS – A REVIEW ijait
 
Artista a network for ar tifical immune sys tems
Artista a network for ar tifical immune sys temsArtista a network for ar tifical immune sys tems
Artista a network for ar tifical immune sys temsUltraUploader
 
Representation of ontology by Classified Interrelated object model
Representation of ontology by Classified Interrelated object modelRepresentation of ontology by Classified Interrelated object model
Representation of ontology by Classified Interrelated object modelMihika Shah
 
Genetic algorithms in molecular design of novel fabrics Sylvia Wower
Genetic algorithms in molecular design of novel fabrics Sylvia Wower Genetic algorithms in molecular design of novel fabrics Sylvia Wower
Genetic algorithms in molecular design of novel fabrics Sylvia Wower Sylvia Wower
 
So I have an SD File... What do I do next?
So I have an SD File... What do I do next?So I have an SD File... What do I do next?
So I have an SD File... What do I do next?baoilleach
 
Translating natural language competency questions into sparql queries web2013
Translating natural language competency questions into sparql queries   web2013Translating natural language competency questions into sparql queries   web2013
Translating natural language competency questions into sparql queries web2013Leila Zemmouchi-Ghomari
 

What's hot (18)

Reference Ontology Presentation
Reference Ontology PresentationReference Ontology Presentation
Reference Ontology Presentation
 
CV-English.doc
CV-English.docCV-English.doc
CV-English.doc
 
PhDc exam presentation
PhDc exam presentationPhDc exam presentation
PhDc exam presentation
 
B017441015
B017441015B017441015
B017441015
 
Presentation
PresentationPresentation
Presentation
 
Keynote at AgroLT 2008
Keynote at AgroLT 2008Keynote at AgroLT 2008
Keynote at AgroLT 2008
 
Detecting the High Level Similarities in Software Implementation Process Usin...
Detecting the High Level Similarities in Software Implementation Process Usin...Detecting the High Level Similarities in Software Implementation Process Usin...
Detecting the High Level Similarities in Software Implementation Process Usin...
 
H017445260
H017445260H017445260
H017445260
 
ONTOLOGY VISUALIZATION PROTÉGÉ TOOLS – A REVIEW
ONTOLOGY VISUALIZATION PROTÉGÉ TOOLS – A REVIEW ONTOLOGY VISUALIZATION PROTÉGÉ TOOLS – A REVIEW
ONTOLOGY VISUALIZATION PROTÉGÉ TOOLS – A REVIEW
 
IBSB tutorial
IBSB tutorialIBSB tutorial
IBSB tutorial
 
Artista a network for ar tifical immune sys tems
Artista a network for ar tifical immune sys temsArtista a network for ar tifical immune sys tems
Artista a network for ar tifical immune sys tems
 
Representation of ontology by Classified Interrelated object model
Representation of ontology by Classified Interrelated object modelRepresentation of ontology by Classified Interrelated object model
Representation of ontology by Classified Interrelated object model
 
Ontology
OntologyOntology
Ontology
 
Genetic algorithms in molecular design of novel fabrics Sylvia Wower
Genetic algorithms in molecular design of novel fabrics Sylvia Wower Genetic algorithms in molecular design of novel fabrics Sylvia Wower
Genetic algorithms in molecular design of novel fabrics Sylvia Wower
 
Artificial Intelligence of the Web through Domain Ontologies
Artificial Intelligence of the Web through Domain OntologiesArtificial Intelligence of the Web through Domain Ontologies
Artificial Intelligence of the Web through Domain Ontologies
 
CV
CVCV
CV
 
So I have an SD File... What do I do next?
So I have an SD File... What do I do next?So I have an SD File... What do I do next?
So I have an SD File... What do I do next?
 
Translating natural language competency questions into sparql queries web2013
Translating natural language competency questions into sparql queries   web2013Translating natural language competency questions into sparql queries   web2013
Translating natural language competency questions into sparql queries web2013
 

Similar to download

download
downloaddownload
downloadbutest
 
download
downloaddownload
downloadbutest
 
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...Semantics for Bioinformatics: What, Why and How of Search, Integration and An...
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...Amit Sheth
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Sciencedrnigam
 
Greene Bosc2008
Greene Bosc2008Greene Bosc2008
Greene Bosc2008bosc_2008
 
Recruitment Based On Ontology with Enhanced Security Features
Recruitment Based On Ontology with Enhanced Security FeaturesRecruitment Based On Ontology with Enhanced Security Features
Recruitment Based On Ontology with Enhanced Security Featurestheijes
 
The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960mare34
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsDuncan Hull
 
Syst biol 2012-burguiere-sysbio sys069
Syst biol 2012-burguiere-sysbio sys069Syst biol 2012-burguiere-sysbio sys069
Syst biol 2012-burguiere-sysbio sys069Thomas Burguiere
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsmikaelhuss
 
II-SDV 2012 From (Text) Mining to Models: Applying Large-Scale Text Mining on...
II-SDV 2012 From (Text) Mining to Models: Applying Large-Scale Text Mining on...II-SDV 2012 From (Text) Mining to Models: Applying Large-Scale Text Mining on...
II-SDV 2012 From (Text) Mining to Models: Applying Large-Scale Text Mining on...Dr. Haxel Consult
 
Computational of Bioinformatics
Computational of BioinformaticsComputational of Bioinformatics
Computational of Bioinformaticsijtsrd
 
Top 5 MOST VIEWED LANGUAGE COMPUTING ARTICLE - International Journal on Natur...
Top 5 MOST VIEWED LANGUAGE COMPUTING ARTICLE - International Journal on Natur...Top 5 MOST VIEWED LANGUAGE COMPUTING ARTICLE - International Journal on Natur...
Top 5 MOST VIEWED LANGUAGE COMPUTING ARTICLE - International Journal on Natur...kevig
 
Knowledge Science for AI-based biomedical and clinical applications
Knowledge Science for AI-based biomedical and clinical applicationsKnowledge Science for AI-based biomedical and clinical applications
Knowledge Science for AI-based biomedical and clinical applicationsCatia Pesquita
 
[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...
[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...
[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...DataScienceConferenc1
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the partsCarole Goble
 

Similar to download (20)

download
downloaddownload
download
 
download
downloaddownload
download
 
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...Semantics for Bioinformatics: What, Why and How of Search, Integration and An...
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
 
Prosdocimi ucb cdao
Prosdocimi ucb cdaoProsdocimi ucb cdao
Prosdocimi ucb cdao
 
A biologist in e-Science
A biologist in e-ScienceA biologist in e-Science
A biologist in e-Science
 
Greene Bosc2008
Greene Bosc2008Greene Bosc2008
Greene Bosc2008
 
BioPortal: ontologies and integrated data resources at the click of a mouse
BioPortal: ontologies and integrated data resourcesat the click of a mouseBioPortal: ontologies and integrated data resourcesat the click of a mouse
BioPortal: ontologies and integrated data resources at the click of a mouse
 
Recruitment Based On Ontology with Enhanced Security Features
Recruitment Based On Ontology with Enhanced Security FeaturesRecruitment Based On Ontology with Enhanced Security Features
Recruitment Based On Ontology with Enhanced Security Features
 
Semantic annotation of biomedical data
Semantic annotation of biomedical dataSemantic annotation of biomedical data
Semantic annotation of biomedical data
 
The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of Bioinformatics
 
Syst biol 2012-burguiere-sysbio sys069
Syst biol 2012-burguiere-sysbio sys069Syst biol 2012-burguiere-sysbio sys069
Syst biol 2012-burguiere-sysbio sys069
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
II-SDV 2012 From (Text) Mining to Models: Applying Large-Scale Text Mining on...
II-SDV 2012 From (Text) Mining to Models: Applying Large-Scale Text Mining on...II-SDV 2012 From (Text) Mining to Models: Applying Large-Scale Text Mining on...
II-SDV 2012 From (Text) Mining to Models: Applying Large-Scale Text Mining on...
 
Computational of Bioinformatics
Computational of BioinformaticsComputational of Bioinformatics
Computational of Bioinformatics
 
Top 5 MOST VIEWED LANGUAGE COMPUTING ARTICLE - International Journal on Natur...
Top 5 MOST VIEWED LANGUAGE COMPUTING ARTICLE - International Journal on Natur...Top 5 MOST VIEWED LANGUAGE COMPUTING ARTICLE - International Journal on Natur...
Top 5 MOST VIEWED LANGUAGE COMPUTING ARTICLE - International Journal on Natur...
 
Knowledge Science for AI-based biomedical and clinical applications
Knowledge Science for AI-based biomedical and clinical applicationsKnowledge Science for AI-based biomedical and clinical applications
Knowledge Science for AI-based biomedical and clinical applications
 
[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...
[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...
[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the parts
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

download

  • 1. Department of Computer Science University of Bari Knowledge Acquisition & Machine Learning Lab CILC 2006 Convegno Italiano di Logica Computazionale 26-27 giugno 2006, Dipartimento di Informatica, Bari Learning for Biomedical Information Extraction with ILP Margherita Berardi Vincenzo Giuliano Donato Malerba
  • 2. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Outline of the talk  IE for Biomedicine  Looking around  IE problem formulation  which representation model on data? which features?  which framework for reasoning?  Mutual Recursion in IE  Text processing & domain knowledge  Application to studies on mitochondrial genome  Conclusions & Future work
  • 3. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari What is “Information Extraction” Filling slots in a database from sub-segments of text.As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION
  • 4. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari What is “Information Extraction” Filling slots in a database from sub-segments of text.As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. IE
  • 5. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari IE from Biomedical Texts: Motivation  Complexity of biological systems:  Too many specialized biological tasks  Several entities interacting in a single phenomenon  Many conditions to simultaneously verify  Complexity of biomedical languages:  Several nomenclatures, dictionaries, lexica  tending to quickly become obsolete Too much to read! Genome decoding  increasing amount of published literature
  • 6. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari IE History  Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96]  Most early work dominated by hand-built models  E.g. SRI’s FASTUS, hand-built FSMs.  But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]  Wrapper Induction: initially hand-build, then ML [Soderland ’96], [Kushmeric ’97], …  Most learning attempts based on statistical approaches  Learning of production rules constrained by probability measures (e.g., HMMs, Probabilistic Context-free Grammars)  Some recent logic-based approaches  Rapier (Califf ’98)  SRV (Freitag ’98)  INTHELEX (Ferilli et al. ’01)  FOIL-based (Aitken ’02)  Aleph-based (Goadrich et al. ’04)
  • 7. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Learning Language in biomedicine  BioCreAtIvE - Critical Assessment for Information Extraction in Biology (http://biocreative.sourceforge.net/)  BioNLP, Natural language processing of biology text (http://www.bionlp.org)  ACL/COLING Workshops on Natural Language Processing in Biomedicine  SIGIR Workshops on Text Analysis for Bioinformatics  Special Interest Group in Text Mining since ISMB’03 (Intelligent Systems for Molecular Biology): BioLINK (Biology Literature, Information and Knowledge)  PSB (Pacific Symposium on Biocomputing) tracks  Genomic tracks in TREC (Text Retrieval Conference)  PASCAL challenges on information extraction http://nlp.shef.ac.uk/pascal/  Workshops: IJCAI, ECAI, ECML/PKDD, ICML (Learning Language in Logic since ’99, challenge task on Extracting Relations from Biomedical Texts)
  • 8. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Is there “Logic” in language learning?  IE systems limitations, in general:  Portability (domain-dependent, task-dependent)  Scalability (work well on “relevant” data)  Statistics-based approaches  wide coverage,  scalability,  no semantics,  no domain knowledge  Logic-based approaches:  natural encoding of natural language statements and queries in first- order logic,  human-comprehensible models,  domain knowledge  refinement of models [R. J. Mooney, Learning for Semantic Interpretation: Scaling Up Without Dumbing Down, ICML Workshop on Language Learning in Logic, 1999]
  • 9. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari IE problem formulation for HmtDB  HmtDB resource of variability data associated to clinical phenotypes concerning human mithocondrial genome (http://www.hmdb.uniba.it/)
  • 10. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Textual Entity Extraction Ex: “Cytoplasts from two unrelated patients with MELAS (mitochondrial myopathy, encephalopathy, lactic acidosis, and strokelike episodes) harboring an A-*G transition at nucleotide position 3243 in the tRNALeU(UUR) gene of the mitochondrial genome were fused with human cells lacking endogenous mitochondrial DNA (mtDNA)” pathology associated to the mutation under study, substitution that causes the mutation, type of the mutation, position in the DNA where the mutation occurs, gene correlated to the mutation. By modelling the sentence structure: substitution(X)  follows (Y,X), type (Y) Extractors cannot be learned independently!!!
  • 11. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Textual Entity Extraction  Each entity is characterized by some slots defining a template  The task is to learn rules to fill slots (template filling)  Relations in data may allow:  intra-template dependencies to be learned  context-sensitive application of “extractors” Mutation Sampled population DNA sample tissue DNA screening method … Title Abstract Introduction Methods
  • 12. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari The learning task  Classification  Each class (slot) is a concept (target predicate), each model (template filler) induced for the class is a logical theory explaining the concept (set of predicate definitions)  Predefined models of classification should be provided  Importance of domain knowledge and first-order representations  Usefulness of mutual recursion (concept dependencies)  ILP = Inductive Learning  Logic Programming  From IL: inductive reasoning from observations and background knowledge  From LP: first-order logic as representation formalism
  • 13. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari ATRE (Apprendimento di Teorie Ricorsive da Esempi) http://www.di.uniba.it/~malerba/software/atre/ Given  a set of concepts C1, C2, ... , Cr  a set of objects O described in a language LO  a background knowledge BK described in a language LBK  a language of hypotheses LH that defines the space of hypotheses SH  a user’s preference criterion PC Find a (possibly recursive) logical theory T for the concepts C1, C2, ... , Cr , such that T is complete and consistent with respect to the set of observations and satisfies the preference criterion PC.
  • 14. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari ATRE Main Characteristics  Learning problem: induce recursive theories from examples  ILP setting: learning from interpretations  Observation language: ground multiple-head clauses  Hypothesis language: non-ground definite clauses  Constraints: linkedness + range-restrictedness  Generalization model: generalized implication  Search strategy for a recursive theory: separate-and- parallel-conquer  Continuous and discrete attributes and relations  Background knowledge: intensionally defined
  • 15. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Data preparation ATRE’s observation language: multiple-head clauses  Enumeration of positive and negative examples (expert users manual annotations + unlabelled tokens)  Descriptions of examples: which features?  Statistical (frequencies)  Lexical (alphanumeric, capitalized, …)  Syntactical (nouns, verbs, adjectives, …)  Domain-specific (dictionaries)
  • 16. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari
  • 17. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Text processing  The GATE (A General Architecture for Text Engeneering) framework (http://gate.ac.uk/)  ANNIE is the IE core:  Tokeniser  Sentence Splitter  POS tagger  Morphological Analyser  Gazetteers  Semantic tagger (JAPE transducer)  Orthomatcher (orthographic coreference)  Some domain specific gazetteers have been added (diseases, enzymes, genes, methods of analysis)
  • 18. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Text processing  Some reg. expr. to capture some domain specific patterns (alphanumeric strings, appositions, etc.)  Shallow acronym resolution Screening operations:  Some POSs (nouns, verbs, adjectives, numbers, symbols)  Punctuation  stopwords (glimpse.cs.arizona.edu. ) Stemming (Porter)
  • 19. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Text description  word_to_string(token) Numerical:  lenght(token), word_frequency(token), distance_word_category(token1,token2) Structural:  s_part_of(token1,token2), first(token), last(token), first_is_char(token), first_is_numeric(token), middle_is_char(token), middle_is_numeric(token), last_is_char(token), last_is_numeric(token), single_char(token), follows(token1,token2) Lexical:  type_of(token), type_POS(token) Domain dependent:  word_category(token)
  • 20. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Application  We considered 71 documents selected by biologists  Expert users manually annotated occurrences of entities of interest, namely Mutation: position, type, substitution, type_position, locus Subjects: nationality, method, pathology, category, number  The extraction process (both learning and recognition) is locally performed to text portions of interest, automatically classified
  • 21. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Textual portions of papers were categorized in five classes: Abstract, Introduction, Materials & Methods, Discussion and Results The abstract of each paper was processed 0,00 10,00 20,00 30,00 40,00 50,00 60,00 70,00 80,00 90,00 100,00 Abstract Introduction Methods Results Discussion Correctlyclassified(%) Avg. No. of categories correctly classified
  • 22. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari An A-to-G mutation at nucleotide position (np) 3243 in the mitochondrial tRNALeu(UUR) gene is closely associated with various clinical phenotypes of diabetes mellitus. [annotation(3)=substitution, annotation(4)=no_tag, annotation(5)=no_tag, annotation(6)=no_tag, annotation(7)=position, annotation(8)=no_tag, annotation(9)=locus, annotation(10)=no_tag, annotation(11)=no_tag, annotation(12)=no_tag, annotation(13)=no_tag, annotation(14)=no_tag, annotation(15)=no_tag, annotation(16)=pathology],  [part_of(1,2)=true, contain(2,3)=true, …, contain(2,16)=true, word_to_string(3)=‘A-to-G', word_to_string(4)='mutation', word_to_string(5)='nucleotid', word_to_string(6)='position',word_to_string(7)='3243', word_to_string(8)='mitochondri', word_to_string(9)='trnaleu(uur)', word_to_string(10)='gene', word_to_string(11)='clos', word_to_string(12)='associat', word_to_string(13)='variou', word_to_string(14)='clinic', word_to_string(15)='phenotyp', word_to_string(16)='diabetes_mellitus', type_of(3)=upperinitial, …, type_of(7)=numeric, type_POS(3)=jj, type_POS(4)=nn, …, type_POS(15)=nns, word_frequency(3)=3, word_frequency(4)=6, …, word_frequency(16)=1, word_category(9)=locus, word_category(16)=disease, distance_word_category(9,16)=1, follows(3,4)=true, follows(4,5)=true,…, follows(14,15)=true, follows(15,16)=true]). Example description
  • 23. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Background knowledge  follows(X,Z)  follows(X,Y)=true, follows(Y,Z)=true  char_number_char(X)=true  first_is_char(X)=true, middle_is_numeric(X)=true, last_is_char(X)=true  number_char_char(X)=true  first_is_numeric(X)=true, middle_is_char(X)=true, last_is_char(X)=true  char_char_number(X)=true  first_is_char(X)=true, middle_is_char(X)=true, last_is_numeric(X)=true Domain knowledge:  word_to_string(X)=transition  word_to_string(X)=transversion  word_to_string(X)=substitution  word_to_string(X)=replacement
  • 24. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Experiments  Mutation template  6-fold cross validation  The user manually annotates 355 tokens (8.65 per abstract)  About 11% positives
  • 25. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Experiments
  • 26. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Learned theories annotation(X1)=position  follows(X2,X1)=true, type_of(X1)=numeric, follows(X1,X3)=true, word_category(X3)=gene, word_to_string(X2)=position. annotation(X1)=type  follows(X1,X2)=true, word_frequency(X2) in [8..140], follows(X3,X1)=true, annotation(X3)=substitution annotation(X1)=position  follows(X2,X1)=true, annotation(X2)=substitution, follows(X3,X1)=true, follows(X1,X4)=true, word_frequency(X4) in [6..6], annotation(X3)=type, follows(X1,X5)=true, annotation(X5)=locus, word_frequency(X1) in [1..2]
  • 27. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Wrap-up  IE in Biomedicine  The ILP approach to IE within a multi-relational framework allows to implicitly define  Domain knowledge  Learning from users’ interaction  Relational representations  Learning relational patterns to allow context-sensitive application of models  Recursive Theory Learning in IE: ATRE  Efforts on text processing level:  Ambiguities  Data sparseness  Noise  Encouraging results on a real-world data set
  • 28. CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari Where from here?  Test on available corpus for Bio IE  Genia  BioCreative  NLPBA  Genic interaction challenges  Investigation of semisupervised approaches: online extension of dictionaries  How to encapsulate taxonomical knowledge?  Can information extracted by ATRE be really used as background knowledge for genomic database mining?

Editor's Notes

  1. Firstly I’ll introduce peculiarities of SDM. They ‘re particularly interesting because the practice of geo-referencing them have caused a growing demand for powerful exploratory data analysis techniques overcomes classical statistical and data mining techniques and, among other things,support the analysis of socio economic phenomena by a spatial point of view. In this talk I’ll focus my attention on a specific task that is the discovery of spatial association rules For this purpose I’ll present ARES a system to extract association rules from census data and illustrate an application ARES to mine spatial association rules on North West England 1998 census data in order to study the mportality risk in Greater manchester county
  2. What is IE. As a task it is… Starting with some text… and a empty data base with a defined ontology of fields and records, Use the information in the text to fill the database.
  3. ML… although this is an area where ML has not yet trounced the hand-built systems. In some of the latest evaluations, hand-built shared 1 st place with a ML. Now many companies making a business from IE (from the Web): WasBang, Inxight, Intelliseek, ClearForest.
  4. Data sparseness, robustness
  5. CV i.e. it is divided into 5 folds (Four are used for training and one for testing in turn).
  6. Initial ILP reasearch deals with concept learning in form of predicate definition learning
  7. ATRE is a multiple-concept learning system, which solves the following problem:
  8. Since the generation of a clause depends on the chosen seed, several seeds have to be chosen such that at least one seed per incomplete predicate definition is kept . Therefore, the search space is actually a forest of as many search-trees as the number of chosen seeds. The parallel exploration of the forest related to odd and even numbers. Spec. hierarchies are traversed top-dow. Search proceeds towards deeper and deeper levels of the specialization hierarchies until at least a user-defined number of consistent clauses is found. A supervisor task decides whether the search should carry on or not on the basis of the results returned by the concurrent tasks. When the search is stopped, the supervisor selects the “best” consistent clause according to the user’s preference criterion. This strategy has the advantage that simpler consistent clauses are found first, independently of the predicates to be learned. First learning step Consistent clauses in red
  9. Second learning step
  10. CV i.e. it is divided into 5 folds (Four are used for training and one for testing in turn).
  11. If we guarantee the following two conditions: ……………………… then after a finite number of steps a theory T , which is complete and consistent, is built. If we denote by LHM( T i ) the least Herbrand model of a theory T i , the stepwise construction of theories entails that LHM( T i )  LHM( T i+1 ), for each i  {0, 1,  , n-1}, since the addition of a clause to a theory can only augment the LHM
  12. In order to guarantee the first of the two conditions it is possible to proceed as follows. First, a positive example e + of a predicate p to be learned is selected, such that e + is not in LHM( T i ). The example e + is called seed . Then the space of definite clauses more general than e + is explored, looking for a clause C, if any, such that neg(LHM( T i  { C })) =  . In this way we guarantee that the second condition above holds as well. When found, C is added to T i giving T i+1 . If some positive examples are not included in LHM( T i+1 ) then a new seed is selected and the process is repeated. The second condition is more difficult to guarantee because of the non-monotonicity property. The approach followed in ATRE to remove inconsistency due to the addition of a clause to the theory consists of simple syntactic changes in the theory, which eventually creates new layers . The layering of a theory introduces a first variation of the classical separate-and-conquer strategy sketched above, since the addition of a locally consistent clause generated in the conquer stage is preceded by a global consistency check.
  13. Learning multi-relational patterns from multi-relational data and background knowledge It allows to navigate the relational structure of data