Generating Lexical Information for Terminologyin a Bioinformatics Ontology

Generating Lexical Information for Terminology
in a Bioinformatics Ontology
Hammad Afzal1,3, Paul Buitelaar1, Philipp Cimiano2, John McCrae2, Tobias
Wunner1

Unit for Natural Language Processing, Digital Enterprise Research Institute,
National University of Ireland, Galway, Ireland1
Semantic Computing Group, Center of Excellence (CITEC),
Bielefeld University, Bielefeld, Germany2
Department of Computer Science, College of Telecommunication Engineering, National
University of Sciences and Technology, Pakistan3

Motivation
 Lack of Linguistic Expressiveness in formally specified ontologies
 Typically developed to provide a shared view of a domain’s knowledge.
 Not necessarily support the natural language processing (NLP) tasks.

 Solutions :
 Terminologies to include linguistic information to facilitate using ontologies for text
processing, e.g. Specialist Lexicon contains lexical variants of many terms that are
used in the biomedical domain.
 Simple Knowledge Organization System (SKOS) format provides a standard way to
represent knowledge organization systems using the Resource Description Framework
(RDF).

 Limitations:
 SKOS provides a data-model to represent classification schemas such as thesauri
etc by introducing further typology of labels (preferred, alternative, hidden etc.) and
is not intended to associate more sophisticated lexical and linguistic information
with an arbitrary ontology.

Desiderata for Ontology-Lexicon model
 Separation between linguistic and ontological Level
 Develop lexica independently of specific ontologies for the same domain
 Allow different lexica for each ontology

 Independence between linguistic and ontological level
 No mutual constraints
 Ontological structures/concepts do not need to have a corresponding representation
of linguistic structure and vice versa

 Detailed information on linguistic realization
 Part of speech, morphology (inflection, decomposition), syntactic structure (sub-
categorization frames), etc.

 Support for multi-linguality

Towards our approach: LexInfo
 Recent principled approaches to associate linguistic information
to an arbitrary ontology:
 LingInfo: modeling morpho-syntactic decomposition of (complex) terms [Buitelaar et al.
2006]

 LexOnto: capturing syntactic behaviour and syntax-semantics links [Cimiano et al. 2007]

 Lexical Markup Framework (LMF): ISO standardized model for representing machine
readable lexica (agnostic about connection with ontology) [Francopoulo et al. 2007]

 LexInfo: building on LMF as a core, develop a model which “subsumes”
LingInfo and LexOnto for flexibly associating linguistic information to
ontologies [Buitelaar, Cimiano, Haase, Sintek 2009]

Case Study: Lexicalizing a bioinformatics ontology
 Creating a LexInfo-based lexicon for lexical enrichment of a bioinformatics
ontology i.e. the myGrid ontology (Wolstencroft et al., 2007).

 Lexical information is derived from semantic lexicons such as WordNet
(Fellbaum, 1998), and a domain related corpus.

Key points:

 The capture of morpho-syntactic behavior such as part-of-speech (POS),
decomposition, lemmatization and sub-categorization behaviour of lexical
elements.
 The lexicalized terms along with their linguistic information are added to
the OWL-based lexicon based on the LexInfo model.

MyGrid Ontology

 Supports Service Description of bioinformatics resources through service
annotation.

 Manual annotation is a slow process: e.g. Taverna/Feta: only ~15-20% of
services are functionally described: Result is increasingly growing of backlog
of un-annotated services

 Certain NLP-based attempts for automation of service descriptions are
reported where myGrid ontology is used.

 Lexicalization of myGrid ontology can improve performance of such
approaches

• LexInfo
 A principled way to enrich ontologies with linguistic information.

 Provides a framework for automatic construction of 'lexicalized
ontologies' on top of existing ontologies and lexical resources (Buitelaar
et al, 2009)

• Main characteristics:
 Two separate domain of discourse by way if using different name spaces:
 Domain ontology and LexInfo Model
 Domain ontology defines the classes, properties and individuals in that
domain
 The main entities in lexical domain of discourse are instances of class
LexicalEntry.
 LexInfo attaches lexical information (e.g. part-of-speech, morphological, sub-
categorization) to lexical entries.

Rest of the talk
• Methodology
 Dual approach towards lexicalization of myGrid ontology
 Collection of Bioinformatics Corpus
 Lexicalization of Class Labels
 Lexicalization of Property Labels

• Statistics, Experiments and Results
 Semi-automatically created lexicon
 Automatically generated lexicon

• What’s Next

Methodology - I
 Dual approach towards lexicalization of myGrid ontology
 Semi-automatically created LexInfo-based lexicon.

 Automatically created lexicon using LexInfo ontology lexicalization
service.

 Difference:

 In Semi-automatically created lexicon, the linguistic information has been
mainly derived from the domain corpus, and manually analyzed to verify
correctness

 In automatic generation, a generic POS-tagger and domain independent
lexical resources are used to derive morpho- syntactic behaviour on the basis
of an automatic analysis of the labels of the concepts, properties and
individuals in the ontology

Methodology - II
 Collection of Bioinformatics Corpus

 Domain specific behaviour (linguistic information) of the lexical entries is
derived from 2691 full text journal articles of BMC Bioinformatics.

 The GeniaTagger is used to get POS information; the tags of interest are
Nouns, Proper Nouns, Verbs and Adjectives.

 Syntactic information is derived using the Stanford parser.
 Currently, we have worked only on the syntactic behaviour of properties
(owl:ObjectProperty and owl:DataProperty in particular) and not of classes.

Methodology - III
• Lexicalization of Class Labels (Step-wise approach)

1. LexicalEntry is created for each Class (in the domain ontology) and is
linked to Class through the hasSense property.

2. The LexicalEntry is initialized as one of its sub-classes (e.g. Noun, Verb,
Adjective, etc.)

3. POS tag is derived from a semantic lexicon such as WordNet and
further supported from associated domain corpus

4. The lexical form (Lemma, WordForm etc) is attached to the lexical
entries through the corresponding relation: hasLemma or hasWordForm.

Methodology - III
• Lexicalization of Class Labels (Single Word)

The linking of LexicalEntry with a domain Class, and attachment of
grammatical information and lemma with LexicalEntry

Methodology - III
• Lexicalization of Class Labels (Multi-Word)

 LexInfo associates a ListOfComponents with a LexicalEntry with an
ordered list of Components and size given as a DataProperty of
ListOfComponents.

 Each of the Components is linked with a LexicalEntry.

 The validity of Component as a legitimate LexicalEntry is derived from its
presence in the myGrid ontology as a separate entity, or its substantive
existence in the domain corpus.

Methodology - III
• Lexicalization of Class Labels (Multi-Word)
 An example of morphological decomposition of a multi-word class label
(from the myGrid ontology).

Methodology - IV
• Lexicalization of Property Labels (Steps)
 Morphological decomposition as well as the syntactic analysis of the property
label is performed.

 The property labels are automatically tokenized, and tokens are then linked with
the LexicalEntries (Same as Classes).

 On syntactic level, the tokens are analyzed to attach their respective syntactic
behavior which is then linked with the subcategorization frames.

 LexInfo model provides various specializations of subCategorization frames such
as Transitive, TransitivePP, IntransitivePP, AdjectiveNP, NounPP and Noun2PP
etc

 Mapping of syntactic arguments such as Subject, Object, PObject etc. linked with
the LexicalEntry to the semantic arguments such as Domain, Range,
RangeOfProperty corresponding to the object property.

Methodology - IV
• Lexicalization of Property Labels

 In automatic lexicon generation, the lexical entries are derived
automatically by processing the labels in the ontology using LILAC
grammar.

 LILAC production rules state part-of-speech patterns that apply to the
label. For example, a label with the structure “N Prep” gives rise to a
lexicon entry of type “NounPP”.

 Currently, LexInfo uses 73 rules to generate lexicons automatically
(further details on LexInfo homepage).

Methodology - IV
• Lexicalization of Property Labels
– Lexicalization of ObjectProperty produces.

Statistics - I
• Some of the statistics about the myGrid ontology

Ontology Constructs Total Number of
Occurrences
Single word class labels 88

Two word class labels 200
Classes 475
Three or more word class 187
labels
Single word property labels 1

Two word property labels 4
ObjectProperties 8
Three or more word class 3
labels
DataProperties 0

Individuals 0

Statistics - II
• Semi-automatically generated LexInfo based lexicon of the myGrid ontology.
Number of
LexInfo Specialized Entries in
Example Labels
Constructs Constructs ‘myGrid
Lexicon’
Adjective Multiple 21
Noun Alignment 752
LexicalEntries
Proper Noun Medline 253
Verb Perform 4
NounPhrase Sequence_similarity_Search 369
AdjectivePhrase Tertiary_Structure_Prediction 16
VerbPhrase Performs_task 1
Written-Form 1044
List-of-
387
Components
Syntactic- Transitive produces 4
Behaviour NounPP is_part_of 4

Statistics - III
• Statistics about the automatically generated LexInfo based lexicon of the
myGrid ontology using LexInfo lexicon generation service.

# of Entries in
LexInfo
Specialized Constructs Example Labels ‘myGrid
Constructs
Lexicon’
Adjective local 131
Noun Record 973
Proper Noun Maize 15
Verb Perform 19
LexicalEntry
Genotype-phenotype-
NounPhrase 1069
database
ProperNounPhrase UniProt 1
VerbPhrase 0
List-of-
1071
Components
Transitive produces 3
Syntactic-
NounPP is_part_of 4
Behaviour
IntransitivePP produced_by 1

Discussion
Semi-Automatically created Lexicon

Lexicalization of Classes

 Most of the LexicalEntries are of type Noun, NounPhrase and ProperNoun

 Not many Verb occurrences.
 Class labels are mostly named using nouns, whereas the object properties are
typically named using verbs,
 Small number of ObjectProperties (8 properties) resulted in a smaller number
of verbs in the lexicon.

 The number of Proper Nouns is 253; 32 of which are created from single-
word Class names.

 387 ListOfComponents are created from the 387 multi-word class names
in the ontology (myGrid), 371 of them correspond to NounPhrases and 16
are AdjectivePhrases,

Discussion
Semi-Automatically created Lexicon

Lexicalization of ObjectProperties

 is_identifier_of, and is_part_of lexicalized as Nouns (part and identifier)
 SyntacticBehavior linked to the subcategorization frame of type NounPP
(Noun: identifier, Prep: of and Noun: part, Prep:of).

 performs_task and task_performed_by lexicalized as Verb (perform).
 SyntacticBehavior linked to the subcategorization frame of type Transitive.
 Both properties are inverse of each other, and are lexicalized using the
same verb, however, the mapping of syntactic arguments to domain and
range is inversed in the two cases.

 Produces and produced_by are lexicalized lexicalized as Verb (perform)
 performs_task is recognized as a VerbPhrase with performs as a Verb and a
Transitive subCategorization frame linked with it.

 The syntactic behaviors of has_identifier and has_part are also modeled
as NounPP.

Discussion
Automatically generated Lexicon using LexInfo service

Lexicalization of Classes (differences from the semi-automatically created)

 The number of Adjectives has significantly increased to 131 and those of
ProperNouns has steeply decreased to 15.
 Reason is that ProperNouns are incorrectly identified as Adjectives by our POS tagger
(Stanford Tagger), e.g. DDBJ in DDBJ_Amino_Acid_Database), PIRSF in PIRSF_report
are recognized as Adjectives by the POS-tagger.
 This problem can be resolved by using domain corpora, or considering a domain
thesaurus or dictionary etc.

 The number of Verbs has increased to 19
 Again due to a POS tag error: gerunds such as “manipulating”, “predicting” are
incorrectly identified as Verbs.

 The identification of ProperNounPhrase is incorrect due to a tokenization error.
 “UniProt” is tokenized as two proper nouns, “uni” and “prot”, although it is a single word,
i.e. name of a bioinformatics database.
 This can also be resolved using a domain corpus or thesaurus.

Discussion
Automatically generated Lexicon using LexInfo service

Lexicalization of ObjectProperties

 ObjectProperties are mostly lexicalized correctly.

 Only error is in lexicalization of “produced_by” that is recognized as
IntransitivePP. This is because of an error in the ontology lexicalization
(LILAC) rules which consider the occurrence of a past-participle verb
followed by “by” as an occurrence of IntransitivePP.

Implementation
• Initial implementation of LexInfo Model as API – Univ. Bielefeld, DERI –
National Univ. of Ireland, Galway
– https://lexinfo.googlecode.com/svn

Future Work
 Linguistically enriched ontology for improvement of service annotation
 The linguistically enriched lexicon associated with the myGrid ontology can
improve the performance of literature based approaches for automatic annotation
of bioinformatics web services.

 Optimization of LexInfo model by including WordNet etc.
 To generate all possible lexicalizations of given ontological constructs by
utilizing Synsets from WordNet and extract semantically similar verbs from
VerbNet and FrameNet

 LexInfo API is currently under development
 Allows the creation, management and serialization of ontology lexica according to the
LexInfo model. An early prototype of a lexicon generation service based on LexInfo
model is also made available. Available at: http://code.google.com/p/lexinfo/

Acknowledgments
• Supported in part by the European Union under Grant No. 248458 for the Monnet
project as well as by the Science Foundation Ireland under Grant No.
SFI/08/CE/I1380 (Lion-2).

• Thanks to Thomas Wangler, Michael Sintek and Matthias Mantel for their valuable
contributions in designing the LexInfo model and developing the LexInfo API.

References
• Afzal, H., Stevens, R. and Nenadic, G. Mining Semantic Descriptions of Bioinformatics Web
Resources from the Literature, In Proceedings of the 6th European Semantic Web Conference (ESWC
2009), LNCS 5554, Springer-Verlag: 535-549.

• Afzal, H., Stevens, R., Nenadic, G. Towards Semantic Annotation of Bioinformatics Services:
Building a Controlled Vocabulary, In Proceedings of the Third International Symposium on Semantic
Mining in Biomedicine (SMBM 2008):5-12.

• Buitelaar, P., Declerck, T., Frank, A., Racioppa, S., Kiesel, M., Sintek, M., Engel, R., Romanelli, M.,
Sonntag, D., Loos, B., Micelli, V., Porzel, R. and Cimiano, P. LingInfo: Design and Applications of a
Model for the Integration of Linguistic Information in Ontologies. In Proceedings of OntoLex06, a
workshop at LREC, Genoa, Italy.

• Paul Buitelaar, Philipp Cimiano, Peter Haase, Michael Sintek: Towards Linguistically Grounded
Ontologies. In Proceedings of the 6th European Semantic Web Conference (ESWC 2009), Lecture Notes
in Computer Science, Springer 2009.

• Cimiano, P., Haase, P., Herold, M., Mantel, M. and Buitelaar, P.: LexOnto: A model for ontology
lexicons for ontology-based NLP. In Proceedings of the OntoLex (From Text to Knowledge: The
Lexicon/Ontology Interface) workshop at ISWC07 (International Semantic Web Conference).

• Francopoulo, G., Bel, N., Georg, Calzolari, N., Monachini, M., Pet, M. and Soria, C.: Lexical markup
framework: ISO standard for semantic information in NLP lexicons. In Proceedings of the Workshop
of the GLDV Working Group on Lexicography at the Biennial Spring Conference of the GLDV

Resources Used
• BMC Bioinformatics: http://www.biomedcentral.com/bmcbioinformatics/
• Genia Tagger: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
• Stanford Parser: http://nlp.stanford.edu/downloads/lex-parser.shtml
• Stanford Tagger: http://nlp.stanford.edu/software/tagger.shtml
• TreeTagger: http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

Generating Lexical Information for Terminologyin a Bioinformatics Ontology

Generating Lexical Information for Terminologyin a Bioinformatics Ontology

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Generating Lexical Information for Terminologyin a Bioinformatics Ontology

Similaire à Generating Lexical Information for Terminologyin a Bioinformatics Ontology (20)