SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Generating Lexical Information for Terminology
          in a Bioinformatics Ontology
    Hammad Afzal1,3, Paul Buitelaar1, Philipp Cimiano2, John McCrae2, Tobias
                                    Wunner1

     Unit for Natural Language Processing, Digital Enterprise Research Institute,
                    National University of Ireland, Galway, Ireland1
              Semantic Computing Group, Center of Excellence (CITEC),
                       Bielefeld University, Bielefeld, Germany2
Department of Computer Science, College of Telecommunication Engineering, National
                  University of Sciences and Technology, Pakistan3
Motivation
 Lack of Linguistic Expressiveness in formally specified ontologies
     Typically developed to provide a shared view of a domain’s knowledge.
     Not necessarily support the natural language processing (NLP) tasks.

 Solutions   :
     Terminologies to include linguistic information to facilitate using ontologies for text
      processing, e.g. Specialist Lexicon contains lexical variants of many terms that are
      used in the biomedical domain.
     Simple Knowledge Organization System (SKOS) format provides a standard way to
      represent knowledge organization systems using the Resource Description Framework
      (RDF).

     Limitations:
         SKOS provides a data-model to represent classification schemas such as thesauri
          etc by introducing further typology of labels (preferred, alternative, hidden etc.) and
          is not intended to associate more sophisticated lexical and linguistic information
          with an arbitrary ontology.
Desiderata for Ontology-Lexicon model
  Separation between linguistic and ontological Level
    Develop lexica independently of specific ontologies for the same domain
    Allow different lexica for each ontology


  Independence between linguistic and ontological level
    No mutual constraints
    Ontological structures/concepts do not need to have a corresponding representation
     of linguistic structure and vice versa


  Detailed information on linguistic realization
    Part of speech, morphology (inflection, decomposition), syntactic structure (sub-
     categorization frames), etc.


  Support for multi-linguality
Towards our approach: LexInfo
 Recent principled approaches to associate linguistic information
  to an arbitrary ontology:
    LingInfo: modeling morpho-syntactic decomposition of (complex) terms [Buitelaar et al.
     2006]

    LexOnto: capturing syntactic behaviour and syntax-semantics links [Cimiano et al. 2007]

    Lexical Markup Framework (LMF): ISO standardized model for representing machine
     readable lexica (agnostic about connection with ontology) [Francopoulo et al. 2007]


 LexInfo: building on LMF as a core, develop a model which “subsumes”
  LingInfo and LexOnto for flexibly associating linguistic information to
  ontologies [Buitelaar, Cimiano, Haase, Sintek 2009]
Case Study: Lexicalizing a bioinformatics ontology
   Creating a LexInfo-based lexicon for lexical enrichment of a bioinformatics
    ontology i.e. the myGrid ontology (Wolstencroft et al., 2007).

   Lexical information is derived from semantic lexicons such as WordNet
    (Fellbaum, 1998), and a domain related corpus.


Key points:

        The capture of morpho-syntactic behavior such as part-of-speech (POS),
         decomposition, lemmatization and sub-categorization behaviour of lexical
         elements.
        The lexicalized terms along with their linguistic information are added to
         the OWL-based lexicon based on the LexInfo model.
Case Study: Lexicalizing a bioinformatics ontology
 MyGrid Ontology

       Supports Service Description of bioinformatics resources through service
        annotation.

       Manual annotation is a slow process: e.g. Taverna/Feta: only ~15-20% of
        services are functionally described: Result is increasingly growing of backlog
        of un-annotated services

       Certain NLP-based attempts for automation of service descriptions are
        reported where myGrid ontology is used.

       Lexicalization of myGrid ontology can improve performance of such
        approaches
Case Study: Lexicalizing a bioinformatics ontology
• LexInfo
      A principled way to enrich ontologies with linguistic information.

      Provides a framework for automatic construction of 'lexicalized
       ontologies' on top of existing ontologies and lexical resources (Buitelaar
       et al, 2009)

•   Main characteristics:
     Two separate domain of discourse by way if using different name spaces:
      Domain ontology and LexInfo Model
     Domain ontology defines the classes, properties and individuals in that
      domain
     The main entities in lexical domain of discourse are instances of class
      LexicalEntry.
     LexInfo attaches lexical information (e.g. part-of-speech, morphological, sub-
      categorization) to lexical entries.
Rest of the talk
• Methodology
      Dual approach towards lexicalization of myGrid ontology
      Collection of Bioinformatics Corpus
      Lexicalization of Class Labels
      Lexicalization of Property Labels

• Statistics, Experiments and Results
    Semi-automatically created lexicon
    Automatically generated lexicon

• What’s Next
Methodology - I
 Dual approach towards lexicalization of myGrid ontology
    Semi-automatically created LexInfo-based lexicon.

    Automatically created lexicon using LexInfo ontology lexicalization
     service.

    Difference:

        In Semi-automatically created lexicon, the linguistic information has been
         mainly derived from the domain corpus, and manually analyzed to verify
         correctness

        In automatic generation, a generic POS-tagger and domain independent
         lexical resources are used to derive morpho- syntactic behaviour on the basis
         of an automatic analysis of the labels of the concepts, properties and
         individuals in the ontology
Methodology - II
 Collection of Bioinformatics Corpus

    Domain specific behaviour (linguistic information) of the lexical entries is
     derived from 2691 full text journal articles of BMC Bioinformatics.

    The GeniaTagger is used to get POS information; the tags of interest are
     Nouns, Proper Nouns, Verbs and Adjectives.

    Syntactic information is derived using the Stanford parser.
        Currently, we have worked only on the syntactic behaviour of properties
         (owl:ObjectProperty and owl:DataProperty in particular) and not of classes.
Methodology - III
• Lexicalization of Class Labels (Step-wise approach)

   1. LexicalEntry is created for each Class (in the domain ontology) and is
      linked to Class through the hasSense property.

   2. The LexicalEntry is initialized as one of its sub-classes (e.g. Noun, Verb,
      Adjective, etc.)

   3. POS tag is derived from a semantic lexicon such as WordNet and
      further supported from associated domain corpus

   4. The lexical form (Lemma, WordForm etc) is attached to the lexical
      entries through the corresponding relation: hasLemma or hasWordForm.
Methodology - III
• Lexicalization of Class Labels (Single Word)

   The linking of LexicalEntry with a domain Class, and attachment of
   grammatical information and lemma with LexicalEntry
Methodology - III
• Lexicalization of Class Labels (Multi-Word)

    LexInfo associates a ListOfComponents with a LexicalEntry with an
     ordered list of Components and size given as a DataProperty of
     ListOfComponents.

    Each of the Components is linked with a LexicalEntry.

    The validity of Component as a legitimate LexicalEntry is derived from its
     presence in the myGrid ontology as a separate entity, or its substantive
     existence in the domain corpus.
Methodology - III
• Lexicalization of Class Labels (Multi-Word)
    An example of morphological decomposition of a multi-word class label
     (from the myGrid ontology).
Methodology - IV
• Lexicalization of Property Labels (Steps)
    Morphological decomposition as well as the syntactic analysis of the property
     label is performed.

    The property labels are automatically tokenized, and tokens are then linked with
     the LexicalEntries (Same as Classes).

    On syntactic level, the tokens are analyzed to attach their respective syntactic
     behavior which is then linked with the subcategorization frames.

    LexInfo model provides various specializations of subCategorization frames such
     as Transitive, TransitivePP, IntransitivePP, AdjectiveNP, NounPP and Noun2PP
     etc

    Mapping of syntactic arguments such as Subject, Object, PObject etc. linked with
     the LexicalEntry to the semantic arguments such as Domain, Range,
     RangeOfProperty corresponding to the object property.
Methodology - IV
• Lexicalization of Property Labels

    In automatic lexicon generation, the lexical entries are derived
     automatically by processing the labels in the ontology using LILAC
     grammar.

    LILAC production rules state part-of-speech patterns that apply to the
     label. For example, a label with the structure “N Prep” gives rise to a
     lexicon entry of type “NounPP”.

    Currently, LexInfo uses 73 rules to generate lexicons automatically
     (further details on LexInfo homepage).
Methodology - IV
• Lexicalization of Property Labels
   – Lexicalization of ObjectProperty produces.
Statistics - I
•   Some of the statistics about the myGrid ontology



        Ontology Constructs                                 Total Number of
                                                              Occurrences
                               Single word class labels       88

                                Two word class labels        200
              Classes                                                  475
                               Three or more word class      187
                                        labels
                              Single word property labels     1

                               Two word property labels       4
          ObjectProperties                                              8
                               Three or more word class       3
                                        labels
           DataProperties                                          0

            Individuals                                            0
Statistics - II
•   Semi-automatically generated LexInfo based lexicon of the myGrid ontology.
                                                                       Number of
    LexInfo          Specialized                                       Entries in
                                       Example Labels
    Constructs       Constructs                                        ‘myGrid
                                                                       Lexicon’
                     Adjective         Multiple                        21
                     Noun              Alignment                       752
    LexicalEntries
                     Proper Noun       Medline                         253
                     Verb              Perform                         4
                     NounPhrase        Sequence_similarity_Search      369
                     AdjectivePhrase   Tertiary_Structure_Prediction   16
                     VerbPhrase        Performs_task                   1
    Written-Form                                                       1044
    List-of-
                                                                       387
    Components
    Syntactic-       Transitive                         produces       4
    Behaviour        NounPP                             is_part_of     4
Statistics - III
•   Statistics about the automatically generated LexInfo based lexicon of the
    myGrid ontology using LexInfo lexicon generation service.

                                                                    # of Entries in
    LexInfo
                     Specialized Constructs   Example Labels        ‘myGrid
    Constructs
                                                                    Lexicon’
                     Adjective                local                 131
                     Noun                     Record                973
                     Proper Noun              Maize                 15
                     Verb                     Perform               19
    LexicalEntry
                                              Genotype-phenotype-
                     NounPhrase                                     1069
                                              database
                     ProperNounPhrase         UniProt               1
                     VerbPhrase                                     0
    List-of-
                                                                    1071
    Components
                     Transitive               produces              3
    Syntactic-
                     NounPP                   is_part_of            4
    Behaviour
                     IntransitivePP           produced_by           1
Discussion
Semi-Automatically created Lexicon

Lexicalization of Classes

  Most of the LexicalEntries are of type Noun, NounPhrase and ProperNoun

  Not many Verb occurrences.
        Class labels are mostly named using nouns, whereas the object properties are
         typically named using verbs,
        Small number of ObjectProperties (8 properties) resulted in a smaller number
         of verbs in the lexicon.

  The number of Proper Nouns is 253; 32 of which are created from single-
   word Class names.

  387 ListOfComponents are created from the 387 multi-word class names
   in the ontology (myGrid), 371 of them correspond to NounPhrases and 16
   are AdjectivePhrases,
Discussion
Semi-Automatically created Lexicon

Lexicalization of ObjectProperties

  is_identifier_of, and is_part_of lexicalized as Nouns (part and identifier)
        SyntacticBehavior linked to the subcategorization frame of type NounPP
         (Noun: identifier, Prep: of and Noun: part, Prep:of).

  performs_task and task_performed_by lexicalized as Verb (perform).
        SyntacticBehavior linked to the subcategorization frame of type Transitive.
        Both properties are inverse of each other, and are lexicalized using the
         same verb, however, the mapping of syntactic arguments to domain and
         range is inversed in the two cases.

  Produces and produced_by are lexicalized lexicalized as Verb (perform)
        performs_task is recognized as a VerbPhrase with performs as a Verb and a
         Transitive subCategorization frame linked with it.

  The syntactic behaviors of has_identifier and has_part are also modeled
   as NounPP.
Discussion
Automatically generated Lexicon using LexInfo service

Lexicalization of Classes (differences from the semi-automatically created)

    The number of Adjectives has significantly increased to 131 and those of
     ProperNouns has steeply decreased to 15.
         Reason is that ProperNouns are incorrectly identified as Adjectives by our POS tagger
          (Stanford Tagger), e.g. DDBJ in DDBJ_Amino_Acid_Database), PIRSF in PIRSF_report
          are recognized as Adjectives by the POS-tagger.
         This problem can be resolved by using domain corpora, or considering a domain
          thesaurus or dictionary etc.

    The number of Verbs has increased to 19
         Again due to a POS tag error: gerunds such as “manipulating”, “predicting” are
          incorrectly identified as Verbs.

    The identification of ProperNounPhrase is incorrect due to a tokenization error.
         “UniProt” is tokenized as two proper nouns, “uni” and “prot”, although it is a single word,
          i.e. name of a bioinformatics database.
         This can also be resolved using a domain corpus or thesaurus.
Discussion
Automatically generated Lexicon using LexInfo service

Lexicalization of ObjectProperties

  ObjectProperties are mostly lexicalized correctly.

  Only error is in lexicalization of “produced_by” that is recognized as
   IntransitivePP. This is because of an error in the ontology lexicalization
   (LILAC) rules which consider the occurrence of a past-participle verb
   followed by “by” as an occurrence of IntransitivePP.
Implementation
• Initial implementation of LexInfo Model as API – Univ. Bielefeld, DERI –
  National Univ. of Ireland, Galway
   – https://lexinfo.googlecode.com/svn
Future Work
  Linguistically enriched ontology for improvement of service annotation
     The linguistically enriched lexicon associated with the myGrid ontology can
      improve the performance of literature based approaches for automatic annotation
      of bioinformatics web services.

  Optimization of LexInfo model by including WordNet etc.
        To generate all possible lexicalizations of given ontological constructs by
         utilizing Synsets from WordNet and extract semantically similar verbs from
         VerbNet and FrameNet

  LexInfo API is currently under development
        Allows the creation, management and serialization of ontology lexica according to the
         LexInfo model. An early prototype of a lexicon generation service based on LexInfo
         model is also made available. Available at: http://code.google.com/p/lexinfo/
Acknowledgments
•   Supported in part by the European Union under Grant No. 248458 for the Monnet
    project as well as by the Science Foundation Ireland under Grant No.
    SFI/08/CE/I1380 (Lion-2).

•   Thanks to Thomas Wangler, Michael Sintek and Matthias Mantel for their valuable
    contributions in designing the LexInfo model and developing the LexInfo API.
References
    •   Afzal, H., Stevens, R. and Nenadic, G. Mining Semantic Descriptions of Bioinformatics Web
        Resources from the Literature, In Proceedings of the 6th European Semantic Web Conference (ESWC
        2009), LNCS 5554, Springer-Verlag: 535-549.

    •   Afzal, H., Stevens, R., Nenadic, G. Towards Semantic Annotation of Bioinformatics Services:
        Building a Controlled Vocabulary, In Proceedings of the Third International Symposium on Semantic
        Mining in Biomedicine (SMBM 2008):5-12.

•       Buitelaar, P., Declerck, T., Frank, A., Racioppa, S., Kiesel, M., Sintek, M., Engel, R., Romanelli, M.,
        Sonntag, D., Loos, B., Micelli, V., Porzel, R. and Cimiano, P. LingInfo: Design and Applications of a
        Model for the Integration of Linguistic Information in Ontologies. In Proceedings of OntoLex06, a
        workshop at LREC, Genoa, Italy.

•       Paul Buitelaar, Philipp Cimiano, Peter Haase, Michael Sintek: Towards Linguistically Grounded
        Ontologies. In Proceedings of the 6th European Semantic Web Conference (ESWC 2009), Lecture Notes
        in Computer Science, Springer 2009.

•       Cimiano, P., Haase, P., Herold, M., Mantel, M. and Buitelaar, P.: LexOnto: A model for ontology
        lexicons for ontology-based NLP. In Proceedings of the OntoLex (From Text to Knowledge: The
        Lexicon/Ontology Interface) workshop at ISWC07 (International Semantic Web Conference).

•       Francopoulo, G., Bel, N., Georg, Calzolari, N., Monachini, M., Pet, M. and Soria, C.: Lexical markup
        framework: ISO standard for semantic information in NLP lexicons. In Proceedings of the Workshop
        of the GLDV Working Group on Lexicography at the Biennial Spring Conference of the GLDV
Resources Used
•   BMC Bioinformatics: http://www.biomedcentral.com/bmcbioinformatics/
•   Genia Tagger: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
•   Stanford Parser: http://nlp.stanford.edu/downloads/lex-parser.shtml
•   Stanford Tagger: http://nlp.stanford.edu/software/tagger.shtml
•   TreeTagger: http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
Generating Lexical Information for Terminologyin a Bioinformatics Ontology

Contenu connexe

Tendances

Ekaw ontology learning for cost effective large-scale semantic annotation
Ekaw ontology learning for cost effective large-scale semantic annotationEkaw ontology learning for cost effective large-scale semantic annotation
Ekaw ontology learning for cost effective large-scale semantic annotationShahab Mokarizadeh
 
Ontology and Ontology Libraries: a critical study
Ontology and Ontology Libraries: a critical studyOntology and Ontology Libraries: a critical study
Ontology and Ontology Libraries: a critical studyDebashisnaskar
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Facultad de Informática UCM
 
Introduction to Ontology Concepts and Terminology
Introduction to Ontology Concepts and TerminologyIntroduction to Ontology Concepts and Terminology
Introduction to Ontology Concepts and TerminologySteven Miller
 
download
downloaddownload
downloadbutest
 
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...IOSR Journals
 
Fmri of bilingual brain atl reveals language independent representations
Fmri of bilingual brain atl reveals language independent representations Fmri of bilingual brain atl reveals language independent representations
Fmri of bilingual brain atl reveals language independent representations Emily Sabo
 
Question answer template
Question answer templateQuestion answer template
Question answer templateThanuw Chaks
 
Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...butest
 
Gathering Lexical Linked Data and Knowledge Patterns from FrameNet
Gathering Lexical Linked Data and Knowledge Patterns from FrameNetGathering Lexical Linked Data and Knowledge Patterns from FrameNet
Gathering Lexical Linked Data and Knowledge Patterns from FrameNetAndrea Nuzzolese
 
Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Jorge Baptista
 
Rethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by OntologiesRethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by OntologiesMatteo Romanello
 
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...andrefsantos
 
A DOMAIN INDEPENDENT APPROACH FOR ONTOLOGY SEMANTIC ENRICHMENT
A DOMAIN INDEPENDENT APPROACH FOR ONTOLOGY SEMANTIC ENRICHMENTA DOMAIN INDEPENDENT APPROACH FOR ONTOLOGY SEMANTIC ENRICHMENT
A DOMAIN INDEPENDENT APPROACH FOR ONTOLOGY SEMANTIC ENRICHMENTcscpconf
 
Structural weights in ontology matching
Structural weights in ontology matchingStructural weights in ontology matching
Structural weights in ontology matchingIJwest
 
Automatic Key Term Extraction and Summarization from Spoken Course Lectures
Automatic Key Term Extraction and Summarization from Spoken Course LecturesAutomatic Key Term Extraction and Summarization from Spoken Course Lectures
Automatic Key Term Extraction and Summarization from Spoken Course LecturesYun-Nung (Vivian) Chen
 
NL Context Understanding 23(6)
NL Context Understanding 23(6)NL Context Understanding 23(6)
NL Context Understanding 23(6)IT Industry
 

Tendances (20)

Ekaw ontology learning for cost effective large-scale semantic annotation
Ekaw ontology learning for cost effective large-scale semantic annotationEkaw ontology learning for cost effective large-scale semantic annotation
Ekaw ontology learning for cost effective large-scale semantic annotation
 
Ontology and Ontology Libraries: a critical study
Ontology and Ontology Libraries: a critical studyOntology and Ontology Libraries: a critical study
Ontology and Ontology Libraries: a critical study
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
 
Ontologies
OntologiesOntologies
Ontologies
 
Introduction to Ontology Concepts and Terminology
Introduction to Ontology Concepts and TerminologyIntroduction to Ontology Concepts and Terminology
Introduction to Ontology Concepts and Terminology
 
download
downloaddownload
download
 
AICOL2015_paper_16
AICOL2015_paper_16AICOL2015_paper_16
AICOL2015_paper_16
 
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
 
Fmri of bilingual brain atl reveals language independent representations
Fmri of bilingual brain atl reveals language independent representations Fmri of bilingual brain atl reveals language independent representations
Fmri of bilingual brain atl reveals language independent representations
 
A few contributions of the SIFR (Semantic Indexing of French biomedical Resou...
A few contributions of the SIFR (Semantic Indexing of French biomedical Resou...A few contributions of the SIFR (Semantic Indexing of French biomedical Resou...
A few contributions of the SIFR (Semantic Indexing of French biomedical Resou...
 
Question answer template
Question answer templateQuestion answer template
Question answer template
 
Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...
 
Gathering Lexical Linked Data and Knowledge Patterns from FrameNet
Gathering Lexical Linked Data and Knowledge Patterns from FrameNetGathering Lexical Linked Data and Knowledge Patterns from FrameNet
Gathering Lexical Linked Data and Knowledge Patterns from FrameNet
 
Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)
 
Rethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by OntologiesRethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by Ontologies
 
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
 
A DOMAIN INDEPENDENT APPROACH FOR ONTOLOGY SEMANTIC ENRICHMENT
A DOMAIN INDEPENDENT APPROACH FOR ONTOLOGY SEMANTIC ENRICHMENTA DOMAIN INDEPENDENT APPROACH FOR ONTOLOGY SEMANTIC ENRICHMENT
A DOMAIN INDEPENDENT APPROACH FOR ONTOLOGY SEMANTIC ENRICHMENT
 
Structural weights in ontology matching
Structural weights in ontology matchingStructural weights in ontology matching
Structural weights in ontology matching
 
Automatic Key Term Extraction and Summarization from Spoken Course Lectures
Automatic Key Term Extraction and Summarization from Spoken Course LecturesAutomatic Key Term Extraction and Summarization from Spoken Course Lectures
Automatic Key Term Extraction and Summarization from Spoken Course Lectures
 
NL Context Understanding 23(6)
NL Context Understanding 23(6)NL Context Understanding 23(6)
NL Context Understanding 23(6)
 

Similaire à Generating Lexical Information for Terminology in a Bioinformatics Ontology

SWSN UNIT-3.pptx we can information about swsn professional
SWSN UNIT-3.pptx we can information about swsn professionalSWSN UNIT-3.pptx we can information about swsn professional
SWSN UNIT-3.pptx we can information about swsn professionalgowthamnaidu0986
 
Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Dag Endresen
 
ISO 25964: Thesauri and Interoperability with Other Vocabularies
ISO 25964: Thesauri and Interoperability with Other VocabulariesISO 25964: Thesauri and Interoperability with Other Vocabularies
ISO 25964: Thesauri and Interoperability with Other VocabulariesMarcia Zeng
 
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...Khirulnizam Abd Rahman
 
Representation of ontology by Classified Interrelated object model
Representation of ontology by Classified Interrelated object modelRepresentation of ontology by Classified Interrelated object model
Representation of ontology by Classified Interrelated object modelMihika Shah
 
KOS Management - The case of the Organic.Edunet Ontology
KOS Management - The case of the Organic.Edunet OntologyKOS Management - The case of the Organic.Edunet Ontology
KOS Management - The case of the Organic.Edunet OntologyVassilis Protonotarios
 
Pedagogical applications of corpus data for English for General and Specific ...
Pedagogical applications of corpus data for English for General and Specific ...Pedagogical applications of corpus data for English for General and Specific ...
Pedagogical applications of corpus data for English for General and Specific ...Pascual Pérez-Paredes
 
Literature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resourcesLiterature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resourcesHammad Afzal
 
NLP in Web Data Extraction (Omer Gunes)
NLP in Web Data Extraction (Omer Gunes)NLP in Web Data Extraction (Omer Gunes)
NLP in Web Data Extraction (Omer Gunes)timfu
 
LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...
LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...
LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...locloud
 
LDL 2012 - Linking to ISOcat Data Categories
LDL 2012 - Linking to ISOcat Data CategoriesLDL 2012 - Linking to ISOcat Data Categories
LDL 2012 - Linking to ISOcat Data CategoriesMenzo Windhouwer
 
TDWG VoMaG Vocabulary management workflow, 2013-10-31
TDWG VoMaG Vocabulary management workflow, 2013-10-31TDWG VoMaG Vocabulary management workflow, 2013-10-31
TDWG VoMaG Vocabulary management workflow, 2013-10-31Dag Endresen
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY mimisy
 
download
downloaddownload
downloadbutest
 
download
downloaddownload
downloadbutest
 
PA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.docPA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.docbutest
 
PA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.docPA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.docbutest
 

Similaire à Generating Lexical Information for Terminology in a Bioinformatics Ontology (20)

SWSN UNIT-3.pptx we can information about swsn professional
SWSN UNIT-3.pptx we can information about swsn professionalSWSN UNIT-3.pptx we can information about swsn professional
SWSN UNIT-3.pptx we can information about swsn professional
 
Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...
 
ISO 25964: Thesauri and Interoperability with Other Vocabularies
ISO 25964: Thesauri and Interoperability with Other VocabulariesISO 25964: Thesauri and Interoperability with Other Vocabularies
ISO 25964: Thesauri and Interoperability with Other Vocabularies
 
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
 
Representation of ontology by Classified Interrelated object model
Representation of ontology by Classified Interrelated object modelRepresentation of ontology by Classified Interrelated object model
Representation of ontology by Classified Interrelated object model
 
KOS Management - The case of the Organic.Edunet Ontology
KOS Management - The case of the Organic.Edunet OntologyKOS Management - The case of the Organic.Edunet Ontology
KOS Management - The case of the Organic.Edunet Ontology
 
Knowledge Organization Systems (KOS): Management of Classification Systems in...
Knowledge Organization Systems (KOS): Management of Classification Systems in...Knowledge Organization Systems (KOS): Management of Classification Systems in...
Knowledge Organization Systems (KOS): Management of Classification Systems in...
 
Pedagogical applications of corpus data for English for General and Specific ...
Pedagogical applications of corpus data for English for General and Specific ...Pedagogical applications of corpus data for English for General and Specific ...
Pedagogical applications of corpus data for English for General and Specific ...
 
Literature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resourcesLiterature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resources
 
Laboratory for applied ontology
Laboratory for applied ontologyLaboratory for applied ontology
Laboratory for applied ontology
 
NLP in Web Data Extraction (Omer Gunes)
NLP in Web Data Extraction (Omer Gunes)NLP in Web Data Extraction (Omer Gunes)
NLP in Web Data Extraction (Omer Gunes)
 
LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...
LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...
LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...
 
LDL 2012 - Linking to ISOcat Data Categories
LDL 2012 - Linking to ISOcat Data CategoriesLDL 2012 - Linking to ISOcat Data Categories
LDL 2012 - Linking to ISOcat Data Categories
 
TDWG VoMaG Vocabulary management workflow, 2013-10-31
TDWG VoMaG Vocabulary management workflow, 2013-10-31TDWG VoMaG Vocabulary management workflow, 2013-10-31
TDWG VoMaG Vocabulary management workflow, 2013-10-31
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY
 
download
downloaddownload
download
 
download
downloaddownload
download
 
Project proposal for a fishery ontology service
Project proposal for a fishery ontology serviceProject proposal for a fishery ontology service
Project proposal for a fishery ontology service
 
PA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.docPA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.doc
 
PA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.docPA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.doc
 

Generating Lexical Information for Terminology in a Bioinformatics Ontology

  • 1. Generating Lexical Information for Terminology in a Bioinformatics Ontology Hammad Afzal1,3, Paul Buitelaar1, Philipp Cimiano2, John McCrae2, Tobias Wunner1 Unit for Natural Language Processing, Digital Enterprise Research Institute, National University of Ireland, Galway, Ireland1 Semantic Computing Group, Center of Excellence (CITEC), Bielefeld University, Bielefeld, Germany2 Department of Computer Science, College of Telecommunication Engineering, National University of Sciences and Technology, Pakistan3
  • 2. Motivation  Lack of Linguistic Expressiveness in formally specified ontologies  Typically developed to provide a shared view of a domain’s knowledge.  Not necessarily support the natural language processing (NLP) tasks.  Solutions :  Terminologies to include linguistic information to facilitate using ontologies for text processing, e.g. Specialist Lexicon contains lexical variants of many terms that are used in the biomedical domain.  Simple Knowledge Organization System (SKOS) format provides a standard way to represent knowledge organization systems using the Resource Description Framework (RDF).  Limitations:  SKOS provides a data-model to represent classification schemas such as thesauri etc by introducing further typology of labels (preferred, alternative, hidden etc.) and is not intended to associate more sophisticated lexical and linguistic information with an arbitrary ontology.
  • 3. Desiderata for Ontology-Lexicon model  Separation between linguistic and ontological Level  Develop lexica independently of specific ontologies for the same domain  Allow different lexica for each ontology  Independence between linguistic and ontological level  No mutual constraints  Ontological structures/concepts do not need to have a corresponding representation of linguistic structure and vice versa  Detailed information on linguistic realization  Part of speech, morphology (inflection, decomposition), syntactic structure (sub- categorization frames), etc.  Support for multi-linguality
  • 4. Towards our approach: LexInfo  Recent principled approaches to associate linguistic information to an arbitrary ontology:  LingInfo: modeling morpho-syntactic decomposition of (complex) terms [Buitelaar et al. 2006]  LexOnto: capturing syntactic behaviour and syntax-semantics links [Cimiano et al. 2007]  Lexical Markup Framework (LMF): ISO standardized model for representing machine readable lexica (agnostic about connection with ontology) [Francopoulo et al. 2007]  LexInfo: building on LMF as a core, develop a model which “subsumes” LingInfo and LexOnto for flexibly associating linguistic information to ontologies [Buitelaar, Cimiano, Haase, Sintek 2009]
  • 5. Case Study: Lexicalizing a bioinformatics ontology  Creating a LexInfo-based lexicon for lexical enrichment of a bioinformatics ontology i.e. the myGrid ontology (Wolstencroft et al., 2007).  Lexical information is derived from semantic lexicons such as WordNet (Fellbaum, 1998), and a domain related corpus. Key points:  The capture of morpho-syntactic behavior such as part-of-speech (POS), decomposition, lemmatization and sub-categorization behaviour of lexical elements.  The lexicalized terms along with their linguistic information are added to the OWL-based lexicon based on the LexInfo model.
  • 6. Case Study: Lexicalizing a bioinformatics ontology MyGrid Ontology  Supports Service Description of bioinformatics resources through service annotation.  Manual annotation is a slow process: e.g. Taverna/Feta: only ~15-20% of services are functionally described: Result is increasingly growing of backlog of un-annotated services  Certain NLP-based attempts for automation of service descriptions are reported where myGrid ontology is used.  Lexicalization of myGrid ontology can improve performance of such approaches
  • 7. Case Study: Lexicalizing a bioinformatics ontology • LexInfo  A principled way to enrich ontologies with linguistic information.  Provides a framework for automatic construction of 'lexicalized ontologies' on top of existing ontologies and lexical resources (Buitelaar et al, 2009) • Main characteristics:  Two separate domain of discourse by way if using different name spaces:  Domain ontology and LexInfo Model  Domain ontology defines the classes, properties and individuals in that domain  The main entities in lexical domain of discourse are instances of class LexicalEntry.  LexInfo attaches lexical information (e.g. part-of-speech, morphological, sub- categorization) to lexical entries.
  • 8. Rest of the talk • Methodology  Dual approach towards lexicalization of myGrid ontology  Collection of Bioinformatics Corpus  Lexicalization of Class Labels  Lexicalization of Property Labels • Statistics, Experiments and Results  Semi-automatically created lexicon  Automatically generated lexicon • What’s Next
  • 9. Methodology - I  Dual approach towards lexicalization of myGrid ontology  Semi-automatically created LexInfo-based lexicon.  Automatically created lexicon using LexInfo ontology lexicalization service.  Difference:  In Semi-automatically created lexicon, the linguistic information has been mainly derived from the domain corpus, and manually analyzed to verify correctness  In automatic generation, a generic POS-tagger and domain independent lexical resources are used to derive morpho- syntactic behaviour on the basis of an automatic analysis of the labels of the concepts, properties and individuals in the ontology
  • 10. Methodology - II  Collection of Bioinformatics Corpus  Domain specific behaviour (linguistic information) of the lexical entries is derived from 2691 full text journal articles of BMC Bioinformatics.  The GeniaTagger is used to get POS information; the tags of interest are Nouns, Proper Nouns, Verbs and Adjectives.  Syntactic information is derived using the Stanford parser.  Currently, we have worked only on the syntactic behaviour of properties (owl:ObjectProperty and owl:DataProperty in particular) and not of classes.
  • 11. Methodology - III • Lexicalization of Class Labels (Step-wise approach) 1. LexicalEntry is created for each Class (in the domain ontology) and is linked to Class through the hasSense property. 2. The LexicalEntry is initialized as one of its sub-classes (e.g. Noun, Verb, Adjective, etc.) 3. POS tag is derived from a semantic lexicon such as WordNet and further supported from associated domain corpus 4. The lexical form (Lemma, WordForm etc) is attached to the lexical entries through the corresponding relation: hasLemma or hasWordForm.
  • 12. Methodology - III • Lexicalization of Class Labels (Single Word) The linking of LexicalEntry with a domain Class, and attachment of grammatical information and lemma with LexicalEntry
  • 13. Methodology - III • Lexicalization of Class Labels (Multi-Word)  LexInfo associates a ListOfComponents with a LexicalEntry with an ordered list of Components and size given as a DataProperty of ListOfComponents.  Each of the Components is linked with a LexicalEntry.  The validity of Component as a legitimate LexicalEntry is derived from its presence in the myGrid ontology as a separate entity, or its substantive existence in the domain corpus.
  • 14. Methodology - III • Lexicalization of Class Labels (Multi-Word)  An example of morphological decomposition of a multi-word class label (from the myGrid ontology).
  • 15. Methodology - IV • Lexicalization of Property Labels (Steps)  Morphological decomposition as well as the syntactic analysis of the property label is performed.  The property labels are automatically tokenized, and tokens are then linked with the LexicalEntries (Same as Classes).  On syntactic level, the tokens are analyzed to attach their respective syntactic behavior which is then linked with the subcategorization frames.  LexInfo model provides various specializations of subCategorization frames such as Transitive, TransitivePP, IntransitivePP, AdjectiveNP, NounPP and Noun2PP etc  Mapping of syntactic arguments such as Subject, Object, PObject etc. linked with the LexicalEntry to the semantic arguments such as Domain, Range, RangeOfProperty corresponding to the object property.
  • 16. Methodology - IV • Lexicalization of Property Labels  In automatic lexicon generation, the lexical entries are derived automatically by processing the labels in the ontology using LILAC grammar.  LILAC production rules state part-of-speech patterns that apply to the label. For example, a label with the structure “N Prep” gives rise to a lexicon entry of type “NounPP”.  Currently, LexInfo uses 73 rules to generate lexicons automatically (further details on LexInfo homepage).
  • 17. Methodology - IV • Lexicalization of Property Labels – Lexicalization of ObjectProperty produces.
  • 18. Statistics - I • Some of the statistics about the myGrid ontology Ontology Constructs Total Number of Occurrences Single word class labels 88 Two word class labels 200 Classes 475 Three or more word class 187 labels Single word property labels 1 Two word property labels 4 ObjectProperties 8 Three or more word class 3 labels DataProperties 0 Individuals 0
  • 19. Statistics - II • Semi-automatically generated LexInfo based lexicon of the myGrid ontology. Number of LexInfo Specialized Entries in Example Labels Constructs Constructs ‘myGrid Lexicon’ Adjective Multiple 21 Noun Alignment 752 LexicalEntries Proper Noun Medline 253 Verb Perform 4 NounPhrase Sequence_similarity_Search 369 AdjectivePhrase Tertiary_Structure_Prediction 16 VerbPhrase Performs_task 1 Written-Form 1044 List-of- 387 Components Syntactic- Transitive produces 4 Behaviour NounPP is_part_of 4
  • 20. Statistics - III • Statistics about the automatically generated LexInfo based lexicon of the myGrid ontology using LexInfo lexicon generation service. # of Entries in LexInfo Specialized Constructs Example Labels ‘myGrid Constructs Lexicon’ Adjective local 131 Noun Record 973 Proper Noun Maize 15 Verb Perform 19 LexicalEntry Genotype-phenotype- NounPhrase 1069 database ProperNounPhrase UniProt 1 VerbPhrase 0 List-of- 1071 Components Transitive produces 3 Syntactic- NounPP is_part_of 4 Behaviour IntransitivePP produced_by 1
  • 21. Discussion Semi-Automatically created Lexicon Lexicalization of Classes  Most of the LexicalEntries are of type Noun, NounPhrase and ProperNoun  Not many Verb occurrences.  Class labels are mostly named using nouns, whereas the object properties are typically named using verbs,  Small number of ObjectProperties (8 properties) resulted in a smaller number of verbs in the lexicon.  The number of Proper Nouns is 253; 32 of which are created from single- word Class names.  387 ListOfComponents are created from the 387 multi-word class names in the ontology (myGrid), 371 of them correspond to NounPhrases and 16 are AdjectivePhrases,
  • 22. Discussion Semi-Automatically created Lexicon Lexicalization of ObjectProperties  is_identifier_of, and is_part_of lexicalized as Nouns (part and identifier)  SyntacticBehavior linked to the subcategorization frame of type NounPP (Noun: identifier, Prep: of and Noun: part, Prep:of).  performs_task and task_performed_by lexicalized as Verb (perform).  SyntacticBehavior linked to the subcategorization frame of type Transitive.  Both properties are inverse of each other, and are lexicalized using the same verb, however, the mapping of syntactic arguments to domain and range is inversed in the two cases.  Produces and produced_by are lexicalized lexicalized as Verb (perform)  performs_task is recognized as a VerbPhrase with performs as a Verb and a Transitive subCategorization frame linked with it.  The syntactic behaviors of has_identifier and has_part are also modeled as NounPP.
  • 23. Discussion Automatically generated Lexicon using LexInfo service Lexicalization of Classes (differences from the semi-automatically created)  The number of Adjectives has significantly increased to 131 and those of ProperNouns has steeply decreased to 15.  Reason is that ProperNouns are incorrectly identified as Adjectives by our POS tagger (Stanford Tagger), e.g. DDBJ in DDBJ_Amino_Acid_Database), PIRSF in PIRSF_report are recognized as Adjectives by the POS-tagger.  This problem can be resolved by using domain corpora, or considering a domain thesaurus or dictionary etc.  The number of Verbs has increased to 19  Again due to a POS tag error: gerunds such as “manipulating”, “predicting” are incorrectly identified as Verbs.  The identification of ProperNounPhrase is incorrect due to a tokenization error.  “UniProt” is tokenized as two proper nouns, “uni” and “prot”, although it is a single word, i.e. name of a bioinformatics database.  This can also be resolved using a domain corpus or thesaurus.
  • 24. Discussion Automatically generated Lexicon using LexInfo service Lexicalization of ObjectProperties  ObjectProperties are mostly lexicalized correctly.  Only error is in lexicalization of “produced_by” that is recognized as IntransitivePP. This is because of an error in the ontology lexicalization (LILAC) rules which consider the occurrence of a past-participle verb followed by “by” as an occurrence of IntransitivePP.
  • 25. Implementation • Initial implementation of LexInfo Model as API – Univ. Bielefeld, DERI – National Univ. of Ireland, Galway – https://lexinfo.googlecode.com/svn
  • 26. Future Work  Linguistically enriched ontology for improvement of service annotation  The linguistically enriched lexicon associated with the myGrid ontology can improve the performance of literature based approaches for automatic annotation of bioinformatics web services.  Optimization of LexInfo model by including WordNet etc.  To generate all possible lexicalizations of given ontological constructs by utilizing Synsets from WordNet and extract semantically similar verbs from VerbNet and FrameNet  LexInfo API is currently under development  Allows the creation, management and serialization of ontology lexica according to the LexInfo model. An early prototype of a lexicon generation service based on LexInfo model is also made available. Available at: http://code.google.com/p/lexinfo/
  • 27. Acknowledgments • Supported in part by the European Union under Grant No. 248458 for the Monnet project as well as by the Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2). • Thanks to Thomas Wangler, Michael Sintek and Matthias Mantel for their valuable contributions in designing the LexInfo model and developing the LexInfo API.
  • 28. References • Afzal, H., Stevens, R. and Nenadic, G. Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature, In Proceedings of the 6th European Semantic Web Conference (ESWC 2009), LNCS 5554, Springer-Verlag: 535-549. • Afzal, H., Stevens, R., Nenadic, G. Towards Semantic Annotation of Bioinformatics Services: Building a Controlled Vocabulary, In Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008):5-12. • Buitelaar, P., Declerck, T., Frank, A., Racioppa, S., Kiesel, M., Sintek, M., Engel, R., Romanelli, M., Sonntag, D., Loos, B., Micelli, V., Porzel, R. and Cimiano, P. LingInfo: Design and Applications of a Model for the Integration of Linguistic Information in Ontologies. In Proceedings of OntoLex06, a workshop at LREC, Genoa, Italy. • Paul Buitelaar, Philipp Cimiano, Peter Haase, Michael Sintek: Towards Linguistically Grounded Ontologies. In Proceedings of the 6th European Semantic Web Conference (ESWC 2009), Lecture Notes in Computer Science, Springer 2009. • Cimiano, P., Haase, P., Herold, M., Mantel, M. and Buitelaar, P.: LexOnto: A model for ontology lexicons for ontology-based NLP. In Proceedings of the OntoLex (From Text to Knowledge: The Lexicon/Ontology Interface) workshop at ISWC07 (International Semantic Web Conference). • Francopoulo, G., Bel, N., Georg, Calzolari, N., Monachini, M., Pet, M. and Soria, C.: Lexical markup framework: ISO standard for semantic information in NLP lexicons. In Proceedings of the Workshop of the GLDV Working Group on Lexicography at the Biennial Spring Conference of the GLDV
  • 29. Resources Used • BMC Bioinformatics: http://www.biomedcentral.com/bmcbioinformatics/ • Genia Tagger: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/ • Stanford Parser: http://nlp.stanford.edu/downloads/lex-parser.shtml • Stanford Tagger: http://nlp.stanford.edu/software/tagger.shtml • TreeTagger: http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/