SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Text mining tools for
semantically enriching the
   scientific literature
          Sophia Ananiadou
               Director
    National Centre for Text Mining
     School of Computer Science
      University of Manchester
Need for enriching the literature
• Need for semantic search i.e. beyond keywords
• Need for technologies enabling focused
  semantic search via the creation of semantic
  metadata from literature

 “The current scientific literature, were it to be
  presented in semantically accessible form,
  contains huge amounts of undiscovered
  science”
  Peter Murray-Rust, Data-driven science: A Scientist’s view.
  NSF/JISC Repositories Workshop, 2007
Impact of text mining
• Extraction of named entities (genes, proteins,
  metabolites, etc)
• Discovery of concepts allows semantic annotation of
  documents
   – Improves information access by going beyond index
     terms, enabling semantic querying
   – Improves clustering, classification of documents
   – Visualisation based on semantic metadata derived
     from text mining results
Beyond named entities: facts
• Extraction of relationships, events (facts)
  for knowledge discovery
  – Information extraction, more sophisticated
    annotation of texts (fact annotation)
  – Enables even more advanced semantic
    querying
Enriched annotation
• Text Mining provides enriched annotation
  layers
  – the user will be able to carry out an easily
   expressed semantic query which will deliver
   facts matching that semantic query rather
   than just sets of documents he has to read…
    • Information Extraction and not just Information
      Retrieval
    • Fact extraction and not just sentence extraction
Annotations derived from Text Mining

                                       lexicon                          ontology



                                                 text processing


      raw                                                                                deep                   annotated
                          part-of-speech              named entity
 (unstructured)                                                                        syntactic               (structured)
                              tagging                  recognition
      text                                                                              parsing                     text



   ………………………....                                                S
   ... Secretion of TNF was
   abolished by BHA in                                                       VP
   PMA-stimulated U937
                                                      NP                                  VP
   cells. ……………………
                                                                                                     PP
                                                 NP        PP                         PP                  NP


                                              NN     IN NN VBZ     VBN     IN NN IN      JJ         NN NNS .
Multi-layered                               Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells .

annotations                                           protein_molecule            organic_compound         cell_line


                                                                    negative regulation
Mining associations from MEDLINE
• FACTA: Finding Associated Concepts with
  Text Analysis
   – What diseases are related to a particular chemical?
   – What proteins are related to a particular disease?
   – etc.

• EBIMed http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp
• PubMatrix http://pubmatrix.grc.nia.nih.gov/
     :
• FACTA http://text0.mib.man.ac.uk/software/facta/
   – Quick and interactive
Query
Click!
Innovative Technologies applied to:
• Term recognition
• Named entity recognition        Semantic
                                  Mark-up
• Fact extraction
  ! semantic mark-up improves search
  ! classifying, linking documents
  ! knowledge discovery, hidden links,
   associations, hypothesis generation
Natural Language Processing
           technologies
• Part-of-speech tagging: GENIA
  – Tuned to biomedical text: 97-99% precision
• Dictionary-based named-entity recognition
• Deep parsing
  – Predicate argument relations (90%)
• Protein-protein interaction extraction
• Event / fact extraction
Automatic Term Recognition




http://www.nactem.ac.uk/software/termine/
Recognising and Disambiguating
Acronyms in Biomedical Literature




        http://www.nactem.ac.uk/software/acromine
Named-entity recognition

    The peri-kappa B site mediates human immunodeficiency
             DNA                                     virus
    virus type 2 enhancer activation in monocytes …
                                          cell_type

! Entity types (defined by Ontologies)
   quot; Genes/protein names
   quot; Enzymes, substances, metabolites, etc
   quot; GO ontology, KEGG, CheBI, etc
Leveraging resources
• Annotated texts (GENIA corpus, GENIA
  event corpus)
• Resources for bio-text mining
  – resource-building NLP tools for text-based
    knowledge harvesting (NaCTeM)
  – BioLexicon
    • Over 1.5M lexical entries for bio-text mining and
      growing….
    • Containing rich linguistic information for bio-text
      mining
Population Process
Existing repositories
                          chemical, disease, enzyme, species names

                          Subclustering        gene/protein names
                         of term variants

                        new gene/protein names
Medline abstracts         Named entity          Term mapping
                           recognition         by normalization    Bio-Lexicon


                                                  terminological verbs
                         Manual curation

                                                   on-going
                         Verb subcategorization

                                         verb subcategorization frames
Semantic search based on facts
• MEDIE: an interactive advanced IR
  system retrieving facts
• Performs a semantic search
! Core technology annotates texts
    quot; GENIA tagger quot; syntactic structures
    quot; Enju (deep parser) quot; facts
    quot; Dictionary-based named entity recognition
J. Tsujii
Medie system overview
            Off-line
                                          On-line
              Deep
              parser     Semantically-     RegionAlgebra
 Input
Textbase                   annotated       Search engine
              Entity       Textbase
            Recognizer

                                                     Search
                                         Query
                                                     results
Sentence Retrieval System
Using Semantic Representation
           MEDIE
InfoPubMed
! An interactive Information Extraction system and
  an efficient PubMed search tool, helping users to
  find information about biomedical entities such
  as genes, proteins, and the interactions
  between them.
! System components
  quot; Deep parsing technology
  quot; Extraction of protein-protein interactions
  quot; Multi-window interface on a browser
InfoPubMed
             Interactions and not
             just co-occurrences.
             Calculated using ML
             and deep semantics.
Semantic Information Retrieval
        http://nactem4.mc.man.ac.uk:8080/Kleio/


# KLEIO: a semantically enriched
 information retrieval system for biology
# Offers textual and metadata searches
 across MEDLINE
# Leverages terminology technologies
  #Named entity recognition: gene, protein,
   metabolite, organ, disease, symptom
KLEIO architecture
Fewer documents
with more precise
query
Linking and enriching pathways
           with text

– REFINE (BBSRC)
quot; MCISB and NaCTeM (Kell, Ananiadou, Tsujii)
– to integrate text mining techniques with
  visualisation technologies for better
  understanding of the evidence for biochemical
  and signalling pathways
– to enrich pathway models encoded in the
  Systems Biology Markup Language (SBML)
  with evidence derived from text mining
2 Steps for linking text with
                   pathways
                                          IkB P   IkB U       !
                                IkB
                 Pathways

Pathway Construction
                                          IkB     IkB P

            Biological events             IkB     IkB U
                                            IkB     !

  Event Extraction

                                      … IkappaB is phosphorylated …
                 Literature     … Ikappa B ubiquitination …
                                      … degradation of IkB…
          Tsujii-lab, Tokyo
Event Annotation - Example
Statistics & References
! Statistics
  quot; 36,114 events have been identified from
    and annotated to
     ! 1,000 Medline abstracts, which contain
     ! 9,372 sentences
  quot; Kim, Jin-Dong, Tomoko Ohta and Jun'ichi
    Tsujii (2008) Corpus annotation for
    mining biomedical events from
    literature. BMC Bioinformatics
  quot; http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA
Acknowledgements
• Junichi Tsujii and his lab (University of Tokyo) MEDIE,
  InfoPubMed, event annotation
• Yoshimasa Tsuruoka (NER, FACTA, KLEIO, REFINE)
• Naoaki Okazaki (TerMine, AcroMine)
• Yutaka Sasaki (BioLexicon, NER, KLEIO)
• John McNaught (BioLexicon, BOOTStrep project)
• Chikashi Nobata (KLEIO)
• Douglas Kell (REFINE)

Contenu connexe

Similaire à Text mining tools for semantically enriching scientific literature

Protein function and bioinformatics
Protein function and bioinformaticsProtein function and bioinformatics
Protein function and bioinformaticsNeil Saunders
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformaticsAtai Rabby
 
Introduction to BioNLP and its applications
Introduction to BioNLP and its applicationsIntroduction to BioNLP and its applications
Introduction to BioNLP and its applicationsShankaiYan
 
Unison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningUnison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningReece Hart
 
Research presentation-wd
Research presentation-wdResearch presentation-wd
Research presentation-wdWagied Davids
 
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Amit Sheth
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS
 
Introduction to Biological database ppt(1).pptx
Introduction to Biological database ppt(1).pptxIntroduction to Biological database ppt(1).pptx
Introduction to Biological database ppt(1).pptxRAJESHKUMAR428748
 
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseTowards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseHilmar Lapp
 
Introduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdfIntroduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdfkigaruantony
 
"Ontology-centric navigation of the scientific literature"
"Ontology-centric navigation of the scientific literature""Ontology-centric navigation of the scientific literature"
"Ontology-centric navigation of the scientific literature"bridgingworlds2008
 
BITS: Basics of sequence databases
BITS: Basics of sequence databasesBITS: Basics of sequence databases
BITS: Basics of sequence databasesBITS
 
B.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseB.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseRai University
 
Curation Introduction - Apollo Workshop
Curation Introduction - Apollo WorkshopCuration Introduction - Apollo Workshop
Curation Introduction - Apollo WorkshopMonica Munoz-Torres
 
BioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsBioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsAyeshaYousaf20
 
Biocuration2012 poster P113
Biocuration2012 poster P113Biocuration2012 poster P113
Biocuration2012 poster P113mjmeurs
 

Similaire à Text mining tools for semantically enriching scientific literature (20)

RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
 
Protein function and bioinformatics
Protein function and bioinformaticsProtein function and bioinformatics
Protein function and bioinformatics
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
 
bioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics databioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics data
 
Introduction to BioNLP and its applications
Introduction to BioNLP and its applicationsIntroduction to BioNLP and its applications
Introduction to BioNLP and its applications
 
Unison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningUnison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic mining
 
Research presentation-wd
Research presentation-wdResearch presentation-wd
Research presentation-wd
 
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
 
Ncbi
NcbiNcbi
Ncbi
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequences
 
Introduction to Biological database ppt(1).pptx
Introduction to Biological database ppt(1).pptxIntroduction to Biological database ppt(1).pptx
Introduction to Biological database ppt(1).pptx
 
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseTowards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
 
Introduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdfIntroduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdf
 
"Ontology-centric navigation of the scientific literature"
"Ontology-centric navigation of the scientific literature""Ontology-centric navigation of the scientific literature"
"Ontology-centric navigation of the scientific literature"
 
Ibn Sina
Ibn SinaIbn Sina
Ibn Sina
 
BITS: Basics of sequence databases
BITS: Basics of sequence databasesBITS: Basics of sequence databases
BITS: Basics of sequence databases
 
B.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseB.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 database
 
Curation Introduction - Apollo Workshop
Curation Introduction - Apollo WorkshopCuration Introduction - Apollo Workshop
Curation Introduction - Apollo Workshop
 
BioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomicsBioInformatics Tools -Genomics , Proteomics and metablomics
BioInformatics Tools -Genomics , Proteomics and metablomics
 
Biocuration2012 poster P113
Biocuration2012 poster P113Biocuration2012 poster P113
Biocuration2012 poster P113
 

Plus de Duncan Hull

Why study plants?
Why study plants?Why study plants?
Why study plants?Duncan Hull
 
Embedding employability in the Computer Science curriculum
Embedding employability in the Computer Science curriculumEmbedding employability in the Computer Science curriculum
Embedding employability in the Computer Science curriculumDuncan Hull
 
Wikipedia at the Royal Society: The Good, the Bad and the Ugly
Wikipedia at the Royal Society: The Good, the Bad and the UglyWikipedia at the Royal Society: The Good, the Bad and the Ugly
Wikipedia at the Royal Society: The Good, the Bad and the UglyDuncan Hull
 
Improving the troubled relationship between Scientists and Wikipedia
Improving the troubled relationship between Scientists and Wikipedia Improving the troubled relationship between Scientists and Wikipedia
Improving the troubled relationship between Scientists and Wikipedia Duncan Hull
 
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome CampusBibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome CampusDuncan Hull
 
Accessing small molecule data using ChEBI
Accessing small molecule data using ChEBIAccessing small molecule data using ChEBI
Accessing small molecule data using ChEBIDuncan Hull
 
OWL-XML-Summer-School-09
OWL-XML-Summer-School-09OWL-XML-Summer-School-09
OWL-XML-Summer-School-09Duncan Hull
 
Authenticating Scientists with OpenID
Authenticating Scientists with OpenIDAuthenticating Scientists with OpenID
Authenticating Scientists with OpenIDDuncan Hull
 
The Invisible Scientist
The Invisible ScientistThe Invisible Scientist
The Invisible ScientistDuncan Hull
 
myExperiment @ Nettab
myExperiment @ NettabmyExperiment @ Nettab
myExperiment @ NettabDuncan Hull
 
The Year of Blogging Dangerously
The Year of Blogging DangerouslyThe Year of Blogging Dangerously
The Year of Blogging DangerouslyDuncan Hull
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...Duncan Hull
 
The Future of Research (Science and Technology)
The Future of Research (Science and Technology)The Future of Research (Science and Technology)
The Future of Research (Science and Technology)Duncan Hull
 
Chemical named entity recognition and literature mark-up
Chemical named entity recognition and literature mark-upChemical named entity recognition and literature mark-up
Chemical named entity recognition and literature mark-upDuncan Hull
 
Chemoinformatics and information management
Chemoinformatics and information managementChemoinformatics and information management
Chemoinformatics and information managementDuncan Hull
 
Issues for metabolomics and
Issues for metabolomics and Issues for metabolomics and
Issues for metabolomics and Duncan Hull
 
Adding Meaning To Your Data
Adding Meaning To Your DataAdding Meaning To Your Data
Adding Meaning To Your DataDuncan Hull
 

Plus de Duncan Hull (20)

Why study plants?
Why study plants?Why study plants?
Why study plants?
 
Embedding employability in the Computer Science curriculum
Embedding employability in the Computer Science curriculumEmbedding employability in the Computer Science curriculum
Embedding employability in the Computer Science curriculum
 
Wikipedia at the Royal Society: The Good, the Bad and the Ugly
Wikipedia at the Royal Society: The Good, the Bad and the UglyWikipedia at the Royal Society: The Good, the Bad and the Ugly
Wikipedia at the Royal Society: The Good, the Bad and the Ugly
 
Improving the troubled relationship between Scientists and Wikipedia
Improving the troubled relationship between Scientists and Wikipedia Improving the troubled relationship between Scientists and Wikipedia
Improving the troubled relationship between Scientists and Wikipedia
 
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome CampusBibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
 
OWL and OBO
OWL and OBOOWL and OBO
OWL and OBO
 
Accessing small molecule data using ChEBI
Accessing small molecule data using ChEBIAccessing small molecule data using ChEBI
Accessing small molecule data using ChEBI
 
How to Blog
How to BlogHow to Blog
How to Blog
 
OWL-XML-Summer-School-09
OWL-XML-Summer-School-09OWL-XML-Summer-School-09
OWL-XML-Summer-School-09
 
Authenticating Scientists with OpenID
Authenticating Scientists with OpenIDAuthenticating Scientists with OpenID
Authenticating Scientists with OpenID
 
The Invisible Scientist
The Invisible ScientistThe Invisible Scientist
The Invisible Scientist
 
myExperiment @ Nettab
myExperiment @ NettabmyExperiment @ Nettab
myExperiment @ Nettab
 
The Year of Blogging Dangerously
The Year of Blogging DangerouslyThe Year of Blogging Dangerously
The Year of Blogging Dangerously
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
 
The Future of Research (Science and Technology)
The Future of Research (Science and Technology)The Future of Research (Science and Technology)
The Future of Research (Science and Technology)
 
Chemical named entity recognition and literature mark-up
Chemical named entity recognition and literature mark-upChemical named entity recognition and literature mark-up
Chemical named entity recognition and literature mark-up
 
Chemoinformatics and information management
Chemoinformatics and information managementChemoinformatics and information management
Chemoinformatics and information management
 
Issues for metabolomics and
Issues for metabolomics and Issues for metabolomics and
Issues for metabolomics and
 
Adding Meaning To Your Data
Adding Meaning To Your DataAdding Meaning To Your Data
Adding Meaning To Your Data
 

Dernier

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterMateoGardella
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 

Dernier (20)

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 

Text mining tools for semantically enriching scientific literature

  • 1. Text mining tools for semantically enriching the scientific literature Sophia Ananiadou Director National Centre for Text Mining School of Computer Science University of Manchester
  • 2. Need for enriching the literature • Need for semantic search i.e. beyond keywords • Need for technologies enabling focused semantic search via the creation of semantic metadata from literature “The current scientific literature, were it to be presented in semantically accessible form, contains huge amounts of undiscovered science” Peter Murray-Rust, Data-driven science: A Scientist’s view. NSF/JISC Repositories Workshop, 2007
  • 3. Impact of text mining • Extraction of named entities (genes, proteins, metabolites, etc) • Discovery of concepts allows semantic annotation of documents – Improves information access by going beyond index terms, enabling semantic querying – Improves clustering, classification of documents – Visualisation based on semantic metadata derived from text mining results
  • 4. Beyond named entities: facts • Extraction of relationships, events (facts) for knowledge discovery – Information extraction, more sophisticated annotation of texts (fact annotation) – Enables even more advanced semantic querying
  • 5. Enriched annotation • Text Mining provides enriched annotation layers – the user will be able to carry out an easily expressed semantic query which will deliver facts matching that semantic query rather than just sets of documents he has to read… • Information Extraction and not just Information Retrieval • Fact extraction and not just sentence extraction
  • 6. Annotations derived from Text Mining lexicon ontology text processing raw deep annotated part-of-speech named entity (unstructured) syntactic (structured) tagging recognition text parsing text ……………………….... S ... Secretion of TNF was abolished by BHA in VP PMA-stimulated U937 NP VP cells. …………………… PP NP PP PP NP NN IN NN VBZ VBN IN NN IN JJ NN NNS . Multi-layered Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells . annotations protein_molecule organic_compound cell_line negative regulation
  • 7. Mining associations from MEDLINE • FACTA: Finding Associated Concepts with Text Analysis – What diseases are related to a particular chemical? – What proteins are related to a particular disease? – etc. • EBIMed http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp • PubMatrix http://pubmatrix.grc.nia.nih.gov/ : • FACTA http://text0.mib.man.ac.uk/software/facta/ – Quick and interactive
  • 10.
  • 11. Innovative Technologies applied to: • Term recognition • Named entity recognition Semantic Mark-up • Fact extraction ! semantic mark-up improves search ! classifying, linking documents ! knowledge discovery, hidden links, associations, hypothesis generation
  • 12. Natural Language Processing technologies • Part-of-speech tagging: GENIA – Tuned to biomedical text: 97-99% precision • Dictionary-based named-entity recognition • Deep parsing – Predicate argument relations (90%) • Protein-protein interaction extraction • Event / fact extraction
  • 14.
  • 15.
  • 16. Recognising and Disambiguating Acronyms in Biomedical Literature http://www.nactem.ac.uk/software/acromine
  • 17. Named-entity recognition The peri-kappa B site mediates human immunodeficiency DNA virus virus type 2 enhancer activation in monocytes … cell_type ! Entity types (defined by Ontologies) quot; Genes/protein names quot; Enzymes, substances, metabolites, etc quot; GO ontology, KEGG, CheBI, etc
  • 18.
  • 19. Leveraging resources • Annotated texts (GENIA corpus, GENIA event corpus) • Resources for bio-text mining – resource-building NLP tools for text-based knowledge harvesting (NaCTeM) – BioLexicon • Over 1.5M lexical entries for bio-text mining and growing…. • Containing rich linguistic information for bio-text mining
  • 20. Population Process Existing repositories chemical, disease, enzyme, species names Subclustering gene/protein names of term variants new gene/protein names Medline abstracts Named entity Term mapping recognition by normalization Bio-Lexicon terminological verbs Manual curation on-going Verb subcategorization verb subcategorization frames
  • 21. Semantic search based on facts • MEDIE: an interactive advanced IR system retrieving facts • Performs a semantic search ! Core technology annotates texts quot; GENIA tagger quot; syntactic structures quot; Enju (deep parser) quot; facts quot; Dictionary-based named entity recognition J. Tsujii
  • 22. Medie system overview Off-line On-line Deep parser Semantically- RegionAlgebra Input Textbase annotated Search engine Entity Textbase Recognizer Search Query results
  • 23. Sentence Retrieval System Using Semantic Representation MEDIE
  • 24.
  • 25. InfoPubMed ! An interactive Information Extraction system and an efficient PubMed search tool, helping users to find information about biomedical entities such as genes, proteins, and the interactions between them. ! System components quot; Deep parsing technology quot; Extraction of protein-protein interactions quot; Multi-window interface on a browser
  • 26. InfoPubMed Interactions and not just co-occurrences. Calculated using ML and deep semantics.
  • 27. Semantic Information Retrieval http://nactem4.mc.man.ac.uk:8080/Kleio/ # KLEIO: a semantically enriched information retrieval system for biology # Offers textual and metadata searches across MEDLINE # Leverages terminology technologies #Named entity recognition: gene, protein, metabolite, organ, disease, symptom
  • 29.
  • 31. Linking and enriching pathways with text – REFINE (BBSRC) quot; MCISB and NaCTeM (Kell, Ananiadou, Tsujii) – to integrate text mining techniques with visualisation technologies for better understanding of the evidence for biochemical and signalling pathways – to enrich pathway models encoded in the Systems Biology Markup Language (SBML) with evidence derived from text mining
  • 32. 2 Steps for linking text with pathways IkB P IkB U ! IkB Pathways Pathway Construction IkB IkB P Biological events IkB IkB U IkB ! Event Extraction … IkappaB is phosphorylated … Literature … Ikappa B ubiquitination … … degradation of IkB… Tsujii-lab, Tokyo
  • 34. Statistics & References ! Statistics quot; 36,114 events have been identified from and annotated to ! 1,000 Medline abstracts, which contain ! 9,372 sentences quot; Kim, Jin-Dong, Tomoko Ohta and Jun'ichi Tsujii (2008) Corpus annotation for mining biomedical events from literature. BMC Bioinformatics quot; http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA
  • 35. Acknowledgements • Junichi Tsujii and his lab (University of Tokyo) MEDIE, InfoPubMed, event annotation • Yoshimasa Tsuruoka (NER, FACTA, KLEIO, REFINE) • Naoaki Okazaki (TerMine, AcroMine) • Yutaka Sasaki (BioLexicon, NER, KLEIO) • John McNaught (BioLexicon, BOOTStrep project) • Chikashi Nobata (KLEIO) • Douglas Kell (REFINE)