Publicité

AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Analysis of Scientific Literature Stefan Geißler (Kairntech, Germany)

Dr. Christoph Haxel à Dr. Haxel Consult
11 Oct 2022
Publicité

Contenu connexe

Similaire à AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Analysis of Scientific Literature Stefan Geißler (Kairntech, Germany)(20)

Plus de Dr. Haxel Consult(20)

Publicité

AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Analysis of Scientific Literature Stefan Geißler (Kairntech, Germany)

  1. “Automatic relation extraction from biomedical content for knowledge graph creation and maintenance” Stefan Geißler - Kairntech “AI SDV” Vienna, Oct 11, 2022 info@kairntech.com
  2. 1 2 3 French/German NLP/AI specialist created in December 2018 Grenoble, France (headquarter), Paris and Heidelberg, Germany Selected customers: Boehringer-Ingelheim, SCAI-Fraunhofer, AFP, French Government, Groupe Revue Fiduciaire,… Introducing Kairntech 4 Mission: Making NLP / Machine Learning accessible for domain experts (“No code / Low code platform”) Stefan Geißler Co-founder Kairntech Heidelberg
  3. Project Intro & Background • Spring 2021: Fraunhofer SCAI: “We are looking for an automatic procedure to extract information about entities and their relations from a larger corpus on psychiatric disorders and render that in the BEL language. Can you do that?” “Phosphorylation of glycogen synthase kinase 3beta at Threonine, 668 increases the degradation of Amyloid precursor protein.” BEL (Biological Expression Language) captures and encodes biological relations Copied from Martin Hofmann-Apitius
  4. The challenge: updating knowledge graphs Number of articles Existing knowledge graphs outdated quickly; update is time-consuming and challenging Data volumes are large – but not infinite Knowledge is complex - but can be formalized Hypotheses: Publications on Pubmed on “TAU phosphorylation” Copied from Martin Hofmann-Apitius
  5. Assessing the “Pharmacome” Previous efforts at SCAI: Manual creation of KGs on genes, proteins, drugs on Neurogenerative diseases Graphs often involve thousands of genes, proteins and their interactions On Alzheimer alone, SCAI built 24 specific subgraphs Copied from Martin Hofmann-Apitius Main goal: Find information for drug repurposing: Which drugable, known substances have a therapeutic effect on the investigated diseases?
  6. Relevant but hard to find: Indirect relations “A interacts with B which stimulates C which blocks D …” Approach: Harvest interactions automatically and introduce them into machine-readable large indication-wide Knowledge Graphs Chains of interactions over multiple entities are notoriously hard to find, often unknown even to individuals who are aware of the specific isolated steps. Copied from Martin Hofmann-Apitius
  7. Towards Automatic Knowledge Graph Creation: Relation Extraction Knowledge Graphs: • Network of entities and the relations between them • Facilitates investigating many relevant questions (compared to trad. Relational DBs) • E.g. in bio-sciences: pathways (“A stimulates B which blocks C which increases D…”) • Growing adoption in the industry • Many intuitive data management systems Relation Extraction: • Challenging NLP task: Identify relations from natural language text • Co-Occurrence: Relation holds when entities occur together in text • Benefits: Fast, simple • Drawbacks: overgeneration, no relation type • Hand-crafted rules: Apply complex grammar • Benefits: Can be very accurate, can detect types of relations • Drawbacks: complex process of rule building. How to model relations beyond sentence-level? • Machine-learning: Training on annotated text • Benefits: highest quality • Drawback: Data & Computation intensive
  8. Kairntech Studio (build a pipeline) TXT, PDF, Word, HTML, XML, JSON… Import documents Create a training dataset Explore, label text with manual or assisted text annotation tools Create and compare learning models Create NLP pipelines Combining models, client taxonomy, Knowledge graphs, technical components… Using built-in and state-of-the-art algorithms. Usage via No-code & easy-to-use web application or via RestAPI
  9. Kairntech Server (run the pipeline in production) REST API Link entities to WikiData taxonomies Custom housekeeping to improve recognition quality: Filtering Refine entity extraction NLP Pipeline Knowledge applications • Large-scale entity recognition • Multi-topic • Multi-lingual • Constantly updated • Linked to background information Compute add’tl entity properties Proteins receive label about protein modification (ex: phosphorylation) Build relations between entities Format results to output Apply custom- trained relation extraction model Return results in BEL format for integration in Knowledge Graph environment
  10. Entity Linking & Namespaces Entities are not just strings, they need to be associated with the real-world objects they refer to. A subset of namespaces is used: ⚫ HGNC (Hugo Gene Nomenclature Committee) for proteins ⚫ MeSH (Medical Subject Headings) for pathologies ⚫ CheBI (Chemical Entities of Biological Interest) for drug/chemical abundances ⚫ GO (Gene Ontology) for biological processes Recognized entities are ⚫ Disambiguated (“cancer”: animal or disease?) ⚫ Scored (prominence/weight of the concept at in this context?) ⚫ Normalized (synonyms → preferred form, ex: “NIDDM” → “Diabetes Mellitus Type 2”) ⚫ Linked: Entities are linked to world-knowledge
  11. NLP Pipeline for Entity Recognition & Linking: Example!
  12. Relationship extraction • Trained on a relation training data set • The table (right) lists the most important relations addressed in the project • The model determines whether between any pair of entities a relation holds or not (“NoRelation”) and if yes, which one • Quality decreases with smaller numbers of training examples per relationship type (as expected). • Manual inspection of (a sample of) the results from Kairntech by SCAI experts: ~73% of the relations returned by Kairntech are valid RELATION NoRelation RELATION increases RELATION decreases RELATION regulates RELATION positiveCorrelation RELATION association RELATION negativeCorrelation
  13. Complete NLP Pipeline with Relationship Extraction: Example!
  14. Results displayed as a Knowledge Graph: Example! “Which proteins are known to be positively correlated both with Schizophrenia as well as Bipolar Disorders?”
  15. “We now have relation extraction accuracies of between 70% and 80%. Not long ago you had to be happy to get that for just entity extraction. Having this for relations now is outstanding.” Martin Hoffmann-Apitius, SCAI Link entities to taxonomies Refine entity extraction Kairntech Natural Language Processing Pipeline Create relations between entities BEL output Automated relationship extraction
  16. Ongoing work • Speed up the analysis by parallelizing the process: Make use of SCAI’s high performance computing cluster, leverage SCAI expertise in parallelizing complex software processes • Investigate extension to other therapeutic areas • Define joint approach & offering for industry use cases on • computing topic specific knowledge graphs • Updating/extending knowledge graphs • Expand approach to cover large chunks (all?) of Medline? • Joint (Kairntech&SCAI) communication efforts: Webinars, publications • Cf. www.biorxiv.org/content/10.1101/2022.03.07.483233v1
  17. Conclusion / findings • Kairntech off-the-shelf entity extraction performs well even in this highly specific subdomain • Kairntech notion of processing pipelines allows to define sophisticated processing chains (here: entity recognition, application of specific custom model (ModType), relation extraction, output encoding into BEL syntax) • Relation results assessed and declared useful by SCAI experts after detailed manual analysis • Relation extraction from large literature corpus to feed Knowledge Graphs is feasible
  18. Thank you for your attention! info@kairntech.com
Publicité