Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Andersson (Artificial Researcher IT GmbH, Austria)

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité

Consultez-les par la suite

1 sur 23 Publicité

AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Andersson (Artificial Researcher IT GmbH, Austria)

Télécharger pour lire hors ligne

In the patent domain, all types of issues, from very specific search requirements to the linguistic characteristics of the text domain, are accentuated. Consequently, to develop patent text mining tools for scientists and patent experts, we need to understand their daily work tasks, as well as the linguistic character of the text genre (i.e., patentese). Patent text is a mixture of legal and domain-specific terms. In processing technical English texts, a multi-word unit method is often deployed as a word-formation strategy to expand the working vocabulary, i.e., introducing a new concept without the invention of an entirely new word. This productive word formation is a well-known challenge for traditional natural language processing tools utilizing supervised machine learning algorithms due to limited domain-specific training data. Deep learning technologies have been introduced to overcome the reduction in performance of traditional NLP tools. In the Artificial Researcher technologies, we have integrated explicit and implicit linguistic knowledge into the deep learning algorithms, essential for domain-specific text mining tools. In this talk, we will present a step-by-step process of how we have developed the mentioned text mining tools. For the final outline, we will also demonstrate how these tools can be integrated in a cross-genre passage retrieval system, based on a technology from 2016 that still holds the state-of-the-art within the patent text mining research community in 2022.

In the patent domain, all types of issues, from very specific search requirements to the linguistic characteristics of the text domain, are accentuated. Consequently, to develop patent text mining tools for scientists and patent experts, we need to understand their daily work tasks, as well as the linguistic character of the text genre (i.e., patentese). Patent text is a mixture of legal and domain-specific terms. In processing technical English texts, a multi-word unit method is often deployed as a word-formation strategy to expand the working vocabulary, i.e., introducing a new concept without the invention of an entirely new word. This productive word formation is a well-known challenge for traditional natural language processing tools utilizing supervised machine learning algorithms due to limited domain-specific training data. Deep learning technologies have been introduced to overcome the reduction in performance of traditional NLP tools. In the Artificial Researcher technologies, we have integrated explicit and implicit linguistic knowledge into the deep learning algorithms, essential for domain-specific text mining tools. In this talk, we will present a step-by-step process of how we have developed the mentioned text mining tools. For the final outline, we will also demonstrate how these tools can be integrated in a cross-genre passage retrieval system, based on a technology from 2016 that still holds the state-of-the-art within the patent text mining research community in 2022.

Publicité
Publicité

Plus De Contenu Connexe

Similaire à AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Andersson (Artificial Researcher IT GmbH, Austria) (20)

Plus par Dr. Haxel Consult (20)

Publicité

Plus récents (20)

AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Andersson (Artificial Researcher IT GmbH, Austria)

  1. 1. Domain Knowledge makes Artificial Intelligence Smart Linda Andersson AI-SDV 2022 © Artificial Researcher,2022 1
  2. 2. Linda Andersson CEO/CTO Artificial Researcher IT GmbH Awards • 2018 Commercial Viability award by the Austrian Angel Investors Association. • 2017 PhD high potential R&D award, i2c Award, Vienna University of Technology • 1999 Finalist in a Venture Cup held at Mälardalens University College. Research fields • Natural Language Understanding • Natural Language Processing • Information Extraction • Information Retrieval Academic merits • Information design 1998-2001 • General Linguistics 2001-2003 • Computational Linguistics 2006-2009 • Language Engineering 2004 • Library & Information Science 2004-2006 • Computer Science 2009-2019 © Artificial Researcher,2022 2
  3. 3. “There is little you can do about prior art which wasn’t found in the search that you made before filing the patent application.” Barry Franks Hynell, “Opposition before the EPO”, preface to Winning with IP: managing intellectual property today, ISBN 978-1-9998329-6-4 © Artificial Researcher,2022 3
  4. 4. Information Need ”Prior Art” No 37435 Benz Patent – Motorwagen (1886 ) Steering mechanism Search for: • Problem and Solution (A) • Scope of the invention (A,B,C) • Specific technical details (B,C) Engine function A C C The transport vehicle B © Artificial Researcher,2022 4
  5. 5. What make a AI Tool technical successful? • A successful text mining solution does not only focus on developing the technology or the best deep learning algorithm, it is much more complex. The technology design reflects: • Task Complexity • Information needs, retrieve relevant paragraphs and not just documents • Language Complexity • Word formation of new words are particular important for the patent text genre. • Domain text-genre Complexity • Multi word terms © Artificial Researcher,2022 5
  6. 6. Known item search Explorative search Find relevant paragraphs Scientific Summarization ComparativeSummarization Requests Output example Reading text Different Information Needs Task Complexity © Artificial Researcher,2022 6
  7. 7. Next generation Search Technology Our search software generate the query in the specific domain Retrieves relevant paragraphs Graph Search & Text Box Search We have developed text mining technologies by combining traditional Natural Language Processing and Deep Learning technologies. © Artificial Researcher,2022 7 Task Complexity
  8. 8. Example of automatic query generation Automatic query expansion terms © Artificial Researcher,2022 8 Automatic Query formulation - From text segment to automatic Boolean query Task Complexity
  9. 9. Automatic Query formulation - From terms to concept term graphs https://graph-demo.artificialresearcher.com/ Term: environment climate fuel biofuel (fuel OR gas OR engine OR vehicle OR oil OR coal OR energy OR hydrocarbon OR steam OR liquid OR fluid OR furnace OR lubricant OR hydrogen OR feed) AND ("diesel fuel"~4 OR "liquid fuel"~4 OR "fuel gas"~4 OR "fuel oil"~4 OR "fuel material"~4 OR "hydrogen gas"~4 OR "diesel oil"~4 OR "liquid hydrocarbon"~4 OR "burning gas"~4 OR "synthetic gas"~4 OR "suitable hydrocarbon"~4 OR "mixed gas"~4 OR "tail gas"~4) With context similarity of 0.85 © Artificial Researcher,2022 9 Task Complexity
  10. 10. Artificial Researcher-Search Solutions Relation graph for terms extracted from patent and scientific publications • Direct access • Relevant paragraphs (passages) • Abstracts • Bibliographic information • Links to full text documents • Download • Paragraphs including all bibliographic data, including identified technical terms, subject classification terms, scientific classifications • Standard Index Collections • Open Access scientific articles • access to 204,582,649 – CORE.uk • Patents • EP full-text data for text analytics • Medical collections • COVID-19 Open Research Dataset © Artificial Researcher,2022 10 Task Complexity
  11. 11. How indices and ontologies are generated • Technology and data flow overview • Language and Domain Complexity © Artificial Researcher,2022 11
  12. 12. From Text to Structured Semantic Representation of Patents and Scientific Publications © Artificial Researcher,2022 12
  13. 13. Any Data collections NLP pipeline, PoS & Chunker, Lexico-Semantic Patterns Harmonization & Text Normalization External Language Resources, Gazetteers Ontology (OWL, JSON) Ontology & Index Repositories Meta Data External Information Resources API + output input + Search options Graph Search Artificial Researcher NLP-Toolkit Word Similarity Model Context Similarity Model Task Model 1 2 3 4 5 6 Documents are processed through several text analysis modules: 1. Text Normalization and harmonization (including PDF converter) 2. NLP and relation identification 3. Artificial Researcher NLP-Toolkit - Technical Terms 4. Packaged as enhanced XML/JSON 5. Software AR Ontology Services and AR Passage Retrieval 6. End user applications rest APIs, desktop script, GUI Cloud platforms (Google, AZURE2023) Internal Information Resources (analyses are processed with SaaS – so no data are stored the cloud) Artificial Research Data Pipeline solution Assembly line for creating Indices and Ontologies © Artificial Researcher,2022 13
  14. 14. Combining NLP & Distributional Semantics 𝐽𝑜𝑖𝑛𝑒𝑑𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = ෍ 𝑖,𝑗=1,𝑛 𝑖≠𝑗 𝑖<𝑗 𝑁 cos 𝑤𝑖 , 𝑤𝑗 𝑁 • wi, wj represent each word vector pair cosine similarity of a MWT • N is the number of words for a MWT (Andersson 2013-2017) Embedding identifies similarities between different words, and the standard reference threshold for w2v 0.60 • Underwear similar to underpants , undergarment, panties, underclothes • Strength similar to strengths, strength, toughness, stronger, sfrength Technical semantic relations are a mixture of single words and phrases • synthetic fibers synonym to polyester fibers • thrips hypernym to bulb fly larvae © Artificial Researcher,2022 14 Language and Domain Complexity
  15. 15. What is Technical Term? Depends on who you are asking! © Artificial Researcher,2022 15 Language and Domain Complexity
  16. 16. BERT – Technical Term Prediction • Task: detect domain term spans in sentences A network synchronization method allows Domain Term O I I I O • Using pre-trained SciBERT Model • With additional training data for Technical Term detection • Randomly picked 22 patents with different International Patent Classification codes • Sentences 10,337, Technical Term dictionary of size 5,099 Fink T., Andersson L., Hanbury A. (2021) Detecting Multi Word Terms in patents the same way as entities / World Patent Information, 67 102078; 1 - 6 © Artificial Researcher,2022 16 Language and Domain Complexity
  17. 17. Technical term identification - Evaluation using WIPO PEARL Terminology DB RECALL AND PRECISION ASSESSMENT WIPO CONCEPT RECALL 81% ALL DOMAIN TERM RECALL 96% ALL DOMAIN TERM PRECISION 85% © Artificial Researcher,2022 17 Language and Domain Complexity
  18. 18. What is Semantic Lexical Relations Hyponym, a narrow term e.g. horse is a hyponym of farm animals Hypernym, a broad term, e.g. farm animals a hypernym of pig, cow, sheep, horse Hyponymy relations: • part of • kind of • member of • type of • domain specific relation …. farm animals such as pigs, sheep, cows and horses © Artificial Researcher,2022 18 Language and Domain complexity
  19. 19. Better than Distributional Semantics brake pedal: vehicle operating pedal, conventional hydraulic brake system pedal devices position brake actuating member brake actuating member hydraulically-assisted rack pinion steering gear brake operating member conventional braking system pair pedals accelerator pedal case pedal device pedal device Related concept Artificial Researcher’s text mining technology Extraction using pre-trained contextual models only brake pedal PatBERT SciBERT conventional hydraulic brake system 0.69 0.89 hydraulically-assisted rack pinion steering gear 0.49 0.80 conventional braking system 0.66 0.84 Technical terms Using semantically enriched data can retrieve up to 60% more related concepts than traditional pre-trained contextual models (a.k.a. Neural Network-based methods, BERT-family). © Artificial Researcher,2022 19 Language and Domain complexity
  20. 20. Proof-of-concept -Passage retrieval performance on CLEF-IP 2013 text collection. Method PRES@ 100 Recall@ 100 MAP@ 100 Prec(D) (Albarede et. al. 2021) 0.461 0.603 0.173 0.231 (Andersson et. al., 2016) 0.444 0.560 0.187 0.282 (Andersson et. al., 2017) 0.558 0.647 0.269 0.207 (Luo J and Yang H. 2013) 0.433 0.540 0.191 0.213 Artificial Researcher Passage Retrieval ServiceTM Artificial Researcher Graph Search ServiceTM Albarede L., & Mulhem P., Goeuriot L., Le Pape-Gardeux C., Sylvain M., Trindad C. (2021). Passage retrieval in context: Experiments on Patents. Andersson L, Lupu M, Palotti J., Hanbury A., Rauber A. (2016) When is the time Ripe for Natural Language Processing for Patent Passage Retrieval? In Proceedings of the 25th ACM International Conference on Conference on Information and Knowledge Management (CIKM 2016) Andersson L, Rekabsaz N., Hanbury A. (2017) Automatic query expansion for patent passage retrieval using paradigmatic and syntagmatic information The first WiNLP Workshop co-located with the Annual Meeting of the Association for Computational Linguistics (ACL 2017), Vancouver Luo J and Yang H. (2013). Query formulation for prior art search-Georgetown university at clef-ip 2013. In Proc. of CLEF. © Artificial Researcher,2022 20 Language and Domain complexity
  21. 21. New REST API service Text classification and semantic relation extraction Output Automatic classification according to taxonomies such as IPC/CPC, MeSH, PLoS, WIPO Scientific Subject fields Input © Artificial Researcher,2022 21
  22. 22. Artificial Researcher - Services & Products Test access: API key b10225791d2a478c9ff3d8cd106ad65a • Artificial Researcher Passage Retrieval Service • Access: rest API, Desktop App Script (Python) • https://swagger.artificialresearcher.com/ (developer) • https://passageretrieval.artificialresearcher.com/ (demo) • SaaS or on-premises software (Docker) • Artificial Researcher Ontology Service • Access: rest API, Desktop App Script (Python) • https://swagger.artificialresearcher.com/?urls.primaryName=onto-api (developer) • Data format OWL or JSON • SaaS or on-premises software (Docker) • Ontology generator – only SaaS • Artificial Researcher NLP-toolkit Services • Access: rest API • https://swagger.artificialresearcher.com/?urls.primaryName=unified-nlp-server (developer) • SaaS or on-premises software (Docker) • Artificial Researcher Graph Search Service • UX: https://graph-demo.artificialresearcher.com/ (demo) © Artificial Researcher,2022 22
  23. 23. Thank you for your attention artificialresearcher.com linda.andersson@artificialresearcher.com © Artificial Researcher,2022 23

×