SlideShare une entreprise Scribd logo
1  sur  1
Télécharger pour lire hors ligne
2. Efficient Dictionaries
1. Abstract 7. Recall Benchmark Results
www.nextmovesoftware.co.uk
www.nextmovesoftware.com
NextMove Software Limited
Innovation Centre (Unit 23)
Cambridge Science Park
Milton Road, Cambridge
England CB4 0EY
Improved Chemical Text mining of Patents using
automatic spelling correction and infinite dictionaries
Roger Sayle1, Paul-Hongxing Xie2, Plamen Petrov2, Jon Winter3 and Sorel Muresan2
1 NextMove Software Ltd, Cambridge, UK 2 AstraZeneca, Mölndal, Sweden 3 AstraZeneca, Alderley Park, UK
The text mining of patents for chemical structures of pharmaceutical interest
poses a number of unique challenges not encountered in other fields of text
mining. Unlike fields such as bioinformatics where the number of terms of
interest is enumerable and static, systematic chemical nomenclature can
describe an infinite number of molecules. Hence the dictionary techniques
that are commonly used for gene names, diseases, species etc. have limited
utility when searching for novel therapeutic compounds in patents.
Additionally, the length and composition of IUPAC-like names makes them
more susceptible typographical problems; OCR failures, human errors and
hyphenation and line breaking issues. This work describes a novel technique,
called CaffeineFix, designed to efficiently and correctly identify chemical
names in free text, even in the presence of typographical errors. This forms a
pre-processing pass, independent of the name-to-structure software used,
and is shown to greatly improve results in our study.
10. Bibliography
1. Roger Sayle, “Foreign Language Translation of Chemical Nomenclature by
Computer”, JCIM, Vol. 49, No. 3, pp. 519-530, 2009.
2. James Rhodes, Stephen Boyer, Jeffrey Kreule, Ying Chen and Patricia Ordonez,
“Mining Patents using Molecular Similarity Search”, Pacific Symposium on
Biocomputing, Vol. 12, pp. 304-315, 2007.
3. Ithipol Suriyawongkul, Chris Southan and Sorel Muresan, “The Cinderella of
Biological Data Integration: Addressing the Challenges of Entity and Relationship
Mining from Patent Sources”, In Data Integration in the Life Sciences, Springer 2010.
The CaffeineFix algorithm efficiently represents the set or dictionary of
entities to recognize as a minimal finite state machine (FSM). This encoding
allows it to very efficiently match a string against very large dictionaries;
significantly more efficient than the hashing or search tree based methods
used by Java or C++’s STL library.
The FSM below demonstrates the encoding a dictionary for small nitrogen
containing heterocycles, containing pyrrole, pyrazole, imidazole, pyridine,
pyridazine, pyrimidine and pyrazine.
A novel advance over previous spelling checking implementations is to
observe that FSM representations can encode an infinite number of terms
by permitting backward edges, and using stacks (push down automata) for
balancing brackets in context-free (IUPAC-like) grammars.
Consider a simple chemical grammar consisting of optionally halo substituted
alkanes; where the prefixes “bromo”, “chloro” and “fluoro” may be
repeatedly applied to “methane”, “ethane”, “propane” or “butane”.
4. Infinite Dictionaries (Grammars)
3. Improved Tokenization
A major benefit of using FSMs for chemical text mining is improved
tokenization (the splitting of free text into words/terms). Chemical names
can contain spaces, hyphens, digits, commas, parentheses, brackets, braces,
apostrophes, periods, superscripts and Greek characters that confound most
NLP tools. This is made harder still by typos, OCR errors, line breaks and
hyphenation, XML/HTML tags, line and page numbers. The ability to FSMs to
efficiently recognize valid prefixes, allows characters such as spaces and
commas to be delimiters in some contexts but part of the recognized entity
in others.
This FSM can recognize an infinite number of strings include “methane”,
“chloroethane”, “2-bromo-propane”, “chloro-bromo-methane” and so on.
5. Systematic IUPAC/CAS-like Grammar
6. Automatic Spelling Correction
8. Precision Benchmark Results
9. Conclusions
To recognize the majority of nomenclature used in medicinal chemistry,
CaffeineFix’s current IUPAC dictionary (FSM) contains 452,126 states. This
covers 98.97% of the OpenEye Lexichem generated names for the NCI00
database, and 91.23% of the 71K names in the Maybridge 2003 catalogue.
Another major benefit of using FSMs in CaffeineFix is the ability to perform
“fuzzy” matching, returning all possible terms within a limited Levenshtein
(or string-edit) distance of a query string. By backtracking over the FSM, it is
possible to determine the minimum number of single character insertions,
deletions or substitutions required to transform one string into another.
When only a single unique “correction” can be found within a given radius,
the misspelling may be corrected automatically.
CaffeineFix suggests “1,2-dichlorobenzene” to replace “12-dichlorobenzene”,
“dodec-2-ene” for “didec-2-ene” and “spiro[2.3]hexane” for “spiro[2.2]hexane”.
Additionally penalties/distances may be parameterized (as in bioinformatics)
such that homoglyphic substitutions found in OCR (between “1”, “l” and “I”,
between “0” and “O” or insertion of “<br/>”) cost less than other changes.
CaffeineFix’s use of finite state machines allows it to efficiently recognize an
infinite number systematic IUPAC-like chemical names in free text, even in
the presence of OCR and other typographical errors. This is shown to
improve the extraction of relevant structures from pharmaceutical patents.
The methods were applied to mining IBM’s text database of 12 million US,
European and world patents. Non-English abstracts were translated using
OpenEye’s Lexichem translation functionality. The analysis yielded a total of
13,523,284 unique chemical names, including 7,262,798 systematic names.
Using a suite of name-to-structure convertors including Lexichem, ChemAxon
and OPSIN allowed 92.2% of these names to be converted to structures. In
total, 5,805,172 unique canonical SMILES were indexed.
To assess the quality of structures extracted, a set of 50 US patents denoting
top selling drugs was used as a benchmark. These included US4255431
(Losec), US4681893 (Lipitor), US4847265 (Plavix), US6566360 (Levitra) and so
on. The objective of the benchmark was to report the highest Tanimoto
similarity, using MACCS 166-bit keys , between the query “drug” and the
compounds extracted from its patent.
CaffeineFix Both IBM SIMPLE Neither
8 12 8 22
Arimidex
Tarceva
Paxil
Aciphex
Bextra
Nexium
Iressa
Casodex
Patanol
Ambien
Cialis
Flovent
Zofran
CaffeineFix (D=1) and Lexichem equalled IBM’s SIMPLE annotator with 20/50
exact matches, and ahead of OSCAR/OPSIN’s 17. Multiple name-to-structure
programs and more aggressive spelling correction produced better results.

Contenu connexe

Similaire à Improved Chemical Text mining of Patents using automatic spelling correction and infinite dictionaries

LeadMine: A grammar and dictionary driven approach to chemical entity recogni...
LeadMine: A grammar and dictionary driven approach to chemical entity recogni...LeadMine: A grammar and dictionary driven approach to chemical entity recogni...
LeadMine: A grammar and dictionary driven approach to chemical entity recogni...NextMove Software
 
2011-11-28 Open PHACTS at RSC CICAG
2011-11-28 Open PHACTS at RSC CICAG2011-11-28 Open PHACTS at RSC CICAG
2011-11-28 Open PHACTS at RSC CICAGopen_phacts
 
2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europeopen_phacts
 
Biomedical literature mining
Biomedical literature miningBiomedical literature mining
Biomedical literature miningLars Juhl Jensen
 
Biological literature mining - from information retrieval to biological disco...
Biological literature mining - from information retrieval to biological disco...Biological literature mining - from information retrieval to biological disco...
Biological literature mining - from information retrieval to biological disco...Lars Juhl Jensen
 
Welch Wordifier Bosc2009
Welch Wordifier Bosc2009Welch Wordifier Bosc2009
Welch Wordifier Bosc2009bosc
 
Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Pistoia Alliance
 
AI & ML in Drug Design: Pistoia Alliance CoE
AI & ML in Drug Design: Pistoia Alliance CoEAI & ML in Drug Design: Pistoia Alliance CoE
AI & ML in Drug Design: Pistoia Alliance CoEPistoia Alliance
 
Information extraction from EHR
Information extraction from EHRInformation extraction from EHR
Information extraction from EHRAshis Chanda
 
Biomedical-named entity recognition using CUDA accelerated KNN algorithm
Biomedical-named entity recognition using CUDA accelerated KNN algorithmBiomedical-named entity recognition using CUDA accelerated KNN algorithm
Biomedical-named entity recognition using CUDA accelerated KNN algorithmTELKOMNIKA JOURNAL
 
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...Graham Smith
 
Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...NextMove Software
 
Ondex: Data integration and visualisation
Ondex: Data integration and visualisationOndex: Data integration and visualisation
Ondex: Data integration and visualisationBiogeeks
 
Assessing Drug Safety Using AI
Assessing Drug Safety Using AIAssessing Drug Safety Using AI
Assessing Drug Safety Using AIDatabricks
 

Similaire à Improved Chemical Text mining of Patents using automatic spelling correction and infinite dictionaries (20)

Automatic vs manual curation of a multisource chemical dictionary
Automatic vs manual curation of a multisource chemical dictionaryAutomatic vs manual curation of a multisource chemical dictionary
Automatic vs manual curation of a multisource chemical dictionary
 
Automated identification and conversion of chemical names to structure search...
Automated identification and conversion of chemical names to structure search...Automated identification and conversion of chemical names to structure search...
Automated identification and conversion of chemical names to structure search...
 
Automated Identification and Conversion of Chemical Names to Structure Search...
Automated Identification and Conversion of Chemical Names to Structure Search...Automated Identification and Conversion of Chemical Names to Structure Search...
Automated Identification and Conversion of Chemical Names to Structure Search...
 
LeadMine: A grammar and dictionary driven approach to chemical entity recogni...
LeadMine: A grammar and dictionary driven approach to chemical entity recogni...LeadMine: A grammar and dictionary driven approach to chemical entity recogni...
LeadMine: A grammar and dictionary driven approach to chemical entity recogni...
 
2011-11-28 Open PHACTS at RSC CICAG
2011-11-28 Open PHACTS at RSC CICAG2011-11-28 Open PHACTS at RSC CICAG
2011-11-28 Open PHACTS at RSC CICAG
 
Chibucos annot go_final
Chibucos annot go_finalChibucos annot go_final
Chibucos annot go_final
 
2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe
 
Biomedical literature mining
Biomedical literature miningBiomedical literature mining
Biomedical literature mining
 
Biological literature mining - from information retrieval to biological disco...
Biological literature mining - from information retrieval to biological disco...Biological literature mining - from information retrieval to biological disco...
Biological literature mining - from information retrieval to biological disco...
 
Welch Wordifier Bosc2009
Welch Wordifier Bosc2009Welch Wordifier Bosc2009
Welch Wordifier Bosc2009
 
Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019
 
AI & ML in Drug Design: Pistoia Alliance CoE
AI & ML in Drug Design: Pistoia Alliance CoEAI & ML in Drug Design: Pistoia Alliance CoE
AI & ML in Drug Design: Pistoia Alliance CoE
 
Information extraction from EHR
Information extraction from EHRInformation extraction from EHR
Information extraction from EHR
 
Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider
Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpiderIdentification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider
Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider
 
Biomedical-named entity recognition using CUDA accelerated KNN algorithm
Biomedical-named entity recognition using CUDA accelerated KNN algorithmBiomedical-named entity recognition using CUDA accelerated KNN algorithm
Biomedical-named entity recognition using CUDA accelerated KNN algorithm
 
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...
 
Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...
 
Bioinformatics.pptx
Bioinformatics.pptxBioinformatics.pptx
Bioinformatics.pptx
 
Ondex: Data integration and visualisation
Ondex: Data integration and visualisationOndex: Data integration and visualisation
Ondex: Data integration and visualisation
 
Assessing Drug Safety Using AI
Assessing Drug Safety Using AIAssessing Drug Safety Using AI
Assessing Drug Safety Using AI
 

Plus de NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 

Plus de NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 

Dernier

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...KokoStevan
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterMateoGardella
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.MateoGardella
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 

Dernier (20)

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 

Improved Chemical Text mining of Patents using automatic spelling correction and infinite dictionaries

  • 1. 2. Efficient Dictionaries 1. Abstract 7. Recall Benchmark Results www.nextmovesoftware.co.uk www.nextmovesoftware.com NextMove Software Limited Innovation Centre (Unit 23) Cambridge Science Park Milton Road, Cambridge England CB4 0EY Improved Chemical Text mining of Patents using automatic spelling correction and infinite dictionaries Roger Sayle1, Paul-Hongxing Xie2, Plamen Petrov2, Jon Winter3 and Sorel Muresan2 1 NextMove Software Ltd, Cambridge, UK 2 AstraZeneca, Mölndal, Sweden 3 AstraZeneca, Alderley Park, UK The text mining of patents for chemical structures of pharmaceutical interest poses a number of unique challenges not encountered in other fields of text mining. Unlike fields such as bioinformatics where the number of terms of interest is enumerable and static, systematic chemical nomenclature can describe an infinite number of molecules. Hence the dictionary techniques that are commonly used for gene names, diseases, species etc. have limited utility when searching for novel therapeutic compounds in patents. Additionally, the length and composition of IUPAC-like names makes them more susceptible typographical problems; OCR failures, human errors and hyphenation and line breaking issues. This work describes a novel technique, called CaffeineFix, designed to efficiently and correctly identify chemical names in free text, even in the presence of typographical errors. This forms a pre-processing pass, independent of the name-to-structure software used, and is shown to greatly improve results in our study. 10. Bibliography 1. Roger Sayle, “Foreign Language Translation of Chemical Nomenclature by Computer”, JCIM, Vol. 49, No. 3, pp. 519-530, 2009. 2. James Rhodes, Stephen Boyer, Jeffrey Kreule, Ying Chen and Patricia Ordonez, “Mining Patents using Molecular Similarity Search”, Pacific Symposium on Biocomputing, Vol. 12, pp. 304-315, 2007. 3. Ithipol Suriyawongkul, Chris Southan and Sorel Muresan, “The Cinderella of Biological Data Integration: Addressing the Challenges of Entity and Relationship Mining from Patent Sources”, In Data Integration in the Life Sciences, Springer 2010. The CaffeineFix algorithm efficiently represents the set or dictionary of entities to recognize as a minimal finite state machine (FSM). This encoding allows it to very efficiently match a string against very large dictionaries; significantly more efficient than the hashing or search tree based methods used by Java or C++’s STL library. The FSM below demonstrates the encoding a dictionary for small nitrogen containing heterocycles, containing pyrrole, pyrazole, imidazole, pyridine, pyridazine, pyrimidine and pyrazine. A novel advance over previous spelling checking implementations is to observe that FSM representations can encode an infinite number of terms by permitting backward edges, and using stacks (push down automata) for balancing brackets in context-free (IUPAC-like) grammars. Consider a simple chemical grammar consisting of optionally halo substituted alkanes; where the prefixes “bromo”, “chloro” and “fluoro” may be repeatedly applied to “methane”, “ethane”, “propane” or “butane”. 4. Infinite Dictionaries (Grammars) 3. Improved Tokenization A major benefit of using FSMs for chemical text mining is improved tokenization (the splitting of free text into words/terms). Chemical names can contain spaces, hyphens, digits, commas, parentheses, brackets, braces, apostrophes, periods, superscripts and Greek characters that confound most NLP tools. This is made harder still by typos, OCR errors, line breaks and hyphenation, XML/HTML tags, line and page numbers. The ability to FSMs to efficiently recognize valid prefixes, allows characters such as spaces and commas to be delimiters in some contexts but part of the recognized entity in others. This FSM can recognize an infinite number of strings include “methane”, “chloroethane”, “2-bromo-propane”, “chloro-bromo-methane” and so on. 5. Systematic IUPAC/CAS-like Grammar 6. Automatic Spelling Correction 8. Precision Benchmark Results 9. Conclusions To recognize the majority of nomenclature used in medicinal chemistry, CaffeineFix’s current IUPAC dictionary (FSM) contains 452,126 states. This covers 98.97% of the OpenEye Lexichem generated names for the NCI00 database, and 91.23% of the 71K names in the Maybridge 2003 catalogue. Another major benefit of using FSMs in CaffeineFix is the ability to perform “fuzzy” matching, returning all possible terms within a limited Levenshtein (or string-edit) distance of a query string. By backtracking over the FSM, it is possible to determine the minimum number of single character insertions, deletions or substitutions required to transform one string into another. When only a single unique “correction” can be found within a given radius, the misspelling may be corrected automatically. CaffeineFix suggests “1,2-dichlorobenzene” to replace “12-dichlorobenzene”, “dodec-2-ene” for “didec-2-ene” and “spiro[2.3]hexane” for “spiro[2.2]hexane”. Additionally penalties/distances may be parameterized (as in bioinformatics) such that homoglyphic substitutions found in OCR (between “1”, “l” and “I”, between “0” and “O” or insertion of “<br/>”) cost less than other changes. CaffeineFix’s use of finite state machines allows it to efficiently recognize an infinite number systematic IUPAC-like chemical names in free text, even in the presence of OCR and other typographical errors. This is shown to improve the extraction of relevant structures from pharmaceutical patents. The methods were applied to mining IBM’s text database of 12 million US, European and world patents. Non-English abstracts were translated using OpenEye’s Lexichem translation functionality. The analysis yielded a total of 13,523,284 unique chemical names, including 7,262,798 systematic names. Using a suite of name-to-structure convertors including Lexichem, ChemAxon and OPSIN allowed 92.2% of these names to be converted to structures. In total, 5,805,172 unique canonical SMILES were indexed. To assess the quality of structures extracted, a set of 50 US patents denoting top selling drugs was used as a benchmark. These included US4255431 (Losec), US4681893 (Lipitor), US4847265 (Plavix), US6566360 (Levitra) and so on. The objective of the benchmark was to report the highest Tanimoto similarity, using MACCS 166-bit keys , between the query “drug” and the compounds extracted from its patent. CaffeineFix Both IBM SIMPLE Neither 8 12 8 22 Arimidex Tarceva Paxil Aciphex Bextra Nexium Iressa Casodex Patanol Ambien Cialis Flovent Zofran CaffeineFix (D=1) and Lexichem equalled IBM’s SIMPLE annotator with 20/50 exact matches, and ahead of OSCAR/OPSIN’s 17. Multiple name-to-structure programs and more aggressive spelling correction produced better results.