SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Translating IUPAC-like Chemical
          Nomenclature to and from
               Simplified Chinese

                       Roger Sayle and Daniel Lowe
                           NextMove Software
                              Cambridge, UK

ACS National Meeting, Philadelphia, USA 22nd August 2012
motivation

• Whilst most scientific articles are written in English, a
  significant fraction of chemistry research remains
  only available in other languages.
• The increasing use of overseas contract research
  organizations is increasing the prevalence of foreign
  languages in Electronic Lab Notebooks and internal
  technical reports.
• Existing language tools, such as Google translate,
  often perform poorly on technical (domain specific)
  terminology, such as IUPAC-like nomenclature.
ACS National Meeting, Philadelphia, USA 22nd August 2012
google search statistics

                Lang        Query                          Search Hits
                  en        benzoic acid                   6,670,000
                  zh        苯甲酸                            4,090,000
                  ja        安息香酸                           2,000,000
                  es        ácido benzoico                 439,000
                  ru        Бензойная кислота              310,000
                  de        benzoesäure                    267,000
                  fr        acide benzoïque                179,000
                  nl        benzoëzuur                     139,000


ACS National Meeting, Philadelphia, USA 22nd August 2012
ibm simple patent database
          c07D abstract languages
                    Code     Language         Count             Fraction
                      en     English              316338         70.96%
                      fr     French               104924         23.54%
                      de     German                22694           5.09%
                      ja     Japanese                1178          0.26%
                      zh     Chinese                  310          0.07%
                      es     Spanish                  230          0.05%
                      ko     Korean                   120          0.03%
                      ru     Russian                       16      0.00%
                      pt     Portuguese                    7       0.00%
                       fi    Finnish                       4       0.00%
                                                  445821        100.00%

ACS National Meeting, Philadelphia, USA 22nd August 2012
hattori08 benchmark languages
                    Dr ug N a m e   C o m pa ny              P aten t          L a ng ua g e
                    Ac iphex        E is ai C o              E P 268956(A2)         EN
                    Aldara          3M P ha rma ceutica ls   E P 145340(A2)         EN
                    Aricept         E is ai C o              E P 296560(A2)         EN
                    Arimidex        As tra Z enec a          E P 296749(A1)         EN
                    Ata cand        As tra Z enec a          E P 459136(A1)         EN
                    Avapro          Bristol-Myers S quibb    WO1991014679(A 1)      FR
                    Benic ar        S a nkyo P ha rma        E P 503785(A1)         EN
                    Bextra          P fizer                  WO1996025405(A 1)      EN
                    C a sodex       As tra Z enec a          E P 100172(A1)         EN
                    C elebrex       P fizer                  WO1995015316(A 1)      EN
                    C ia lis        L illy IC OS             WO1995019978(A 1)      EN
                    C oreg          G laxoS mithK line       DE 2815926(A 1)        DE
                    C oza ar        Merc k & C o.            E P 253310(A2)         EN
                    Detrol          P fizer                  E P 0325571(A1)        EN
                    Diovan          Nova rtis                E P 443983(A1)         DE
                    F emara         Nova rtis                E P 236940(A2)         EN
                    F lovent        G laxoS mithK line       NL 8100707(A)          NL
                    L a misil       Nova rtis                E P 24587(A1)          EN
                    L esc ol        Nova rtis                WO1984002131(A 1)      EN
                    L evitra        Ba yer                   WO1999024433(A 1)      DE
                    Na sonex        S c hering-P lough       E P 57401(A1)          EN
                    P a tanol       Nestle S A               E P 235796(A2)         EN
                    P a xil         G laxoS mithK line       E P 266574(A2)         EN
                    R eya ta z      Bristol-Myers S quibb    WO1997040029(A 1)      EN
                    S piriva        Boehringer Ing elheim    E P 418716(A1)         DE
                    S ustiva        Bristol-Myers S quibb    E P 582455(A1)         EN
                    Tarceva         G enentec h              WO1996030347(A 1)      EN
                    Viga mox        Alc on                   E P 550903(A1)         DE
                    Z ofra n        G laxoS mithK line       DE 3502508(A 1)        DE
                    Z omig          Medpointe P harm         WO1991018897(A 1)      EN

ACS National Meeting, Philadelphia, USA 22nd August 2012
previous work

• Roger Sayle, “Foreign Language Translation of
  Chemical Nomenclature by Computer”, Journal of
  Chemical Information and Modeling, Vol. 49, No. 3,
  pp. 519-530, 2009.
• Roger Sayle, “Preserving Nuance in Chemical
  Nomenclature Translation”, The ATA Chronicle (The
  Journal of the American Translators Association), Vol.
  38, October 2009.
international nomenclature

• Morphology/semantics preserved across languages:
     –   English: 2-(4-chlorophenoxy)acetic acid
     –   Chinese: 2-(4-氯苯氧基)醋酸
     –   German: 2-(4-chlorphenoxy)essigsäure
     –   French: acide 2-(4-chlorophénoxy)acétique
     –   Spanish: ácido 2-(4-clorofenoxi)acético
     –   Swedish: 2-(4-klorofenoxi)ättiksyra
     –   Italian: acido 2-(4-clorofenossi)acetico
     –   Polish: kwas 2-(4-chlorofenoksy)octowy
     –   Japanese: 2-(4-クロロフェノキシ)酢酸

ACS National Meeting, Philadelphia, USA 22nd August 2012
chinese elements

•   hydrogen                     氢                     [H]
•   oxygen                       氧                     [O]
•   nitrogen                     氮                     [N]
•   gold                         金                     [Au]
•   silver                       银                     [Ag]
•   iron                         铁                     [Fe]
•   lead                         铅                     [Pb]

ACS National Meeting, Philadelphia, USA 22nd August 2012
chinese alkanes

•   methane                      甲烷                    C
•   ethane                       乙烷                    CC
•   propane                      丙烷                    CCC
•   butane                       丁烷                    CCCC
•   pentane                      戊烷                    CCCCC
•   hexane                       己烷                    CCCCCC
•   heptane                      庚烷                    CCCCCCC

ACS National Meeting, Philadelphia, USA 22nd August 2012
chinese alkyl groups

•   methyl                       甲基                    *C
•   ethyl                        乙基                    *CC
•   propyl                       丙基                    *CCC
•   butyl                        丁基                    *CCCC
•   pentyl                       戊基                    *CCCCC
•   hexyl                        己基                    *CCCCCC
•   heptyl                       庚基                    *CCCCCCC

ACS National Meeting, Philadelphia, USA 22nd August 2012
chinese functional groups

•   hydroxy-                     羟基                    *O
•   amino-                       胺基                    *N
•   fluoro-                      氟                     *F
•   chloro-                      氯                     *Cl
•   bromo-                       溴                     *Br
•   iodo-                        碘                     *I
•   nitro-                       硝基                    *N(=O)=O

ACS National Meeting, Philadelphia, USA 22nd August 2012
chinese carbocycles

•   benzene                                 苯              c1ccccc1
•   naphthalene                             萘              c1ccc2ccccc2c1
•   azulene                                 薁
•   anthracene                              蒽
•   phenanthrene                            菲
•   fluorene                                芴
•   pyrene                                  芘

ACS National Meeting, Philadelphia, USA 22nd August 2012
chinese heterocycles

•   furan                        呋喃                    o1cccc1
•   pyrrole                      吡咯                    [nH]1cccc1
•   thiophene                    噻吩                    s1cccc1
•   pyridine                     吡啶                    n1ccccc1
•   pyridazine                   哒嗪                    n1ncccc1
•   pyrimidine                   嘧啶                    n1cnccc1
•   pyrazine                     吡嗪                    n1ccncc1

ACS National Meeting, Philadelphia, USA 22nd August 2012
chinese carboxylic acids

•   formic acid 蚁酸                                     C(=O)O
•   acetic acid    醋酸                                  CC(=O)O
•   propionic acid 丙酸                                  CCC(=O)O
•   butyric acid 酪酸                                    CCCC(=O)O
•   valeric acid 戊酸                                    CCCCC(=O)O
•   hexanoic acid 己酸                                   CCCCCC(=O)O
•   enanthic acid 庚酸                                   CCCCCCC(=O)O

ACS National Meeting, Philadelphia, USA 22nd August 2012
example #1: aspirin




• 2-acetoxybenzoic acid
• 2-乙酰氧基苯甲酸
ACS National Meeting, Philadelphia, USA 22nd August 2012
example #2: alprenolol




• 1-(2-allylphenoxy)-3-(isopropylamino)propan-2-ol
• 1-(2-烯丙基苯氧基)-3-(异丙基氨基)丙-2-醇
ACS National Meeting, Philadelphia, USA 22nd August 2012
example #3: aminopyrine




• 4-dimethylamino-1,5-dimethyl-2-phenyl-pyrazol-3-one
• 4-二甲氨基-1,5-二甲基-2-苯基-吡唑-3-酮
ACS National Meeting, Philadelphia, USA 22nd August 2012
example #4: caffeine




• 1,3,7-trimethylpurine-2,6-dione
• 1,3,7-三甲基嘌呤-2,6-二酮
ACS National Meeting, Philadelphia, USA 22nd August 2012
simplified chinese to english for
 chemical text mining of patents
• Early efforts in (Chinese) chemical nomenclature
  translation focused on translating and round tripping
  delimited and often machine-generated IUPAC
  names.
• More recently focus has shifted to the application of
  this technology to performing chemical named entity
  recognition (CNER) in pharmaceutical patents.
• The rest of this presentation describes the current
  state-of-the-art on this challenge, and the influence
  of Chinese OCR on compound retrieval rates.
ACS National Meeting, Philadelphia, USA 22nd August 2012
leadmine introduction

• LeadMine is NextMove Software’s application for
  chemical text mining, built upon CaffeineFix for
  automatic chemical spelling correction.
• R. Sayle, P.H. Xie and S. Muresan, “Improved
  Chemical Text Mining of Patents with Infinite
  Dictionaries and Automatic Spelling Correction”, J.
  Chem. Inf. Model. 52(1), pp. 51-62, January 2012.
• Recent improvements to LeadMine and CaffeineFix
  were described earlier in the week in the CINF
  session on “Chemical Information in Patents”.
ACS National Meeting, Philadelphia, USA 22nd August 2012
leadmine v2.0 chinese support

• Chinese is supported by preprocessing the input text,
  substituting/transliterating the hanzi characters
  found in IUPAC-like names into English.
• The current translation dictionary (June 2012)
  includes 1046 substitution rules.
• Traditional Chinese is supported by internally
  automatically mapping 166 traditional characters
  found in IUPAC-like names to their simplified forms.
• Efficient matching is made possible by “compiling”
  these rules into a 7000 line Java source file.
ACS National Meeting, Philadelphia, USA 22nd August 2012
example noisy input

5.如权利要求1所述的化合物,所述化合物是:2_(4-氯-苯氧
基)-2-曱基-1^-噻唑-2-基-丙酰胺;2—曱基-2-(4-曱硫基-苯氧
基)-N-噻唑-2-基-丙酰胺;2-(6-氯-吡啶-2-基氧基)-2-曱基-N-噻唑
-2-基-丙酰胺; 2_曱基-2-(萘-1 -基氧基)-N-噻唑-2-基-丙酰胺
; 2_曱基-2-(萘-2-基氧基)->^噻唑-2-基-丙酰胺; 2-(2,4-二
氟-苯氧基)-2-曱基-N-噻唑-2-基- 丙酰胺; 2-(4-氟-苯硫基)-2-
曱基-N-瘗唑-2-基-丙酰胺; 2-曱基-2-(4-苯氧基-苯氧基)-1^-噻
唑-2-基-丙酰胺; 2_甲基-N-噻唑-2-基-2-(4'-三氟甲氧基-联苯-
4-基氧基)-丙酰胺; 2-(苯并[1,3]二氧杂环戊烯-5-基氧基)-
^(5-氯-噻唑-2-基)-2-曱基-丙酰胺;N-(5-氯-噻唑-2-基)-2-(2,4-
二氟-苯氧基)-2-曱基-丙酰胺; 2-(5-氯-吡啶-3-基氧基)-N-(5-氯
-瘗唑-2-基)-2-曱基-丙酰胺;...

ACS National Meeting, Philadelphia, USA 22nd August 2012
example annotated output




ACS National Meeting, Philadelphia, USA 22nd August 2012
benchmark #1: CN101622231A

• The UTF-16 encoded XML for this patent was kindly
  provided/suggested by Steve Boyer (IBM/Collabra).
• This document corresponds to US patent application
  2010/0144772 A1, the non-OCR full text is available
  from Google patents.
• Claim #5 lists 168 fully exemplified structures
  explicitly claimed by this patent.




ACS National Meeting, Philadelphia, USA 22nd August 2012
benchmark #1: CN101622231A

• 144/168 (85.7%) can be recognized by LeadMine v2
  from the English text (US 2010144772).
• 0/168 (0%) can be recognized by LeadMine from the
  text translated using Google translate.
• 5/168 (3%) could be recognized by LeadMine v1,
  which uses OpenEye Scientific Software’s Lexichem
  for Chinese translation.
• 71/168 (42%) can be recognized by LeadMine v2,
  which uses NextMove Software’s Chinese translation.
• Only 9/168 (5.4%) when OCR correction is disabled.
ACS National Meeting, Philadelphia, USA 22nd August 2012
benchmark #1: CN101622231A

• Three compounds (39, 79 and 161) can be recovered
  from the Chinese text but could not be found in the
  English patent! These appear to be typos corrected
  during translation.
• A mistake by the translators has also resulted in an
  accidental duplication of names 33, 34 and 35, such
  that there are 171 compounds in the CN patent.
• A significant cause of failures is the incorrect OCR of
  the “甲” as found in methyl, which is frequently
  incorrectly perceived as “曱”.
ACS National Meeting, Philadelphia, USA 22nd August 2012
benchmark #2: CN1964970

• The non-searchable PDF of CN1964970 was provided
  by the NIBR-IT text mining group at Novartis.
• These images can be freely downloaded via EPO.
• This document corresponds to US patent application
  US20050234042.
• The challenge is to extract the 57 exemplified
  structures from claim #24 (on pages 19-23).
• This benchmark tested the quality/impact of Chinese
  Optical Character Recognition (OCR) software.

ACS National Meeting, Philadelphia, USA 22nd August 2012
benchmark #2: CN1964970

• Two OCR approaches were evaluated; the open
  source tesseract from HP/Google and the
  commercial product Abbyy FineReader version 11.
• For tesseract, conversion to 300x300 dpi image files
  was done using GPL Ghostscript v9.05.
• Of the two lossless image file formats supported by
  tesseract’s leptonica library (v1.68), TIFF g3 format
  produced smaller files than PNG.
• Abbyy FineReader processes pages significantly
  faster than the 1 min/page of Google’s tesseract.
ACS National Meeting, Philadelphia, USA 22nd August 2012
benchmark #2: CN1964970

• 50/57 (88%) of compounds can be found by
  LeadMine from the English patent, US20050234042.
• 0/57 compounds can be found using Abbyy
  FineReader’s pure “Simplified Chinese” setting.
• 4/57 (7%) can be found using FineReader’s hybrid
  “Simplified Chinese and English” language setting.
• 2/57 (3.5%) can be found using HP/Google’s open
  source tesseract package.
• Similar results were also seen with CN1684951.

ACS National Meeting, Philadelphia, USA 22nd August 2012
conclusions

• In some respects, the ability to mine any chemical
  structures from Chinese patents is encouraging.
• On high quality OCR input, about 50% of the names
  found in English CNER can be retrieved from Chinese.
• Results are highly dependent upon the quality of
  OCR software used (IBM looks to have/use the best).
• Abbyy FineReader’s “Simplified Chinese and English”
  performed best of the tested OCR software.
• Interestingly, some things are only found in Chinese.

ACS National Meeting, Philadelphia, USA 22nd August 2012
potential future work

• Evaluate additional Chinese OCR packages, include
  Adobe Acrobat, COCR2, ReadIris 14 and Dan Ching.
• Teach LeadMine’s Chinese translation about more
  problematic homoglyph errors from Chinese OCR.
• Make use of (tesseract’s) OCR per character
  confidence during CaffeineFix spelling correction.
• Perhaps fund a (Chinese speaking) summer student
  to develop improved “training data” files for Google’s
  tesseract, to better support the fonts used by the
  Chinese patent office.
ACS National Meeting, Philadelphia, USA 22nd August 2012
acknowledgements

• Sorel Muresan, Plamen Petrov and Paul Hongxing
  Xie, AstraZeneca R&D, Molndal, SE.
• Steve Boyer, IBM/Collabra, San Jose, USA.
• Fatma Oezdemir-Zaech, NBIR-IT, Novartis, Basel, CH.
• OpenEye Scientific Software, Santa Fe, USA.
• Wikipedia.


• Thank you for your time.

ACS National Meeting, Philadelphia, USA 22nd August 2012

Contenu connexe

Plus de NextMove Software

Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeNextMove Software
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsNextMove Software
 

Plus de NextMove Software (20)

Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patents
 

Dernier

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 

Dernier (20)

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 

Translating IUPAC - like Chemical Nomenclature to and from Simplified Chinese

  • 1. Translating IUPAC-like Chemical Nomenclature to and from Simplified Chinese Roger Sayle and Daniel Lowe NextMove Software Cambridge, UK ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 2. motivation • Whilst most scientific articles are written in English, a significant fraction of chemistry research remains only available in other languages. • The increasing use of overseas contract research organizations is increasing the prevalence of foreign languages in Electronic Lab Notebooks and internal technical reports. • Existing language tools, such as Google translate, often perform poorly on technical (domain specific) terminology, such as IUPAC-like nomenclature. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 3. google search statistics Lang Query Search Hits en benzoic acid 6,670,000 zh 苯甲酸 4,090,000 ja 安息香酸 2,000,000 es ácido benzoico 439,000 ru Бензойная кислота 310,000 de benzoesäure 267,000 fr acide benzoïque 179,000 nl benzoëzuur 139,000 ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 4. ibm simple patent database c07D abstract languages Code Language Count Fraction en English 316338 70.96% fr French 104924 23.54% de German 22694 5.09% ja Japanese 1178 0.26% zh Chinese 310 0.07% es Spanish 230 0.05% ko Korean 120 0.03% ru Russian 16 0.00% pt Portuguese 7 0.00% fi Finnish 4 0.00% 445821 100.00% ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 5. hattori08 benchmark languages Dr ug N a m e C o m pa ny P aten t L a ng ua g e Ac iphex E is ai C o E P 268956(A2) EN Aldara 3M P ha rma ceutica ls E P 145340(A2) EN Aricept E is ai C o E P 296560(A2) EN Arimidex As tra Z enec a E P 296749(A1) EN Ata cand As tra Z enec a E P 459136(A1) EN Avapro Bristol-Myers S quibb WO1991014679(A 1) FR Benic ar S a nkyo P ha rma E P 503785(A1) EN Bextra P fizer WO1996025405(A 1) EN C a sodex As tra Z enec a E P 100172(A1) EN C elebrex P fizer WO1995015316(A 1) EN C ia lis L illy IC OS WO1995019978(A 1) EN C oreg G laxoS mithK line DE 2815926(A 1) DE C oza ar Merc k & C o. E P 253310(A2) EN Detrol P fizer E P 0325571(A1) EN Diovan Nova rtis E P 443983(A1) DE F emara Nova rtis E P 236940(A2) EN F lovent G laxoS mithK line NL 8100707(A) NL L a misil Nova rtis E P 24587(A1) EN L esc ol Nova rtis WO1984002131(A 1) EN L evitra Ba yer WO1999024433(A 1) DE Na sonex S c hering-P lough E P 57401(A1) EN P a tanol Nestle S A E P 235796(A2) EN P a xil G laxoS mithK line E P 266574(A2) EN R eya ta z Bristol-Myers S quibb WO1997040029(A 1) EN S piriva Boehringer Ing elheim E P 418716(A1) DE S ustiva Bristol-Myers S quibb E P 582455(A1) EN Tarceva G enentec h WO1996030347(A 1) EN Viga mox Alc on E P 550903(A1) DE Z ofra n G laxoS mithK line DE 3502508(A 1) DE Z omig Medpointe P harm WO1991018897(A 1) EN ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 6. previous work • Roger Sayle, “Foreign Language Translation of Chemical Nomenclature by Computer”, Journal of Chemical Information and Modeling, Vol. 49, No. 3, pp. 519-530, 2009. • Roger Sayle, “Preserving Nuance in Chemical Nomenclature Translation”, The ATA Chronicle (The Journal of the American Translators Association), Vol. 38, October 2009.
  • 7. international nomenclature • Morphology/semantics preserved across languages: – English: 2-(4-chlorophenoxy)acetic acid – Chinese: 2-(4-氯苯氧基)醋酸 – German: 2-(4-chlorphenoxy)essigsäure – French: acide 2-(4-chlorophénoxy)acétique – Spanish: ácido 2-(4-clorofenoxi)acético – Swedish: 2-(4-klorofenoxi)ättiksyra – Italian: acido 2-(4-clorofenossi)acetico – Polish: kwas 2-(4-chlorofenoksy)octowy – Japanese: 2-(4-クロロフェノキシ)酢酸 ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 8. chinese elements • hydrogen 氢 [H] • oxygen 氧 [O] • nitrogen 氮 [N] • gold 金 [Au] • silver 银 [Ag] • iron 铁 [Fe] • lead 铅 [Pb] ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 9. chinese alkanes • methane 甲烷 C • ethane 乙烷 CC • propane 丙烷 CCC • butane 丁烷 CCCC • pentane 戊烷 CCCCC • hexane 己烷 CCCCCC • heptane 庚烷 CCCCCCC ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 10. chinese alkyl groups • methyl 甲基 *C • ethyl 乙基 *CC • propyl 丙基 *CCC • butyl 丁基 *CCCC • pentyl 戊基 *CCCCC • hexyl 己基 *CCCCCC • heptyl 庚基 *CCCCCCC ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 11. chinese functional groups • hydroxy- 羟基 *O • amino- 胺基 *N • fluoro- 氟 *F • chloro- 氯 *Cl • bromo- 溴 *Br • iodo- 碘 *I • nitro- 硝基 *N(=O)=O ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 12. chinese carbocycles • benzene 苯 c1ccccc1 • naphthalene 萘 c1ccc2ccccc2c1 • azulene 薁 • anthracene 蒽 • phenanthrene 菲 • fluorene 芴 • pyrene 芘 ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 13. chinese heterocycles • furan 呋喃 o1cccc1 • pyrrole 吡咯 [nH]1cccc1 • thiophene 噻吩 s1cccc1 • pyridine 吡啶 n1ccccc1 • pyridazine 哒嗪 n1ncccc1 • pyrimidine 嘧啶 n1cnccc1 • pyrazine 吡嗪 n1ccncc1 ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 14. chinese carboxylic acids • formic acid 蚁酸 C(=O)O • acetic acid 醋酸 CC(=O)O • propionic acid 丙酸 CCC(=O)O • butyric acid 酪酸 CCCC(=O)O • valeric acid 戊酸 CCCCC(=O)O • hexanoic acid 己酸 CCCCCC(=O)O • enanthic acid 庚酸 CCCCCCC(=O)O ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 15. example #1: aspirin • 2-acetoxybenzoic acid • 2-乙酰氧基苯甲酸 ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 16. example #2: alprenolol • 1-(2-allylphenoxy)-3-(isopropylamino)propan-2-ol • 1-(2-烯丙基苯氧基)-3-(异丙基氨基)丙-2-醇 ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 17. example #3: aminopyrine • 4-dimethylamino-1,5-dimethyl-2-phenyl-pyrazol-3-one • 4-二甲氨基-1,5-二甲基-2-苯基-吡唑-3-酮 ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 18. example #4: caffeine • 1,3,7-trimethylpurine-2,6-dione • 1,3,7-三甲基嘌呤-2,6-二酮 ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 19. simplified chinese to english for chemical text mining of patents • Early efforts in (Chinese) chemical nomenclature translation focused on translating and round tripping delimited and often machine-generated IUPAC names. • More recently focus has shifted to the application of this technology to performing chemical named entity recognition (CNER) in pharmaceutical patents. • The rest of this presentation describes the current state-of-the-art on this challenge, and the influence of Chinese OCR on compound retrieval rates. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 20. leadmine introduction • LeadMine is NextMove Software’s application for chemical text mining, built upon CaffeineFix for automatic chemical spelling correction. • R. Sayle, P.H. Xie and S. Muresan, “Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction”, J. Chem. Inf. Model. 52(1), pp. 51-62, January 2012. • Recent improvements to LeadMine and CaffeineFix were described earlier in the week in the CINF session on “Chemical Information in Patents”. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 21. leadmine v2.0 chinese support • Chinese is supported by preprocessing the input text, substituting/transliterating the hanzi characters found in IUPAC-like names into English. • The current translation dictionary (June 2012) includes 1046 substitution rules. • Traditional Chinese is supported by internally automatically mapping 166 traditional characters found in IUPAC-like names to their simplified forms. • Efficient matching is made possible by “compiling” these rules into a 7000 line Java source file. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 22. example noisy input 5.如权利要求1所述的化合物,所述化合物是:2_(4-氯-苯氧 基)-2-曱基-1^-噻唑-2-基-丙酰胺;2—曱基-2-(4-曱硫基-苯氧 基)-N-噻唑-2-基-丙酰胺;2-(6-氯-吡啶-2-基氧基)-2-曱基-N-噻唑 -2-基-丙酰胺; 2_曱基-2-(萘-1 -基氧基)-N-噻唑-2-基-丙酰胺 ; 2_曱基-2-(萘-2-基氧基)->^噻唑-2-基-丙酰胺; 2-(2,4-二 氟-苯氧基)-2-曱基-N-噻唑-2-基- 丙酰胺; 2-(4-氟-苯硫基)-2- 曱基-N-瘗唑-2-基-丙酰胺; 2-曱基-2-(4-苯氧基-苯氧基)-1^-噻 唑-2-基-丙酰胺; 2_甲基-N-噻唑-2-基-2-(4'-三氟甲氧基-联苯- 4-基氧基)-丙酰胺; 2-(苯并[1,3]二氧杂环戊烯-5-基氧基)- ^(5-氯-噻唑-2-基)-2-曱基-丙酰胺;N-(5-氯-噻唑-2-基)-2-(2,4- 二氟-苯氧基)-2-曱基-丙酰胺; 2-(5-氯-吡啶-3-基氧基)-N-(5-氯 -瘗唑-2-基)-2-曱基-丙酰胺;... ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 23. example annotated output ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 24. benchmark #1: CN101622231A • The UTF-16 encoded XML for this patent was kindly provided/suggested by Steve Boyer (IBM/Collabra). • This document corresponds to US patent application 2010/0144772 A1, the non-OCR full text is available from Google patents. • Claim #5 lists 168 fully exemplified structures explicitly claimed by this patent. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 25. benchmark #1: CN101622231A • 144/168 (85.7%) can be recognized by LeadMine v2 from the English text (US 2010144772). • 0/168 (0%) can be recognized by LeadMine from the text translated using Google translate. • 5/168 (3%) could be recognized by LeadMine v1, which uses OpenEye Scientific Software’s Lexichem for Chinese translation. • 71/168 (42%) can be recognized by LeadMine v2, which uses NextMove Software’s Chinese translation. • Only 9/168 (5.4%) when OCR correction is disabled. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 26. benchmark #1: CN101622231A • Three compounds (39, 79 and 161) can be recovered from the Chinese text but could not be found in the English patent! These appear to be typos corrected during translation. • A mistake by the translators has also resulted in an accidental duplication of names 33, 34 and 35, such that there are 171 compounds in the CN patent. • A significant cause of failures is the incorrect OCR of the “甲” as found in methyl, which is frequently incorrectly perceived as “曱”. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 27. benchmark #2: CN1964970 • The non-searchable PDF of CN1964970 was provided by the NIBR-IT text mining group at Novartis. • These images can be freely downloaded via EPO. • This document corresponds to US patent application US20050234042. • The challenge is to extract the 57 exemplified structures from claim #24 (on pages 19-23). • This benchmark tested the quality/impact of Chinese Optical Character Recognition (OCR) software. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 28. benchmark #2: CN1964970 • Two OCR approaches were evaluated; the open source tesseract from HP/Google and the commercial product Abbyy FineReader version 11. • For tesseract, conversion to 300x300 dpi image files was done using GPL Ghostscript v9.05. • Of the two lossless image file formats supported by tesseract’s leptonica library (v1.68), TIFF g3 format produced smaller files than PNG. • Abbyy FineReader processes pages significantly faster than the 1 min/page of Google’s tesseract. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 29. benchmark #2: CN1964970 • 50/57 (88%) of compounds can be found by LeadMine from the English patent, US20050234042. • 0/57 compounds can be found using Abbyy FineReader’s pure “Simplified Chinese” setting. • 4/57 (7%) can be found using FineReader’s hybrid “Simplified Chinese and English” language setting. • 2/57 (3.5%) can be found using HP/Google’s open source tesseract package. • Similar results were also seen with CN1684951. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 30. conclusions • In some respects, the ability to mine any chemical structures from Chinese patents is encouraging. • On high quality OCR input, about 50% of the names found in English CNER can be retrieved from Chinese. • Results are highly dependent upon the quality of OCR software used (IBM looks to have/use the best). • Abbyy FineReader’s “Simplified Chinese and English” performed best of the tested OCR software. • Interestingly, some things are only found in Chinese. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 31. potential future work • Evaluate additional Chinese OCR packages, include Adobe Acrobat, COCR2, ReadIris 14 and Dan Ching. • Teach LeadMine’s Chinese translation about more problematic homoglyph errors from Chinese OCR. • Make use of (tesseract’s) OCR per character confidence during CaffeineFix spelling correction. • Perhaps fund a (Chinese speaking) summer student to develop improved “training data” files for Google’s tesseract, to better support the fonts used by the Chinese patent office. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 32. acknowledgements • Sorel Muresan, Plamen Petrov and Paul Hongxing Xie, AstraZeneca R&D, Molndal, SE. • Steve Boyer, IBM/Collabra, San Jose, USA. • Fatma Oezdemir-Zaech, NBIR-IT, Novartis, Basel, CH. • OpenEye Scientific Software, Santa Fe, USA. • Wikipedia. • Thank you for your time. ACS National Meeting, Philadelphia, USA 22nd August 2012