SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Towards automated mining
of chemical structures
in Chinese Patents
Daniel Bonniot de Ruisselet
ChemAxon
ICIC 2013, Vienna
October 16th 2013
2
3
4
Why Chinese patents matter
• Volume, exploding...
• Increasingly innovative
• Potential infrigment, lawsuits
●
Apple (2008, 2012, 2013), Schneider Electric,
Samsung, ...
• Hard to access because of language

5
Why chemical mining matters
• Find interesting patent(s) using text search
– Each patent can contain 100s of chemical names
– Convert them automatically to structures
– Enables chemical calculations

• Find interesting patent(s) using chemical structure
search
– Requires building a chemical database index

• Track structures accross multiple patents
– Including multiple languages
– Searching for prior art, infringment, …
– Chemical similarity search

• ...
6
Putting it together
Chinese patents matter
&
chemical mining matters
→
Chemical mining of chinese
patents matters

7
ChemAxon?
• Cheminformatics, since 1998
• All of the top 15 global pharmas are customers
• Chemical database: indexing and searching
• English Name to Structure
• Document to Structure
• Missing piece: Chinese Name to Structure

8
Chinese Name to Structure

邓巍 (Wei Deng, a.k.a. David)
Builds on english name to structure
Specific dictionaries
Changes in algorithms...
9
The Challenges
1. Chinese texts have no spaces
2. Ester & Salt

乙酸乙酯

Ethyl Acetate
10
The Challenges
3.

English: name alterations
丁烷 → buta + ane → butane

4. Chinese: many Characters have different
meanings
盐
= salt
酸
= acid
盐酸 = hydrochloric acid
11
OCR Error correction
3-( 笨基 ) 丙酸
苯
苯基
丙酸

12

= benzene
= phenyl
= proprionic acid
OCR Error correction
3-( 笨基 ) 丙酸
苯
苯基
丙酸

13

= benzene
= phenyl
= proprionic acid
Chinese Document to Structure
• Additional challenge: no spaces
• 如式 I 所示的 {5-[2-(4- 正辛基苯基 ) 乙基 ]-2 , 2- 二
甲基 -1 , 3- 二氧六环 -5- 基 } 氨基甲酸叔丁酯是合成
芬戈莫德及其衍生物的重要中间体。

14
Chinese Document to Structure
• Additional challenge: no spaces
• 如式 I 所示的 {5-[2-(4- 正辛基苯基 ) 乙基 ]-2 , 2- 二
甲基 -1 , 3- 二氧六环 -5- 基 } 氨基甲酸叔丁酯是合
成芬戈莫德及其衍生物的重要中间体。

15
Chinese Document to Structure
• Additional challenge: no spaces
• 如式 I 所示的 {5-[2-(4- 正辛基苯基 ) 乙基 ]-2 , 2- 二
甲基 -1 , 3- 二氧六环 -5- 基 } 氨基甲酸叔丁酯是合
成芬戈莫德及其衍生物的重要中间体。
• XML Markup
●
Patent metadata
●

Encoding of characters

●

Tags (e.g. <p>)

• Document annotation

16
Document to Database
Document to Database

18
Document to Database

19
Document to Database

20
Validation 1: Chinese name to structure
• Test set: 38,600 Chinese names + CAS
number
• Contains unusual, incorrect, ambiguous
names, radicals, inorganic salts,
• Conversion rate = 59 – 79 %
• Accuracy = 91%

21
Validation 2: Chinese patents
• 54K chinese patents with automated english
translation
• Filter: structures with at least 20 heavy atoms, and
patents with at least 20 structures
• Remains: 2108 patents

22
Validation 2: Chinese patents

23
Conclusions
• Patent volume in chinese is booming
• It is important to mine & monitor it
• Automated solutions are needed, but hard
• General purpose auto translation is not enough
• Chinese N2S already gives better results
• ChemAxon can build solutions for specific workflows
• More collaboration with patent providers is needed to
keep improving quality and solutions
谢谢!
24
Extra information

谢谢!

25
Automatic OCR Error Correction
(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
Λr-benzyl-Λr-[3-(lH-tetrazol-5-yl)phenyl]propanamide
N-benzyl-N-[3-(1H-tetrazol-5-yl)phenyl]propanamide
我们日前止在研究开友中文化字名称的 OCR 白动纠错工力能

我们目前正在研究开发中文化学名称的 OCR 自动纠错功能
26
From Document to Structures

27

Non-searchable patent (50 pages)

Structure (text + image) + location
ChemAxon’s “Document to Structure”
• Extract chemical information from documents
–
–
–
–
–

28

Names: powered by the Naming Technology
Also import SMILES, InChI, CAS number …
Images: OSRA, ...
Works with scanned non-searchable PDF
Returns structures and their location in the document
ChemAxon’s “Document to Structure”
• Supported formats:
– MS Office document: doc, docx, ppt, pptx, xls, xlsx, odt …
– Embedded structure objects (ChemDraw, Symyx, Marvin, …)
– PDF, text, XML, HTML

29
ChemAxon’s “Document to Database”
• Data in DB:
– Structures
– Source (name, smiles, embedded, …) and location
– Documents, Authors, Metadata...

• Questions:
– What structures appear in a specific document?
– What documents contain a structure/substructure/...?
– What documents written since 2010 in location X contain
substructure S?
– ...

30

Contenu connexe

En vedette

II-SDV 2017 - The International Information Conference on Search, Data Mining...
II-SDV 2017 - The International Information Conference on Search, Data Mining...II-SDV 2017 - The International Information Conference on Search, Data Mining...
II-SDV 2017 - The International Information Conference on Search, Data Mining...Dr. Haxel Consult
 
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryDr. Haxel Consult
 
ICIC 2013 New Product Introductions Minesoft
ICIC 2013 New Product Introductions MinesoftICIC 2013 New Product Introductions Minesoft
ICIC 2013 New Product Introductions MinesoftDr. Haxel Consult
 
ICIC 2014 New Product Introduction CAS
ICIC 2014 New Product Introduction CASICIC 2014 New Product Introduction CAS
ICIC 2014 New Product Introduction CASDr. Haxel Consult
 
New Product Introductions - GenomeQuest Life Sciences
New Product Introductions - GenomeQuest Life SciencesNew Product Introductions - GenomeQuest Life Sciences
New Product Introductions - GenomeQuest Life SciencesDr. Haxel Consult
 
ICIC 2013 New Product Introductions Dolcera
ICIC 2013 New Product Introductions DolceraICIC 2013 New Product Introductions Dolcera
ICIC 2013 New Product Introductions DolceraDr. Haxel Consult
 
ICIC 2014 Panel: Mobile Apps for Patent Searchers
ICIC 2014 Panel: Mobile Apps for Patent SearchersICIC 2014 Panel: Mobile Apps for Patent Searchers
ICIC 2014 Panel: Mobile Apps for Patent SearchersDr. Haxel Consult
 
ICIC 2013 Conference Proceedings Sebastian Radestock
ICIC 2013 Conference Proceedings Sebastian RadestockICIC 2013 Conference Proceedings Sebastian Radestock
ICIC 2013 Conference Proceedings Sebastian RadestockDr. Haxel Consult
 
ICIC 2014 New Product Introduction Gridlogisc
ICIC 2014 New Product Introduction GridlogiscICIC 2014 New Product Introduction Gridlogisc
ICIC 2014 New Product Introduction GridlogiscDr. Haxel Consult
 
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities  ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities Dr. Haxel Consult
 
ICIC 2013 Conference Proceedings Krishna Molecular Connections
ICIC 2013 Conference Proceedings Krishna Molecular ConnectionsICIC 2013 Conference Proceedings Krishna Molecular Connections
ICIC 2013 Conference Proceedings Krishna Molecular ConnectionsDr. Haxel Consult
 
ICIC 2013 New Product Introductions GenomeQuest
ICIC 2013 New Product Introductions GenomeQuestICIC 2013 New Product Introductions GenomeQuest
ICIC 2013 New Product Introductions GenomeQuestDr. Haxel Consult
 
ICIC 2013 New Product Introductions InfoChem
ICIC 2013 New Product Introductions InfoChemICIC 2013 New Product Introductions InfoChem
ICIC 2013 New Product Introductions InfoChemDr. Haxel Consult
 
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallDr. Haxel Consult
 
ICIC 2014 The Intermediates are becoming extict - radical Change for Info Pr...
ICIC 2014 The Intermediates are becoming extict  - radical Change for Info Pr...ICIC 2014 The Intermediates are becoming extict  - radical Change for Info Pr...
ICIC 2014 The Intermediates are becoming extict - radical Change for Info Pr...Dr. Haxel Consult
 
ICIC 2014 New Product Introduction Averbis
ICIC 2014 New Product Introduction AverbisICIC 2014 New Product Introduction Averbis
ICIC 2014 New Product Introduction AverbisDr. Haxel Consult
 
ICIC 2013 New Product Introductions Linguamatics
ICIC 2013 New Product Introductions LinguamaticsICIC 2013 New Product Introductions Linguamatics
ICIC 2013 New Product Introductions LinguamaticsDr. Haxel Consult
 
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...Dr. Haxel Consult
 

En vedette (18)

II-SDV 2017 - The International Information Conference on Search, Data Mining...
II-SDV 2017 - The International Information Conference on Search, Data Mining...II-SDV 2017 - The International Information Conference on Search, Data Mining...
II-SDV 2017 - The International Information Conference on Search, Data Mining...
 
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
 
ICIC 2013 New Product Introductions Minesoft
ICIC 2013 New Product Introductions MinesoftICIC 2013 New Product Introductions Minesoft
ICIC 2013 New Product Introductions Minesoft
 
ICIC 2014 New Product Introduction CAS
ICIC 2014 New Product Introduction CASICIC 2014 New Product Introduction CAS
ICIC 2014 New Product Introduction CAS
 
New Product Introductions - GenomeQuest Life Sciences
New Product Introductions - GenomeQuest Life SciencesNew Product Introductions - GenomeQuest Life Sciences
New Product Introductions - GenomeQuest Life Sciences
 
ICIC 2013 New Product Introductions Dolcera
ICIC 2013 New Product Introductions DolceraICIC 2013 New Product Introductions Dolcera
ICIC 2013 New Product Introductions Dolcera
 
ICIC 2014 Panel: Mobile Apps for Patent Searchers
ICIC 2014 Panel: Mobile Apps for Patent SearchersICIC 2014 Panel: Mobile Apps for Patent Searchers
ICIC 2014 Panel: Mobile Apps for Patent Searchers
 
ICIC 2013 Conference Proceedings Sebastian Radestock
ICIC 2013 Conference Proceedings Sebastian RadestockICIC 2013 Conference Proceedings Sebastian Radestock
ICIC 2013 Conference Proceedings Sebastian Radestock
 
ICIC 2014 New Product Introduction Gridlogisc
ICIC 2014 New Product Introduction GridlogiscICIC 2014 New Product Introduction Gridlogisc
ICIC 2014 New Product Introduction Gridlogisc
 
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities  ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
 
ICIC 2013 Conference Proceedings Krishna Molecular Connections
ICIC 2013 Conference Proceedings Krishna Molecular ConnectionsICIC 2013 Conference Proceedings Krishna Molecular Connections
ICIC 2013 Conference Proceedings Krishna Molecular Connections
 
ICIC 2013 New Product Introductions GenomeQuest
ICIC 2013 New Product Introductions GenomeQuestICIC 2013 New Product Introductions GenomeQuest
ICIC 2013 New Product Introductions GenomeQuest
 
ICIC 2013 New Product Introductions InfoChem
ICIC 2013 New Product Introductions InfoChemICIC 2013 New Product Introductions InfoChem
ICIC 2013 New Product Introductions InfoChem
 
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
 
ICIC 2014 The Intermediates are becoming extict - radical Change for Info Pr...
ICIC 2014 The Intermediates are becoming extict  - radical Change for Info Pr...ICIC 2014 The Intermediates are becoming extict  - radical Change for Info Pr...
ICIC 2014 The Intermediates are becoming extict - radical Change for Info Pr...
 
ICIC 2014 New Product Introduction Averbis
ICIC 2014 New Product Introduction AverbisICIC 2014 New Product Introduction Averbis
ICIC 2014 New Product Introduction Averbis
 
ICIC 2013 New Product Introductions Linguamatics
ICIC 2013 New Product Introductions LinguamaticsICIC 2013 New Product Introductions Linguamatics
ICIC 2013 New Product Introductions Linguamatics
 
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
 

Plus de Dr. Haxel Consult

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementDr. Haxel Consult
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...Dr. Haxel Consult
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...Dr. Haxel Consult
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...Dr. Haxel Consult
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...Dr. Haxel Consult
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...Dr. Haxel Consult
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...Dr. Haxel Consult
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...Dr. Haxel Consult
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...Dr. Haxel Consult
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...Dr. Haxel Consult
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...Dr. Haxel Consult
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...Dr. Haxel Consult
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...Dr. Haxel Consult
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterDr. Haxel Consult
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCDr. Haxel Consult
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...Dr. Haxel Consult
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...Dr. Haxel Consult
 

Plus de Dr. Haxel Consult (20)

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance Center
 
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IPAI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IP
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOC
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
 

Dernier

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Dernier (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

ICIC 2013 Conference Proceedings Daniel Bonniot ChemAxon

  • 1. Towards automated mining of chemical structures in Chinese Patents Daniel Bonniot de Ruisselet ChemAxon ICIC 2013, Vienna October 16th 2013
  • 2. 2
  • 3. 3
  • 4. 4
  • 5. Why Chinese patents matter • Volume, exploding... • Increasingly innovative • Potential infrigment, lawsuits ● Apple (2008, 2012, 2013), Schneider Electric, Samsung, ... • Hard to access because of language 5
  • 6. Why chemical mining matters • Find interesting patent(s) using text search – Each patent can contain 100s of chemical names – Convert them automatically to structures – Enables chemical calculations • Find interesting patent(s) using chemical structure search – Requires building a chemical database index • Track structures accross multiple patents – Including multiple languages – Searching for prior art, infringment, … – Chemical similarity search • ... 6
  • 7. Putting it together Chinese patents matter & chemical mining matters → Chemical mining of chinese patents matters 7
  • 8. ChemAxon? • Cheminformatics, since 1998 • All of the top 15 global pharmas are customers • Chemical database: indexing and searching • English Name to Structure • Document to Structure • Missing piece: Chinese Name to Structure 8
  • 9. Chinese Name to Structure 邓巍 (Wei Deng, a.k.a. David) Builds on english name to structure Specific dictionaries Changes in algorithms... 9
  • 10. The Challenges 1. Chinese texts have no spaces 2. Ester & Salt 乙酸乙酯 Ethyl Acetate 10
  • 11. The Challenges 3. English: name alterations 丁烷 → buta + ane → butane 4. Chinese: many Characters have different meanings 盐 = salt 酸 = acid 盐酸 = hydrochloric acid 11
  • 12. OCR Error correction 3-( 笨基 ) 丙酸 苯 苯基 丙酸 12 = benzene = phenyl = proprionic acid
  • 13. OCR Error correction 3-( 笨基 ) 丙酸 苯 苯基 丙酸 13 = benzene = phenyl = proprionic acid
  • 14. Chinese Document to Structure • Additional challenge: no spaces • 如式 I 所示的 {5-[2-(4- 正辛基苯基 ) 乙基 ]-2 , 2- 二 甲基 -1 , 3- 二氧六环 -5- 基 } 氨基甲酸叔丁酯是合成 芬戈莫德及其衍生物的重要中间体。 14
  • 15. Chinese Document to Structure • Additional challenge: no spaces • 如式 I 所示的 {5-[2-(4- 正辛基苯基 ) 乙基 ]-2 , 2- 二 甲基 -1 , 3- 二氧六环 -5- 基 } 氨基甲酸叔丁酯是合 成芬戈莫德及其衍生物的重要中间体。 15
  • 16. Chinese Document to Structure • Additional challenge: no spaces • 如式 I 所示的 {5-[2-(4- 正辛基苯基 ) 乙基 ]-2 , 2- 二 甲基 -1 , 3- 二氧六环 -5- 基 } 氨基甲酸叔丁酯是合 成芬戈莫德及其衍生物的重要中间体。 • XML Markup ● Patent metadata ● Encoding of characters ● Tags (e.g. <p>) • Document annotation 16
  • 21. Validation 1: Chinese name to structure • Test set: 38,600 Chinese names + CAS number • Contains unusual, incorrect, ambiguous names, radicals, inorganic salts, • Conversion rate = 59 – 79 % • Accuracy = 91% 21
  • 22. Validation 2: Chinese patents • 54K chinese patents with automated english translation • Filter: structures with at least 20 heavy atoms, and patents with at least 20 structures • Remains: 2108 patents 22
  • 23. Validation 2: Chinese patents 23
  • 24. Conclusions • Patent volume in chinese is booming • It is important to mine & monitor it • Automated solutions are needed, but hard • General purpose auto translation is not enough • Chinese N2S already gives better results • ChemAxon can build solutions for specific workflows • More collaboration with patent providers is needed to keep improving quality and solutions 谢谢! 24
  • 26. Automatic OCR Error Correction (2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate (2R)-2-methylsulfanyl-3-hydroxybutanedioate Λr-benzyl-Λr-[3-(lH-tetrazol-5-yl)phenyl]propanamide N-benzyl-N-[3-(1H-tetrazol-5-yl)phenyl]propanamide 我们日前止在研究开友中文化字名称的 OCR 白动纠错工力能 我们目前正在研究开发中文化学名称的 OCR 自动纠错功能 26
  • 27. From Document to Structures 27 Non-searchable patent (50 pages) Structure (text + image) + location
  • 28. ChemAxon’s “Document to Structure” • Extract chemical information from documents – – – – – 28 Names: powered by the Naming Technology Also import SMILES, InChI, CAS number … Images: OSRA, ... Works with scanned non-searchable PDF Returns structures and their location in the document
  • 29. ChemAxon’s “Document to Structure” • Supported formats: – MS Office document: doc, docx, ppt, pptx, xls, xlsx, odt … – Embedded structure objects (ChemDraw, Symyx, Marvin, …) – PDF, text, XML, HTML 29
  • 30. ChemAxon’s “Document to Database” • Data in DB: – Structures – Source (name, smiles, embedded, …) and location – Documents, Authors, Metadata... • Questions: – What structures appear in a specific document? – What documents contain a structure/substructure/...? – What documents written since 2010 in location X contain substructure S? – ... 30