The application of text and data mining to enhance the RSC publication archive

The Application of Text and Data
Mining to Enhance the Royal Society
of Chemistry Publication Archive
Antony Williams
Emerging Trends in Scholarly Publishing™
Seminar,
Washington, April 24th
2014

So, I’m writing an article…

And these…I will lose data 

Data in Publications
• This is not new, you know the story…
• So much data of value is contained within a
publication and delivered in a PDF form
• PDF files, and unclear licensing/copyright,
limit access to data so I can rework, reuse,
repurpose, text mine etc.
• “I specialize in XXXX. I want a database of
YYYY extracted from publications and made
available, for free, with the capabilities I
need, and the publishers should just do it”

And over the years, progress…
• There is much progress with open access, data
access, licensing, enhanced articles, open
data, free online tools, open source codes,
publishers waking up, scientists contributing
• We should be excited at what is available now,
what the future holds, what opportunities exist
in front of us

It is so difficult to navigate…
What’s the
structure?
What’s the
structure?
Are they in
our file?
Are they in
our file?
What’s
similar?
What’s
similar?
What’s the
target?
What’s the
target?Pharmacology
data?
Pharmacology
data?
Known
Pathways?
Known
Pathways?
Working On
Now?
Working On
Now?Connections
to disease?
Connections
to disease?
Expressed in
right cell type?
Expressed in
right cell type?
Competitors?Competitors?
IP?IP?

“Data enable” publications?
• We would LOVE to bring data out of our archive
• What could we do?
• Find chemical names and generate structures
• Find chemical images and generate structures
• Find reactions – and make a database!
• Find data (MP, BP, LogP) and host. Build models!
• Find figures and database them
• Find spectra (and link to structures)
• Validate the data algorithmically

Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride
( 5 ml ) and benzene ( 50 ml ) were charged into a glass
reaction vessel equipped with a mechanical stirrer ,
thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride
were stripped from the reaction mixture under reduced
pressure to yield the desired product N-(β-chloroethyl)-N-
methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a
solid residue

But names = structures
• Systematic names can be generated FROM
chemical structures algorithmically

But names = structures
• …and structures from systematic names

But what of trivial names?
• What about trivial names, trade names, CAS
numbers, multilingual names etc.?

Searching that lipid in patents

• ~30 million chemicals and growing
• Data sourced from >500 different sources
• Crowd sourced curation and annotation
• Ongoing deposition of data from our
journals and our collaborators
• Structure centric hub for web-searching
• …and a really big dictionary!!!

Experimental/Predicted Properties

Chemical vendors and data sources

How is DERA going?
• We have text-mined all 21st
century articles…
>100k articles from 2000-2013
• Marked up with XML and published onto the
HTML forms of the articles
• Required multiple iterations based on
dictionaries, markup, text mining iterations
• New visualization tools in development – not
just chemical names. Add chemical and
biomedical terms markup also!

But Context Gives Reactions
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride
( 5 ml ) and benzene ( 50 ml ) were charged into a glass
reaction vessel equipped with a mechanical stirrer ,
thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride
were stripped from the reaction mixture under reduced
pressure to yield the desired product N-(β-chloroethyl)-N-
methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a
solid residue

Dictionary
(ontologies)RSC ontologies
(methods,
reactions)
Dictionary
(chemistry)
Text-mining
Curated dictionaries for known names
ACD N2S
OPSIN
Unknown names: automated
name to structure conversion
XML ready for
publication
Marked-up
XML
Production
processes
CDX integration
(coming soon)
Chemical
structures SD
file
Is It Easy?

So..compounds and reactions
• ChemSpider is a compounds repository
• We are building a Reactions Repository
• “Reaction Validation” procedures to check data
• Ontological approaches to classify the reactions
• But why stop at chemicals and reactions?

But publication data is FIGURES

So Turn “Figures” Into Data
EXTRACTED
DATA
FIGURE

Early Test Experiments

74 supplementary data documents/ 3444 pages

Extracted content in 1069 page instances to
produce 1151 spectra, > 80% of peaks extracted
to within 1-2 decimal places

Working on batch extraction and production of
spectral data

Validating Spectra
• How will we check data consistency?
• How do we know the structure and the spectra
match?
• Predict spectra and use algorithmic checking.
• Flag “suspect data” and crowd source data
checking

1H NMR (CDCl3, 400 MHz):
δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35
(t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J =
2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H,
ArH)

Visualization of Spectral Data
• For spectra associated with compounds we
will be viewing “interactive spectra”

What are we extracting?
• Compounds from compound names
• Reactions from the text
• Spectral extraction – from figures and text
• Extraction of data from “tables” – not only
CSV files but tables in the publication

BUT I hate text mining data
• DERA: using pipelining tools for text-mining
so we will be able to process documents
for mark-up
• Compound extraction/markup
• Reaction extraction/conversion
• Extract data from tables
• Convert “text spectra” to generate spectral
libraries
• REALLY???? AGGHHHHH!

DERA is FINE for an archive
The WRONG WAY otherwise!
• We should NOT be mining data out of future
publications
• Structures should be submitted “correctly”
• Spectra should be digital spectral formats,
not images
• ESI should be RICH and interactive
• Data should be open, available, with meta
data and provenance

We can solve for Authors here
Will it be used though???

ChemSpider as a Foundation
• >30 million chemicals (and growing) with
associated experimental and predicted
property data, analytical data, links out to
hundreds of data sources, patents, journal
articles, books etc…is a lot of data!
• ChemSpider is free to access for everyone –
and the API means people program against it
• What projects can we benefit?

Support grant-based services
• Multiple European consortium-based grants
• PharmaSea (FP7 funded)
• Open PHACTS (IMI funded)
• UK National Chemical Database Service (
http://cds.rsc.org) – developing data repository
for lab data, integrate Electronic Lab Notebooks
• Open Drug Discovery projects

• 3-year Innovative Medicines Initiative project
• Integrating chemistry and biology data using
semantic web technologies
• Open code, open data, open standards
• Academics, Pharmas, Publishers…
• To put medicines in the pipeline…

The Open PHACTS community ecosystem

Open Source Drug Discovery
India

Conclusions
• Great progress in mining the archive for
compounds
• Reaction extraction and spectral data are
underway
• All of the resulting data will be available to
the chemistry community

And that article I’m writing

And linking will InChI forward

Thank you
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

The application of text and data mining to enhance the RSC publication archive

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à The application of text and data mining to enhance the RSC publication archive

Similaire à The application of text and data mining to enhance the RSC publication archive (20)

Dernier

Dernier (20)

The application of text and data mining to enhance the RSC publication archive