The Royal Society of Chemistry (RSC) is one of the world’s most prominent scientific societies and STM publishers. Our contributions to the scientific community include the delivery of a myriad of resources to support the chemistry community to access chemistry-related data, information and knowledge. This includes ChemSpider, a compound centric platform linking together over 30 million chemical compounds with internet-based resources. Using this compound database and its associated chemical identifiers as a basis the RSC is utilizing text and data mining approaches to data enable our published archive of scientific publications. This presentation will provide an overview of our technical approaches to text and data enable our archive of scientific articles, how we are developing an integrated database of chemical compounds, reactions, physical and analytical data and how it will be used to facilitate scientific discovery.
Botany krishna series 2nd semester Only Mcq type questions
The application of text and data mining to enhance the RSC publication archive
1. The Application of Text and Data
Mining to Enhance the Royal Society
of Chemistry Publication Archive
Antony Williams
Emerging Trends in Scholarly Publishing™
Seminar,
Washington, April 24th
2014
5. Data in Publications
• This is not new, you know the story…
• So much data of value is contained within a
publication and delivered in a PDF form
• PDF files, and unclear licensing/copyright,
limit access to data so I can rework, reuse,
repurpose, text mine etc.
• “I specialize in XXXX. I want a database of
YYYY extracted from publications and made
available, for free, with the capabilities I
need, and the publishers should just do it”
6. And over the years, progress…
• There is much progress with open access, data
access, licensing, enhanced articles, open
data, free online tools, open source codes,
publishers waking up, scientists contributing
• We should be excited at what is available now,
what the future holds, what opportunities exist
in front of us
7. It is so difficult to navigate…
What’s the
structure?
What’s the
structure?
Are they in
our file?
Are they in
our file?
What’s
similar?
What’s
similar?
What’s the
target?
What’s the
target?Pharmacology
data?
Pharmacology
data?
Known
Pathways?
Known
Pathways?
Working On
Now?
Working On
Now?Connections
to disease?
Connections
to disease?
Expressed in
right cell type?
Expressed in
right cell type?
Competitors?Competitors?
IP?IP?
8. “Data enable” publications?
• We would LOVE to bring data out of our archive
• What could we do?
• Find chemical names and generate structures
• Find chemical images and generate structures
• Find reactions – and make a database!
• Find data (MP, BP, LogP) and host. Build models!
• Find figures and database them
• Find spectra (and link to structures)
• Validate the data algorithmically
10. Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride
( 5 ml ) and benzene ( 50 ml ) were charged into a glass
reaction vessel equipped with a mechanical stirrer ,
thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride
were stripped from the reaction mixture under reduced
pressure to yield the desired product N-(β-chloroethyl)-N-
methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a
solid residue
11. Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride
( 5 ml ) and benzene ( 50 ml ) were charged into a glass
reaction vessel equipped with a mechanical stirrer ,
thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride
were stripped from the reaction mixture under reduced
pressure to yield the desired product N-(β-chloroethyl)-N-
methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a
solid residue
12. But names = structures
• Systematic names can be generated FROM
chemical structures algorithmically
13. But names = structures
• …and structures from systematic names
14. But what of trivial names?
• What about trivial names, trade names, CAS
numbers, multilingual names etc.?
16. • ~30 million chemicals and growing
• Data sourced from >500 different sources
• Crowd sourced curation and annotation
• Ongoing deposition of data from our
journals and our collaborators
• Structure centric hub for web-searching
• …and a really big dictionary!!!
26. How is DERA going?
• We have text-mined all 21st
century articles…
>100k articles from 2000-2013
• Marked up with XML and published onto the
HTML forms of the articles
• Required multiple iterations based on
dictionaries, markup, text mining iterations
• New visualization tools in development – not
just chemical names. Add chemical and
biomedical terms markup also!
31. But Context Gives Reactions
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride
( 5 ml ) and benzene ( 50 ml ) were charged into a glass
reaction vessel equipped with a mechanical stirrer ,
thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride
were stripped from the reaction mixture under reduced
pressure to yield the desired product N-(β-chloroethyl)-N-
methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a
solid residue
35. So..compounds and reactions
• ChemSpider is a compounds repository
• We are building a Reactions Repository
• “Reaction Validation” procedures to check data
• Ontological approaches to classify the reactions
• But why stop at chemicals and reactions?
41. Early Test Experiments
74 supplementary data documents/ 3444 pages
Extracted content in 1069 page instances to
produce 1151 spectra, > 80% of peaks extracted
to within 1-2 decimal places
Working on batch extraction and production of
spectral data
42. Validating Spectra
• How will we check data consistency?
• How do we know the structure and the spectra
match?
• Predict spectra and use algorithmic checking.
• Flag “suspect data” and crowd source data
checking
46. Visualization of Spectral Data
• For spectra associated with compounds we
will be viewing “interactive spectra”
47. What are we extracting?
• Compounds from compound names
• Reactions from the text
• Spectral extraction – from figures and text
• Extraction of data from “tables” – not only
CSV files but tables in the publication
48. BUT I hate text mining data
• DERA: using pipelining tools for text-mining
so we will be able to process documents
for mark-up
• Compound extraction/markup
• Reaction extraction/conversion
• Extract data from tables
• Convert “text spectra” to generate spectral
libraries
• REALLY???? AGGHHHHH!
49. DERA is FINE for an archive
The WRONG WAY otherwise!
• We should NOT be mining data out of future
publications
• Structures should be submitted “correctly”
• Spectra should be digital spectral formats,
not images
• ESI should be RICH and interactive
• Data should be open, available, with meta
data and provenance
51. We can solve for Authors here
Will it be used though???
52. ChemSpider as a Foundation
• >30 million chemicals (and growing) with
associated experimental and predicted
property data, analytical data, links out to
hundreds of data sources, patents, journal
articles, books etc…is a lot of data!
• ChemSpider is free to access for everyone –
and the API means people program against it
• What projects can we benefit?
53. Support grant-based services
• Multiple European consortium-based grants
• PharmaSea (FP7 funded)
• Open PHACTS (IMI funded)
• UK National Chemical Database Service (
http://cds.rsc.org) – developing data repository
for lab data, integrate Electronic Lab Notebooks
• Open Drug Discovery projects
54.
55. • 3-year Innovative Medicines Initiative project
• Integrating chemistry and biology data using
semantic web technologies
• Open code, open data, open standards
• Academics, Pharmas, Publishers…
• To put medicines in the pipeline…
59. Conclusions
• Great progress in mining the archive for
compounds
• Reaction extraction and spectral data are
underway
• All of the resulting data will be available to
the chemistry community