SlideShare une entreprise Scribd logo
1  sur  116
Data mining dissertations
Adventures and Experiences in the World of Chemistry
Antony Williams
CLIR/DLF Postdoctoral Fellowship Summer Seminar,
July 2014
What a small world…
• Who’s got an ORCID?
• Who has heard of/involved with AltMetrics?
• Who has edited a Wikipedia page?
• Who has direct experience of text mining?
• All slides already on Slideshare here:
• www.slideshare.net/AntonyWilliams
Before we start….
• Context – why do we want to mine data?
• Our experiences in extracting theses:
– Text and data mining
– Chemistry as an example
– Before you start
– Resources and tools
Contents
• Let’s map together all historical chemistry
data and build systems to integrate
• Heck, let’s integrate chemistry and biology
data and add in disease data too
• Let’s model the data and see if we can
extract new relationships – quantitative and
qualitative
• Let’s make it all available on the web
Taking on a big challenge…
• We’re going to map the world
• We’re going to take photos of as many
places as we can and link them together
• We’ll let people annotate and curate the map
• Then let’s make it available free on the web
• We’ll make it available for decision making
• Put it on Mobile Devices, Give it Away
What about this….
I’m from here…on Google
Wikipedia
Wikipedia
The Power of Contribution
How do you spell Afonwen?
And there’s Denbigh…
• So the world can be mapped…
• We can enter a 3D world within the map
• We can add annotations
• We can use the data, reference it, we can
extract it, we can make decisions with it
• And we can do it on our lap, in our hands
• Let’s do this for chemistry…
Whoa…
• Once upon a time we built a database….
In a basement not far away…
ChemSpider
ChemSpider and Data Validation
Dictionary Linking
Dictionary Linking
• This is not new, you known the story…
• So much data of value contained within a
publication and delivered in a PDF form
• “PDF files, and especially unclear licensing,
don’t allow me at the data so I can rework,
reuse, repurpose, text mine etc.”
• “I specialize in XXXX. I want a database of
YYYY extracted from publications and made
available, for free, with capabilities I need,
and the publishers should just do it”
Data in a Scientific Publication
It is so difficult to navigate…
What’s the
structure?
Are they in
our file?
What’s
similar?
What’s the
target?Pharmacology
data?
Known
Pathways?
Working On
Now?Connections
to disease?
Expressed in
right cell type?
Competitors?
IP?
• Manage “all” of the chemistry data associated
with chemical substances
• Data to be downloadable, reusable, interactive
• Build a platform that enables the scientist
• Data storage, validation, standardization and
curation
• Collaborative data sharing
• Provide data platform that can enable and
enhance publishing of scientific papers
We set a vision…
• Every compound from every article at RSC is
extracted, in a database, and linked
• Chemical properties are extracted, databased
and used for predictive models
• Data tables are downloadable, interactive and
not just “dumb-PDFs”
• …and what can we extract from chemistry
theses too?
XXX Years from Now at RSC
• We are seen as one of the repositories for
published AND unpublished research data
• An intuitive platform for research data
management in the cloud
• Individual, collaborative and public data
management of diverse data in the cloud
• …and where all data referenced in a
thesis is available at a button click
XXX Years from Now at RSC
• But how does it map onto your domain??
So this is chemistry…
Mining as an allegory
• You have a mountain of stuff which
contains valuable nuggets
• You (more or less) know what you’re
looking for
• You know what you’re going to do with it
once you have it
Mining as an allegory - intent
• You get lots of stuff out
• It requires sifting and grading
• It’s a triumph if you manage to extract
80-90% of what is there
• You will go back to the heap and redo it
Mining as an allegory - result
• That which is easy to get out - is well
known and unlikely to be novel
• The novel and interesting stuff is likely to
be rare and not easily defined
Mining as an allegory - effort
• Do the initial investigations by hand
• Send in the machines later
• Still needs some humans tweaking
Mining as allegory - automation
Context
• From Utopia Documents team
• Good at extracting structure from typeset pdfs
• http://pdfx.cs.man.ac.uk/
PDFX
OCR recognition
• Underlining doesn’t help OCR
• In this case it was the only signpost to the
department, supervisor and funding details
• Hardcopy
• Scanned and OCR’d PDF
• PDF derived from Word
• Word or LaTeX
• …and for OCR not all are borne equal
• …and of course history and language is
a major influence. “Oil of vitriol”
Building blocks to mine…
• Ontologies, taxonomies, dictionaries
• But these are very domain focussed…
• As an example, Open PHACTS spend a
lot of effort mapping biology to chemistry
to disease over many data sources
More building blocks
• Provide a controlled vocabulary – what
your data describes, where it came from
• Provide a shared vocabulary for
integrating with other people’s data
What can ontologies do for me?
Questions to ask:
(1)Has someone already produced an ontology
covering your area? (Places to look:
Bioportal, OBO Foundry.)
(2)Do they take requests?
(3)Are they responsive?
(4)Is the ontology kept up to date?
Early days for ontologies and any ontology will
almost certainly be a long way from complete!
Best practices: experiences
from biomedical ontologies
• Best that these don’t change
• Best that everyone calls them the same things
• Best that they are unambiguous
• Meanwhile, back in the real world
What things are you looking for?
• Place names – somewhat ambiguous
• Species names – can change with time
• Diseases – every pharmaceutical company
has a different list
• People – can be very ambiguous: Authors
and researchers are hard to map…except
for Google it seems!
How easy?
http://www.amazon.com/-/e/B004YRPRV2
http://orcid.org/0000-0002-2668-4821
Thankfully people follow…
Google Scholar Citations
http://scholar.google.com/citations?user=O2L8nh4AAAAJ
ORCID take up???
• All publications easily connected but also
– Important in early scientific career –
consider every data point contribution, every
“research object”
– Every article
– Every presentation
– Thesis and dissertation
– Provenance….and feeding AltMetrics
So the benefits of ORCIDs?
The Alt-Metrics Manifesto
AltMetrics via Plum Analytics
Usage, Citations, Social Media
Detailed Usage Statistics
Indexed and Searchable
ORCIDS for reputation…
Tinman - mutant fly embryos lack a heart.
Van Gogh - hair-like bristles on wings have a swirling pattern.
INDY - acronym for I'm Not Dead Yet, they live twice as long as normal; from the scene in
the movie "Monty Python and the Holy Grail"
Ken and Barbie - males and females lack external genitalia.
Tribbles - some cells divide uncontrollably
Cheap date - flies are extra-sensitive to alcohol.
Cleopatra - flies die when Cleopatra gene interacts with another gene, Asp.
Kojak - no hairs on wings.
Maggie - fly development is arrested; named after Maggie Simpson, who's development
also seems to be arrested.
Oh my..Fruitfly gene names
• http://stlists.blogspot.co.uk/2005/05/fruitfly-gene-names.html
• those that belong to the Emperor,
• embalmed ones,
• those that are trained,
• suckling pigs,
• mermaids,
• fabulous ones,
• stray dogs,
• those included in the present classification,
• those that tremble as if they were mad,
• innumerable ones,
• those drawn with a very fine camelhair brush,
• others,
• those that have just broken a flower vase,
• those that from a long way off look like flies.
Allegedly from “Celestial Emporium of Benevolent Knowledge”
The Analytical Language of John Wilkins, Jorge Luis Borges
Animal classification
• Are you just identifying entities?
• Are you looking for sentiment?
• In chemistry names will lead you to a
recipe for synthesis, and analytical data
about that compound
Classification after “things”
• Used to aid discovery - directly
• Used to aid discovery - indirectly
• Extract data in electronic form for reuse
• Needs to be use case driven – why, then
what/how comes later
End result
• Automation can give good results
• Especially looked at in bulk
• Less easy to judge at the article level
• People accept discovery is fuzzy
• Not so with data points
• (but maybe can screen out)
Quality
• Chemical names are both difficult and rewarding.
• Difficult in the sense that they can break
standard software.
• Rewarding in the sense that you can extract
useful information about the molecule they’re
referring to without a dictionary.
• Some examples…
Chemistry-specific challenges
and opportunities
• …and it gets worse
A series of mono and di-N-2,3-epoxypropyl N-
phenylhydrazones have been prepared on a large scale
by reaction of the corresponding N-phenylhydrazones of
9-ethyl-3-carbazolecarbaldehyde, 9-ethyl-3,6-
carbazoledicarbaldehyde, 4-dimethyl-amino-, 4-
diethylamino-, 4-benzylethylamino-, 4-(diphenylamino)-,
4-(4,4-4′-dimethyl-diphenylamino)-, 4-(4-
formyldiphenylamino)- and 4-(4-formyl-4′-
methyldiphenyl-amino)benzaldehyde with
epichlorohydrin in the presence of KOH and anhydrous
Na(2)SO(4).
From Molecules, via the BioNLP list
Annotate this...
How many explicit compounds?
• How many numbered compounds
actually are named in a given
paper?
• iloprost (1)
• tributyl-1-hexynylstannane (2)
• the desired 2-heptyne (3)
• methyl–Pd(II) iodide 4 or 4′
• alkynylstannane 5
• the hypervalent stannate 6
• (alkynyl)(methyl)Pd(II) complex 7
• the desired methylalkyne 8
• compounds 9–14
• the stannyl precursors 15 and 16
• methylated compounds 17 and 18
• stannyl precursor 19
• iloprost methyl ester 20
• “iloprost methyl ester” is the real
name, but you need to know that
iloprost is a monocarboxylic acid!
Names from structures
• Systematic names can be generated FROM
chemical structures algorithmically
General-purpose parsers do
NOT get chemical names
Visualization by bpodgursky.com using d3.js; parsing by
Stanford’s CoreNLP.
But names can reverse back
to structures…
• OPSIN (chemical name to structure)
http://opsin.ch.cam.ac.uk/
Tools to try
Not all names are systematic..
Antony Williams vs Identifiers
Passport ID
Dad, Tony, others
SSN
Green Card
License
5 email addresses
ChemSpiderman (blog,
Twitter account,
Facebook, Friendfeed)
OpenID
….
Many Names, One Structure
Aspirin on ChemSpider
Unique Structure Identifiers
Structure Searching the Web
Certainly happens with Welsh!
• All of the tasks below are possible to varying extents.
Pioneered on journal abstracts and journal full text.
– Named entity recognition: what is this about? Where are
the places mentioned? Who are the people?
– Clustering and classification: which other dissertations
are like this one? What genres of dissertations are there?
– Event extraction: what processes (chemical reactions,
gene expression) occur? What are the participants?
– Citation analysis: who do dissertations cite?
– What sentiments towards the citations do authors express?
Dissertation analysis
• Dissertation copyright varies
• Institution
• Author
• Published or not?
Copyright issues
• Probably less structured than papers
• Not much work has been done here before
Dissertation specifics
• For example:
• Stylometrics (to find out who wrote this)
• Language identification
• What else?....
• In addition to above, there are different tasks
we can perform on scientific publications and
dissertations
Digital Humanities textual
analysis tasks
• We would LOVE to bring data out of our
archive
• What could we do?
• Find chemical names and generate structures
• Find chemical images and generate structures
• Find reactions – and make a database!
• Find data (MP, BP, LogP) and host. Build
models!
• Find figures and database them
• Find spectra (and link to structures)
• Validate the data algorithmically
“Data enable” publications?
RSC Archive – since 1841
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride
( 5 ml ) and benzene ( 50 ml ) were charged into a glass
reaction vessel equipped with a mechanical stirrer ,
thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride
were stripped from the reaction mixture under reduced
pressure to yield the desired product N-(β-chloroethyl)-N-
methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a
solid residue
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride
( 5 ml ) and benzene ( 50 ml ) were charged into a glass
reaction vessel equipped with a mechanical stirrer ,
thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride
were stripped from the reaction mixture under reduced
pressure to yield the desired product N-(β-chloroethyl)-N-
methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a
solid residue
• 13C NMR (CDCl3, 100 MHz): δ = 14.12
(CH3), 30.11 (CH, benzylic methane),
30.77 (CH, benzylic methane), 66.12
(CH2), 68.49 (CH2), 117.72, 118.19,
120.29, 122.67, 123.37, 125.69, 125.84,
129.03, 130.00, 130.53 (ArCH), 99.42,
123.60, 134.69, 139.23, 147.21, 147.61,
149.41, 152.62, 154.88 (ArC)
Text spectra?
1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me,
C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H,
Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd,
1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH),
7.18–7.94 (m, 11H, ArH)
Turn “Figures” Into Data
Make it interactive
SO MANY reactions!
Reactions From Patents
Experimental data checker
• http://chemicaltagger.ch.cam.ac.uk/
Tools to try: ChemicalTagger
Tools to try: ChemicalTagger
• ChemicalTagger
Tools to try
How is DERA going?
• We have text-mined all 21st century articles…
>100k articles from 2000-2013
• Marked up with XML and published onto the
HTML forms of the articles
• Required multiple iterations based on
dictionaries, markup, text mining iterations
• New visualization tools in development – not
just chemical names. Add chemical and
biomedical terms markup also!
Work in Progress
Work in Progress
Work in Progress
Work in Progress
Dictionary
(ontologies)
RSC ontologies
(methods,
reactions)
Dictionary
(chemistry)
Text-mining
Curated dictionaries for known names
ACD N2S
OPSIN
Unknown names: automated
name to structure conversion
XML ready for
publication
Marked-up
XML
Production
processes
CDX integration
(coming soon)
Chemical
structures SD
file
Is It Easy?
Our Supporting Ontologies
• The ‘National Compound Collection’
• Extracting compounds manually from theses
• 700 theses, 44,000 compounds (growing…)
• 4 months, 12 UK institutions
• Deposited into ChemSpider
A pilot examining theses
• Screening for interesting drug candidates
• Mapping the chain from author to institution
to data to industry
• British Library involved (EThOS collection)
• Build a business model for this
Pilot objectives
• Funders encouraging submission from new
dissertations
• Mining of old collections (mostly automated,
likely to need manual QA)
• Extension to other areas of chemical science
…and future (ideal)
• Don’t reinvent the wheel
• Research your domain to find work already
underway and test tools for value/utility
In your domain???
Most Domains are Active
A good place to start
• NaCTeM tools for e.g sentiment analysis
http://www.nactem.ac.uk/opminpackage/opinion_analysis
Tools to try
• NaCTeM tools for e.g sentiment analysis
Tools to try
There is always something new
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams
Thank you

Contenu connexe

Tendances

Ontologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinOntologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinSimon Jupp
 
schema.org and biomedical ontologies
schema.org and biomedical ontologies schema.org and biomedical ontologies
schema.org and biomedical ontologies Simon Jupp
 
Semantics as a service at EMBL-EBI
Semantics as a service at EMBL-EBISemantics as a service at EMBL-EBI
Semantics as a service at EMBL-EBISimon Jupp
 
Importing life science at a into Neo4j
Importing life science at a into Neo4jImporting life science at a into Neo4j
Importing life science at a into Neo4jSimon Jupp
 
Facilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppFacilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppSimon Jupp
 

Tendances (20)

Value of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry communityValue of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry community
 
The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...
 
The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...
 
Our dire need to mandate data standards and expectations for scientific publi...
Our dire need to mandate data standards and expectations for scientific publi...Our dire need to mandate data standards and expectations for scientific publi...
Our dire need to mandate data standards and expectations for scientific publi...
 
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts projectOpen innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts project
 
Cheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural ProductsCheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural Products
 
Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...
 
Encouraging undergraduate students to participate as authors of scientific pu...
Encouraging undergraduate students to participate as authors of scientific pu...Encouraging undergraduate students to participate as authors of scientific pu...
Encouraging undergraduate students to participate as authors of scientific pu...
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
 
Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...
 
Building a data repository to manage chemistry research data
Building a data repository to manage chemistry research dataBuilding a data repository to manage chemistry research data
Building a data repository to manage chemistry research data
 
Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...
 
Ontologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinOntologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlin
 
schema.org and biomedical ontologies
schema.org and biomedical ontologies schema.org and biomedical ontologies
schema.org and biomedical ontologies
 
The Possibilities and Pitfalls of Internet-Based Chemical Data
The Possibilities and Pitfalls of Internet-Based Chemical Data The Possibilities and Pitfalls of Internet-Based Chemical Data
The Possibilities and Pitfalls of Internet-Based Chemical Data
 
Semantics as a service at EMBL-EBI
Semantics as a service at EMBL-EBISemantics as a service at EMBL-EBI
Semantics as a service at EMBL-EBI
 
Importing life science at a into Neo4j
Importing life science at a into Neo4jImporting life science at a into Neo4j
Importing life science at a into Neo4j
 
Facilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppFacilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-jupp
 

Similaire à Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Research Skills for Egyptology
Research Skills for EgyptologyResearch Skills for Egyptology
Research Skills for EgyptologyMelanie Pitkin
 
Research skills for Egyptology
Research skills for Egyptology Research skills for Egyptology
Research skills for Egyptology Melanie Pitkin
 
Why does research data matter to libraries
Why does research data matter to librariesWhy does research data matter to libraries
Why does research data matter to librariesJisc RDM
 
New Metaphors: Data Papers and Data Citations
New Metaphors: Data Papers and Data CitationsNew Metaphors: Data Papers and Data Citations
New Metaphors: Data Papers and Data CitationsJohn Kunze
 
Preservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanitiesPreservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanitiesDorothea Salo
 
Exploring a world of networked information built from free-text metadata
Exploring a world of networked information built from free-text metadataExploring a world of networked information built from free-text metadata
Exploring a world of networked information built from free-text metadataShenghui Wang
 
NURS 3351
NURS 3351NURS 3351
NURS 3351Traciwm
 
Experiences in building an ontology driven image database for ...
Experiences in building an ontology driven image database for ...Experiences in building an ontology driven image database for ...
Experiences in building an ontology driven image database for ...Carla Lima
 
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Bertram Ludäscher
 
Melissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabMelissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabUniversity of Edinburgh
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neurosciencepetermurrayrust
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in NeuroscienceTheContentMine
 

Similaire à Data Mining Dissertations and Adventures and Experiences in the World of Chemistry (20)

Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...
 
Research Skills for Egyptology
Research Skills for EgyptologyResearch Skills for Egyptology
Research Skills for Egyptology
 
Research skills for Egyptology
Research skills for Egyptology Research skills for Egyptology
Research skills for Egyptology
 
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
 
Contributions to the World of eScience from the Royal Society of Chemistry
Contributions to the World of eScience from the Royal Society of ChemistryContributions to the World of eScience from the Royal Society of Chemistry
Contributions to the World of eScience from the Royal Society of Chemistry
 
Why does research data matter to libraries
Why does research data matter to librariesWhy does research data matter to libraries
Why does research data matter to libraries
 
eScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiativeseScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiatives
 
eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...
 
Delivering on the promise of a chemistry data repository for the world
Delivering on the promise of a chemistry data repository for the worldDelivering on the promise of a chemistry data repository for the world
Delivering on the promise of a chemistry data repository for the world
 
New Metaphors: Data Papers and Data Citations
New Metaphors: Data Papers and Data CitationsNew Metaphors: Data Papers and Data Citations
New Metaphors: Data Papers and Data Citations
 
Preservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanitiesPreservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanities
 
Exploring a world of networked information built from free-text metadata
Exploring a world of networked information built from free-text metadataExploring a world of networked information built from free-text metadata
Exploring a world of networked information built from free-text metadata
 
NURS 3351
NURS 3351NURS 3351
NURS 3351
 
Ngsp
NgspNgsp
Ngsp
 
Experiences in building an ontology driven image database for ...
Experiences in building an ontology driven image database for ...Experiences in building an ontology driven image database for ...
Experiences in building an ontology driven image database for ...
 
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
 
Melissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabMelissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLab
 
Waves keynote2c
Waves keynote2cWaves keynote2c
Waves keynote2c
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 

Dernier

Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLkantirani197
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxSilpa
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.Silpa
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptxSilpa
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsSérgio Sacani
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceAlex Henderson
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Silpa
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Silpa
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxANSARKHAN96
 

Dernier (20)

Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 

Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

  • 1. Data mining dissertations Adventures and Experiences in the World of Chemistry Antony Williams CLIR/DLF Postdoctoral Fellowship Summer Seminar, July 2014
  • 2. What a small world…
  • 3. • Who’s got an ORCID? • Who has heard of/involved with AltMetrics? • Who has edited a Wikipedia page? • Who has direct experience of text mining? • All slides already on Slideshare here: • www.slideshare.net/AntonyWilliams Before we start….
  • 4. • Context – why do we want to mine data? • Our experiences in extracting theses: – Text and data mining – Chemistry as an example – Before you start – Resources and tools Contents
  • 5. • Let’s map together all historical chemistry data and build systems to integrate • Heck, let’s integrate chemistry and biology data and add in disease data too • Let’s model the data and see if we can extract new relationships – quantitative and qualitative • Let’s make it all available on the web Taking on a big challenge…
  • 6.
  • 7. • We’re going to map the world • We’re going to take photos of as many places as we can and link them together • We’ll let people annotate and curate the map • Then let’s make it available free on the web • We’ll make it available for decision making • Put it on Mobile Devices, Give it Away What about this….
  • 11.
  • 12.
  • 13.
  • 14. The Power of Contribution
  • 15. How do you spell Afonwen?
  • 17. • So the world can be mapped… • We can enter a 3D world within the map • We can add annotations • We can use the data, reference it, we can extract it, we can make decisions with it • And we can do it on our lap, in our hands • Let’s do this for chemistry… Whoa…
  • 18. • Once upon a time we built a database…. In a basement not far away…
  • 20. ChemSpider and Data Validation
  • 23. • This is not new, you known the story… • So much data of value contained within a publication and delivered in a PDF form • “PDF files, and especially unclear licensing, don’t allow me at the data so I can rework, reuse, repurpose, text mine etc.” • “I specialize in XXXX. I want a database of YYYY extracted from publications and made available, for free, with capabilities I need, and the publishers should just do it” Data in a Scientific Publication
  • 24. It is so difficult to navigate… What’s the structure? Are they in our file? What’s similar? What’s the target?Pharmacology data? Known Pathways? Working On Now?Connections to disease? Expressed in right cell type? Competitors? IP?
  • 25. • Manage “all” of the chemistry data associated with chemical substances • Data to be downloadable, reusable, interactive • Build a platform that enables the scientist • Data storage, validation, standardization and curation • Collaborative data sharing • Provide data platform that can enable and enhance publishing of scientific papers We set a vision…
  • 26. • Every compound from every article at RSC is extracted, in a database, and linked • Chemical properties are extracted, databased and used for predictive models • Data tables are downloadable, interactive and not just “dumb-PDFs” • …and what can we extract from chemistry theses too? XXX Years from Now at RSC
  • 27. • We are seen as one of the repositories for published AND unpublished research data • An intuitive platform for research data management in the cloud • Individual, collaborative and public data management of diverse data in the cloud • …and where all data referenced in a thesis is available at a button click XXX Years from Now at RSC
  • 28. • But how does it map onto your domain?? So this is chemistry…
  • 29. Mining as an allegory
  • 30. • You have a mountain of stuff which contains valuable nuggets • You (more or less) know what you’re looking for • You know what you’re going to do with it once you have it Mining as an allegory - intent
  • 31. • You get lots of stuff out • It requires sifting and grading • It’s a triumph if you manage to extract 80-90% of what is there • You will go back to the heap and redo it Mining as an allegory - result
  • 32. • That which is easy to get out - is well known and unlikely to be novel • The novel and interesting stuff is likely to be rare and not easily defined Mining as an allegory - effort
  • 33. • Do the initial investigations by hand • Send in the machines later • Still needs some humans tweaking Mining as allegory - automation
  • 35. • From Utopia Documents team • Good at extracting structure from typeset pdfs • http://pdfx.cs.man.ac.uk/ PDFX
  • 36. OCR recognition • Underlining doesn’t help OCR • In this case it was the only signpost to the department, supervisor and funding details
  • 37. • Hardcopy • Scanned and OCR’d PDF • PDF derived from Word • Word or LaTeX • …and for OCR not all are borne equal • …and of course history and language is a major influence. “Oil of vitriol” Building blocks to mine…
  • 38. • Ontologies, taxonomies, dictionaries • But these are very domain focussed… • As an example, Open PHACTS spend a lot of effort mapping biology to chemistry to disease over many data sources More building blocks
  • 39.
  • 40. • Provide a controlled vocabulary – what your data describes, where it came from • Provide a shared vocabulary for integrating with other people’s data What can ontologies do for me?
  • 41. Questions to ask: (1)Has someone already produced an ontology covering your area? (Places to look: Bioportal, OBO Foundry.) (2)Do they take requests? (3)Are they responsive? (4)Is the ontology kept up to date? Early days for ontologies and any ontology will almost certainly be a long way from complete! Best practices: experiences from biomedical ontologies
  • 42. • Best that these don’t change • Best that everyone calls them the same things • Best that they are unambiguous • Meanwhile, back in the real world What things are you looking for?
  • 43. • Place names – somewhat ambiguous • Species names – can change with time • Diseases – every pharmaceutical company has a different list • People – can be very ambiguous: Authors and researchers are hard to map…except for Google it seems! How easy?
  • 49. • All publications easily connected but also – Important in early scientific career – consider every data point contribution, every “research object” – Every article – Every presentation – Thesis and dissertation – Provenance….and feeding AltMetrics So the benefits of ORCIDs?
  • 51.
  • 52. AltMetrics via Plum Analytics
  • 57. Tinman - mutant fly embryos lack a heart. Van Gogh - hair-like bristles on wings have a swirling pattern. INDY - acronym for I'm Not Dead Yet, they live twice as long as normal; from the scene in the movie "Monty Python and the Holy Grail" Ken and Barbie - males and females lack external genitalia. Tribbles - some cells divide uncontrollably Cheap date - flies are extra-sensitive to alcohol. Cleopatra - flies die when Cleopatra gene interacts with another gene, Asp. Kojak - no hairs on wings. Maggie - fly development is arrested; named after Maggie Simpson, who's development also seems to be arrested. Oh my..Fruitfly gene names • http://stlists.blogspot.co.uk/2005/05/fruitfly-gene-names.html
  • 58.
  • 59. • those that belong to the Emperor, • embalmed ones, • those that are trained, • suckling pigs, • mermaids, • fabulous ones, • stray dogs, • those included in the present classification, • those that tremble as if they were mad, • innumerable ones, • those drawn with a very fine camelhair brush, • others, • those that have just broken a flower vase, • those that from a long way off look like flies. Allegedly from “Celestial Emporium of Benevolent Knowledge” The Analytical Language of John Wilkins, Jorge Luis Borges Animal classification
  • 60. • Are you just identifying entities? • Are you looking for sentiment? • In chemistry names will lead you to a recipe for synthesis, and analytical data about that compound Classification after “things”
  • 61. • Used to aid discovery - directly • Used to aid discovery - indirectly • Extract data in electronic form for reuse • Needs to be use case driven – why, then what/how comes later End result
  • 62. • Automation can give good results • Especially looked at in bulk • Less easy to judge at the article level • People accept discovery is fuzzy • Not so with data points • (but maybe can screen out) Quality
  • 63. • Chemical names are both difficult and rewarding. • Difficult in the sense that they can break standard software. • Rewarding in the sense that you can extract useful information about the molecule they’re referring to without a dictionary. • Some examples… Chemistry-specific challenges and opportunities
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69. • …and it gets worse
  • 70. A series of mono and di-N-2,3-epoxypropyl N- phenylhydrazones have been prepared on a large scale by reaction of the corresponding N-phenylhydrazones of 9-ethyl-3-carbazolecarbaldehyde, 9-ethyl-3,6- carbazoledicarbaldehyde, 4-dimethyl-amino-, 4- diethylamino-, 4-benzylethylamino-, 4-(diphenylamino)-, 4-(4,4-4′-dimethyl-diphenylamino)-, 4-(4- formyldiphenylamino)- and 4-(4-formyl-4′- methyldiphenyl-amino)benzaldehyde with epichlorohydrin in the presence of KOH and anhydrous Na(2)SO(4). From Molecules, via the BioNLP list Annotate this...
  • 71. How many explicit compounds? • How many numbered compounds actually are named in a given paper? • iloprost (1) • tributyl-1-hexynylstannane (2) • the desired 2-heptyne (3) • methyl–Pd(II) iodide 4 or 4′ • alkynylstannane 5 • the hypervalent stannate 6 • (alkynyl)(methyl)Pd(II) complex 7 • the desired methylalkyne 8 • compounds 9–14 • the stannyl precursors 15 and 16 • methylated compounds 17 and 18 • stannyl precursor 19 • iloprost methyl ester 20 • “iloprost methyl ester” is the real name, but you need to know that iloprost is a monocarboxylic acid!
  • 72.
  • 73. Names from structures • Systematic names can be generated FROM chemical structures algorithmically
  • 74. General-purpose parsers do NOT get chemical names Visualization by bpodgursky.com using d3.js; parsing by Stanford’s CoreNLP.
  • 75. But names can reverse back to structures…
  • 76. • OPSIN (chemical name to structure) http://opsin.ch.cam.ac.uk/ Tools to try
  • 77. Not all names are systematic.. Antony Williams vs Identifiers Passport ID Dad, Tony, others SSN Green Card License 5 email addresses ChemSpiderman (blog, Twitter account, Facebook, Friendfeed) OpenID ….
  • 78. Many Names, One Structure
  • 83. • All of the tasks below are possible to varying extents. Pioneered on journal abstracts and journal full text. – Named entity recognition: what is this about? Where are the places mentioned? Who are the people? – Clustering and classification: which other dissertations are like this one? What genres of dissertations are there? – Event extraction: what processes (chemical reactions, gene expression) occur? What are the participants? – Citation analysis: who do dissertations cite? – What sentiments towards the citations do authors express? Dissertation analysis
  • 84. • Dissertation copyright varies • Institution • Author • Published or not? Copyright issues
  • 85. • Probably less structured than papers • Not much work has been done here before Dissertation specifics
  • 86. • For example: • Stylometrics (to find out who wrote this) • Language identification • What else?.... • In addition to above, there are different tasks we can perform on scientific publications and dissertations Digital Humanities textual analysis tasks
  • 87. • We would LOVE to bring data out of our archive • What could we do? • Find chemical names and generate structures • Find chemical images and generate structures • Find reactions – and make a database! • Find data (MP, BP, LogP) and host. Build models! • Find figures and database them • Find spectra (and link to structures) • Validate the data algorithmically “Data enable” publications?
  • 88. RSC Archive – since 1841
  • 89. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  • 90. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  • 91. • 13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC) Text spectra?
  • 92. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  • 99. Tools to try: ChemicalTagger
  • 101. How is DERA going? • We have text-mined all 21st century articles… >100k articles from 2000-2013 • Marked up with XML and published onto the HTML forms of the articles • Required multiple iterations based on dictionaries, markup, text mining iterations • New visualization tools in development – not just chemical names. Add chemical and biomedical terms markup also!
  • 106. Dictionary (ontologies) RSC ontologies (methods, reactions) Dictionary (chemistry) Text-mining Curated dictionaries for known names ACD N2S OPSIN Unknown names: automated name to structure conversion XML ready for publication Marked-up XML Production processes CDX integration (coming soon) Chemical structures SD file Is It Easy?
  • 108. • The ‘National Compound Collection’ • Extracting compounds manually from theses • 700 theses, 44,000 compounds (growing…) • 4 months, 12 UK institutions • Deposited into ChemSpider A pilot examining theses
  • 109. • Screening for interesting drug candidates • Mapping the chain from author to institution to data to industry • British Library involved (EThOS collection) • Build a business model for this Pilot objectives
  • 110. • Funders encouraging submission from new dissertations • Mining of old collections (mostly automated, likely to need manual QA) • Extension to other areas of chemical science …and future (ideal)
  • 111. • Don’t reinvent the wheel • Research your domain to find work already underway and test tools for value/utility In your domain??? Most Domains are Active
  • 112. A good place to start
  • 113. • NaCTeM tools for e.g sentiment analysis http://www.nactem.ac.uk/opminpackage/opinion_analysis Tools to try
  • 114. • NaCTeM tools for e.g sentiment analysis Tools to try
  • 115. There is always something new
  • 116. Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams Thank you

Notes de l'éditeur

  1. Can easily spot titles, authors, abstract, headings
  2. In this case, the OCR used by the scanning organisation couldn’t cope with inderlined titles, and the context pointers were ost
  3. This should be a single node, a noun. But because of the punctuation in the chemical name, the Stanford parser, which is very good but trained on the Wall Street Journal, has interpreted it as a huge baroque sentence.
  4. Not sure who the audience for this will be but worth mentioning if there’s a humanities audience.