SlideShare une entreprise Scribd logo
1  sur  69
Can machines understand the Scientific
Literature?

Peter Murray-Rust
University of Cambridge
Open Knowledge Foundation
Vilnius University, 2014-01-24, LT
Themes
•
•
•
•
•

Collaboration with COD/IBT
The Semantic Web.
The power and need for Open
Multidisciplinarity
“Artificial Intelligence / Google for Science”

• Open, volunteer-based communities
OpenStreetMap

Built by 1 million volunteers; no central
funding
History of OSM mapping Vilnius 2009-10

Users donate GPS traces
From Saulius Grazulis
The Semantic Web
"The Semantic Web is an extension of the
current web in which information is given welldefined meaning, better enabling computers
and people to work in cooperation."
Tim Berners-Lee, James Hendler, Ora Lassila, The
Semantic Web, Scientific American, May 2001

CC-BY-SA Images from Wikipedia
Artificial Intelligence in science
In 1970 chess and chemistry were the sandboxes for AI. Some
approaches:
• Lookup (Knowledge)
• Natural Language Processing (NLP)
• Brute force calculation (inc. physical methods)
• Tree-pruning and heuristics
• Logic (cf. OWL-DL)
• Human-machine integration (crowdsourcing)
• Computer Vision
Domain-specific Turing test: Can a machine pass a first-year
chemistry exam?
The scientist’s amanuensis
• "The bane of my life is doing things I know computers could do
for me" (Dan Connolly, W3C)
Example: A semantic amanuensis could
• Give me a daily digest of mineralogy papers
• Extract all the crystal structures from them
• Compute physical properties with GULP and NWChem
• Compare the results statistically
• Preserve and distribute the complete operation
• Prepare the results for publication

The semantic web is having a personal amanuensis
Linked Open Data – the world’s knowledge
RDF
triples
Music,
Social
Art
Literature

Knowledge
bases

DBPedia

Lib

GOV.uk
Comp

PDB

GOV
Ontologies

BIO

very little physical science 
http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png
Part of a COD RDF entry

The Semantic Web understands this
Linked Open data from Wikipedia

“Which Rivers flow into the Rhine and are longer
than 50 kilometers?” or “Which Skyscrapers
in China have more than 50 floors and have
been constructed before the year 2000?”
Open Crystallography?
“Which countries where tropical diseases are
endemic have published structures of chiral
natural products?”
CC-BY-SA from Wikipedia
MathML

Mathematics Markup Language
Energy of c.c.p lattice of argon

Automatic!

Human-friendly

4 pages clipped

Many editors and tools exist
We used MathWeaver

Machinefriendly
CML (Chemical Markup Language)

Automatic!

Human-friendly

Machine-friendly
Innovation with Componentisation

Individual, manual,
unreusable, flaky

Commodity, standard,
reliable, re-usable
Current scientific information flow
… is broken for data-rich science
Non-semantic
data

PDF

Lineprinter output

Human input
Text files

Data extraction
difficult and
incomplete

Human
readers
Semantic network closes the loop
Measurement

Computation

Semantic
Authoring

Analysis

Community

Data available for
e-science and reuse

Data mined from
document
The network grows autonomously

Human-machine

Human-human

Machine-human

Machine-machine
Humans and machines use different
languages
How a machine reads a chemical thesis

nodes are compounds; arrows are reactions
We can’t turn a hamburger into a cow

But we can now
turn PDFs into
Science
Chemical Computer Vision

Raw Mobile photo; problems:
Shadows, contrast, noise, skew, clipping
Binarization (pixels = 0,1)

Irregular edges
Hough transform for lines

Finds orientation and position (not extent)
Canny edge detection
Thinning: thick lines to 1-pixel
Chemical Optical Character Recognition

Small alphabet, clean typefaces, clear boundaries make
this relatively tractable. Problems are “I” “O” etc.
TITLES

DATA!!
2000+ points
UNITS
TICKS
SCALE
QUANTITY
Dumb PDF

Automatic
extraction

CSV
Smoothing
Gaussian Filter

2nd Derivative

Semantic
Spectrum
PROPERTIES (Name-Value-Units-Error)

Name
VU N

Value
VU N

Units
N

U
N

VE

U

Note CML supports value ranges and errors

VE
“nuggets” in a scientific paper
places

project
Value ranges

quantity
units
chemical
Humans aren’t designed to mine this … 
Natural Language Processing

Part of speech tagging (Wordnet, Brown Corpus, etc.)
Chemical NLP components
Parsing chemical sentences
http://wwmm.ch.cam.ac.uk/chemicaltagger

• Typical

Typical chemical synthesis
Automatic semantic markup of chemistry

Could be used for analytical, crystallization, etc.
Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
Mathematics

CML is being integrated with
computable (content) MathML
PDF 

AMI
HTML 
Evolution of ultraviolet
vision in the largest avian
radiation - the passerines
Anders Ödeen 1* , Olle
Håstad 2,3 and Per Alström 4

Styles , superscripts
And diåcritics
preserved!
PDF 
Turdus iliacus
Taeniopygia guttata
Serinus canaria
Lanius excubitor
Melopsittacus undulatus
Pavo cristatus
Sturnus vulgaris
Dolichonyx oryzivorus
Ficedula hypoleuca
Vaccinium myrtillus
Falco tinnunculus

Turdus
Pomatostomus
Leothrix
Amytornis
Acanthisitta
Orthonyx x 2
Malurus
Cnemophilus x 4
Philesturnus x 2
Motacilla x 2
Toxorhampus x 2
Typical phylo tree: 60 nodes, complex and miniscule annotation,
vertical text, hyphenation and valuable branch lengths. AMI extracts ALL
AMI
0.84
0.91
0.93
0.95
Posterior
probability

23.12
34.54
37.21
38.55

NexML
HTML

AMI can MEASURE
Branch lengths!

Acanthisitta
Acrocephalus
Ailuroedus
Ailuroedus
Amytornis
Camptostoma

Acanthisittidae
Acanthizidae
Acrocephalidae
Callaeidae
Campephagidae
Cnemophilidae
Corvidae

Genus

Family
Supplemental
Information (CIFs)
harvested
from Publications

ACS
IUCr

RSC

ELS
As-Cl Bond lengths

Short

Long
Long
Short
Link to Journal
The Blue Obelisk – Open Chemistry
•Open Data, Open Standards, Open Source
•consistent and complementary
•non-divisive and fun
•CDK
•JChempaint
•Jmol
•JOELib
•JUMBO
•NMRShiftDB
•Octet
•Openbabel
•QSAR
•WWMM
•JSpecView
•http://www.blueobelisk.org
Blue Obelisk 2005-03-13
Recommendations for Open
Crystallography
• Require Open Crystal Data for all publications
• Deposition of Open Data in COD
• Integrate CIF dictionaries as RDF into Linked
Open Data
• Integrate COD into Linked Open Data Cloud
• CCDC/ICSD to publish RAW author CIFs Openly
• http://upload.wikimedia.org/wikipedia/comm
ons/3/34/LOD_Cloud_Diagram_as_of_Septem
ber_2011.png
CIFDIC

COD
The network grows autonomously

Human-machine

Human-human

Machine-human

Machine-machine
TimBerners-Lee’s Open data
http://5stardata.info
★
CIFDIC
ACS ★★
IUCr

make your stuff available on the Web (whatever
format) under an OPEN license
make it available as structured data (i.e. NOT
PDF)
CRYSTALEYE

★★★

use non-proprietary formats (e.g., CSV)

★★★
★

use URIs to denote things, so that people can
point at your stuff

★★★
★★

link your data to other data to provide context
• statement "Nitrazepam" "target" "Gammaaminobutyric-acid receptor subunit alpha-1"
• triple:
• drugbank:DB01595 // compound
drugbank_vocabulary:target // predicate
drugbank:872 .
// target
“Compound hasTarget Target”
Some statistics
•
•
•
•
•
•
•

3,000,000 scholpubs/year => 10,000 / day
~~ $1000 APC / pub, typesetting $10 per page,
Subscriptions ~ $10,000,000,000 / year
20% ?? of current pubs Open or accessible
Article ~ 1MByte, 15 pp (w/o data, images)
Download and processing ca 1 sec/page
arXiv $7 per article.
RCUK
Wellcome
ERC
NSF …
require
fully OPEN

[at Research Data Alliance, we are entering a new “era of open science”, which will be “good
for citizens, good for scientists and good for society”.
She explicitly highlighted the transformative potential of open access, open data, open
software and open educational resources – mentioning the EU’s policy requiring open access
to all publications and data resulting from EU funded research.
http://blog.okfn.org/2013/03/21/we-are-entering-an-era-of-open-science-says-eu-vp-neeliekroes/#sthash.3SWDXDE6.dpuf
Open Definition
• “A piece of data or content is open if anyone is
free to use, reuse, and redistribute it —
subject only, at most, to the requirement to
attribute and/or share-alike.”
OPEN

NOT OPEN

PDB
COD,Crystaleye

CCDC, ICSD

RSC/ACS/IUCr CIFs

Elsevier/Wiley/Springer CIFs

Acta Cryst E

Acta Cryst ABCD (default)

CIF dictionaries
Panton Principles for Open Data in Science
Why? Wanted to avoid the mess in OA
• Peter Murray-Rust, Cameron
Neylon, Rufus Pollock, John
Wilbanks
2008-> 2010 (launch) at
Panton Arms
Launch 2010
Peter
John
Jordan
Panton Fellowships (2012)Murray-Rust
Hatcher Wilbanks
Jenny
Molloy

Rufus
Pollock

Cameron
Neylon

“Licence STM Data as CC0”
ContentMining Targets
• PLOS, BMC (species/phylo): Ross Mounce (Bath)
• MDPI (metabolism, molecules):
AndyHowlett, MarkWilliamson (Cambridge)
• Crystallography (PMR, COD)
In 2014-04 ALL papers are minable in UK:
• Species/phylogenetics (ca 10,000 /year)
• Crystallographic recipes
• Metabolism
Hackathons
Large-scale Mining
CRAWLING

SEMANTICS
Raw content

Publisher
Site

Publisher-Specific
Crawler

Scientific Search Indexes
Current/Daily awareness
Mashups/LOD
New data
Validation / reproducibility
Reformatting
Semantic Objects
USE

AMI
Metadata

CKAN

STORAGE

PDF
XML
HTML
SVG
PNG
DOCX, XLS
CSV, CIF

Science

BackingStore
GoogleAPI?,
AWS?
Benefits of ContentMining
•
•
•
•

Liberation of fulltext data.
Liberation of supplemental data (PDF, DOCx)
Normalization of syntax and vocabulary
Integration with Open resources
(Wikip/media), Pubchem, ChEBI, ChEMBL
• Open non-proprietary search indexes
• Validation (self-consistency, against standards,
computability, fraud)
10 million spectra published /year
Review of the NMR data reported in the Supporting
Information in this article evidences instances where some of
the spectra were inappropriately edited to remove
impurities. A coauthor and former student, Dr. Bruno
Anxionnat, has shared with me formal communication in
which he states “I would like to take full responsibility for this
entire situation. I was in charge of making the SI of my papers
and I erased some peaks without telling anybody. All my
supervisors (Pr. Cossy, Dr. Gomez Pardo and Dr. Ricci) trusted
me and I wasn't dependable. I am the only one who has to be
blamed for all that, in any case them. I know my behavior is
highly unethical. I am deeply sorry for what I have done and
for hurting people….”
Some thanks
• Jenny Molloy (Oxford), Max Hauessler (UCSD)
• Joe Townsend, Nick Day, Jim Downing, Mark
Williamson, Peter Corbett, Daniel Lowe and
others UCC Cambridge.
• Ross Mounce (Bath)
• Saulius Grazulis (COD)
Take-away messages
•
•
•
•
•

Lost/unused STM* data costs 30-100Billion /yr [1]
Licence: DATA as CCZero and TEXT as CC-BY
Content Mining for DATA is a RIGHT
Apathy is our worst enemy
Trust and empower young people

“A piece of content or data is open if anyone is
free to use, reuse, and redistribute it — subject
only, at most, to the requirement to attribute
and/or share-alike.”
*Scientific Technical Medical

[1] PMR: submission to UK Hargreaves process

Contenu connexe

Tendances

ContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesespetermurrayrust
 
Open data and Open Science
Open data and Open ScienceOpen data and Open Science
Open data and Open Sciencepetermurrayrust
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machinespetermurrayrust
 
OpenNotebookScience NOW!
OpenNotebookScience NOW!OpenNotebookScience NOW!
OpenNotebookScience NOW!petermurrayrust
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trustpetermurrayrust
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neurosciencepetermurrayrust
 
Copyright Reform and Open Data
Copyright Reform and Open DataCopyright Reform and Open Data
Copyright Reform and Open Datapetermurrayrust
 
Embrace the Open Revolution
Embrace the Open RevolutionEmbrace the Open Revolution
Embrace the Open Revolutionpetermurrayrust
 
Semantic Web in Physical Science
Semantic Web in Physical ScienceSemantic Web in Physical Science
Semantic Web in Physical Sciencepetermurrayrust
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)TheContentMine
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in NeuroscienceTheContentMine
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesTheContentMine
 
ContentMining and Clinical Trials
ContentMining and Clinical TrialsContentMining and Clinical Trials
ContentMining and Clinical Trialspetermurrayrust
 
ContentMining and Clinical Trials
ContentMining and Clinical TrialsContentMining and Clinical Trials
ContentMining and Clinical TrialsTheContentMine
 
ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiDataTheContentMine
 
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...Ross Mounce
 
Principles and practice of Open Science
Principles and practice of Open SciencePrinciples and practice of Open Science
Principles and practice of Open Sciencepetermurrayrust
 
Content Mining of Science in Cambridge
Content Mining of Science in CambridgeContent Mining of Science in Cambridge
Content Mining of Science in CambridgeTheContentMine
 

Tendances (20)

Making Theses USEFUL
Making Theses USEFULMaking Theses USEFUL
Making Theses USEFUL
 
ContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and theses
 
Open data and Open Science
Open data and Open ScienceOpen data and Open Science
Open data and Open Science
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
 
OpenNotebookScience NOW!
OpenNotebookScience NOW!OpenNotebookScience NOW!
OpenNotebookScience NOW!
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
Petermrjisc20141201
Petermrjisc20141201Petermrjisc20141201
Petermrjisc20141201
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
Copyright Reform and Open Data
Copyright Reform and Open DataCopyright Reform and Open Data
Copyright Reform and Open Data
 
Embrace the Open Revolution
Embrace the Open RevolutionEmbrace the Open Revolution
Embrace the Open Revolution
 
Semantic Web in Physical Science
Semantic Web in Physical ScienceSemantic Web in Physical Science
Semantic Web in Physical Science
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
 
ContentMining and Clinical Trials
ContentMining and Clinical TrialsContentMining and Clinical Trials
ContentMining and Clinical Trials
 
ContentMining and Clinical Trials
ContentMining and Clinical TrialsContentMining and Clinical Trials
ContentMining and Clinical Trials
 
ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiData
 
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
 
Principles and practice of Open Science
Principles and practice of Open SciencePrinciples and practice of Open Science
Principles and practice of Open Science
 
Content Mining of Science in Cambridge
Content Mining of Science in CambridgeContent Mining of Science in Cambridge
Content Mining of Science in Cambridge
 

Similaire à Machines Can Understand the Scientific Literature

ContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesTheContentMine
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic BiologyTheContentMine
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biologypetermurrayrust
 
Social Machines of Scholarly Collaboration
Social Machines of Scholarly CollaborationSocial Machines of Scholarly Collaboration
Social Machines of Scholarly CollaborationDavid De Roure
 
Emerging Forms of Data and Analytics
Emerging Forms of Data and AnalyticsEmerging Forms of Data and Analytics
Emerging Forms of Data and AnalyticsDavid De Roure
 
Can machines understand the scientific literature
Can machines understand the scientific literatureCan machines understand the scientific literature
Can machines understand the scientific literaturepetermurrayrust
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in NeuroscienceTheContentMine
 
Digital Scholarship Intersection Scale Social Machines
Digital Scholarship Intersection Scale Social MachinesDigital Scholarship Intersection Scale Social Machines
Digital Scholarship Intersection Scale Social MachinesDavid De Roure
 
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKpetermurrayrust
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of ScienceGlobus
 
AH-XLDBEurope-position-09 jun2011
AH-XLDBEurope-position-09 jun2011AH-XLDBEurope-position-09 jun2011
AH-XLDBEurope-position-09 jun2011Alex Hardisty
 
towards interoperable archives: the Universal Preprint Service initiative
towards interoperable archives:  the Universal Preprint Service initiativetowards interoperable archives:  the Universal Preprint Service initiative
towards interoperable archives: the Universal Preprint Service initiativeHerbert Van de Sompel
 
High throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and thesesHigh throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and thesespetermurrayrust
 
Climate Change and Human Migration
Climate Change and Human MigrationClimate Change and Human Migration
Climate Change and Human Migrationpetermurrayrust
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search petermurrayrust
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsJon Voss
 
ContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific LiteratureContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific Literaturepetermurrayrust
 
When will there be a digital revolution in the humanities?
When will there be a digital revolution in the humanities?When will there be a digital revolution in the humanities?
When will there be a digital revolution in the humanities?Martin Wynne
 

Similaire à Machines Can Understand the Scientific Literature (20)

ContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and theses
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
Social Machines of Scholarly Collaboration
Social Machines of Scholarly CollaborationSocial Machines of Scholarly Collaboration
Social Machines of Scholarly Collaboration
 
Emerging Forms of Data and Analytics
Emerging Forms of Data and AnalyticsEmerging Forms of Data and Analytics
Emerging Forms of Data and Analytics
 
Can machines understand the scientific literature
Can machines understand the scientific literatureCan machines understand the scientific literature
Can machines understand the scientific literature
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
Digital Scholarship Intersection Scale Social Machines
Digital Scholarship Intersection Scale Social MachinesDigital Scholarship Intersection Scale Social Machines
Digital Scholarship Intersection Scale Social Machines
 
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UK
 
Making Theses USEFUL
Making Theses USEFULMaking Theses USEFUL
Making Theses USEFUL
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of Science
 
AH-XLDBEurope-position-09 jun2011
AH-XLDBEurope-position-09 jun2011AH-XLDBEurope-position-09 jun2011
AH-XLDBEurope-position-09 jun2011
 
Cyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and BeyondCyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and Beyond
 
towards interoperable archives: the Universal Preprint Service initiative
towards interoperable archives:  the Universal Preprint Service initiativetowards interoperable archives:  the Universal Preprint Service initiative
towards interoperable archives: the Universal Preprint Service initiative
 
High throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and thesesHigh throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and theses
 
Climate Change and Human Migration
Climate Change and Human MigrationClimate Change and Human Migration
Climate Change and Human Migration
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
ContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific LiteratureContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific Literature
 
When will there be a digital revolution in the humanities?
When will there be a digital revolution in the humanities?When will there be a digital revolution in the humanities?
When will there be a digital revolution in the humanities?
 

Plus de petermurrayrust

Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Agepetermurrayrust
 
Open Science Principles and Practice
Open Science Principles and PracticeOpen Science Principles and Practice
Open Science Principles and Practicepetermurrayrust
 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentationpetermurrayrust
 
Can machines understand the scientific literature?
Can machines understand the scientific literature?Can machines understand the scientific literature?
Can machines understand the scientific literature?petermurrayrust
 
OpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFestOpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFestpetermurrayrust
 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentationpetermurrayrust
 
Automatic mining of data from materials science literature
Automatic mining of data from materials science literatureAutomatic mining of data from materials science literature
Automatic mining of data from materials science literaturepetermurrayrust
 
openVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on virusesopenVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on virusespetermurrayrust
 
XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?petermurrayrust
 
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be BraveEarly Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be Bravepetermurrayrust
 
Early Career Reseachers and Open Healthcare
Early Career Reseachers and Open HealthcareEarly Career Reseachers and Open Healthcare
Early Career Reseachers and Open Healthcarepetermurrayrust
 
Scientific search for everyone
Scientific search for everyoneScientific search for everyone
Scientific search for everyonepetermurrayrust
 
Openplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searchingOpenplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searchingpetermurrayrust
 
Extracting science from the archive
Extracting science from the archiveExtracting science from the archive
Extracting science from the archivepetermurrayrust
 
WikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and EverythingWikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and Everythingpetermurrayrust
 
Disrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic ComplexDisrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic Complexpetermurrayrust
 
Paradise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to MineParadise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to Minepetermurrayrust
 
Young people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge NeocolonialismYoung people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge Neocolonialismpetermurrayrust
 
WikiFactMine: Science for Everyone
WikiFactMine: Science for EveryoneWikiFactMine: Science for Everyone
WikiFactMine: Science for Everyonepetermurrayrust
 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017petermurrayrust
 

Plus de petermurrayrust (20)

Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Age
 
Open Science Principles and Practice
Open Science Principles and PracticeOpen Science Principles and Practice
Open Science Principles and Practice
 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentation
 
Can machines understand the scientific literature?
Can machines understand the scientific literature?Can machines understand the scientific literature?
Can machines understand the scientific literature?
 
OpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFestOpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFest
 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentation
 
Automatic mining of data from materials science literature
Automatic mining of data from materials science literatureAutomatic mining of data from materials science literature
Automatic mining of data from materials science literature
 
openVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on virusesopenVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on viruses
 
XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?
 
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be BraveEarly Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
 
Early Career Reseachers and Open Healthcare
Early Career Reseachers and Open HealthcareEarly Career Reseachers and Open Healthcare
Early Career Reseachers and Open Healthcare
 
Scientific search for everyone
Scientific search for everyoneScientific search for everyone
Scientific search for everyone
 
Openplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searchingOpenplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searching
 
Extracting science from the archive
Extracting science from the archiveExtracting science from the archive
Extracting science from the archive
 
WikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and EverythingWikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and Everything
 
Disrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic ComplexDisrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic Complex
 
Paradise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to MineParadise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to Mine
 
Young people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge NeocolonialismYoung people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge Neocolonialism
 
WikiFactMine: Science for Everyone
WikiFactMine: Science for EveryoneWikiFactMine: Science for Everyone
WikiFactMine: Science for Everyone
 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017
 

Dernier

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Dernier (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Machines Can Understand the Scientific Literature

  • 1. Can machines understand the Scientific Literature? Peter Murray-Rust University of Cambridge Open Knowledge Foundation Vilnius University, 2014-01-24, LT
  • 2. Themes • • • • • Collaboration with COD/IBT The Semantic Web. The power and need for Open Multidisciplinarity “Artificial Intelligence / Google for Science” • Open, volunteer-based communities
  • 3. OpenStreetMap Built by 1 million volunteers; no central funding
  • 4. History of OSM mapping Vilnius 2009-10 Users donate GPS traces
  • 6. The Semantic Web "The Semantic Web is an extension of the current web in which information is given welldefined meaning, better enabling computers and people to work in cooperation." Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001 CC-BY-SA Images from Wikipedia
  • 7. Artificial Intelligence in science In 1970 chess and chemistry were the sandboxes for AI. Some approaches: • Lookup (Knowledge) • Natural Language Processing (NLP) • Brute force calculation (inc. physical methods) • Tree-pruning and heuristics • Logic (cf. OWL-DL) • Human-machine integration (crowdsourcing) • Computer Vision Domain-specific Turing test: Can a machine pass a first-year chemistry exam?
  • 8. The scientist’s amanuensis • "The bane of my life is doing things I know computers could do for me" (Dan Connolly, W3C) Example: A semantic amanuensis could • Give me a daily digest of mineralogy papers • Extract all the crystal structures from them • Compute physical properties with GULP and NWChem • Compare the results statistically • Preserve and distribute the complete operation • Prepare the results for publication The semantic web is having a personal amanuensis
  • 9. Linked Open Data – the world’s knowledge RDF triples Music, Social Art Literature Knowledge bases DBPedia Lib GOV.uk Comp PDB GOV Ontologies BIO very little physical science  http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png
  • 10. Part of a COD RDF entry The Semantic Web understands this
  • 11. Linked Open data from Wikipedia “Which Rivers flow into the Rhine and are longer than 50 kilometers?” or “Which Skyscrapers in China have more than 50 floors and have been constructed before the year 2000?” Open Crystallography? “Which countries where tropical diseases are endemic have published structures of chiral natural products?” CC-BY-SA from Wikipedia
  • 12. MathML Mathematics Markup Language Energy of c.c.p lattice of argon Automatic! Human-friendly 4 pages clipped Many editors and tools exist We used MathWeaver Machinefriendly
  • 13. CML (Chemical Markup Language) Automatic! Human-friendly Machine-friendly
  • 14. Innovation with Componentisation Individual, manual, unreusable, flaky Commodity, standard, reliable, re-usable
  • 15. Current scientific information flow … is broken for data-rich science Non-semantic data PDF Lineprinter output Human input Text files Data extraction difficult and incomplete Human readers
  • 16. Semantic network closes the loop Measurement Computation Semantic Authoring Analysis Community Data available for e-science and reuse Data mined from document
  • 17. The network grows autonomously Human-machine Human-human Machine-human Machine-machine
  • 18. Humans and machines use different languages
  • 19. How a machine reads a chemical thesis nodes are compounds; arrows are reactions
  • 20. We can’t turn a hamburger into a cow But we can now turn PDFs into Science
  • 21. Chemical Computer Vision Raw Mobile photo; problems: Shadows, contrast, noise, skew, clipping
  • 22. Binarization (pixels = 0,1) Irregular edges
  • 23. Hough transform for lines Finds orientation and position (not extent)
  • 25. Thinning: thick lines to 1-pixel
  • 26. Chemical Optical Character Recognition Small alphabet, clean typefaces, clear boundaries make this relatively tractable. Problems are “I” “O” etc.
  • 29. PROPERTIES (Name-Value-Units-Error) Name VU N Value VU N Units N U N VE U Note CML supports value ranges and errors VE
  • 30. “nuggets” in a scientific paper places project Value ranges quantity units chemical Humans aren’t designed to mine this … 
  • 31. Natural Language Processing Part of speech tagging (Wordnet, Brown Corpus, etc.)
  • 32.
  • 33.
  • 37. Automatic semantic markup of chemistry Could be used for analytical, crystallization, etc.
  • 38. Open Content Mining of FACTs Machines can interpret chemical reactions We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.
  • 39. Mathematics CML is being integrated with computable (content) MathML
  • 40. PDF  AMI HTML  Evolution of ultraviolet vision in the largest avian radiation - the passerines Anders Ödeen 1* , Olle Håstad 2,3 and Per Alström 4 Styles , superscripts And diåcritics preserved!
  • 41. PDF  Turdus iliacus Taeniopygia guttata Serinus canaria Lanius excubitor Melopsittacus undulatus Pavo cristatus Sturnus vulgaris Dolichonyx oryzivorus Ficedula hypoleuca Vaccinium myrtillus Falco tinnunculus Turdus Pomatostomus Leothrix Amytornis Acanthisitta Orthonyx x 2 Malurus Cnemophilus x 4 Philesturnus x 2 Motacilla x 2 Toxorhampus x 2
  • 42. Typical phylo tree: 60 nodes, complex and miniscule annotation, vertical text, hyphenation and valuable branch lengths. AMI extracts ALL
  • 43. AMI 0.84 0.91 0.93 0.95 Posterior probability 23.12 34.54 37.21 38.55 NexML HTML AMI can MEASURE Branch lengths! Acanthisitta Acrocephalus Ailuroedus Ailuroedus Amytornis Camptostoma Acanthisittidae Acanthizidae Acrocephalidae Callaeidae Campephagidae Cnemophilidae Corvidae Genus Family
  • 46. Long
  • 47. Short
  • 49. The Blue Obelisk – Open Chemistry •Open Data, Open Standards, Open Source •consistent and complementary •non-divisive and fun •CDK •JChempaint •Jmol •JOELib •JUMBO •NMRShiftDB •Octet •Openbabel •QSAR •WWMM •JSpecView •http://www.blueobelisk.org
  • 51. Recommendations for Open Crystallography • Require Open Crystal Data for all publications • Deposition of Open Data in COD • Integrate CIF dictionaries as RDF into Linked Open Data • Integrate COD into Linked Open Data Cloud • CCDC/ICSD to publish RAW author CIFs Openly
  • 53. The network grows autonomously Human-machine Human-human Machine-human Machine-machine
  • 54. TimBerners-Lee’s Open data http://5stardata.info ★ CIFDIC ACS ★★ IUCr make your stuff available on the Web (whatever format) under an OPEN license make it available as structured data (i.e. NOT PDF) CRYSTALEYE ★★★ use non-proprietary formats (e.g., CSV) ★★★ ★ use URIs to denote things, so that people can point at your stuff ★★★ ★★ link your data to other data to provide context
  • 55. • statement "Nitrazepam" "target" "Gammaaminobutyric-acid receptor subunit alpha-1" • triple: • drugbank:DB01595 // compound drugbank_vocabulary:target // predicate drugbank:872 . // target “Compound hasTarget Target”
  • 56. Some statistics • • • • • • • 3,000,000 scholpubs/year => 10,000 / day ~~ $1000 APC / pub, typesetting $10 per page, Subscriptions ~ $10,000,000,000 / year 20% ?? of current pubs Open or accessible Article ~ 1MByte, 15 pp (w/o data, images) Download and processing ca 1 sec/page arXiv $7 per article.
  • 57. RCUK Wellcome ERC NSF … require fully OPEN [at Research Data Alliance, we are entering a new “era of open science”, which will be “good for citizens, good for scientists and good for society”. She explicitly highlighted the transformative potential of open access, open data, open software and open educational resources – mentioning the EU’s policy requiring open access to all publications and data resulting from EU funded research. http://blog.okfn.org/2013/03/21/we-are-entering-an-era-of-open-science-says-eu-vp-neeliekroes/#sthash.3SWDXDE6.dpuf
  • 58. Open Definition • “A piece of data or content is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.” OPEN NOT OPEN PDB COD,Crystaleye CCDC, ICSD RSC/ACS/IUCr CIFs Elsevier/Wiley/Springer CIFs Acta Cryst E Acta Cryst ABCD (default) CIF dictionaries
  • 59. Panton Principles for Open Data in Science Why? Wanted to avoid the mess in OA • Peter Murray-Rust, Cameron Neylon, Rufus Pollock, John Wilbanks 2008-> 2010 (launch) at Panton Arms Launch 2010 Peter John Jordan Panton Fellowships (2012)Murray-Rust Hatcher Wilbanks Jenny Molloy Rufus Pollock Cameron Neylon “Licence STM Data as CC0”
  • 60. ContentMining Targets • PLOS, BMC (species/phylo): Ross Mounce (Bath) • MDPI (metabolism, molecules): AndyHowlett, MarkWilliamson (Cambridge) • Crystallography (PMR, COD) In 2014-04 ALL papers are minable in UK: • Species/phylogenetics (ca 10,000 /year) • Crystallographic recipes • Metabolism
  • 62. Large-scale Mining CRAWLING SEMANTICS Raw content Publisher Site Publisher-Specific Crawler Scientific Search Indexes Current/Daily awareness Mashups/LOD New data Validation / reproducibility Reformatting Semantic Objects USE AMI Metadata CKAN STORAGE PDF XML HTML SVG PNG DOCX, XLS CSV, CIF Science BackingStore GoogleAPI?, AWS?
  • 63. Benefits of ContentMining • • • • Liberation of fulltext data. Liberation of supplemental data (PDF, DOCx) Normalization of syntax and vocabulary Integration with Open resources (Wikip/media), Pubchem, ChEBI, ChEMBL • Open non-proprietary search indexes • Validation (self-consistency, against standards, computability, fraud)
  • 64. 10 million spectra published /year
  • 65.
  • 66.
  • 67. Review of the NMR data reported in the Supporting Information in this article evidences instances where some of the spectra were inappropriately edited to remove impurities. A coauthor and former student, Dr. Bruno Anxionnat, has shared with me formal communication in which he states “I would like to take full responsibility for this entire situation. I was in charge of making the SI of my papers and I erased some peaks without telling anybody. All my supervisors (Pr. Cossy, Dr. Gomez Pardo and Dr. Ricci) trusted me and I wasn't dependable. I am the only one who has to be blamed for all that, in any case them. I know my behavior is highly unethical. I am deeply sorry for what I have done and for hurting people….”
  • 68. Some thanks • Jenny Molloy (Oxford), Max Hauessler (UCSD) • Joe Townsend, Nick Day, Jim Downing, Mark Williamson, Peter Corbett, Daniel Lowe and others UCC Cambridge. • Ross Mounce (Bath) • Saulius Grazulis (COD)
  • 69. Take-away messages • • • • • Lost/unused STM* data costs 30-100Billion /yr [1] Licence: DATA as CCZero and TEXT as CC-BY Content Mining for DATA is a RIGHT Apathy is our worst enemy Trust and empower young people “A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.” *Scientific Technical Medical [1] PMR: submission to UK Hargreaves process