SlideShare une entreprise Scribd logo
1  sur  19
Mining Scientific Images 
Peter Murray-Rust, 
and TheContentMine 
WOSP, London, UK, 2014-09-12 
The ContentMine is supported by a grant to PMR as a
Research requires mining the WHOLE 
literature (3000 papers/day) 
• Aggregation of similar objects (phylogenetic 
trees) e.g. bacterial 
• Aggregation of complementary information 
(chemicals and species) 
• Metabolism (EBI and Cambridge) 
• Phytochemistry (Mint taxonomy)
Mint phylogeny working group 
The mint family (Lamiaceae), 
with approximately 236 
genera and 7200 species, is 
the sixth largest family of 
flowering plants, and has 
major economic and cultural 
importance worldwide. 
http://lamiaceae.myspecies.info/content/lamiaceae 
Ross Mounce 
P Murray-Rust 
Collaborators
http://www.slideshare.net/rossmounce/the-pluto-project-ievobio-2014
PMR is collaborating with the European Bioinformatics 
Institute to liberate all metabolic information from journals
Publishers destroy structured 
information (LaTeX, Word) into PDF … 
• Characters (NOT words or higher structure) 
WORD is simply 4 characters, NO spaces 
• Paths (NOT circles, squares …) “Vectors” 
… They / their APIs then destroy it further 
into Pixels (e.g. PNG  or JPG ) 
Content Mine will read 10,000 PNGs a day and 
try to recover the science.
We can’t turn a hamburger into a cow 
But we can now 
turn PDFs into 
Science 
Pixel => Path => Shape => Char => Word => Para => Document => SCIENCE
UNITS 
TICKS 
QUANTITY 
SCALE 
TITLES 
DATA!! 
2000+ points
Dumb PDF 
CSV 
Semantic 
Spectrum 
Automatic 
extraction 
Smoothing 
Gaussian Filter 
2nd Derivative
Chemical Computer Vision 
Raw Mobile photo; problems: 
Shadows, contrast, noise, skew, clipping
Binarization (pixels = 0,1) 
Irregular edges
Thinning: thick lines to 1-pixel
Chemical Optical Character Recognition 
Small alphabet, clean typefaces, clear boundaries make 
this relatively tractable. Problems are “I” “O” etc.
AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home 
AMI reads the complete diagram, 
recognizes the paths and 
generates the molecules. Then 
she creates a stop-fram animation 
showing how the 12 reactions 
lead into each other 
CLICK HERE FOR ANIMATION 
(may be browser dependent) 
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI Demo 
http://www.mdpi.com/2218-1989/2/1/39/pdf 
https://bitbucket.org/AndyHowlett/ami2-poc 
ami2-poc -i example 
-v org.xmlcml.xhtml2stm.visitor.chem.ChemVisitor 
May take time to start if not connected to web 
Output in 
./example/target/output/reactionsexample 
Look at: image.g.1.4.svg.reaction0.cml in Avogadro
NEW Bacteria must have a phylogenetic tree 
Note Jaggy and 
broken pixels 
Length 
Weight _________ 
Binomial Name Culture/Strain GENBANK ID 
Evolution 
Rate
Thinning Topology 
https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-mine- 
images-of-phylogenetic-trees-and-more/ for story of extraction 
Serialization 
Newick
Display your own tree 
• Cut and paste… 
• ((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182) 
,((((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),(n218 
,(n215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),((n233,n18 
7),n41)))),((((n59,n184),((n134,n200),(n137,(n212,((n92,n209),n29))))),(n8 
8,(n102,n161))),((((n70,n140),(n18,n188)),(n49,((n123,n132),(n219,n198)) 
)),(((n37,(n65,n46)),(n135,(n11,(n113,n142)))),(n210,((n69,(n216,n36)),(n2 
31,n160))))))),(((n107,n43),((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7, 
((n139,n5),((n170,(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126, 
n222))))))))))),(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n1 
58,(n91,(n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,(n163 
,n227)),((n53,n131),n159))))))); 
• View with http://www.unc.edu/~bdmorris/treelib-js/demo.html or 
• http://www.trex.uqam.ca/index.php?action=newick&project=trex
Questions and comments 
• Technical and/or scientific, please 
• Politics can wait till Charles Oppenheim 
presentation 
Thanks: 
• Andy Howlett, Dept Chemistry, Cambridge 
• Mark Williamson, Dept Chemistry, Cambridge 
• Ross Mounce, Biology, University of Bath

Contenu connexe

Tendances

SPARC Overview and Update, October 2008
SPARC Overview and Update, October 2008SPARC Overview and Update, October 2008
SPARC Overview and Update, October 2008
Elizabeth Brown
 
Open Access Overview, Faculty Senate Library Committee, 10/21/08
Open Access Overview, Faculty Senate Library Committee, 10/21/08Open Access Overview, Faculty Senate Library Committee, 10/21/08
Open Access Overview, Faculty Senate Library Committee, 10/21/08
Elizabeth Brown
 
An Overview of the OAI Object Reuse and Exchange Interoperability Framework
An Overview of the OAI Object Reuse and Exchange Interoperability FrameworkAn Overview of the OAI Object Reuse and Exchange Interoperability Framework
An Overview of the OAI Object Reuse and Exchange Interoperability Framework
Herbert Van de Sompel
 
DiFiore: JSTOR & Portico: Committed to preserving the scholarly record , Bing...
DiFiore: JSTOR & Portico: Committed to preserving the scholarly record , Bing...DiFiore: JSTOR & Portico: Committed to preserving the scholarly record , Bing...
DiFiore: JSTOR & Portico: Committed to preserving the scholarly record , Bing...
Elizabeth Brown
 

Tendances (20)

ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UK
 
Open Access for Early Career Researchers
Open Access for Early Career ResearchersOpen Access for Early Career Researchers
Open Access for Early Career Researchers
 
Text and data mining in UK and France (ADBU - 13 Dec 16)
Text and data mining in UK and France (ADBU - 13 Dec 16)Text and data mining in UK and France (ADBU - 13 Dec 16)
Text and data mining in UK and France (ADBU - 13 Dec 16)
 
IASSIST 2009
IASSIST 2009IASSIST 2009
IASSIST 2009
 
Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on them
 
Modern Tools & Rationales for 21st Century Research
Modern Tools & Rationales  for 21st Century ResearchModern Tools & Rationales  for 21st Century Research
Modern Tools & Rationales for 21st Century Research
 
Capturing the context: one small(ish step for modellers, one giant leap for m...
Capturing the context: one small(ish step for modellers, one giant leap for m...Capturing the context: one small(ish step for modellers, one giant leap for m...
Capturing the context: one small(ish step for modellers, one giant leap for m...
 
Open Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureOpen Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | Future
 
Pham yang embl-ebi
Pham yang embl-ebiPham yang embl-ebi
Pham yang embl-ebi
 
EMBL Australia Bioinformatics Resource BioInfoSummer 2016
EMBL Australia Bioinformatics Resource BioInfoSummer 2016EMBL Australia Bioinformatics Resource BioInfoSummer 2016
EMBL Australia Bioinformatics Resource BioInfoSummer 2016
 
High throughput mining of the plant-science literature
High throughput mining of the plant-science literatureHigh throughput mining of the plant-science literature
High throughput mining of the plant-science literature
 
Towards Responsible Content Mining: A Cambridge perspective
Towards Responsible Content Mining: A Cambridge perspectiveTowards Responsible Content Mining: A Cambridge perspective
Towards Responsible Content Mining: A Cambridge perspective
 
SPARC Overview and Update, October 2008
SPARC Overview and Update, October 2008SPARC Overview and Update, October 2008
SPARC Overview and Update, October 2008
 
Reproducible and citable data and models: an introduction.
Reproducible and citable data and models: an introduction.Reproducible and citable data and models: an introduction.
Reproducible and citable data and models: an introduction.
 
Mining the scientific literature for plants and chemistry
Mining the scientific literature for plants and chemistryMining the scientific literature for plants and chemistry
Mining the scientific literature for plants and chemistry
 
Open Access Overview, Faculty Senate Library Committee, 10/21/08
Open Access Overview, Faculty Senate Library Committee, 10/21/08Open Access Overview, Faculty Senate Library Committee, 10/21/08
Open Access Overview, Faculty Senate Library Committee, 10/21/08
 
An Overview of the OAI Object Reuse and Exchange Interoperability Framework
An Overview of the OAI Object Reuse and Exchange Interoperability FrameworkAn Overview of the OAI Object Reuse and Exchange Interoperability Framework
An Overview of the OAI Object Reuse and Exchange Interoperability Framework
 
DiFiore: JSTOR & Portico: Committed to preserving the scholarly record , Bing...
DiFiore: JSTOR & Portico: Committed to preserving the scholarly record , Bing...DiFiore: JSTOR & Portico: Committed to preserving the scholarly record , Bing...
DiFiore: JSTOR & Portico: Committed to preserving the scholarly record , Bing...
 
Demaeyer
DemaeyerDemaeyer
Demaeyer
 
Your research as open science
Your research as open scienceYour research as open science
Your research as open science
 

En vedette

Data Warehouse Architectures
Data Warehouse ArchitecturesData Warehouse Architectures
Data Warehouse Architectures
Theju Paul
 

En vedette (20)

Mining Scientific Diagrams for facts
Mining Scientific Diagrams for factsMining Scientific Diagrams for facts
Mining Scientific Diagrams for facts
 
Asking the scientific literature to tell us about metabolism
Asking the scientific literature to tell us about metabolismAsking the scientific literature to tell us about metabolism
Asking the scientific literature to tell us about metabolism
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
TheContentMine: Mining for Everyone
TheContentMine: Mining for EveryoneTheContentMine: Mining for Everyone
TheContentMine: Mining for Everyone
 
ContentMine (EMBL-EBI Industry Programme)
ContentMine (EMBL-EBI Industry Programme)ContentMine (EMBL-EBI Industry Programme)
ContentMine (EMBL-EBI Industry Programme)
 
Architecture of ContentMine Components contentmine.org
Architecture of ContentMine Components contentmine.orgArchitecture of ContentMine Components contentmine.org
Architecture of ContentMine Components contentmine.org
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literatureAmanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literatureAutomatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
 
High throughput mining of the scholarly literature
High throughput mining of the scholarly literatureHigh throughput mining of the scholarly literature
High throughput mining of the scholarly literature
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSSOpen software and knowledge for MIOSS
Open software and knowledge for MIOSS
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
Asking the scientific literature to tell us about metabolism
Asking the scientific literature to tell us about metabolismAsking the scientific literature to tell us about metabolism
Asking the scientific literature to tell us about metabolism
 
Link Analysis
Link AnalysisLink Analysis
Link Analysis
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)
 
Content Mining of Science in Cambridge
Content Mining of Science in CambridgeContent Mining of Science in Cambridge
Content Mining of Science in Cambridge
 
High throughput mining of the scholarly literature
High throughput mining of the scholarly literature High throughput mining of the scholarly literature
High throughput mining of the scholarly literature
 
Data Warehouse Architectures
Data Warehouse ArchitecturesData Warehouse Architectures
Data Warehouse Architectures
 
Multidimensional data models
Multidimensional data  modelsMultidimensional data  models
Multidimensional data models
 
Link analysis .. Data Mining
Link analysis .. Data MiningLink analysis .. Data Mining
Link analysis .. Data Mining
 
Data warehouse concepts
Data warehouse conceptsData warehouse concepts
Data warehouse concepts
 

Similaire à Mining Scientific Images

Toast 2015 qiime_talk2
Toast 2015 qiime_talk2Toast 2015 qiime_talk2
Toast 2015 qiime_talk2
TOASTworkshop
 

Similaire à Mining Scientific Images (20)

Can machines understand the scientific literature
Can machines understand the scientific literatureCan machines understand the scientific literature
Can machines understand the scientific literature
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
ContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and theses
 
ContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and theses
 
Toast 2015 qiime_talk2
Toast 2015 qiime_talk2Toast 2015 qiime_talk2
Toast 2015 qiime_talk2
 
Checking, Curating And Qualifying Chemistry
Checking, Curating And Qualifying ChemistryChecking, Curating And Qualifying Chemistry
Checking, Curating And Qualifying Chemistry
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
 
FAIR data management in biomedicine
FAIR data management  in biomedicineFAIR data management  in biomedicine
FAIR data management in biomedicine
 
Paradise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to MineParadise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to Mine
 
Mac309 Intro to the module
Mac309 Intro to the moduleMac309 Intro to the module
Mac309 Intro to the module
 
Qualifying Online Information Resources for Chemists
Qualifying Online Information Resources for ChemistsQualifying Online Information Resources for Chemists
Qualifying Online Information Resources for Chemists
 
Biobanks a cornerstone of Research Infrastructure
Biobanks a cornerstone of Research InfrastructureBiobanks a cornerstone of Research Infrastructure
Biobanks a cornerstone of Research Infrastructure
 
Content Mining for Machines and Humans
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and Humans
 
Content Mining for Machines and Humans
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and Humans
 
Scientific search for everyone
Scientific search for everyoneScientific search for everyone
Scientific search for everyone
 
ChemSpider as a Platform for Crowd Participation in Curating Chemistry
ChemSpider as a Platform for Crowd Participation in Curating ChemistryChemSpider as a Platform for Crowd Participation in Curating Chemistry
ChemSpider as a Platform for Crowd Participation in Curating Chemistry
 

Plus de petermurrayrust

Plus de petermurrayrust (20)

Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Age
 
Open Science Principles and Practice
Open Science Principles and PracticeOpen Science Principles and Practice
Open Science Principles and Practice
 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentation
 
Can machines understand the scientific literature?
Can machines understand the scientific literature?Can machines understand the scientific literature?
Can machines understand the scientific literature?
 
OpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFestOpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFest
 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentation
 
Automatic mining of data from materials science literature
Automatic mining of data from materials science literatureAutomatic mining of data from materials science literature
Automatic mining of data from materials science literature
 
Climate Change and Human Migration
Climate Change and Human MigrationClimate Change and Human Migration
Climate Change and Human Migration
 
openVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on virusesopenVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on viruses
 
XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?
 
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be BraveEarly Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
 
Early Career Reseachers and Open Healthcare
Early Career Reseachers and Open HealthcareEarly Career Reseachers and Open Healthcare
Early Career Reseachers and Open Healthcare
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search
 
Openplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searchingOpenplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searching
 
Extracting science from the archive
Extracting science from the archiveExtracting science from the archive
Extracting science from the archive
 
WikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and EverythingWikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and Everything
 
Disrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic ComplexDisrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic Complex
 
Young people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge NeocolonialismYoung people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge Neocolonialism
 
WikiFactMine: Science for Everyone
WikiFactMine: Science for EveryoneWikiFactMine: Science for Everyone
WikiFactMine: Science for Everyone
 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Silpa
 

Dernier (20)

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
An introduction on sequence tagged site mapping
An introduction on sequence tagged site mappingAn introduction on sequence tagged site mapping
An introduction on sequence tagged site mapping
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 

Mining Scientific Images

  • 1. Mining Scientific Images Peter Murray-Rust, and TheContentMine WOSP, London, UK, 2014-09-12 The ContentMine is supported by a grant to PMR as a
  • 2. Research requires mining the WHOLE literature (3000 papers/day) • Aggregation of similar objects (phylogenetic trees) e.g. bacterial • Aggregation of complementary information (chemicals and species) • Metabolism (EBI and Cambridge) • Phytochemistry (Mint taxonomy)
  • 3. Mint phylogeny working group The mint family (Lamiaceae), with approximately 236 genera and 7200 species, is the sixth largest family of flowering plants, and has major economic and cultural importance worldwide. http://lamiaceae.myspecies.info/content/lamiaceae Ross Mounce P Murray-Rust Collaborators
  • 5. PMR is collaborating with the European Bioinformatics Institute to liberate all metabolic information from journals
  • 6. Publishers destroy structured information (LaTeX, Word) into PDF … • Characters (NOT words or higher structure) WORD is simply 4 characters, NO spaces • Paths (NOT circles, squares …) “Vectors” … They / their APIs then destroy it further into Pixels (e.g. PNG  or JPG ) Content Mine will read 10,000 PNGs a day and try to recover the science.
  • 7. We can’t turn a hamburger into a cow But we can now turn PDFs into Science Pixel => Path => Shape => Char => Word => Para => Document => SCIENCE
  • 8. UNITS TICKS QUANTITY SCALE TITLES DATA!! 2000+ points
  • 9. Dumb PDF CSV Semantic Spectrum Automatic extraction Smoothing Gaussian Filter 2nd Derivative
  • 10. Chemical Computer Vision Raw Mobile photo; problems: Shadows, contrast, noise, skew, clipping
  • 11. Binarization (pixels = 0,1) Irregular edges
  • 12. Thinning: thick lines to 1-pixel
  • 13. Chemical Optical Character Recognition Small alphabet, clean typefaces, clear boundaries make this relatively tractable. Problems are “I” “O” etc.
  • 14. AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other CLICK HERE FOR ANIMATION (may be browser dependent) Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
  • 15. AMI Demo http://www.mdpi.com/2218-1989/2/1/39/pdf https://bitbucket.org/AndyHowlett/ami2-poc ami2-poc -i example -v org.xmlcml.xhtml2stm.visitor.chem.ChemVisitor May take time to start if not connected to web Output in ./example/target/output/reactionsexample Look at: image.g.1.4.svg.reaction0.cml in Avogadro
  • 16. NEW Bacteria must have a phylogenetic tree Note Jaggy and broken pixels Length Weight _________ Binomial Name Culture/Strain GENBANK ID Evolution Rate
  • 17. Thinning Topology https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-mine- images-of-phylogenetic-trees-and-more/ for story of extraction Serialization Newick
  • 18. Display your own tree • Cut and paste… • ((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182) ,((((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),(n218 ,(n215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),((n233,n18 7),n41)))),((((n59,n184),((n134,n200),(n137,(n212,((n92,n209),n29))))),(n8 8,(n102,n161))),((((n70,n140),(n18,n188)),(n49,((n123,n132),(n219,n198)) )),(((n37,(n65,n46)),(n135,(n11,(n113,n142)))),(n210,((n69,(n216,n36)),(n2 31,n160))))))),(((n107,n43),((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7, ((n139,n5),((n170,(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126, n222))))))))))),(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n1 58,(n91,(n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,(n163 ,n227)),((n53,n131),n159))))))); • View with http://www.unc.edu/~bdmorris/treelib-js/demo.html or • http://www.trex.uqam.ca/index.php?action=newick&project=trex
  • 19. Questions and comments • Technical and/or scientific, please • Politics can wait till Charles Oppenheim presentation Thanks: • Andy Howlett, Dept Chemistry, Cambridge • Mark Williamson, Dept Chemistry, Cambridge • Ross Mounce, Biology, University of Bath