SlideShare une entreprise Scribd logo
1  sur  23
Rule-based Capture/Storage
of Scientific Data from PDF Files
and Export using a
Generic Scientific Data Model
Stuart J. Chalk, Audrey Bartholomew, Bashar Baraz, John Turner
Department of Chemistry, University of North Florida
schalk@unf.edu
CINF Paper 3 – 251st ACS Meeting Spring 2016
#ACSCINFDataSummit
 Data, Data Everywhere
 Extracting Text from PDF Files
 Making Sense of Textual Data
 Rules and Rulesets
 Overcoming Incompatibilities
 A Generic Scientific Data Model
 Building the Website
 Functionality of the Website
 Next Steps
 Future Plans
 Conclusion
Outline
Data, Data Everywhere
 U.S. Gov. “…open and machine readable data…” (5/2013)
 http://www.data.gov/
 NSF's Open Data Inventory (http://www.nsf.gov/data/)
 Expand -- publish additional data assets in the inventory:
 Enrich -- improve the discoverability, management, and
reusability of agency data assets through metadata.
 Open -- provide machine-readable and publically accessible
agency data assets.
 Funding agencies require “Data Management Plans”
 MPS Open Data Workshop - https://mpsopendata.crc.nd.edu/
Data, Data Everywhere
(but interoperable?)
 Joint Declaration of Data Citation Principles
Force 11 - https://www.force11.org/datacitation
 Importance
 Credit and Attribution
 Unique Identification
 Access
 Persistence
 Specificity and Verifiability
 Interoperability and Flexibility
 PDF Files are not meant to be vehicles for research data
sharing – but sadly have become the default
 Force 11 grew out of an initial meeting at UC San Diego in
2011 – Beyond the PDF
https://sites.google.com/site/beyondthepdf/
 PDF files store text in many places, and not in logical
sequence as it appears on the page
 Scanned images can be converted to text using optical
character recognition
 Many tools to get text out of PDF files
Extracting Text from PDF Files
 Software tested
 Adobe acrobat (plain text, accessible text, RTF, XML)
Converted to the newest PDF version
 lapdftext (Location Aware)
https://github.com/GullyAPCBurns/lapdftext
 pdftotext (part of)
Xpdf http://www.foolabs.com/xpdf/home.html
Extracting Text from PDF Files
Extracting Text from PDF Files
Making Sense of Textual Data
 Process the text file one line at a time
 ‘Generic’ rules to make decisions on a line
 May be multiple rules for one line
 Includes instructions to move down or stay on line
 Sets of rules tailored to match the generic layout of the text
 These were linked to datatype and publication
Rules and Rulesets
More sophiticated GetFormula regex (([A-Z][a-zA-Z0-9]{0,2}s)+)
 The data we extract with
the ruleset is saved in a
json string
 But how should it be
saved so that compatible
with other data?
 Different properties
 Different units
 We need a generic data
model to put the data in…
Overcoming Incompatibilities
 What is data?
A value of a measured property with/without unit
…or a series of data
... or an observation (text)
 But on its own its not that useful...
 What’s the context?
 Under what conditions was the data obtained?
 What were you testing?
 What instrument did you use to take the measurement?
 Who did the experiment?
A Generic Scientific Data Model
 How to store all this information?
 Relational database – very structured, data has to fixed schema
 Graph database – unstructured, flexible, non-standardized
 Intermediate solution needed
 Must have some structure so it can be easily searched
 Must be flexible to handle and type of data
 Approach
 Use JSON for linked data (JSON-LD) as format to hold data
 Provide a framework upon which you can hang the data and
metadata
A Generic Scientific Data Model
 JavaScript Object Notation (JSON)
 JSON for Linked Data (JSON-LD)
 W3C Recommentation (https://www.w3.org/TR/json-ld/)
 Specification that allows data/metadata to be stored in
JSON format but translated automatically to RDF
(Resource Description Framework)
 Provides a mechanism to add context to data/metadata
stored in JSON
 Note: Does not provide validation of the data/metadata
 Can be validated using JSON Schema
(http://json-schema.org)
What is JSON for Linked Data?
What is JSON for Linked Data?
{
"@context": {
"name": "http://schema.org/name",
"isAlive": "http://example.org/isAlive",
"age": "http://example.org/age",
"height": "http://schema.org/height",
"@base": "http://www.unf.edu/chemistry/stuart_chalk.aspx"
},
"@id": "",
"name": "Stuart Chalk",
"isAlive": true,
"age": 49,
"height": 188.0
}
What is JSON for Linked Data?
<http://www.unf.edu/chemistry/stuart_chalk.aspx>
<http://example.org/age>
"49"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://www.unf.edu/chemistry/stuart_chalk.aspx>
<http://example.org/isAlive>
"true"^^<http://www.w3.org/2001/XMLSchema#boolean> .
<http://www.unf.edu/chemistry/stuart_chalk.aspx>
<http://schema.org/height>
"188"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://www.unf.edu/chemistry/stuart_chalk.aspx>
<http://schema.org/name>
"Stuart Chalk" . http://json-ld.org/playground/
 CakePHP 2.7 PHP Framework (http://cakephp.org)
 Apache 2.4 webserver (http://httpd.apache.org)
 PHP 5.6 (http://www.php.org)
 MySQL 5.5 (http://www.mysql.com)
 jQuery (JavaScript) (http://jquery.com)
 Bootstrap (http://getbootstrap.com/)
 PHPStorm (https://www.jetbrains.com)
 GitHub (https://github.com)
Building the Website
 Upload PDF
 Assign Ruleset
 Extract data
 Perform QC
 Ingest Data
 View “dataset”
 Mass Process – many files from one publication
 Map the Process – see what happened
Functionality of the Website
Database Schema
Live Demo
 Formalize the system to
 Organize the processing
 Capture any processing current done in scripts
 Publish a paper
 Ingest data in SDM format into triplestore and
perform SPARQL queries on aggregated data
 Aggregate data of the same type on the website and
view, organize, plot together
 Provide option to capture and present equation
based results
Next Steps
Take Home
 ~27,500 PDF pages processed
 ~530,000 pieces of physical property data extracted
 Data can be extracted from PDF files as long as the
 Text is cleaned up
 Context is encoded in the process of extraction
 Data homogeneity can be achieved by using a generic
scientific data model
 More work needed to create robust system to handle data
in any “well structured” format
 schalk@unf.edu
 Phone: 904-620-5311
 Skype: stuartchalk
 LinkedIn/Slidehare: https://www.linkedin.com/in/stuchalk
 ORCID: http://orcid.org/0000-0002-0703-7776
 ResearcherID: http://www.researcherid.com/rid/D-8577-2013
Questions?
Thanks to Michael Klinge and Stefan Scherer
 Text Extraction of 11 Landolt-Börnstein Volumes
 Regulat Expression (Regex) fragments used to develop
rules to detect specific types of data
 PHP Scripts used to apply ‘Rulesets’ to text
 Extracted data and metadata captured in JSON
 JSON data added to MySQL database to support searching
and export in a generic scientific data model (JSON-LD)
 Website developed to search and display data
 ~27,500 PDF pages processed
 ~530,000 pieces of physical property data extracted
ChemExtractor:
Semantic Data from PDF Files

Contenu connexe

Tendances

2009 0807 Lod Gmod
2009 0807 Lod Gmod2009 0807 Lod Gmod
2009 0807 Lod GmodJun Zhao
 
CDISC2RDF overview with examples
CDISC2RDF overview with examplesCDISC2RDF overview with examples
CDISC2RDF overview with examplesKerstin Forsberg
 
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v12016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1Bruce Kozuma
 
Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1ErhardRahm
 
INSTRUCT - Integrated Structural Biology Infrastructure
INSTRUCT - Integrated Structural Biology InfrastructureINSTRUCT - Integrated Structural Biology Infrastructure
INSTRUCT - Integrated Structural Biology InfrastructureResearch Data Alliance
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkPaul Groth
 
Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015
Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015
Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015Stuart Chalk
 
The OpenOffice.org ODF Toolkit Project
The OpenOffice.org ODF Toolkit ProjectThe OpenOffice.org ODF Toolkit Project
The OpenOffice.org ODF Toolkit ProjectAlexandro Colorado
 
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Laurent Alquier
 
The Electronic Notebook Ontology
The Electronic Notebook OntologyThe Electronic Notebook Ontology
The Electronic Notebook OntologyStuart Chalk
 
Yosemite part-4 webinar-final
Yosemite part-4 webinar-finalYosemite part-4 webinar-final
Yosemite part-4 webinar-finalDATAVERSITY
 
Web 3 Mark Greaves
Web 3 Mark GreavesWeb 3 Mark Greaves
Web 3 Mark GreavesMediabistro
 
Citation Analysis for the Free, Online Literature
Citation Analysis for the Free, Online LiteratureCitation Analysis for the Free, Online Literature
Citation Analysis for the Free, Online LiteratureBalachandar Radhakrishnan
 
Semantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical InformaticsSemantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical InformaticsChimezie Ogbuji
 
Presentation_euroCRIS_ES
Presentation_euroCRIS_ESPresentation_euroCRIS_ES
Presentation_euroCRIS_ESEd Simons
 
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings Kerstin Forsberg
 
Linked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareLinked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareKerstin Forsberg
 

Tendances (20)

Cdao Evolution08
Cdao Evolution08Cdao Evolution08
Cdao Evolution08
 
2009 0807 Lod Gmod
2009 0807 Lod Gmod2009 0807 Lod Gmod
2009 0807 Lod Gmod
 
CDISC2RDF overview with examples
CDISC2RDF overview with examplesCDISC2RDF overview with examples
CDISC2RDF overview with examples
 
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v12016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
 
Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1
 
INSTRUCT - Integrated Structural Biology Infrastructure
INSTRUCT - Integrated Structural Biology InfrastructureINSTRUCT - Integrated Structural Biology Infrastructure
INSTRUCT - Integrated Structural Biology Infrastructure
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic Framework
 
Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015
Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015
Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015
 
The OpenOffice.org ODF Toolkit Project
The OpenOffice.org ODF Toolkit ProjectThe OpenOffice.org ODF Toolkit Project
The OpenOffice.org ODF Toolkit Project
 
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
 
Crosslinks
Crosslinks Crosslinks
Crosslinks
 
The Electronic Notebook Ontology
The Electronic Notebook OntologyThe Electronic Notebook Ontology
The Electronic Notebook Ontology
 
Yosemite part-4 webinar-final
Yosemite part-4 webinar-finalYosemite part-4 webinar-final
Yosemite part-4 webinar-final
 
Web 3 Mark Greaves
Web 3 Mark GreavesWeb 3 Mark Greaves
Web 3 Mark Greaves
 
Citation Analysis for the Free, Online Literature
Citation Analysis for the Free, Online LiteratureCitation Analysis for the Free, Online Literature
Citation Analysis for the Free, Online Literature
 
Semantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical InformaticsSemantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical Informatics
 
Presentation_euroCRIS_ES
Presentation_euroCRIS_ESPresentation_euroCRIS_ES
Presentation_euroCRIS_ES
 
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
 
Linked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareLinked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcare
 
The CIARD RINGValeri
The CIARD RINGValeriThe CIARD RINGValeri
The CIARD RINGValeri
 

Similaire à Rule-based Capture/Storage of Scientific Data from PDF Files and Export using a Generic Scientific Data Model

Linking Open Government Data at Scale
Linking Open Government Data at Scale Linking Open Government Data at Scale
Linking Open Government Data at Scale Bernadette Hyland-Wood
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data ModelingVital.AI
 
Data.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked DataData.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked DataMatthew Rowe
 
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...LEARN Project
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEWShiyong Lu
 
FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout Carole Goble
 
Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...VMware Tanzu
 
Data standardization process for social sciences and humanities
Data standardization process for social sciences and humanitiesData standardization process for social sciences and humanities
Data standardization process for social sciences and humanitiesvty
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017Jongwook Woo
 
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation WorkshopHKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation Workshopl_ernest
 
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...Research Data Alliance
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And VisualizationIvan Ermilov
 
Enterprise Content Management Migration Best Practices Feat Migrations From...
Enterprise Content Management Migration Best Practices   Feat Migrations From...Enterprise Content Management Migration Best Practices   Feat Migrations From...
Enterprise Content Management Migration Best Practices Feat Migrations From...Alfresco Software
 
Swap For Dummies Rsp 2007 11 29
Swap For Dummies Rsp 2007 11 29Swap For Dummies Rsp 2007 11 29
Swap For Dummies Rsp 2007 11 29Julie Allinson
 
A Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemA Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemChris Mattmann
 
RO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research ObjectsRO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research ObjectsCarole Goble
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
 

Similaire à Rule-based Capture/Storage of Scientific Data from PDF Files and Export using a Generic Scientific Data Model (20)

Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Linking Open Government Data at Scale
Linking Open Government Data at Scale Linking Open Government Data at Scale
Linking Open Government Data at Scale
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
 
Data.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked DataData.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked Data
 
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
 
FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout
 
Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...
 
Data standardization process for social sciences and humanities
Data standardization process for social sciences and humanitiesData standardization process for social sciences and humanities
Data standardization process for social sciences and humanities
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
 
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation WorkshopHKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
 
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
 
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
 
Linked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter HaaseLinked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter Haase
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 
Enterprise Content Management Migration Best Practices Feat Migrations From...
Enterprise Content Management Migration Best Practices   Feat Migrations From...Enterprise Content Management Migration Best Practices   Feat Migrations From...
Enterprise Content Management Migration Best Practices Feat Migrations From...
 
Swap For Dummies Rsp 2007 11 29
Swap For Dummies Rsp 2007 11 29Swap For Dummies Rsp 2007 11 29
Swap For Dummies Rsp 2007 11 29
 
A Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemA Look into the Apache OODT Ecosystem
A Look into the Apache OODT Ecosystem
 
RO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research ObjectsRO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research Objects
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 

Plus de Stuart Chalk

Semantic properties and units
Semantic properties and unitsSemantic properties and units
Semantic properties and unitsStuart Chalk
 
Open semantic chemical structures
Open semantic chemical structuresOpen semantic chemical structures
Open semantic chemical structuresStuart Chalk
 
ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...
ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...
ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...Stuart Chalk
 
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series DataSharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series DataStuart Chalk
 
Bringing Flow injection Analysis to the Semantic Web
Bringing Flow injection Analysis to the Semantic WebBringing Flow injection Analysis to the Semantic Web
Bringing Flow injection Analysis to the Semantic WebStuart Chalk
 
Reactions to the Open Spectral Database
Reactions to the Open Spectral DatabaseReactions to the Open Spectral Database
Reactions to the Open Spectral DatabaseStuart Chalk
 
Building a Standard for Standards: The ChAMP Project
Building a Standard for Standards: The ChAMP ProjectBuilding a Standard for Standards: The ChAMP Project
Building a Standard for Standards: The ChAMP ProjectStuart Chalk
 
A Standard Data Format for Computational Chemistry: CSX
A Standard Data Format for Computational Chemistry: CSXA Standard Data Format for Computational Chemistry: CSX
A Standard Data Format for Computational Chemistry: CSXStuart Chalk
 
Overview of the Analytical Information Markup Language (AnIML)
Overview of the Analytical Information Markup Language (AnIML)Overview of the Analytical Information Markup Language (AnIML)
Overview of the Analytical Information Markup Language (AnIML)Stuart Chalk
 
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
ACS 248th Paper 146 VIVO/ScientistsDB Integration into EurekaACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
ACS 248th Paper 146 VIVO/ScientistsDB Integration into EurekaStuart Chalk
 
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
ACS 248th Paper 136 JSmol/JSpecView Eureka IntegrationACS 248th Paper 136 JSmol/JSpecView Eureka Integration
ACS 248th Paper 136 JSmol/JSpecView Eureka IntegrationStuart Chalk
 
ACS 248th Paper 108 NIST-IUPAC Solubility Data
ACS 248th Paper 108 NIST-IUPAC Solubility DataACS 248th Paper 108 NIST-IUPAC Solubility Data
ACS 248th Paper 108 NIST-IUPAC Solubility DataStuart Chalk
 
ACS 248th Paper 104 ChemData Project
ACS 248th Paper 104 ChemData ProjectACS 248th Paper 104 ChemData Project
ACS 248th Paper 104 ChemData ProjectStuart Chalk
 
ACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP ProjectACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP ProjectStuart Chalk
 
ACS 248th Paper 67 Eureka Collaboration
ACS 248th Paper 67 Eureka CollaborationACS 248th Paper 67 Eureka Collaboration
ACS 248th Paper 67 Eureka CollaborationStuart Chalk
 
247th ACS Meeting: Experiment Markup Language (ExptML)
247th ACS Meeting: Experiment Markup Language (ExptML)247th ACS Meeting: Experiment Markup Language (ExptML)
247th ACS Meeting: Experiment Markup Language (ExptML)Stuart Chalk
 
Liberating Laboratory Data - Eureka
Liberating Laboratory Data - EurekaLiberating Laboratory Data - Eureka
Liberating Laboratory Data - EurekaStuart Chalk
 
Liberating Laboratory Data - AnIML
Liberating Laboratory Data - AnIMLLiberating Laboratory Data - AnIML
Liberating Laboratory Data - AnIMLStuart Chalk
 

Plus de Stuart Chalk (18)

Semantic properties and units
Semantic properties and unitsSemantic properties and units
Semantic properties and units
 
Open semantic chemical structures
Open semantic chemical structuresOpen semantic chemical structures
Open semantic chemical structures
 
ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...
ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...
ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...
 
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series DataSharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
 
Bringing Flow injection Analysis to the Semantic Web
Bringing Flow injection Analysis to the Semantic WebBringing Flow injection Analysis to the Semantic Web
Bringing Flow injection Analysis to the Semantic Web
 
Reactions to the Open Spectral Database
Reactions to the Open Spectral DatabaseReactions to the Open Spectral Database
Reactions to the Open Spectral Database
 
Building a Standard for Standards: The ChAMP Project
Building a Standard for Standards: The ChAMP ProjectBuilding a Standard for Standards: The ChAMP Project
Building a Standard for Standards: The ChAMP Project
 
A Standard Data Format for Computational Chemistry: CSX
A Standard Data Format for Computational Chemistry: CSXA Standard Data Format for Computational Chemistry: CSX
A Standard Data Format for Computational Chemistry: CSX
 
Overview of the Analytical Information Markup Language (AnIML)
Overview of the Analytical Information Markup Language (AnIML)Overview of the Analytical Information Markup Language (AnIML)
Overview of the Analytical Information Markup Language (AnIML)
 
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
ACS 248th Paper 146 VIVO/ScientistsDB Integration into EurekaACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
 
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
ACS 248th Paper 136 JSmol/JSpecView Eureka IntegrationACS 248th Paper 136 JSmol/JSpecView Eureka Integration
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
 
ACS 248th Paper 108 NIST-IUPAC Solubility Data
ACS 248th Paper 108 NIST-IUPAC Solubility DataACS 248th Paper 108 NIST-IUPAC Solubility Data
ACS 248th Paper 108 NIST-IUPAC Solubility Data
 
ACS 248th Paper 104 ChemData Project
ACS 248th Paper 104 ChemData ProjectACS 248th Paper 104 ChemData Project
ACS 248th Paper 104 ChemData Project
 
ACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP ProjectACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP Project
 
ACS 248th Paper 67 Eureka Collaboration
ACS 248th Paper 67 Eureka CollaborationACS 248th Paper 67 Eureka Collaboration
ACS 248th Paper 67 Eureka Collaboration
 
247th ACS Meeting: Experiment Markup Language (ExptML)
247th ACS Meeting: Experiment Markup Language (ExptML)247th ACS Meeting: Experiment Markup Language (ExptML)
247th ACS Meeting: Experiment Markup Language (ExptML)
 
Liberating Laboratory Data - Eureka
Liberating Laboratory Data - EurekaLiberating Laboratory Data - Eureka
Liberating Laboratory Data - Eureka
 
Liberating Laboratory Data - AnIML
Liberating Laboratory Data - AnIMLLiberating Laboratory Data - AnIML
Liberating Laboratory Data - AnIML
 

Dernier

FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceAlex Henderson
 
An introduction on sequence tagged site mapping
An introduction on sequence tagged site mappingAn introduction on sequence tagged site mapping
An introduction on sequence tagged site mappingadibshanto115
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspectsmuralinath2
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptxSilpa
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxDiariAli
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfSumit Kumar yadav
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Silpa
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxrohankumarsinghrore1
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Silpa
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curveAreesha Ahmad
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 

Dernier (20)

Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
An introduction on sequence tagged site mapping
An introduction on sequence tagged site mappingAn introduction on sequence tagged site mapping
An introduction on sequence tagged site mapping
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 

Rule-based Capture/Storage of Scientific Data from PDF Files and Export using a Generic Scientific Data Model

  • 1. Rule-based Capture/Storage of Scientific Data from PDF Files and Export using a Generic Scientific Data Model Stuart J. Chalk, Audrey Bartholomew, Bashar Baraz, John Turner Department of Chemistry, University of North Florida schalk@unf.edu CINF Paper 3 – 251st ACS Meeting Spring 2016 #ACSCINFDataSummit
  • 2.  Data, Data Everywhere  Extracting Text from PDF Files  Making Sense of Textual Data  Rules and Rulesets  Overcoming Incompatibilities  A Generic Scientific Data Model  Building the Website  Functionality of the Website  Next Steps  Future Plans  Conclusion Outline
  • 3. Data, Data Everywhere  U.S. Gov. “…open and machine readable data…” (5/2013)  http://www.data.gov/  NSF's Open Data Inventory (http://www.nsf.gov/data/)  Expand -- publish additional data assets in the inventory:  Enrich -- improve the discoverability, management, and reusability of agency data assets through metadata.  Open -- provide machine-readable and publically accessible agency data assets.  Funding agencies require “Data Management Plans”  MPS Open Data Workshop - https://mpsopendata.crc.nd.edu/
  • 4. Data, Data Everywhere (but interoperable?)  Joint Declaration of Data Citation Principles Force 11 - https://www.force11.org/datacitation  Importance  Credit and Attribution  Unique Identification  Access  Persistence  Specificity and Verifiability  Interoperability and Flexibility
  • 5.  PDF Files are not meant to be vehicles for research data sharing – but sadly have become the default  Force 11 grew out of an initial meeting at UC San Diego in 2011 – Beyond the PDF https://sites.google.com/site/beyondthepdf/  PDF files store text in many places, and not in logical sequence as it appears on the page  Scanned images can be converted to text using optical character recognition  Many tools to get text out of PDF files Extracting Text from PDF Files
  • 6.  Software tested  Adobe acrobat (plain text, accessible text, RTF, XML) Converted to the newest PDF version  lapdftext (Location Aware) https://github.com/GullyAPCBurns/lapdftext  pdftotext (part of) Xpdf http://www.foolabs.com/xpdf/home.html Extracting Text from PDF Files
  • 8. Making Sense of Textual Data
  • 9.  Process the text file one line at a time  ‘Generic’ rules to make decisions on a line  May be multiple rules for one line  Includes instructions to move down or stay on line  Sets of rules tailored to match the generic layout of the text  These were linked to datatype and publication Rules and Rulesets More sophiticated GetFormula regex (([A-Z][a-zA-Z0-9]{0,2}s)+)
  • 10.  The data we extract with the ruleset is saved in a json string  But how should it be saved so that compatible with other data?  Different properties  Different units  We need a generic data model to put the data in… Overcoming Incompatibilities
  • 11.  What is data? A value of a measured property with/without unit …or a series of data ... or an observation (text)  But on its own its not that useful...  What’s the context?  Under what conditions was the data obtained?  What were you testing?  What instrument did you use to take the measurement?  Who did the experiment? A Generic Scientific Data Model
  • 12.  How to store all this information?  Relational database – very structured, data has to fixed schema  Graph database – unstructured, flexible, non-standardized  Intermediate solution needed  Must have some structure so it can be easily searched  Must be flexible to handle and type of data  Approach  Use JSON for linked data (JSON-LD) as format to hold data  Provide a framework upon which you can hang the data and metadata A Generic Scientific Data Model
  • 13.  JavaScript Object Notation (JSON)  JSON for Linked Data (JSON-LD)  W3C Recommentation (https://www.w3.org/TR/json-ld/)  Specification that allows data/metadata to be stored in JSON format but translated automatically to RDF (Resource Description Framework)  Provides a mechanism to add context to data/metadata stored in JSON  Note: Does not provide validation of the data/metadata  Can be validated using JSON Schema (http://json-schema.org) What is JSON for Linked Data?
  • 14. What is JSON for Linked Data? { "@context": { "name": "http://schema.org/name", "isAlive": "http://example.org/isAlive", "age": "http://example.org/age", "height": "http://schema.org/height", "@base": "http://www.unf.edu/chemistry/stuart_chalk.aspx" }, "@id": "", "name": "Stuart Chalk", "isAlive": true, "age": 49, "height": 188.0 }
  • 15. What is JSON for Linked Data? <http://www.unf.edu/chemistry/stuart_chalk.aspx> <http://example.org/age> "49"^^<http://www.w3.org/2001/XMLSchema#integer> . <http://www.unf.edu/chemistry/stuart_chalk.aspx> <http://example.org/isAlive> "true"^^<http://www.w3.org/2001/XMLSchema#boolean> . <http://www.unf.edu/chemistry/stuart_chalk.aspx> <http://schema.org/height> "188"^^<http://www.w3.org/2001/XMLSchema#integer> . <http://www.unf.edu/chemistry/stuart_chalk.aspx> <http://schema.org/name> "Stuart Chalk" . http://json-ld.org/playground/
  • 16.  CakePHP 2.7 PHP Framework (http://cakephp.org)  Apache 2.4 webserver (http://httpd.apache.org)  PHP 5.6 (http://www.php.org)  MySQL 5.5 (http://www.mysql.com)  jQuery (JavaScript) (http://jquery.com)  Bootstrap (http://getbootstrap.com/)  PHPStorm (https://www.jetbrains.com)  GitHub (https://github.com) Building the Website
  • 17.  Upload PDF  Assign Ruleset  Extract data  Perform QC  Ingest Data  View “dataset”  Mass Process – many files from one publication  Map the Process – see what happened Functionality of the Website
  • 20.  Formalize the system to  Organize the processing  Capture any processing current done in scripts  Publish a paper  Ingest data in SDM format into triplestore and perform SPARQL queries on aggregated data  Aggregate data of the same type on the website and view, organize, plot together  Provide option to capture and present equation based results Next Steps
  • 21. Take Home  ~27,500 PDF pages processed  ~530,000 pieces of physical property data extracted  Data can be extracted from PDF files as long as the  Text is cleaned up  Context is encoded in the process of extraction  Data homogeneity can be achieved by using a generic scientific data model  More work needed to create robust system to handle data in any “well structured” format
  • 22.  schalk@unf.edu  Phone: 904-620-5311  Skype: stuartchalk  LinkedIn/Slidehare: https://www.linkedin.com/in/stuchalk  ORCID: http://orcid.org/0000-0002-0703-7776  ResearcherID: http://www.researcherid.com/rid/D-8577-2013 Questions? Thanks to Michael Klinge and Stefan Scherer
  • 23.  Text Extraction of 11 Landolt-Börnstein Volumes  Regulat Expression (Regex) fragments used to develop rules to detect specific types of data  PHP Scripts used to apply ‘Rulesets’ to text  Extracted data and metadata captured in JSON  JSON data added to MySQL database to support searching and export in a generic scientific data model (JSON-LD)  Website developed to search and display data  ~27,500 PDF pages processed  ~530,000 pieces of physical property data extracted ChemExtractor: Semantic Data from PDF Files