SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
1
twitter.com/openminted_eu
Petr Knoth
Knowledge Media institute, The Open University
United Kingdom
Machine accessibility of Open Access
scientific publications from publisher systems
via ResourceSync
2
Research literature contains some of the most
important information we have assembled as human
species, such as cures to diseases and answers to
many of the world’s challenges we are facing today.
3
Reading and systematically analysing this
information is beyond human capacities.
4
Why machine accessibility of publications?
• TDM an only fulfil its potential if TDM tools can be applied
on the:
• widest possible set of publications
• as soon as publications are made available
• Many publication providers => need for interoperability
5
Current state of accessing research literature
• Aggregating publications
from:
• Repositories (Green OA)
• Open access journals (Gold
OA)
• OAI-PMH
6
Current state of accessing research literature
• Current state:
• ~80 million metadata records
• 8 million full-text records
• ~1.5 million monthly active
users
• Different services
• API
• Data dumps
• Recsys
• Analytics
• …
• But there is an issue here …
7
Problem
Key players do not provide interoperability for machine access to
metadata and content of research papers.
8
The idea of the Publisher Connector
Provide seamless access over non-standard APIs
9
Expertise Directory 1/2
• Contacted publishers to clarify the expected approaches
• Developed code to implement them:
• In most cases the final approach was different from the
suggestions we received.
• Tested the approach for scalability
• Documented the approach, justified why we followed it
(including what did not work) and gave recommendations
to publishers.
The expertise directory is available at: https://github.com/openminted/omtd-
publisher-connector-harvester/blob/master/interoperability-layer/interoperability-
layer.adoc
10
Expertise Directory 2/2
Example of limitations as described in the Expertise Directory
for Elsevier:
11
The idea of the Publisher Connector
Why not OAI-PMH?
• slow and very inefficient for big
repositories.
• Standardised for metadata
transfer but not for content
transfer.
• Very difficult to represent the
richness of metadata from a
broad range of data providers.
Provide seamless access over non-standard APIs
12
The idea of the Publisher Connector
Provide seamless access over non-standard APIs
Why ResourceSync?
• Scalable
implementation
• Metadata format
agnostic
• Made for the Web
13
Integration with OpenMinTeD
Via CORE and the OMTD-SHARE schema.
14
How does it work?
15
Architecture 1/3
• Microservices architecture with a message queue as a
communication channel.
• Discover, Retrieve, Expose (DRE) Workflow
Discovery Retrieval Expose
16
Architecture 2/3
• Ingestion Services:
• Harvester service: Discovers new resources (publications)
and schedules them for downloading (via the message queue)
• Retriever service: Retrieves scheduled publications from the
queue and downloads them (both metadata and content)
applying an appropriate data source download client for each
publisher.
• Data source download clients: publisher specific methods for
discovery and retrieval + a generic CrossRef API wrapper.
• Exposure service:
• ResourceSync server service: exposes publications accroding
to the ResourceSync standard.
17
Architecture 3/3
• Message queue module: Iinterface to a message broker
(RabbitMQ) that is populated with publications events
scheduled for downloading.
• Database module: Store and keeps downloads for
incremental synchronisation.
18
Discovery
• Could be done via the CrossRef TDM API for some
(typically smaller => scalability) publishers
• Filtering by date of publication and for a set of OA licences
• Sitemaps crawling for Elsevier
Discovery Retrieval Expose
19
Retrieval
• Each publisher employs different methods and rules to
download and retrieve an article
Discovery Retrieval Expose
20
Exposure
• Scalable implementation of a ResourceSync server: For each
publisher a new ResourceSync “capability” is created for its
metadata and one for its content (pdf). The ResourceSync server
is deployed at http://publisher-
connector.core.ac.uk/resourcesync/
Discovery Retrieval Expose
21
How many articles are provided as OA?
Publisher Metadata records TDM-eligible license OA articles
Springer Nature 10,383,519 1,393,991 438,139
Elsevier 14,988,181 ? 1,005,768
Frontiers 68,790 68,790 68,790
Discovery of OA articles (May 2017)
OA Percentage: 7% - TDM eligible 9.7%
22
Volume sizes
As of August 2017
(this is the number exposed via http://publisher-connector.core.ac.uk/resourcesync/)
Total: 1,831,877
Elsevier 1,107,091
Springer 492,462
Frontiers 59,512
PLOS 172,812
23
Scalability analysis
Publisher discovery Metadata + Content
(single thread)
Identification (is
OA?)
Elsevier 8m 1s 59m instant
Springer 6m 52s 51m n/a
Frontiers 16m 40s 2h 46m n/a
* On a sample of 10k documents averaged over 2 trials
We can reprocess all Elsevier articles single threaded in about 100h
24
Contributions
• Content:
• We liberated over 1.8 million open access publications from publishers
and made them available through a seamless layer
• As CORE integrates these papers, we have now over 8 million full-text
papers in CORE (and even more when combined with OpenAIRE).
• Technical:
• First implementation and deployment of ResourceSync that scales to
millions of items.
• ResourceSync solves problems with aggregating content over OAI-
PMH, faster & more efficient aggregation => fresher data in
aggregators compared to OAI-PMH
• More work in this direction upcoming as part of COAR NGR

Contenu connexe

Tendances

Springer LAB: Implementing a discovery tool
Springer LAB: Implementing a discovery toolSpringer LAB: Implementing a discovery tool
Springer LAB: Implementing a discovery tool
Jason Price, PhD
 

Tendances (19)

Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the CloudSept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
 
Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the CloudSept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
 
End-to-End : Open Access Process Review and Improvements
End-to-End: Open Access Process Review and ImprovementsEnd-to-End: Open Access Process Review and Improvements
End-to-End : Open Access Process Review and Improvements
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
 
Towards effective research recommender systems for repositories
Towards effective research recommender systems for repositoriesTowards effective research recommender systems for repositories
Towards effective research recommender systems for repositories
 
Springer LAB: Implementing a discovery tool
Springer LAB: Implementing a discovery toolSpringer LAB: Implementing a discovery tool
Springer LAB: Implementing a discovery tool
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Gateways 2020 Tutorial - Instrument Data Distribution with Globus
Gateways 2020 Tutorial - Instrument Data Distribution with GlobusGateways 2020 Tutorial - Instrument Data Distribution with Globus
Gateways 2020 Tutorial - Instrument Data Distribution with Globus
 
Globus publication demo screenshots
Globus publication demo screenshotsGlobus publication demo screenshots
Globus publication demo screenshots
 
Providing Tools for Author Evaluation - A case study
Providing Tools for Author Evaluation - A case studyProviding Tools for Author Evaluation - A case study
Providing Tools for Author Evaluation - A case study
 
Extending DSpace 7: DSpace-CRIS and DSpace-GLAM for empowered repositories an...
Extending DSpace 7: DSpace-CRIS and DSpace-GLAM for empowered repositories an...Extending DSpace 7: DSpace-CRIS and DSpace-GLAM for empowered repositories an...
Extending DSpace 7: DSpace-CRIS and DSpace-GLAM for empowered repositories an...
 
CPTAC Data Portal and Proteomics Data Commons
CPTAC Data Portal and Proteomics Data CommonsCPTAC Data Portal and Proteomics Data Commons
CPTAC Data Portal and Proteomics Data Commons
 
DSpace-CRIS & OpenAIRE
DSpace-CRIS & OpenAIREDSpace-CRIS & OpenAIRE
DSpace-CRIS & OpenAIRE
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
 
Putting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAMPutting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAM
 
Gateways 2020 Tutorial - Large Scale Data Transfer with Globus
Gateways 2020 Tutorial - Large Scale Data Transfer with GlobusGateways 2020 Tutorial - Large Scale Data Transfer with Globus
Gateways 2020 Tutorial - Large Scale Data Transfer with Globus
 
Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus
Gateways 2020 Tutorial - Automated Data Ingest and Search with GlobusGateways 2020 Tutorial - Automated Data Ingest and Search with Globus
Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus
 
An Approach for RDF-based Semantic Access to NoSQL Repositories
An Approach for RDF-based Semantic Access to NoSQL RepositoriesAn Approach for RDF-based Semantic Access to NoSQL Repositories
An Approach for RDF-based Semantic Access to NoSQL Repositories
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperability
 

Similaire à OSFair2017 training | Machine accessibility of Open Access scientific publications from publisher systems via ResourceSync

Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific PublicationsTowards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
petrknoth
 
From Open Access Metadata to Open Access Content: Two Principles for Increase...
From Open Access Metadata to Open Access Content: Two Principles for Increase...From Open Access Metadata to Open Access Content: Two Principles for Increase...
From Open Access Metadata to Open Access Content: Two Principles for Increase...
petrknoth
 
How can repositories support the text mining of their content and why?
How can repositories support the text mining of their content and why?How can repositories support the text mining of their content and why?
How can repositories support the text mining of their content and why?
openminted_eu
 
Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...
petrknoth
 

Similaire à OSFair2017 training | Machine accessibility of Open Access scientific publications from publisher systems via ResourceSync (20)

Seamless access to the world's open access research papers via resources sync
Seamless access to the world's open access research papers via resources syncSeamless access to the world's open access research papers via resources sync
Seamless access to the world's open access research papers via resources sync
 
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific PublicationsTowards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
 
CORE APIv3
CORE APIv3CORE APIv3
CORE APIv3
 
From Open Access Metadata to Open Access Content: Two Principles for Increase...
From Open Access Metadata to Open Access Content: Two Principles for Increase...From Open Access Metadata to Open Access Content: Two Principles for Increase...
From Open Access Metadata to Open Access Content: Two Principles for Increase...
 
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
 
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
 
How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why? How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why?
 
CRIS 2014 - OpenAIRE Guidelines: supporting interoperability for Literature R...
CRIS 2014 - OpenAIRE Guidelines: supporting interoperability for Literature R...CRIS 2014 - OpenAIRE Guidelines: supporting interoperability for Literature R...
CRIS 2014 - OpenAIRE Guidelines: supporting interoperability for Literature R...
 
OpenAIRE guidelines : supporting interoperability for literature repositories...
OpenAIRE guidelines : supporting interoperability for literature repositories...OpenAIRE guidelines : supporting interoperability for literature repositories...
OpenAIRE guidelines : supporting interoperability for literature repositories...
 
Cloud Dataverse: A Data repository platform for an OpenStack Cloud
Cloud Dataverse: A Data repository platform for an OpenStack CloudCloud Dataverse: A Data repository platform for an OpenStack Cloud
Cloud Dataverse: A Data repository platform for an OpenStack Cloud
 
OAI-PMH
OAI-PMHOAI-PMH
OAI-PMH
 
Literature Services Resource Description Framework
Literature Services Resource Description FrameworkLiterature Services Resource Description Framework
Literature Services Resource Description Framework
 
How can repositories support the text mining of their content and why?
How can repositories support the text mining of their content and why?How can repositories support the text mining of their content and why?
How can repositories support the text mining of their content and why?
 
Cloud web scale discovery services landscape an overview
Cloud web scale discovery services landscape an overviewCloud web scale discovery services landscape an overview
Cloud web scale discovery services landscape an overview
 
Next Generation Repositories
Next Generation RepositoriesNext Generation Repositories
Next Generation Repositories
 
Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...
 
Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...
 
7th Content Providers Community Call
7th Content Providers Community Call7th Content Providers Community Call
7th Content Providers Community Call
 
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...
 
Hughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesHughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication Repositories
 

Plus de Open Science Fair

OSFair2017 Worksop | NUCLEUS project - Are you ready to perform in RRI ecosys...
OSFair2017 Worksop | NUCLEUS project - Are you ready to perform in RRI ecosys...OSFair2017 Worksop | NUCLEUS project - Are you ready to perform in RRI ecosys...
OSFair2017 Worksop | NUCLEUS project - Are you ready to perform in RRI ecosys...
Open Science Fair
 
OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...
OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...
OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...
Open Science Fair
 
OSFair2017 Workshop | Research lifecycle in Arts, Humanities and Social Sciences
OSFair2017 Workshop | Research lifecycle in Arts, Humanities and Social SciencesOSFair2017 Workshop | Research lifecycle in Arts, Humanities and Social Sciences
OSFair2017 Workshop | Research lifecycle in Arts, Humanities and Social Sciences
Open Science Fair
 

Plus de Open Science Fair (20)

OSFair2017 workshop | Monitoring open science trends in europe
OSFair2017 workshop | Monitoring open science trends in europeOSFair2017 workshop | Monitoring open science trends in europe
OSFair2017 workshop | Monitoring open science trends in europe
 
OSFair2017 Worksop | NUCLEUS project - Are you ready to perform in RRI ecosys...
OSFair2017 Worksop | NUCLEUS project - Are you ready to perform in RRI ecosys...OSFair2017 Worksop | NUCLEUS project - Are you ready to perform in RRI ecosys...
OSFair2017 Worksop | NUCLEUS project - Are you ready to perform in RRI ecosys...
 
OSFair2017 Workshop | Data Analytics meets Social Sciences: New Frontiers of ...
OSFair2017 Workshop | Data Analytics meets Social Sciences: New Frontiers of ...OSFair2017 Workshop | Data Analytics meets Social Sciences: New Frontiers of ...
OSFair2017 Workshop | Data Analytics meets Social Sciences: New Frontiers of ...
 
OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...
OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...
OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...
 
OSFair2017 Workshop | Research lifecycle in Arts, Humanities and Social Sciences
OSFair2017 Workshop | Research lifecycle in Arts, Humanities and Social SciencesOSFair2017 Workshop | Research lifecycle in Arts, Humanities and Social Sciences
OSFair2017 Workshop | Research lifecycle in Arts, Humanities and Social Sciences
 
OSFair2017 Workshop | Towards a Policy Framework for the European Open Scienc...
OSFair2017 Workshop | Towards a Policy Framework for the European Open Scienc...OSFair2017 Workshop | Towards a Policy Framework for the European Open Scienc...
OSFair2017 Workshop | Towards a Policy Framework for the European Open Scienc...
 
OSFair2017 Workshop | Big Mechanism: deep reading for cancer biology
OSFair2017 Workshop | Big Mechanism: deep reading for cancer biologyOSFair2017 Workshop | Big Mechanism: deep reading for cancer biology
OSFair2017 Workshop | Big Mechanism: deep reading for cancer biology
 
OSFair2017 Workshop | Text mining
OSFair2017 Workshop | Text miningOSFair2017 Workshop | Text mining
OSFair2017 Workshop | Text mining
 
OSFair2017 Workshop | EOSCpilot governance
OSFair2017 Workshop | EOSCpilot governanceOSFair2017 Workshop | EOSCpilot governance
OSFair2017 Workshop | EOSCpilot governance
 
OSFair2017 Workshop | Brokering services facilitating interoperability and da...
OSFair2017 Workshop | Brokering services facilitating interoperability and da...OSFair2017 Workshop | Brokering services facilitating interoperability and da...
OSFair2017 Workshop | Brokering services facilitating interoperability and da...
 
OSFair2017 Workshop | Service provisioning for excellent sciences
OSFair2017 Workshop | Service provisioning for excellent sciencesOSFair2017 Workshop | Service provisioning for excellent sciences
OSFair2017 Workshop | Service provisioning for excellent sciences
 
OSFair2017 Theatrical Workshop | Are you ready to perform in the rri ecosystem
OSFair2017 Theatrical Workshop | Are you ready to perform in the rri ecosystemOSFair2017 Theatrical Workshop | Are you ready to perform in the rri ecosystem
OSFair2017 Theatrical Workshop | Are you ready to perform in the rri ecosystem
 
OSFair2017 Theatrical Workshop | Nucleus H2020 EU project
OSFair2017 Theatrical Workshop | Nucleus H2020 EU projectOSFair2017 Theatrical Workshop | Nucleus H2020 EU project
OSFair2017 Theatrical Workshop | Nucleus H2020 EU project
 
OSFair2017 Workshop | Open Knowledge Maps, A visual interface to the world's ...
OSFair2017 Workshop | Open Knowledge Maps, A visual interface to the world's ...OSFair2017 Workshop | Open Knowledge Maps, A visual interface to the world's ...
OSFair2017 Workshop | Open Knowledge Maps, A visual interface to the world's ...
 
OSFair2017 Training | Reproducibility in critical care research
OSFair2017 Training | Reproducibility in critical care researchOSFair2017 Training | Reproducibility in critical care research
OSFair2017 Training | Reproducibility in critical care research
 
OSFair2017 Training | Big data and evidence-based medicine in Greece
OSFair2017 Training | Big data and evidence-based medicine in GreeceOSFair2017 Training | Big data and evidence-based medicine in Greece
OSFair2017 Training | Big data and evidence-based medicine in Greece
 
OSFair2017 Training | What is Open Science and why should I care?
OSFair2017 Training | What is Open Science and why should I care?OSFair2017 Training | What is Open Science and why should I care?
OSFair2017 Training | What is Open Science and why should I care?
 
OSFair2017 Training | OpenAIRE monitoring services, EC FP7 & H2020 & other na...
OSFair2017 Training | OpenAIRE monitoring services, EC FP7 & H2020 & other na...OSFair2017 Training | OpenAIRE monitoring services, EC FP7 & H2020 & other na...
OSFair2017 Training | OpenAIRE monitoring services, EC FP7 & H2020 & other na...
 
OSFair2017 Training | Designing & implementing open access, open data & open ...
OSFair2017 Training | Designing & implementing open access, open data & open ...OSFair2017 Training | Designing & implementing open access, open data & open ...
OSFair2017 Training | Designing & implementing open access, open data & open ...
 
OSFair2017 Training | Best practice in Open Science
OSFair2017 Training | Best practice in Open ScienceOSFair2017 Training | Best practice in Open Science
OSFair2017 Training | Best practice in Open Science
 

Dernier

development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 

Dernier (20)

Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
Velocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.pptVelocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.ppt
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 

OSFair2017 training | Machine accessibility of Open Access scientific publications from publisher systems via ResourceSync

  • 1. 1 twitter.com/openminted_eu Petr Knoth Knowledge Media institute, The Open University United Kingdom Machine accessibility of Open Access scientific publications from publisher systems via ResourceSync
  • 2. 2 Research literature contains some of the most important information we have assembled as human species, such as cures to diseases and answers to many of the world’s challenges we are facing today.
  • 3. 3 Reading and systematically analysing this information is beyond human capacities.
  • 4. 4 Why machine accessibility of publications? • TDM an only fulfil its potential if TDM tools can be applied on the: • widest possible set of publications • as soon as publications are made available • Many publication providers => need for interoperability
  • 5. 5 Current state of accessing research literature • Aggregating publications from: • Repositories (Green OA) • Open access journals (Gold OA) • OAI-PMH
  • 6. 6 Current state of accessing research literature • Current state: • ~80 million metadata records • 8 million full-text records • ~1.5 million monthly active users • Different services • API • Data dumps • Recsys • Analytics • … • But there is an issue here …
  • 7. 7 Problem Key players do not provide interoperability for machine access to metadata and content of research papers.
  • 8. 8 The idea of the Publisher Connector Provide seamless access over non-standard APIs
  • 9. 9 Expertise Directory 1/2 • Contacted publishers to clarify the expected approaches • Developed code to implement them: • In most cases the final approach was different from the suggestions we received. • Tested the approach for scalability • Documented the approach, justified why we followed it (including what did not work) and gave recommendations to publishers. The expertise directory is available at: https://github.com/openminted/omtd- publisher-connector-harvester/blob/master/interoperability-layer/interoperability- layer.adoc
  • 10. 10 Expertise Directory 2/2 Example of limitations as described in the Expertise Directory for Elsevier:
  • 11. 11 The idea of the Publisher Connector Why not OAI-PMH? • slow and very inefficient for big repositories. • Standardised for metadata transfer but not for content transfer. • Very difficult to represent the richness of metadata from a broad range of data providers. Provide seamless access over non-standard APIs
  • 12. 12 The idea of the Publisher Connector Provide seamless access over non-standard APIs Why ResourceSync? • Scalable implementation • Metadata format agnostic • Made for the Web
  • 13. 13 Integration with OpenMinTeD Via CORE and the OMTD-SHARE schema.
  • 14. 14 How does it work?
  • 15. 15 Architecture 1/3 • Microservices architecture with a message queue as a communication channel. • Discover, Retrieve, Expose (DRE) Workflow Discovery Retrieval Expose
  • 16. 16 Architecture 2/3 • Ingestion Services: • Harvester service: Discovers new resources (publications) and schedules them for downloading (via the message queue) • Retriever service: Retrieves scheduled publications from the queue and downloads them (both metadata and content) applying an appropriate data source download client for each publisher. • Data source download clients: publisher specific methods for discovery and retrieval + a generic CrossRef API wrapper. • Exposure service: • ResourceSync server service: exposes publications accroding to the ResourceSync standard.
  • 17. 17 Architecture 3/3 • Message queue module: Iinterface to a message broker (RabbitMQ) that is populated with publications events scheduled for downloading. • Database module: Store and keeps downloads for incremental synchronisation.
  • 18. 18 Discovery • Could be done via the CrossRef TDM API for some (typically smaller => scalability) publishers • Filtering by date of publication and for a set of OA licences • Sitemaps crawling for Elsevier Discovery Retrieval Expose
  • 19. 19 Retrieval • Each publisher employs different methods and rules to download and retrieve an article Discovery Retrieval Expose
  • 20. 20 Exposure • Scalable implementation of a ResourceSync server: For each publisher a new ResourceSync “capability” is created for its metadata and one for its content (pdf). The ResourceSync server is deployed at http://publisher- connector.core.ac.uk/resourcesync/ Discovery Retrieval Expose
  • 21. 21 How many articles are provided as OA? Publisher Metadata records TDM-eligible license OA articles Springer Nature 10,383,519 1,393,991 438,139 Elsevier 14,988,181 ? 1,005,768 Frontiers 68,790 68,790 68,790 Discovery of OA articles (May 2017) OA Percentage: 7% - TDM eligible 9.7%
  • 22. 22 Volume sizes As of August 2017 (this is the number exposed via http://publisher-connector.core.ac.uk/resourcesync/) Total: 1,831,877 Elsevier 1,107,091 Springer 492,462 Frontiers 59,512 PLOS 172,812
  • 23. 23 Scalability analysis Publisher discovery Metadata + Content (single thread) Identification (is OA?) Elsevier 8m 1s 59m instant Springer 6m 52s 51m n/a Frontiers 16m 40s 2h 46m n/a * On a sample of 10k documents averaged over 2 trials We can reprocess all Elsevier articles single threaded in about 100h
  • 24. 24 Contributions • Content: • We liberated over 1.8 million open access publications from publishers and made them available through a seamless layer • As CORE integrates these papers, we have now over 8 million full-text papers in CORE (and even more when combined with OpenAIRE). • Technical: • First implementation and deployment of ResourceSync that scales to millions of items. • ResourceSync solves problems with aggregating content over OAI- PMH, faster & more efficient aggregation => fresher data in aggregators compared to OAI-PMH • More work in this direction upcoming as part of COAR NGR