GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
Comparing OAI-PMH and ResourceSync Performance
1. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Comparing the Performance of
OAI-PMH with ResourceSync
Petr Knoth, Matteo Cancellieri
Knowledge Media institute
The Open University
UK
Martin Klein
Research Library
Los Alamos National Laboratory
USA
2. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
“A single scientific repository is of limited value, real benefits
come from the ability to exchange data within a network …
… interoperability allows us to exploit today's computational
power so that we can aggregate, data mine, create new tools
and services, and generate new knowledge from repository
content.” - COAR
ResourceSync and repositories
2
3. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Protocols for data exchange are the blood of the
scholarly communication system
ResourceSync and repositories
3
4. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Aggregators and ResourceSync
4
ResourceSync
(CORE FastSync)
3rd parties
-data analysis
- TDM
5. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Repository aggregators have large full text collections
core.ac.uk stats:
• 13,117,488 Hosted full texts
• 135,539,113 Metadata records
• ~78m Links to full text
• 15TB of raw plain text
• 4,123 Data providers
5
6. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Many OAI-PMH implementations challenges …
Locating full text URLs in metadata
Restrictions on
full text downloading
Sequential nature of OAI-PMH
Failing resumption tokens
Incremental updates
Scalability
Metadata interoperability
Reliability
No content harvesting support
6
7. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Speed of OAI-PMH implementations
7
8. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Aggregators and ResourceSync
8
ResourceSync
(CORE FastSync)
3rd parties
-data analysis
- TDM
9. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Aggregators and ResourceSync
9
ResourceSync
(CORE FastSync)
3rd parties
-data analysis
- TDM
10. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Aggregators have a lot of usage
• January 2019 – CORE reached over 10M monthly active users for
the first time
• 571% increase from January 2018
• core.ac.uk by usage in the top 0.0009% of global websites
10
11. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Aggregator’s challenge
• Stay up to date despite thousands of data providers
• Efficiently expose large amounts of data to many users:
• Human users
• Machines (scalability!)
• OAI-PMH implementations can hardly deal with the job:
• Scalability
• Metadata inconsistency
• Supports for metadata harvesting only
11
12. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Research question
12
Is ResourceSync better suited for the job than
OAI-PMH?
13. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
OAI-PMH - Background
13
http://openarchives.org/pmh/
• Recurrent metadata exchange
from a Data Provider to Service
Providers
• XML metadata only
• Repository centric
• Devised 1999-2002, prior to
REST, prior to dominance of
web search engines
14. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
ResourceSync - Background
14
http://www.openarchives.org/rs/1.1/resourcesync
• Synchronization of resources
from a Source to Destinations
• Web resources, anything with
an HTTP URI & representation
• Resource centric
• Devised 2012-2013, leverages
key ingredients of web
interoperability, existing
specifications, existing Search
Engine Optimization practice
15. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
ResourceSync in a Nutshell
15
16. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
ResourceSync Capabilities
16
17. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
ResourceSync Capabilities
17
18. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
ResourceSync Capabilities
18
19. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
ResourceSync Capabilities
19
20. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
ResourceSync Capabilities
20
21. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Many to One - Aggregator
21
22. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
ResourceSync is based on Sitemaps
22
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
</url>
<url>
<loc>http://example.com/res2</loc>
<lastmod>2013-01-02T14:00:00Z</lastmod>
</url>
…
</urlset>
23. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
ResourceSync Resource List
23
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcelist"
at="2019-06-11T09:00:00Z"
completed="2019-06-11T09:00:44Z" />
<url>
<loc>http://example.com/res1_metadata.xml</loc>
<lastmod>2019-06-02T13:00:00Z</lastmod>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="823"
type="text/xml" />
</url>
</urlset>
24. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Resource List with Link
24
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcelist"
at="2019-06-11T09:00:00Z"
completed="2019-06-11T09:00:44Z" />
<url>
<loc>http://example.com/res1_metadata.xml</loc>
<lastmod>2019-06-02T13:00:00Z</lastmod>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="823"
type="text/xml" />
<rs:ln href="http://example.com/res1_content.pdf"
rel="describes"
length="8876"
type="application/pdf" />
</url>
</urlset>
25. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
• Designed to allow synchronization of resources, not just metadata
• Explicit link between metadata and the described resource
• Not prescriptive about the metadata format
• Web-centric
• Push-based Change Notifications (WebSub)
ResourceSync Characteristics
25
26. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
1. Assess the speed of OAI-PMH implementations across repositories
See results on slide #7
Comparative Analysis
26
27. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
1. Assess the speed of OAI-PMH implementations across repositories
2. Understand the recall in full-text harvesting
Comparative Analysis
27
28. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Recall of full-text harvesting – the power of the explicit full
text link
28
29. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
1. Assess the speed of OAI-PMH implementations across repositories
2. Understand the recall in full-text harvesting
3. Evaluate simulated metadata harvesting with ResourceSync
implementations for:
a) Standard Mode
• Resources sync’ed via Resource Lists, one resource at a time
(per HTTP transaction)
b) Resource Dump Mode
• Resources packaged into a Resource Dump, transferred via
one HTTP transaction
c) Batch Mode
• Resources are packaged into partial and on-demand
Resource Dumps, transferred via multiple HTTP transactions
4.
Comparative Analysis
29
30. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Speed simulated ResourceSync implementations
30
31. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Speed simulated ResourceSync implementations
31
32. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Why On Demand Resource Dump
• Many repositories have hundreds of OAI sets:
• Cannot materialize (too much data and processing requirements)
• Cannot rely on Resource List (too slow)
• HATEOAS approach:
https://blog.core.ac.uk/2018/03/17/increasing-the-speed-of-harvesting-
with-on-demand-resource-dumps/
32
33. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Recommendations for data providers
• Adopt ResourceSync at a platform level (Eprints, Dspace, Fedora, etc.)
• Many considerations:
• Support Change Lists? Dump? Naming of Capability Lists? On
Demand Dumps? How to link resources? WebSub?
• Guidelines needed!
• Resource List adoption only viable for small providers
• Support for on-demand Resource Dumps needed!
• ResourceSync Client-Server implementation available:
https://github.com/resync/resync
• CORE happy to benchmark repository platforms
• LANL working on validator
33
34. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
• OAI-PMH implementations vary substantially in terms of number of
records downloaded per second
• ResourceSync provides up to 10 times faster harvesting speeds with
Resource Dumps
• On-demand Resource Dumps for optimization
• Not yet part of the standard
• Thanks to resource linking, low recall less of an issue!
Take-aways
34
35. Comparing the Performance of OAI-PMH with ResourceSync
@petrknoth @mart1nkle1n
OR 2019, 06/12/2019, Hamburg, Germany
Comparing the Performance of
OAI-PMH with ResourceSync
Petr Knoth, Matteo Cancellieri
Knowledge Media institute
The Open University
UK
Martin Klein
Research Library
Los Alamos National Laboratory
USA