Presentation by Tony Hammond and Michele Pasin to Linked Science workshop, co-located with International Semantic Web Conference (ISWC) 2015, on October 12, 2015
2. Who we are
We are both part of Macmillan Science and Education*
- Macmillan S&E is a global STM publisher
- Tony Hammond is Data Architect, Technology
@tonyhammond
- Michele Pasin is Information Architect, Product Office
@lambdaman
* We merged earlier this year (May 2015) with Springer Science+Business Media
to become Springer Nature. We are currently actively engaged in integrating our
businesses.
4. We publish a lot of science! (1845-2015)
http://www.nature.com/developers/hacks/articles/by-year
1,2 million articles in total
5. Why we’re here today: to ask some questions
We have been making semantic data available in RDF models for a number of
years through our data.nature.com portal (2012–2015)
Big questions:
- Is this data of any use to the Linked Science community?
- Should Springer Nature continue to invest in LOD sharing?
More specifically:
- Does the data contain enough items of interest? [Content]
- Are the vocabularies understandable and useful? [Structure]
- Are the data easy to get and to reuse? [Accessibility]
- Is dereference / download / query the preferred option?
6. Our goals and rationale
- Semantic technologies are a promising way to do enterprise metadata
management at web scale
- Initially used primarily for data publishing / sharing (data.nature.com, 2011)
- Since 2013, a core component of our digital publishing workflow (see ISWC14 paper)
- Contributing to an emerging web of linked science data
- As a major publisher since 1845, ideally positioned to bootstrap a science ‘publications hub’
- Building on the fundamental ties that exist between the actual research works and the
publications that tell the story about it
18. Datasets
- Articles: 25m records (for 1.2m articles) with metadata like title, publication etc.. except authors
- Contributors: 11m records (for 2.7m contributors) i.e. the article’s authors, structured and ordered
but not disambiguated
- Citations: 218m records (for 9.3m citations) – from an earlier release
19. Datasets: articles-wikipedia links
How: data extracted using wikipedia search API, 51,309 links over 145 years
Quality: only ~900 were links to nature.com without a DOI, rest all use DOIs correctly
Encoding: cito:isCitedBy => wiki URL, foaf:topic => dbPedia URI
http://www.nature.com/developers/hacks/wikilinks
20. Data publishing: sources
Sources:
Ontologies (small scale; RDF native)
- mastered as RDF data (Turtle)
- managed in GitHub
- in-memory RDF models built using Apache Jena
- models augmented at build time using SPIN rules
- deployed to MarkLogic as RDF/XML for query
- exported as RDF dataset (Turtle) and as CSV
Documents (large scale; XML native)
- mastered as XML data
- managed in MarkLogic XML database
- data mined from XML documents (1.2m articles) using Scala
- in-memory RDF models built using Apache Jena
- injected as RDF/XML sections into XML documents for query
- exported as RDF dataset (N-Quads)
Organization:
Named graphs – one graph per class
26. Next steps
More features:
- Linked data dereference
- Richer dataset descriptions (VoID, PROV, HCLS Profile, etc.)
- SPARQL endpoint?
- JSON-LD API?
More data:
- Adding extra data points (funding info, abstracts, …)
- Revamp citations dataset
- Longer term: extending archive to include Springer content
More feedback:
- User testing around data accessibility
- Surveying communities/users for this data
27. Looking ahead: how can a publisher make linked
science happen?
From a business perspective:
- Finding adequate licensing solutions
- Justifying the effort to publishers
- Who uses this data? What’s the ROI?
From a communities perspective:
- Do we actually know who are the users?
- How do we get more feedback/uptake?
- Should we work more with non-linked-data communities?
Notes de l'éditeur
main questions for the presentation
> strucutre and mappings; accesisble enoguh?
> content: big enough?
> accessibility: need more ?
> overall: is this useful? should NPG stop releasing these data and keep using it only for internal purposes?
ideally link to online representation
main questions for the presentation
> structure and mappings; accessible enough?
> content: big enough?
> accessibility: need more ?
> overall: is this useful? should NPG stop releasing these data and keep using it only for internal purposes?
> data torrents?
slide about vision [1]
slide about vision [2]
The core model is a formal model that defines the key concepts we use for content publishing.
It includes branches that describe the things we publish (publications), the things we use to categorise the things we publish (types) and more abstract concepts to document details of the publication workflow (events).
In designing the Core Ontology, we adhered to three main principles:
Incremental formalization
We started out with a relatively flat model and tested it against our use cases and system architecture adding additional structure as more precise requirements were made available. The choice of names for classes and properties has also been tested and validated against our target audience and the enterprise use cases.
Cohesiveness
Although we do make some use of public vocabularies such as BIBO and FOAF, in general we decided to follow a minimal commitment to external vocabularies as that would let us retain more control over our model and also create a much more cohesive ontology. This is mainly because currently our main driver is to support internal applications. In order to facilitate web-scale data integration we have whenever possible added mappings to other commonly used vocabularies, e.g. BIBO, FABIO and FOAF, via owl:equivalentClass and owl:equivalentProperty relationships.
Focus on integration
We have primarily focused on building a shared enterprise model, e.g. by getting the core classes and properties right and thus achieving some simple yet fundamental level of data integration. So even though we make use of SPIN rules and some basic inference in the data enrichment phase, we have not yet really taken advantage of the various inference mechanisms that can be built on top of OWL.
Overall, the Core Ontology represents a measured balance between supporting legacy practices (some stretching back over many years) and enabling new requirements (which may only be revealed incrementally). It has been developed and grown within a cross-functional software delivery team. Some of the modelling clearly reflects immediate pragmatic concerns and the 'operational semantics' originating from our specific system architecture, but are included here to show how we are using this ontology to drive forward our content publishing and discovery processes.
The Core Ontology is mapped to a number of external ontologies. We use owl:equivalentClass and owl:equivalentProperty properties to map our classes (>70 mappings) and properties (>30 mappings), respectively.
This a work in progress as we are constantly trying to improve the precision and variety of our mappings. We would encourage any interested party to give us feedback and suggestions about other models we should link to.
> The Subjects Ontology is mapped to the DBpedia and Wikidata datasets and also to the Bio2RDF and MeSH datasets. We use a skos:broadMatch or skos:closeMatch property to map our subjects instances.