Lyftron - A Modern Data Hub Platform for Faster Analytics
TSO Organograms - Moving Linked Data into production and reaping the benefits - SemTechBiz 2012
1. Moving linked data into production
- and reaping the benefits
Richard Goodwin
SemTechBiz the Williams Lea Group
Part of September 2012
[Presentation Title]
2. What does TSO do?
Semantic discoverability solutions
Linked Open Data
Breaking data out of silos
Dedicated semantic team with a variety of experience & backgrounds
Part of the Williams Lea Group
3. Why Semantic discoverability?
Aggregate relevant Extract important
Enrich the content
data information from it
Allow linking,
Add value to the re-use and
data repurposing
Part of the Williams Lea Group
4. Organograms
Making government
Automated
Good use case for information
dissemination
Linked Open Data accessible and
process
open
Part of the Williams Lea Group
6. What was involved?
Using Excel source data
– convert from CSV to RDF using PHP
and XL Wrap
Preview and publish through linked
data API
Creating custom organogram
Visualisation
Supporting distributed publication by
owner organisations
TSO Non Sensitive Part of the Williams Lea Group
7. Achievements
Publishing in RDF right across
government
– now = 200
– soon = 400+
New data published every 6 months
Humans use Visualisation for
information about government
Machines can pull regularly updated
info to create other resources
Part of the Williams Lea Group
8. Challenges
WCM unable to handle RDF
documents
Officials struggling to get sign-off from
ministers
Upload / validation usability issues
Minimise errors that departments are
expected to remedy
Reduce bootstrapping
Exploit value in data set - see
changes over time
Part of the Williams Lea Group
9. Challenges – improve the user experience
CSV and validation usability issues
– Apparently inconsistent validation
– Silent errors
– Department uploading ≠ department featured
– Departments see all files clearly marked as senior CSV, junior CSV and RDF
Sign-off from ministers
– Senior management, short of time
Part of the Williams Lea Group
10. Challenges – improve the user experience
Organisational quirks
– e.g. some Ministry of Defence (MoD) civil servants report to minister, others of same
grade report elsewhere
Grades need to be more flexible
– e.g. „equivalent to grade X‟ or accept those parts which are correct and flag the others
Duplicate uploads need flagging
Improve the speed of preview function
Part of the Williams Lea Group
11. Solutions – WCM and reliability
Replaced the XL Wrap with CSV2TTL
– a Python-based implementation of CSV to RDF
This supports efficient and reliable publishing of RDF triples from CSV
Early validation takes place in spreadsheet template
Data owners upload the spreadsheet to the preview server for signing-off
Part of the Williams Lea Group
12. Solutions – Usability
The main constraint on our action is the use of the templates from within the
Government
Secure Intranet - VBA code inappropriate
Strip out lengthy formulae (hard to maintain)
– Net result no change to file size despite extra features
Provide per cell rather than per row feedback to users
Hide extraneous cells and improve validation rules
Use single-cell lookup point for web application to ascertain validity
Part of the Williams Lea Group
13. Linked data - increasing value over time
Enables user to Solution
View the change in the Serves all datasets from
shape of government same iteration into single
over time Knowledge Base (KB)
with each different
Use a slider on
iteration in separate KBs
Visualisation to show
changes Data registry maintains
the mapping between
<iteration>,
<department> and
knowledge base,
<graph>
Part of the Williams Lea Group
14. OpenUp® Platform
Harvest Enrich Store Publish
Aggregation of Extracting Highly scalable
Websites and
data from web, useful data and database
APIs to reach
APIs databases converting to re- storage and
data users
and files usable formats query engine
Automated processes that deliver reliable data
Part of the Williams Lea Group
So who are TSO?TSO have been around a long time in one way or another, we are the former publishing operations of Her Majesty’s Stationery Office and were privatised in 1996 By virtue of the types of information we process on behalf of government, for example the law, parliamentary information, insolvency information and so on the publishing processes of data capture, transformation and publication needs to a sustainable for the long term. This applies whether the output is print or digital or a mix of both.RDF and web publishing are natural progressions for TSO, we’ve actually being doing semantic work for the last 5 years or so.Semantic technologies are rapidly becoming a standard part of TSO solutions.
Coming back from the specific to the general to the specific, the organogram project is really all about what is semantic discoverabilityFundamentally it’s publishing that you can forget about! That is, once it is set up and configured you are hands-off – it happens day in, day out, reliably, on time in an automated fashion.But it also includes tracking all of the data transformations so that you users and publishers can return to those processes if necessary – so provenance plays an important part- in fact my colleague represents legislation.gov.uk onthe W3C provenance Interchange Working group.Why do it?Because if people are to invest time and money in building a system on top of published data they need to have some level of guarantee that the data will be reliably available, continuously updated and up-to-date and that the system will be around for long enough to ensure that their own system’s viability is maintained.The benefits include:centralized storage, better quality, standardized machine-readable, sharing and reusability, low cost and quicker.
Pulling all of these Open Up® platform elements together is the Government Organogram project went live in June 2011. What makes it exciting is that it is the first example of every government department publishing RDF.The organograms utilise the Harvester, RDF Store, SPARQL endpoint, Linked Data API and hosting facilities. The harvesting strategy was changed in Sep. 2011, we now accept data upload from departments.
capturing data in a simple Excel template converting data to CSV and RDF using PHP and XLWrappreviewing and publishing data through a linked data API and custom organogram visualisation supporting distributed publication by owner organisations “We found detailed guidance, early validation, vocabulary reuse and visualisation are key to data quality, and that Linked Data can support the demands of both re-users and end users. The resulting reference data will be linked to from other government datasets.”We replaced the XL Wrap with CSV2TTL, a python-based implementation of CSV to RDF. The new software supports the publishing RDF triples from CSV in a more efficient and reliable way.The data owners are required to fill up the spreadsheet following the given template. This is where the early validation taking place. The data owners upload the spreadsheet to the preview server, where previewing and signing-off happens.
Web Content Management system unable to handle RDF, so original plan publish on website register on data.gov.uk, and harvester would collect data – not possible.Data validation, and individuals introducing errors and using the system differently.Focus had been on getting the information out rather than figuring how to display the changes over time.
Government is a huge organisation, which means that there were inevitable inconsistencies that crept in across departments –breaking initial assumptions.
Excel spreadsheet improvementsPublic sector and government uses Government Secure Intranet (GSI) as a platform for its IT systems.Would have been ideal to use signed macros – but with 200 sets of users at launch (soon rising to 400) - testing on all systems and supporting that number of users was not practicalThe aim is to make the process intuitive for users.
Our approach to creating a semantic discoverability solution is to create a suite of infrastructural services on which we or others can develop applications. This has been branded the OpenUp® platform and it has developed over the past 18 months.The four major elements of OpenUp® services are:Harvesting: The ability to get hold of content in a reliable, regular wayEnrichment: The ability to extract additional information from your data using information extraction techniquesStorage: Holding the data Publishing: Making that data available on the web as well as supporting websites for that data if requiredThe platform is scalable at each level and runs on hardware in our data centres in Norwich and London. (Our hosting is used in particular for government websites as we appropriate security certifications)