1. Publishing EPA Data as
Linked Data
A brief by
Michael Pendleton
EPA Office of Environmental Information
pendleton.michael@epa.gov
2. What is driving us?
“We’re moving from managing documents
to managing discrete pieces of open data
and content which can be tagged, shared,
secured, mashed up and presented in the
way that is most useful for the consumer
of that information.”
-- Report on Digital Government: Building a 21st Century Platform to
Better Serve the American People
4. Linked Data
What’s It All About?
• Speak the Language of the Web
• Just as you surf web pages, linked data lets you surf
data.
• SOAP was about making the web try to work like
applications; REST was about making applications
work like the web.
• Linked Data is about making your DATA work like the
web.
Slide Credit: David G. Smith
U.S. Environmental Protection Agency 4
Aug 16, 2011 presentation
6. Linked Data
Basics
• Tim Berners-Lee: 5-Star model for publishing
data
Slide Credit: David G. Smith U.S. Environmental Protection Agency 6
7. • Linked Data is about
publishing and
consuming data
using international
data standards
• Based on 20 year
old idea (the Web)
• A system of linked
information systems
8.
9. Global requirements
• Comprehensively link
legislation & regulations
for more effective
government
• Explain context, source,
version & publication
date with the data itself
• We need global
standards for metadata
10. The mission of the Government Linked
Data (GLD) Working Group is to provide
standards and other information which
help governments around the world
publish their data as effective and usable
Linked Data using Semantic Web
technologies.
13. And now,
Linked Open Data ...
• A proof-of-concept launched 2011 with 5 Star Linked Data
• Publication of 1.3M facilities (FRS) and the substances (SRS)
regulated by the EPA
• TRI program links to 25 years of data on major polluters
• Additional pilots in 2012 incorporating EPA and anonymized
electronic medical records (EMR) data from Sentara
Healthcare
• 5 Star Linked Open Data to be hosted & accessible on an EPA
production Web site in summer 2012
14. Increase re-use by publishing
Linked Data
• Empower users to create their own views of data to
satisfy different applications
• Build a community around the data in which users help
each other to curate and connect as needed
• Skip the supermodel - Leave data in the multiple “best
of breed” systems; wrap and expose on the Web of Data
15. There is a Process
Identify
Identify Model
Model Name
Name Describe
Describe Convert
Convert Publish
Publish
Maintain
16.
17.
18.
19. 7 steps to publishing Linked Data
• Identify a dataset others are likely to want to re-use
• Modeling
• Onsite modeling session (half day)
• Linked Data modeling supported by experts
• Validate the model with data owners/stewards
• Publish data on the Web (opendata.epa.gov) per Best Practices
• Produce automated scripts to maintain current data
• Announce Linked Open Data sets *
• Review usage reports to support relevance & user feedback
* Pending EPA Systems Security Plan approval
20. Open Data Platforms
• We’re using Callimachus, a Web
platform for data-driven applications
based on Linked Data principles.
• It is hosted on Amazon EC2 and we
have 24x7x365 data & application
support.
• There are other data platforms, we
selected this one because it is fully
W3C standards compliant, no vendor
“lock in”
• It’s Open Source (Apache 2.0)
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31. Recommendations
• Linked Data promotes goals of transparency &
economic development during times of fiscal
austerity
• Publish in reusable format (RDF family of
standards)
• Use OPEN vs proprietary in data formats
• Define a URI Policy and Strategy
• Use best practices and vocabularies exist --
don’t recreate the wheel
33. Resources
• VisibleGovernment.ca Website http://visiblegovernment.ca
• Hack, Mash and Peer: Crowdsourcing Government Transparency, Jerry Brito, George
Mason University, http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1023485
• Blog on UK Environment Agency Water Quality, see
http://data.southampton.ac.uk/datasets.html
• Southampton Open Data Service, see http://data.southampton.ac.uk/datasets.html
• Blog post on Clean Energy data from Reegle, see http://blog.semantic-
web.at/2012/04/13/reegle-info-linked-open-energy-data-cloud/
• Blog post on Publishing Linked Open Data in Tight Economic Times, 30-Jan-2012,
http://3roundstones.com/2012/01/30/publishing-linked-open-data-makes-good-sense-in-
tight-economic-times/
• Blog post on HealthData.gov from US Health & Human Services, 4-June-2012,
http://www.healthdata.gov/blog/welcome-new-healthdatagov
• Blog post on US HHS Domain Challenge 1: Metadata, 2-June-2012,
http://www.healthdata.gov/blog/domain-challenge-1-metadata
34. Coming soon ...
• Best Practices for Publishing Linked Data (editor’s Draft
20-Apr-2012), see https://dvcs.w3.org/hg/gld/raw-
file/default/bp/index.html
• Linked Data Cookbook, see
http://www.w3.org/2011/gld/wiki/Linked_Data_Cookboo
k
• Linked Data Directory, see http://dir.w3.org
• Attend the 2012 International Open Government Data
Conference co-sponsored by data.gov & The World Bank
10-12 July 2012, Washington DC, see
http://www.data.gov/communities/conference
The recently published report by White House described the information, platform and presentation layers of digital services agencies are to provide. The EPA joins government authorities around the world who are defining plans based on Open data and open APIs.
A lot of people in governments around the world are publishing data on the Web of Data. We ’ re familiar with portals such as data.gov.uk and data.gov . Often this is in the form of CSV files but an increasing amount is available as well modeled LINKED DATA. We just participated in the International Open Data Conference which showed the open (government) data community is really thriving: 450 in-person participants from over 50 countries, 4000 online participants, over 2000 tweets & 162 speakers.
Not all of Open Government content is Linked Data. But a growing number of data sets are available as 4-5 star linked data. Use of structured data is actively promoted by international standards groups like the W3C and major search engines, Google, Yahoo!, Bing, Yandex.
This presentation discusses the increasing number of high value data sets being published by the EPA as 5 star Linked Data. This means data is publishing on the Web in both human & machine readable formats. A human can read the nicely formatted content AND a machine can find, access and re-use the machine readable format if it is published in the Web ’ s data exchange format, RDF.
There are a growing number of resources on this topic, several have been authored by EPA ’ s Linked Data contractors, Dr. David Wood and Bernadette Hyland, and their colleague Dr. Tom Health. Links to all the projects described in this talk are included at the end of this presentation.
Data formats and standards sometimes sounds like alphabet soup to many people The EPA is a member of the W3C Government Linked Data Working Group. We have a practical focus on removing the friction from the Web publishing process and specifically, are working to make it easier for government authorities to publish DATA on the WEB.
The GLD working group works with leading academics, and has guests from the private sector & non-profits who define use cases for open government data. They describe the need for government agencies to publish content that describes the relative authority of a piece of data, for example, case law and regulation.
The EPA is a member of the W3C and are active on the Government Linked Data Working Group, along with our colleague George Thomas from HHS. We are one year into the working group ’ s two year charter. Our mission is ...
The GLD WG is on track to publish BEST PRACTICES, Vocabulary Guidance as W3C Recommendations which are the standards of the World Wide Web. We ’ ve also produced a Linked Data Directory of projects, products and service providers, and a Government Linked Data Cookbook describing a step by step approach for developers.
So where are we today? The EPA already publishes a huge amount of information as CSV files and through portals like Envirofacts. Unfortunately, that data is often hard to find, without context. Furthermore, it ’ s written from a regulatory perspective. It is not re-usable for other scientists and the public without significant re-structuring.
Our goals are to broaden access and re-use of this important data that tax payers have paid us to collect, and to reduce the burden of compliance for regulated entities.
So here is the exact process: Identify the data, model exemplar records -- what you are going to carry forward. Name all of the NOUNs. Turn the records into URIs. Next, describe RESOURCES with vocabularies. Write a script or process to convert from say the CSV to RDF. Automate it so it is easy to maintain. This is routinely done in 30-60 day sprint, with the involvement of the EPA data steward, a project manager and 2 Linked Data experts, part time.
We draw “ ball and stick ” diagrams that describe how all the data is RELATED to each other. That is all there is to Linked Data, it is a view of data and its relationships to other pieces of information. Other people can come along and add more relationships and information they have.
Then we produce scripts that convert CSV to RDF. These scripts can be run ANYTIME there is an update to the underlying CSV extract from the relational database that today stores the data.
So let ’ s review the entire process for producing Linked Open Data and we ’ ll show you what the UI looks like next... OEI has followed this process with 3 different data sets of varying complexity, size and data quality. Each data set was published on an interim cloud server on Amazon EC2 with part time involvement by several EPA staff and a couple of contractors within 60 days. See http://usepa.3roundstones.net We expect the System Security Plan for the production data platform to be approved this summer & we ’ ll host as much Linked Data as EPA produces.
The data platform landscape is emerging. Data.gov is using Socrata for 1-3 star data. We felt it was important to avoid vendor lock-in from proprietary formats & ETL processes, so we chose a Web Standards compliant, open source platform specializing in 4 & 5 star data. It’s commercially supported and available via the cloud.
Once we had the data modeled, validated with SMEs, we converted & loaded into Callimachus. We spent about 1 hour creating templates to view the data in Callimachus. So here is the power of LOD in action -- Within one hour, we could view the data, navigate through the data and verify the contents without being a DBA or Java developer!
A designer with CSS skills can help us make it look pretty with a nice CSS theme. Thus, Web developers with HTML, CSS and RDFa / SPARQL skills can create data driven Web applications. No understanding of semantics, deep RDF knowledge is required.
Callimachus ’ forms driven interface allows authorized users to modify the underlying triples in the database -- we are round tripping create/modify/delete to a triple store via a Web page!
This is an example of an application that was created in less than 3 days by a Web developer using Callimachus. The data sources included EPA FRS, SRS and TRI Linked Data, spreadsheet data from ABT Associates on corporate ownership (as CSV), Open Street Maps content from the Web (Linked Data Cloud).
If you have permissions, you can edit the underlying data stored in the database (an RDF triple store). Several different triple stores are supported by Callimachus. A triple store is effectively just a “ library ” to Callimachus -- as long as it stores the data standards (RDF, SPARQL), it doesn ’ t matter.
Note the fixed name and added comment.
A history of changes is kept. Note the change to the name and the added comment, along with the time/date and name of the user who made the edit.
If you ’ re interested in the maturity of the RDF family of standards, here is the technology “ layer cake ” . The data exchanges standards are mature and well defined. The world ’ s leading technology companies are supporting RDF in their products including Oracle 11g, IBM DB2, EMC. The world ’ s leading search engines including Google, Yahoo!, Bing (Microsoft) and Yandex are displaying content with RDF (RDFa & RDFa Lite). That is why we ’ re joining leading governments worldwide to publish our valuable content as LOD.
Your ability to move into the future will be ensured by publishing data to the Web. Use data exchange standards. Define URI policy, document it and help people to comply. Leverage existing vocabularies. Despite what you think, you are probably talking about many of the same objects (people, organizations, assets, scientific terms, etc as someone else), so use a shared vocabulary to realize the benefits of Linked Data.
This presentation is licensed under a Creative Commons BY-SA license, allowing you to share and remix its contents as long as you give us attribution and share alike.