This document summarizes a presentation on implementing digital provenance on the World Wide Web using semantic web technology. The presentation introduces digital provenance, discusses use cases, reviews the current state of the art, and considers tool development approaches. Digital provenance tracks the origin and changes made to data over time, enabling trust decisions. The W3C is working to define standards to represent and exchange provenance information. Semantic web ontologies are well-suited to capture complex provenance metadata and link it to related data. Open source, standards-compliant tools are needed to generate and manage provenance information.
1. Implementing Digital Provenance on
the World Wide Web Using Semantic
Web Technology
Gregory Joiner*, Douglas Reid
Raytheon BBN Technologies
{gjoiner,dreid}@bbn.com
June 9th, 2011
2. First…Some Administrivia!
• Updated slides are located on SlideShare at:
http://slidesha.re/lqCHWd
• Presentation is not “Technical – Intermediate.”
– I wanted to reach the maximum number of users
– Was not enough time to provide both an overview and
technical instruction.
• Feel free to interrupt me anytime with questions!
June 9th, 2011 2
3. Goals of this Talk
• Learn what digital provenance is
• Understand why it is important
• Know what is currently being done by whom
• Have starting point for implementing provenance
in your semantic web applications
• Be passionate about digital provenance!
June 9th, 2011 3
4. Agenda
• Part 1: A Introduction to Digital Provenance
– What is Digital Provenance
– National Cyber Leap Year Summit
• Part 2: Digital Provenance Use Cases
– Everyday Web Browsing
– Contradictory, Time-Sensitive Information
– Closed Network Provenance
• Part 3: Where Are We Now?
– W3C Provenance Work
– Review of the Current State-of-the-Art
• Part 4: Digital Provenance Tool Development
– Why SemWeb is Perfect for Digital Provenance
– Open Source and Standards Compliance
– Securing Provenance Metadata
– Additional Design Considerations
June 9th, 2011 4
5. A INTRODUCTION TO
DIGITAL PROVENANCE
Part 1:
Part 1: A Introduction to Digital Provenance
Part 2: Digital Provenance Use Cases
Part 3: Where Are We Now?
Part 4: Digital Provenance Tool Development
June 9th, 2011 5
6. What is Digital Provenance
• Provenance is defined by Webster’s Dictionary as “the
origin or source of something” – mainly pertaining to art
or architectural artifacts
• Digital Provenance is metadata that establishes the
chain-of-custody information needed for users to make
trust decisions about digital data
• Digital Provenance Metadata can describe any type of
electronic data at any granularity level from entire web
sites to single files to even individual assertions within a
webpage or document
June 9th, 2011 6
7. What is Digital Provenance
Types of Digital Provenance Metadata include:
• Bibliographical Information – Provides a list of all of the sources
behind a document or assertion
• Chain-of-Custody Information – Provides a history of the different
people and/or systems that have handled the document or assertion
• Proof / Justification Information – Documents the logical steps
followed to make an assertion
• Trust Information – Provides a quantifiable metric to measure and
compare the trustworthiness of one document or assertion to
another.
June 9th, 2011 7
8. National Cyber Leap Year Summit
• Convened in 2009 as a response to
the President’s call to secure the
nation’s cyber infrastructure and
charged with identifying the “game-
changing” technologies needed to
secure cyberspace
• Identified Digital Provenance as
one of those technologies because it
enables the identification,
authentication, and reputation of
entities and objects with appropriate
granularity at many layers of the
protocol hierarchy.
June 9th, 2011 8
9. DIGITAL PROVENANCE
USE CASES
Part 2:
Part 1: A Introduction to Digital Provenance
Part 2: Digital Provenance Use Cases
Part 3: Where Are We Now?
Part 4: Digital Provenance Tool Development
June 9th, 2011 9
10. Everyday Web Browsing
• Scenario: People often rely on the
Internet for advice on important
subjects, such health or finance, and
frequently make key decisions based on
web content alone. This is especially
true for mobile users who lack the
bandwidth and display room to
investigate the provenance on their own.
• Solution: By dynamically marking the
trustworthiness of web content, users
can quickly determine what data they
can trust so they can make more
informed decisions.
June 9th, 2011 10
11. Contradictory, Time-Sensitive Information
• Scenario: When breaking news
happens, content re-publishers and end
users are often forced to chose
between contradicting information. For
example, after the tragic shooting in
Arizona in January 2011, some
websites claimed Rep. Gifford was
dead while others properly reported that
she was still alive.
• Solution: By providing a standard way
to view and compare the bibliographical
and chain-of-custody information of the
conflicting articles, users can make an
informed decision on which one to trust.
June 9th, 2011 11
12. Closed Network Provenance
• Scenario: Even in a closed network,
users frequently have to decide
whether to trust existing content. This is
often the case within the Intelligence
Community and Department of Defense
where certain time-sensitive tasks allow
assumptions to be made that other
tasks can not. For example, the use of
lethal force against a target requires
more concrete evidence than other,
less irreparable actions.
• Solution: By providing analysts with a
complete list of the assumptions and
justifications behind a given assertion,
they can determine whether or not they
can use that assertion in their analysis.
June 9th, 2011 12
13. Additional Use Cases
• License and Contract Compliance
• Public Policy Conformance
• Assigning Credit and Blame to Information
• Many more were identified by the W3C
Provenance Incubator Group and are located at:
http://www.w3.org/2005/Incubator/prov/wiki/Use_Cases
June 9th, 2011 13
14. WHERE ARE WE NOW?
Part 3:
Part 1: A Introduction to Digital Provenance
Part 2: Digital Provenance Use Cases
Part 3: Where Are We Now?
Part 4: Digital Provenance Tool Development
June 9th, 2011 14
15. W3C Provenance Work
• Provenance Interchange Working Group
– Chartered through Oct 2012, based on Incubator Group’s findings
– Formed to “support the widespread publication and use of
provenance information of Web documents, data, and resources”
– Will publish Recommendations to define a language for exchanging
provenance information (PIL) among applications
• Provenance Interchange Language (PIL) design goals
– Be applicable to any resource
– Provide a low barrier to entry to facilitate widespread adoption
– Provide a small, extensible core model
– Draw from existing vocabularies ontologies
• Deliverables
– Conceptual Model, Formal Model, Formal Semantics, Accessing
and Query Provenance, XML Serialization, Best Practice Cookbook,
Primer
June 9th, 2011 15
16. W3C’s work (cont.)
• Key Recommendations for PIL
– Standard way to represent, at a minimum, three basic entities
1. A handle (URI) to refer to an object
2. A person/entity that the object is attributed to
3. A processing step done by a person/entity to an object
– Mechanism to access provenance-related information addressed
by other standards
• Licensing information of an object
• Digital signature for the object
• Digital signature for the provenance records
– Standard way for sites to make provenance information about
their content available to other parties in a selective manner, and
for others to access that information
June 9th, 2011 16
17. Review of the Current State-of-the-Art
Representation
• Existing Provenance Vocabularies/Ontologies
– Dublin Core: “Librarian” vocabulary capturing bibliographical information.
– Provenir Ontology: Upper-level ontology for use in SemWeb applications
– Provenance Vocabulary: Captures data using the Linked Data principles
– Proof Markup Language (PML): “Full-Featured” interlingua that describes
basic provenance meta-data plus justification and trust information.
– Others: Changeset Vocabulary, PREMIS, SWAN Provenance Ontology,
Semantic Web Publishing Vocabulary, and WOT Schema
• Concrete mapping specified between existing ontologies
– The Open Provenance Model (OPM) was chosen as a reference
vocabulary since it contained is a general and broad model that
encompasses many aspects of provenance
– W3C Incubator Group formally encoded the mappings according to Simple
Knowledge Organization System (SKOS) vocabulary
June 9th, 2011 17
18. Review of the Current State-of-the-Art
Implementation
• News aggregation scenario
– Content tracking (Memetracker, Spinn3r & BlogTracker, influence studies)
– Explicit provenance (trackbacks / pingbacks, Twitter’s Retweet)
– Licensing (Creative Commons, Google Books Right Registry)
• Disease outbreak scenario
– Data provenance (human-readable changelogs, database research)
– Workflow provenance (Taverna/Pegasus, Inference Web, ZOOM)
– Justification for policy (ad-hoc user effort)
• Business Contract scenario
– Tracking design (VisTrails)
– Computer-aided Design (Design Rationale editor (DRed), IBIS software)
June 9th, 2011 18
19. State-of-the-Art (cont.)
Gaps
• Content
– No mechanism to refer to the identity/derivation of an information object
– No guidance on granularity for description of complex objects
– No common standard for exposing/expressing provenance information
– No standard for versioning and publishing updates
– No standard to characterize suitability of provenance info for proof
• Management
– No standard for linking provenance between sites
– No guidance on combining existing standards to provide provenance
– No guidance for exposing provenance info on the Web
– No proven approaches to manage scale
– No standard way to ensure only essential non-confidential provenance is
released
June 9th, 2011 19
20. State-of-the-Art (cont.)
More Gaps
• Use
– No clear understanding of how to relate provenance at different levels of
abstraction
– No general solutions to understand provenance publish on the Web
– No standard to enable provenance integration/comparison
– No broadly applicable methodology for making trust judgments based on
provenance when presented with information of varying quality
– No existing mechanism to check compliance with laws, regulations or
contracts
– No means to resolve conflicts in provenance data
June 9th, 2011 20
21. DIGITAL PROVENANCE
TOOL DEVELOPMENT
Part 4:
Part 1: A Introduction to Digital Provenance
Part 2: Digital Provenance Use Cases
Part 3: Where Are We Now?
Part 4: Digital Provenance Tool Development
June 9th, 2011 21
22. Why SemWeb is Perfect for Digital Provenance
• Semantic Web Technologies allow data to be shared and
reused in a manner that is more flexible and
integratable than traditional knowledge representations.
• The Web Ontology Language (OWL) allows deeper
context to be encoded in the digital provenance metadata
which enables the capture of more complex information
in a standard, well specified format.
• With the provenance metadata in a machine-readable
format, powerful automated information processing
can which can provide additional provenance knowledge.
• By semantically tagging the digital provenance metadata,
it can be dynamically linked to supporting (or
contradicting) information to provide a more complete
chain-of-custody picture.
June 9th, 2011 22
23. Why Digital Provenance is Perfect for SemWeb
June 9th, 2011 23
Provenance helps complete the path to the top of the
Semantic Web layer cake and to TBL’s SemWeb nirvana.
24. Open Source and Standards Compliance
• As explained in the National Cyber Leap Year Summit’s Co-Chairs’
Report, establishing standards early on in the development process
is crucial to achieving rapid, widespread community acceptance that
is required for any digital provenance tool to be successful.
• Therefore, Digital Provenance tools should comply with and even
inform the emerging W3C standards discussed earlier in this
presentation
• Furthermore, since digital provenance tools require an additional
time burden for both content developers and end-users, they should
be available at little to no cost to further encourage acceptance.
June 9th, 2011 24
25. Securing Provenance Metadata
• Provenance metadata that is not
signed or secured is susceptible
to tampering and therefore
cannot realistically be trusted.
• Confidentiality and integrity
controls that are consistent with
a wide variety of security models
are crucial to creating a
successful digital provenance
solution.
June 9th, 2011 25
26. Additional Design Considerations
• It is crucial that any digital provenance tool
supports the creation, processing, and
rendering of digital provenance metadata at
all stages of the content creation
lifecycle.
• Since users will require provenance
information at many different levels of detail,
successful digital provenance tools will be
configurable to allow content creators and
users to create and view the metadata at
any granularity level.
June 9th, 2011 26
27. Key Takeaways
• Provenance is key to the future success of the Web and
is the final piece of the Semantic Web puzzle.
• The U.S. government has identified digital provenance
as one of the important “game changing” cyber security
technologies.
• Important W3C work is already underway.
• You can start thinking about and incorporating
provenance in your application right now.
June 9th, 2011 27
28. For More Information
• Authors
– Greg Joiner, gjoiner@bbn.com, 703-284-1259
– Douglas Reid, dreid@bbn.com, 703-284-1291
• National Cyber Leap Year Report
– Co-Chairs Report: http://bit.ly/6NO05g
– Participants’ Ideas Report: http://bit.ly/7HmjQ8
• W3C Provenance Interchange Working Group
– www.w3.org/2011/prov
June 9th, 2011 28