"At the toolbar (menu, whatever) associated with a document there is a button marked "Oh, yeah?". You press it when you lose that feeling of trust. It says to the Web, 'so how do I know I can trust this information?'. The software then goes directly or indirectly back to metainformation about the document, which suggests a number of reasons."
Tim Berners-Lee, W3C Chair, Web Design Issues, September 1997
Provenance is focused on the description and understanding of where and how data is produced, the actors involved in the production of such data, and the processes by which the data was manipulated and transformed until it arrived to the collection from which it is being accessed. Provenance aims at providing the ability to trace the sources of data, enabling the exploration not just of the relationships between datasets, but also of their authors and affiliations, with the goal of preserving data ownership and establishing a notion of trust based on authenticity and reliability.
The Future Internet poses important challenges for provenance, derived from complex and rich scenarios characterized by the presence of large amounts of data stemming from heterogeneous sources like user communities, services, and things. Such challenges span across technical but also socioeconomic dimensions. The former includes aspects like vocabularies for representing provenance, interoperability and scalability issues, and means to produce, acquire, and reason with provenance in order to provide measures of trust and information quality. However, it is probably in the socieconomic dimension where more significant efforts need to be made as to addressing issues like the role of provenance in the overall picture of the Future Internet, entry barriers preventing the generation of provenance-aware internet content, means required to incentivate the production of such content, and ways to prevent provenance forgery.
In this talk, we provide and overview on provenance and the above mentioned challenges and introduce ongoing work in order to address trust issues from the provenance perspective in the Future Internet. We also link provenance to other relevant aspects for trust discussed in the session, like security, legal frameworks, and economics.
1. iSOCO
Provenance and Trust
José Manuel Gómez Pérez
Fundations of Trust in the Future Internet
Future Internet Assembly
Valencia, 15th April 2010
W3C Provenance Incubator Group
2. Why provenance is important
For the Web Architecture
- "At the toolbar (menu, whatever) associated with a document there is a button
marked "Oh, yeah?". You press it when you lose that feeling of trust. It says to
the Web, 'so how do I know I can trust this information?'. The software then
goes directly or indirectly back to metainformation about the document, which
suggests a number of reasons.“-- Tim Berners-Lee, W3C Chair, Web Design
Issues, September 1997
For Linked Data
- "Provenance is the number one issue we face when publishing government
data as linked data for data.gov.uk” -- John Sheridan, UK National Archives,
data.gov.uk, February 2010
For Science
- "We need a paradigm that makes it simple [...] to perform and publish
reproducible computational research. [...] A Reproducible Research
Environment (RRE) [...] provides computational tools together with the ability
to automatically track the provenance of data, analyses, and results and to
package them (or pointers to persistent versions of them) for redistribution."
W3C Provenance Incubator Group
2
3. Provenance is…
Records of
Sources of information,
including entities and
processes, involved in
producing or delivering an
artifact
History of subsequent
owners (change of custody)
W3C Provenance Incubator Group
3
4. Provenance is…
Evidence of authenticity, integrity,
and information quality
Certifies products of good process
W3C Provenance Incubator Group
4
5. Provenance is…
Valuable
Hard to collect and verify
Necessary to assign credit
…and blame
i.e. to establish
Trust
W3C Provenance Incubator Group
5
6. Provenance representation
Provenance models define
- Types of provenance elements
- Relationships between them
Provenance represented as a graph
- Nodes: Pieces of provenance information (Artifacts,
Processes, Agents)
- Edges: relate provenance elements to each other
Additionally: roles, temporal, dependency, causal
dependency
The Open Provenance
Model (OPM)
W3C Provenance Incubator Group
6
7. Provenance-related vocabularies
General Provenance-specific
DC – Dublin Core Metadata OPM - Open Provenance Model
Terms Provenir
FOAF – Friend of a Friend PRV - Provenance Vocabulary
SIOC – Semantically-Interlinked PML - Proof Markup Language
Online Communities
DC - Dublin Core
SWP – Semantic Web Publishing
PREMIS
vocabulary
WOT – Web Of Trust schema
VOID – VOcabulary of Interlinked
Datasets PAV - SWAN Provenance
Ontology
SWP - Semantic Web Publishing
Vocabulary
CS - Changeset Vocabulary
W3C Provenance Incubator Group
7
8. A snapshot on provenance
Uses Research areas Domains
• Meaningful data • Scientific workflows • Science
integration
• Databases (where and • Open Government and
• Establishing trust when why provenance) Linked Data
information sources are
diverse and of varying • Security (Information • Web search & use
quality e.g. in the Web flow and trust)
• Cultural heritage
• Providing justification to • Justification and
the conclusions of Argumentation in KRR
reasoning processes • Licensing and attribution
• Information Retrieval • Manufacturing
• Attribution (who did (analysis of information
what) sources)
• Enabling comparison
and reproducibility of
processes
W3C Provenance Incubator Group
8
9. The Linked Data paradigm
How can we exploit
Tim Berners Lee, 2006 (Design Issues)
all the available
data? 1. Use URIs to identify things
- Anything, not just documents
2. Use HTTP URIs for people to
Data reuse and remix lookup such names
Common flexible and usable APIs - Globally unique names
Standard vocabularies to - Distributed ownership
describe interlinked datasets 3. Provide useful information in RDF
Tools upon URI resolution
Realize the Semantic Web vision 4. Include RDF links to other URIs
- Enable discovery of related
information
W3C Provenance Incubator Group
9
10. Provenance is key in Linked Data
Data Lineage
- The origins of data
- Related artifacts and actors
Information quality
- Timeliness
- (Semantic) consistency between
datasets
- Stable and meaningful data links
Trust
- Data authenticity
- Reliability
W3C Provenance Incubator Group
10
11. Provenance is key in Open Government
Open Government Initiative (http://www.whitehouse.gov/open)
- Release quickly, improve later
- NSF to develop plan (http://www.nsf.gov/open):
“NSF is developing an Open Goverment Plan, which will serve as the roadmap for our plans
to improve transparency, better integrate public participation and collaboration into our core
mission, and become more innovative and efficient.”
Similar initiative in the UK (data.gov.uk)
"Provenance is the number one issue we face when publishing government data as linked
data for data.gov.uk” -- John Sheridan, UK National Archives
Why is provenance so important
- Government data comes from very diverse data sources
- Varying quality
- Different scope
- Different assumptions
W3C Provenance Incubator Group
11
12. Large amounts of data and processes
In 2006, the size of the digital universe
Source: IDC
was estimated in 161 exabytes
3 million times the information in all
books ever written
In 2010, expected to turn 988
exabytes
…and all this data is potentially
published online
W3C Provenance Incubator Group
12
13. It’s all about producing and consuming information…
Automated means
to evaluate
Information Quality
and Trust are
needed!
W3C Provenance Incubator Group
13
14. Provenance and Trust in the Web: A real-life example
• Two fake web sites
Reusing web data without the means that allow • A fake Wikipedia entry
contrasting its provenance can be harmful, • Fake California public
especially in sensitive domains.
safety phone numbers
• Fake local TV station
The hoax caused a 1000-word tome
on Frankfurter Allgemeine Zeitung…
and public apologies from DPA
Trust on Wikipedia misled DPA
In a provenance-aware world, DPA
would have had means based on
data provenance to evaluate trust
- Bluewater did not exist
- The Berlin Boys do not exist
W3C Provenance Incubator Group
14
15. Trust evaluation: Useful provenance questions
Who created that content
(author/attribution)?
Was the content ever manipulated,
if so by what processes/entities?
Who is providing that content
(repository)?
Can any of the answers to these
questions be verified e.g., through
e-signatures?
W3C Provenance Incubator Group
15
16. Immediate need for provenance at W3C
Trust-driven Others
Web of trust Reasoning
- Making trust judgments - Attribution of assertions from
based on provenance diverse sources
Social trust Linked data
- Attribution, authority, - Use of conflicting data of
propagation varying degrees of quality
Social web Life sciences and e-Science
- Privacy and use policies of at large
sensitive (personal) data - Reproducibility of scientific
results
W3C Provenance Incubator Group
16
17. W3C Provenance Group: Charter and goals of the incubator group
Provide state-of-the-art understanding and develop a
roadmap for development and possible standardization
Articulate requirements for accessing and reasoning about
provenance information
- Develop use cases
Identify issues in provenance that are direct concern to the
Semantic Web
- Articulate relationships with other aspects of Web architecture
Report on state-of-the-art work on provenance
Report on a roadmap for provenance in the Semantic Web
- Identify starting points for provenance representations
- Identifying elements of a provenance architecture that would benefit
from standardization
W3C Provenance Incubator Group
17
18. W3C Provenance Group: Products of the group to date
Group formed in September 2009
- All information is public: http://www.w3.org/2005/Incubator/prov/wiki
Developed a set of key dimensions for provenance (11/09)
- Grouped into three major categories: content, management, use
Developed use cases for provenance (12/09)
- More than 30 use cases, most were improved and curated
Developed requirements for provenance that arise from the
use cases (1/10)
- User requirements: what is the purpose/use of the provenance
information
- Technical requirements: derived from the user requirements
Currently developing state-of-the-art report (expected 6/10)
W3C Provenance Incubator Group
18
19. Scientific and technical challenges of provenance
Provenance information needs to be :
Represented
Captured and recorded
Stored and secured, queried, and reasoned about
Visualized and browsed
W3C Provenance Incubator Group
19
20. Scientific and technical challenges of provenance (II)
Vocabularies for representation of provenance content
- Need representations of processes (workflows), entities,roles, data collections,
meta-assertions, etc.
- The Open Provenance Model (OPM): http://twiki.ipaw.info/bin/view/OPM
Granularity of provenance records
- How much detail is useful, manageable/scalable in practice?
• Size of provenance can be orders of magnitude larger than base data
Provenance evaluation for Information Quality and Trust Management
Evolution and updates
- Shelf timeliness of data
• Determine when data becomes obsolete based on provenance info
- Versioning of data sources
• Relate updates of data based on provenance info
- Reproducibility of processes
Provenance-aware visualization, navigation, and resource consumption
W3C Provenance Incubator Group
20
21. Scientific and technical challenges of Provenance & Trust
Policies based on provenance information
- Association-based policies
• Source is NYT, source cites NYT
• Source is cited in Wikipedia
- Bias-based policies
• Source is an oil company
- Distrust policies
• Source is a blog
Policies may be restricted to a context
- Topic of search, topics of page, tags of page
Trust policies may be shared across users
- Like bookmarks in del.icio.us
W3C Provenance Incubator Group
21
22. Socioeconomic challenges
How to incentivize provenance take-up in the Web
architecture so that content and service providers move
into a provenance-aware paradigm?
- Generation of provenance metadata by content providers
- Consumption of provenance metadata by infrastructure and
applications like search engines, browsers, etc.
Objective evidence of trust in online data or service allows
- Ranking increase in search engines
- Attracting internet traffic
- Increasing automation in data and service consumption
But, can you can trust such provenance information itself?
- Non-forgery of provenance metadata
• Authoritative agencies
W3C Provenance Incubator Group
22
23. José Manuel Gómez-Pérez
R&D Director
T +34913349778
Thanks for
M +34609077103
jmgomez@isoco.com your attention!
iSOCO
Para obtener más información sobre como puede
ayudar a su empresa a optimizar sus negocios digitales y aportar
una solución innovadora, contáctenos en
www. .com
Barcelona Madrid Valencia
Tel +34 93 5677200 +34 91 3349797 +34 96 3467143
Edificio Testa A C/Pedro de Valdivia, 10 Oficina 107
C/ Alcalde Barnils 64-68 28006 Madrid C/ Prof. Beltrán Báguena 4,
St. Cugat del Vallès 46009 Valencia
08174 Barcelona
W3C Provenance Incubator Group
23