Hackathon s pb

Data on the Semantic
Web
Peter Mika
Senior Research Scientist
Yahoo! Research

Vague, but exciting… Berners-Lee and the dawn of the Web

-2-

Semantic Web

• Publish information in a way that is easier to process for machines
• Web of Data instead of Web of Documents
• Two main architectural challenges
– A common format for sharing data
– Sharing the meaning of data
• Through social means (shared schemas)
• By using powerful schema languages
• Semantic Web standards from W3C
– Languages (RDF, OWL, RIF)
– Serializations (RDF/XML, RDFa)
– Protocols (SPARQL, HTTP)
• Semantic Web research into knowledge representation and
reasoning, data integration, data quality and many other topics
• Community efforts to publish data and develop schemas
-3-

Resource Description Framework (RDF)

• Each resource (thing, entity) is identified by a URI
– Globally unique identifiers
• RDF represents knowledge as a set of triples
– Each triple is a single fact about the entity (an attribute or a
relationship)
• A set of triples forms an RDF graph

RDF document
type foaf:Person

example:roi name

“Roi Blanco”
-4-

Linking across the Web
Roi’s homepage Friend-of-a-Friend ontology

type
example:roi foaf:Person
name

“Roi Blanco” knows
sameAs

Yahoo!’s website
type

worksWith
#roi2 #peter

email

“pmika@yahoo-inc.com”
-5-

Vocabularies (ontologies)

• Ontologies are collections of classes and properties used to
describe objects in a particular domain
– OWL (the Web Ontology Language) is the standard ontology
language
– OWL has an RDF serialization: ontologies are part of the
Semantic Web
• Classes can be described by sub- and superclasses,
required properties
– Class membership in RDF is expressed using the rdf:type
property
– An instance can have multiple classes (types)
– A class can have multiple superclasses
• Properties can be described by their domain, range,
cardinalities, etc.
-7-

Example: schema.org

• Agreement on a shared set of schemas for common types of
web content
– Bing, Google, and Yahoo! as initial supporters
– Similar in intent to sitemaps.org (2006)
• Use a single format to communicate the same information to all
three search engines
• Support for microdata
• schema.org covers areas of interest to all search engines
– Business listings (local), creative works (video), recipes,
reviews
– User defined extensions
• Each search engine continues to develop its products

-8-

Documentation and OWL ontology

-9-

Data on the Web

• Most web pages on the Web are generated from structured
data
– Data is stored in relational databases (typically)
– Queried through web forms
– Presented as tables or simply as unstructured text
• The structure and semantics (meaning) of the data is not
directly accessible to search engines
• Two solutions
– Extraction using Information Extraction (IE) techniques
(implicit metadata)
• Supervised vs. unsupervised methods
– Relying on publishers to expose structured data using standard
Semantic Web formats (explicit metadata)
• Particularly interesting for long tail content
- 11 -

Information Extraction methods

• Natural Entity Recognition (NER) and Disambiguation
(NED)
• OpenCalais, Zemanta API, Dbpedia Spotlight
• Yahoo! Placemaker

• Extraction of structured data from text
– Yago system (demo)
• Exploiting patterns in web page structure
– Dapper
– ScraperWiki
• Extraction from HTML tables
– Google Squared (deprecated)

- 12 -

Publishing and consuming data on the Semantic Web

• Publishing data involves
– Deciding in which format to publish your data
– Deciding which schema (ontology, vocabulary) to use
• OR you can create a new schema and publish it as well

• Multiple ways of publishing RDF data:
1. Linked Data
2. Metadata in HTML
3. SPARQL endpoints
4. Feeds, e.g. OData

Note: you may implement more than one

- 13 -

Option 1: Linked Data

• A web of RDF documents in parallel to the current Web
– Most often implemented as wrappers around databases or APIs
• The four rules of Linked Data:
– Use URIs to identify things.
– Use HTTP URIs so that these things can be referred to and
looked up ("dereference") by people and user agents.
– Provide useful information about the thing when its URI is
dereferenced, using standard formats such as RDF-XML.
– Include links to other, related URIs in the exposed data to
improve discovery of other related information on the Web.

..
#PeterM
“Peter Mika”
“Budapest”

.. “Peter Mika”
..
#PeterM
“Peter Mika”
“Budapest”
label
#Bud “2,000,000”

“Budapest” label
#PeterM born
label “2,000,000” #Hun
label
#Bud population
“2,000,000”
#Bud label
born capital-of
born
label #Hun
#Hun population
population
capital-of
capital-of

- 14 -

Option 1: Linked Data

• Advantages:
– No change to the publishing of the HTML documents
– Data can be published by third party (e.g. Dbpedia)
• Disadvantages:
– Web servers need to be configured to properly handle URIs that
identify concepts instead of documents
– Not favored by search engines
• Lack of use cases
• Crawling needs to be changed
• Authority is difficult to determine
• Tools
– Triple stores (Virtuoso, Oracle etc.) and front-ends (Pubby)
– RDB-to-RDF mappers (e.g. D2RQ, Triplify)
– Validators (Vapour)
– Linked Data browsers (many)
- 15 -

Growth of Linked Data

• Community effort to (re)publish open datasets as Linked
Data
– In particular, scientific and government datasets
– see linkeddata.org, the Data Hub

- 16 -

Option 2: Metadata in HTML

• Using microformats, RDFa, Microdata (more later)
• Advantages:
– Data and document are always in sync
Peter Mika
Peter Mika
– Browser plug-in friendly was born
was born
in
in
– Search engine friendly Budapest.
Budapest.
– Copy-paste friendly
“Peter Mika”
“Budapest”
#PeterM
label “2,000,000”
#Bud
label

•
born

Tools:
#Hun
population

capital-of

– Any23 (Anything to Triples) Peter Mika
Peter Mika
– RDFaCE was born
was born
in
in
– RDFa Distiller Budapest.
Budapest. “Peter Mika”
“Budapest”
#PeterM
label “2,000,000”
#Bud
label
born
#Hun
population

capital-of

- 17 -

Example: Facebook’s Open Graph Protocol

• RDF vocabulary to be used in conjunction with RDFa
– Simplify the work of developers by restricting the freedom in RDFa
• Activities, Businesses, Groups, Organizations, People, Places,
Products and Entertainment
• Only HTML <head> accepted
• http://opengraphprotocol.org/

<html xmlns:og="http://opengraphprotocol.org/schema/">
<head>
<title>The Rock (1996)</title>
<meta property="og:title" content="The Rock" />
<meta property="og:type" content="movie" />
<meta property="og:url"
content="http://www.imdb.com/title/tt0117500/" />
<meta property="og:image" content="http://ia.media-
imdb.com/images/rock.jpg" /> …
</head> ... - 18 -

Current state of metadata on the Web

• 31% of webpages, 5% of domains contain some
metadata
– Analysis of the Bing Crawl (US crawl, January, 2012)
– RDFa is most common format
• By URL: 25% RDFa, 7% microdata, 9% microformat
• By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat
– Adoption is stronger among large publishers
• Especially for RDFa and microdata
• See also
– P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus,
LDOW 2012
– H.Mühleisen, C.Bizer.Web
Data Commons - Extracting Structured Data from Two Large Web Corpo
, LDOW 2012

- 19 -

Exponential growth in RDFa data

Another five-fold increase
Another five-fold increase
between October 2010 and
between October 2010 and
January, 2012
January, 2012

Five-fold increase
Five-fold increase
between March, 2009 and
between March, 2009 and
October, 2010
October, 2010

Percentage of URLs with embedded metadata in various formats
- 20 -

Option 3: SPARQL endpoints

• An API for accessing RDF databases on the Web
– A query language and an HTTP protocol “Peter Mika”
“Budapest”
#PeterM

• Advantages: born
label
#Bud
label
“2,000,000”

#Hun
population

– Flexible access: make any query you want
capital-of

– Also possible to expose a traditional RDBMs via a wrapper
• Disadvantages:
– For the publisher: cost of supporting arbitrary queries
– For the search engine: discovery of SPARQL servers is unsolved
• Tools:
– Triple stores
• Sesame, Jena, OWLIM, Redland, Oracle, Virtuoso, Stardog etc.
– RDB-to-RDF mappers such as D2RQ and Triplify

- 21 -

Example: Dbpedia

• demo

- 22 -

Crawling the Semantic Web

• Linked Data
– Similar to HTML crawling, but the the crawler needs to parse
RDF/XML (and others) to extract URIs to be crawled
– Semantic Sitemap/VOID descriptions
• RDFa
– Same as HTML crawling, but data is extracted after crawling
– Mika et al. Investigating the Semantic Gap through Query Log
Analysis, ISWC 2010.
• SPARQL endpoints
– Endpoints are not linked, need to be discovered by other
means
– Semantic Sitemap/VOID descriptions

- 24 -

Data fusion

• Ontology (schema) matching
– Widely studied in Semantic Web research
• ontologymatching.org
• Entity resolution
– Finding links between datasets
– Tools: SILK, LIMES
• Blending
– Merging objects that represent the same real world entity and
reconciling information from multiple sources
• Cleaning
– Google Refine

- 25 -

More info

• Ideas for hacks
– http://challenge.semanticweb.org/
– http://iswc2011.semanticweb.org/calls/linked-data-a-thon/
• Book
– Segaran, Evans and Taylor. Programming the Semantic Web.
O’Reilly, 2009.
• More tools
– Exhibit: faceted browsing and other visualizations
– http://www.dajobe.org/talks/200906-semtech-open/
– LOD2 stack (stack.lod2.eu)

- 26 -

Hackathon s pb

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (9)

Similaire à Hackathon s pb

Similaire à Hackathon s pb (20)

Plus de Peter Mika

Plus de Peter Mika (10)

Hackathon s pb

Notes de l'éditeur