SlideShare une entreprise Scribd logo
1  sur  24
Data on the Semantic
               Web
                       Peter Mika
         Senior Research Scientist
                 Yahoo! Research
Vague, but exciting… Berners-Lee and the dawn of the Web




                           -2-
Semantic Web

  • Publish information in a way that is easier to process for machines
  • Web of Data instead of Web of Documents
  • Two main architectural challenges
      – A common format for sharing data
      – Sharing the meaning of data
          • Through social means (shared schemas)
          • By using powerful schema languages
  • Semantic Web standards from W3C
      – Languages (RDF, OWL, RIF)
      – Serializations (RDF/XML, RDFa)
      – Protocols (SPARQL, HTTP)
  • Semantic Web research into knowledge representation and
    reasoning, data integration, data quality and many other topics
  • Community efforts to publish data and develop schemas
                                      -3-
Resource Description Framework (RDF)

  • Each resource (thing, entity) is identified by a URI
      – Globally unique identifiers
  • RDF represents knowledge as a set of triples
      – Each triple is a single fact about the entity (an attribute or a
        relationship)
  •   A set of triples forms an RDF graph

      RDF document
                                 type       foaf:Person

                 example:roi        name

                                            “Roi Blanco”
                                      -4-
Linking across the Web
 Roi’s homepage                                  Friend-of-a-Friend ontology

                                 type
   example:roi                                         foaf:Person
                      name


                          “Roi Blanco”                        knows
           sameAs



 Yahoo!’s website
                                                type

                    worksWith
   #roi2                        #peter

                                              email

                                                “pmika@yahoo-inc.com”
                                        -5-
Vocabularies (ontologies)

   • Ontologies are collections of classes and properties used to
     describe objects in a particular domain
      – OWL (the Web Ontology Language) is the standard ontology
        language
      – OWL has an RDF serialization: ontologies are part of the
        Semantic Web
   • Classes can be described by sub- and superclasses,
     required properties
      – Class membership in RDF is expressed using the rdf:type
        property
      – An instance can have multiple classes (types)
      – A class can have multiple superclasses
   • Properties can be described by their domain, range,
     cardinalities, etc.
                                  -7-
Example: schema.org

  • Agreement on a shared set of schemas for common types of
    web content
     – Bing, Google, and Yahoo! as initial supporters
     – Similar in intent to sitemaps.org (2006)
         • Use a single format to communicate the same information to all
           three search engines
  • Support for microdata
  • schema.org covers areas of interest to all search engines
     – Business listings (local), creative works (video), recipes,
       reviews
     – User defined extensions
  • Each search engine continues to develop its products


                                    -8-
Documentation and OWL ontology




                         -9-
Sources of data
Data on the Web

  • Most web pages on the Web are generated from structured
    data
     – Data is stored in relational databases (typically)
     – Queried through web forms
     – Presented as tables or simply as unstructured text
  • The structure and semantics (meaning) of the data is not
    directly accessible to search engines
  • Two solutions
     – Extraction using Information Extraction (IE) techniques
       (implicit metadata)
         • Supervised vs. unsupervised methods
     – Relying on publishers to expose structured data using standard
       Semantic Web formats (explicit metadata)
         • Particularly interesting for long tail content
                                       - 11 -
Information Extraction methods

   • Natural Entity Recognition (NER) and Disambiguation
     (NED)
      •   OpenCalais, Zemanta API, Dbpedia Spotlight
      •   Yahoo! Placemaker

   • Extraction of structured data from text
      – Yago system (demo)
   • Exploiting patterns in web page structure
      – Dapper
      – ScraperWiki
   • Extraction from HTML tables
      – Google Squared (deprecated)


                                     - 12 -
Publishing and consuming data on the Semantic Web

  • Publishing data involves
     – Deciding in which format to publish your data
     – Deciding which schema (ontology, vocabulary) to use
         • OR you can create a new schema and publish it as well


  • Multiple ways of publishing RDF data:
     1. Linked Data
     2. Metadata in HTML
     3. SPARQL endpoints
     4. Feeds, e.g. OData


     Note: you may implement more than one

                                    - 13 -
Option 1: Linked Data

   • A web of RDF documents in parallel to the current Web
      – Most often implemented as wrappers around databases or APIs
   • The four rules of Linked Data:
      – Use URIs to identify things.
      – Use HTTP URIs so that these things can be referred to and
        looked up ("dereference") by people and user agents.
      – Provide useful information about the thing when its URI is
        dereferenced, using standard formats such as RDF-XML.
      – Include links to other, related URIs in the exposed data to
        improve discovery of other related information on the Web.

                                                                                                           ..
                                                                                                           #PeterM
                                                                                                                          “Peter Mika”
                                                                                                                                           “Budapest”



         ..             “Peter Mika”
                                                          ..
                                                          #PeterM
                                                                         “Peter Mika”
                                                                                          “Budapest”
                                                                                                                     label
                                                                                                                        #Bud                  “2,000,000”

                                         “Budapest”                                                                                label
         #PeterM                                                                                                 born
                                                                    label                    “2,000,000”                                     #Hun
                   label
                                                                       #Bud                                                        population
                                            “2,000,000”
                      #Bud                                                        label
                                                                born                                                             capital-of
               born
                                 label                                                      #Hun
                                           #Hun                                   population
                                 population
                                                                                capital-of
                               capital-of


                                                                              - 14 -
Option 1: Linked Data

   • Advantages:
      – No change to the publishing of the HTML documents
      – Data can be published by third party (e.g. Dbpedia)
   • Disadvantages:
      – Web servers need to be configured to properly handle URIs that
        identify concepts instead of documents
      – Not favored by search engines
          • Lack of use cases
          • Crawling needs to be changed
          • Authority is difficult to determine
   • Tools
      – Triple stores (Virtuoso, Oracle etc.) and front-ends (Pubby)
      – RDB-to-RDF mappers (e.g. D2RQ, Triplify)
      – Validators (Vapour)
      – Linked Data browsers (many)
                                            - 15 -
Growth of Linked Data

   • Community effort to (re)publish open datasets as Linked
     Data
      – In particular, scientific and government datasets
      – see linkeddata.org, the Data Hub




                                   - 16 -
Option 2: Metadata in HTML

  •   Using microformats, RDFa, Microdata (more later)
  •   Advantages:
      – Data and document are always in sync
                                                         Peter Mika
                                                          Peter Mika
      – Browser plug-in friendly                         was born
                                                          was born
                                                         in
                                                          in
      – Search engine friendly                           Budapest.
                                                          Budapest.
      – Copy-paste friendly
                                                                  “Peter Mika”
                                                                                 “Budapest”
                                                     #PeterM
                                                               label               “2,000,000”
                                                               #Bud
                                                                           label


  •
                                                           born

      Tools:
                                                                                   #Hun
                                                                           population

                                                                         capital-of



      – Any23 (Anything to Triples)                                                               Peter Mika
                                                                                                   Peter Mika
      – RDFaCE                                                                                    was born
                                                                                                   was born
                                                                                                  in
                                                                                                   in
      – RDFa Distiller                                                                            Budapest.
                                                                                                   Budapest.  “Peter Mika”
                                                                                                                             “Budapest”
                                                                                                 #PeterM
                                                                                                           label               “2,000,000”
                                                                                                           #Bud
                                                                                                                       label
                                                                                                       born
                                                                                                                               #Hun
                                                                                                                       population

                                                                                                                     capital-of




                                      - 17 -
Example: Facebook’s Open Graph Protocol

  • RDF vocabulary to be used in conjunction with RDFa
      – Simplify the work of developers by restricting the freedom in RDFa
  • Activities, Businesses, Groups, Organizations, People, Places,
    Products and Entertainment
  • Only HTML <head> accepted
  • http://opengraphprotocol.org/

 <html xmlns:og="http://opengraphprotocol.org/schema/">
 <head>
    <title>The Rock (1996)</title>
    <meta property="og:title" content="The Rock" />
    <meta property="og:type" content="movie" />
    <meta property="og:url"
    content="http://www.imdb.com/title/tt0117500/" />
    <meta property="og:image" content="http://ia.media-
    imdb.com/images/rock.jpg" /> …
 </head> ...                        - 18 -
Current state of metadata on the Web

   • 31% of webpages, 5% of domains contain some
     metadata
       – Analysis of the Bing Crawl (US crawl, January, 2012)
       – RDFa is most common format
   • By URL: 25% RDFa, 7% microdata, 9% microformat
   • By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat
       – Adoption is stronger among large publishers
   • Especially for RDFa and microdata
   • See also
      – P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus,
        LDOW 2012
      – H.Mühleisen, C.Bizer.Web
         Data Commons - Extracting Structured Data from Two Large Web Corpo
        , LDOW 2012


                                         - 19 -
Exponential growth in RDFa data

                      Another five-fold increase
                      Another five-fold increase
                      between October 2010 and
                      between October 2010 and
                      January, 2012
                      January, 2012



                  Five-fold increase
                  Five-fold increase
                  between March, 2009 and
                  between March, 2009 and
                  October, 2010
                  October, 2010




     Percentage of URLs with embedded metadata in various formats
                                    - 20 -
Option 3: SPARQL endpoints

  • An API for accessing RDF databases on the Web
     – A query language and an HTTP protocol                                    “Peter Mika”
                                                                                                 “Budapest”
                                                                 #PeterM


  • Advantages:                                                        born
                                                                           label
                                                                              #Bud
                                                                                         label
                                                                                                    “2,000,000”



                                                                                                   #Hun
                                                                                         population



     – Flexible access: make any query you want
                                                                                       capital-of




     – Also possible to expose a traditional RDBMs via a wrapper
  • Disadvantages:
     – For the publisher: cost of supporting arbitrary queries
     – For the search engine: discovery of SPARQL servers is unsolved
  • Tools:
     – Triple stores
         • Sesame, Jena, OWLIM, Redland, Oracle, Virtuoso, Stardog etc.
     – RDB-to-RDF mappers such as D2RQ and Triplify

                                   - 21 -
Example: Dbpedia

  • demo




                   - 22 -
Crawling the Semantic Web

  • Linked Data
     – Similar to HTML crawling, but the the crawler needs to parse
       RDF/XML (and others) to extract URIs to be crawled
     – Semantic Sitemap/VOID descriptions
  • RDFa
     – Same as HTML crawling, but data is extracted after crawling
     – Mika et al. Investigating the Semantic Gap through Query Log
       Analysis, ISWC 2010.
  • SPARQL endpoints
     – Endpoints are not linked, need to be discovered by other
       means
     – Semantic Sitemap/VOID descriptions


                                  - 24 -
Data fusion

   • Ontology (schema) matching
      – Widely studied in Semantic Web research
          • ontologymatching.org
   • Entity resolution
      – Finding links between datasets
      – Tools: SILK, LIMES
   • Blending
      – Merging objects that represent the same real world entity and
        reconciling information from multiple sources
   • Cleaning
      – Google Refine



                                   - 25 -
More info

   • Ideas for hacks
      – http://challenge.semanticweb.org/
      – http://iswc2011.semanticweb.org/calls/linked-data-a-thon/
   • Book
      – Segaran, Evans and Taylor. Programming the Semantic Web.
        O’Reilly, 2009.
   • More tools
      – Exhibit: faceted browsing and other visualizations
      – http://www.dajobe.org/talks/200906-semtech-open/
      – LOD2 stack (stack.lod2.eu)




                                   - 26 -

Contenu connexe

En vedette

En vedette (9)

EnergyUse - A Collective Semantic Platform for Monitoring and Discussing Ener...
EnergyUse - A Collective Semantic Platform for Monitoring and Discussing Ener...EnergyUse - A Collective Semantic Platform for Monitoring and Discussing Ener...
EnergyUse - A Collective Semantic Platform for Monitoring and Discussing Ener...
 
Related Entity Finding on the Web
Related Entity Finding on the WebRelated Entity Finding on the Web
Related Entity Finding on the Web
 
Investigating the Semantic Gap through Query Log Analysis
Investigating the Semantic Gap through Query Log AnalysisInvestigating the Semantic Gap through Query Log Analysis
Investigating the Semantic Gap through Query Log Analysis
 
Semantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsSemantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistants
 
Semantic Search on the Rise
Semantic Search on the RiseSemantic Search on the Rise
Semantic Search on the Rise
 
Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in Practice
 
Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at Yahoo
 
Understanding Queries through Entities
Understanding Queries through EntitiesUnderstanding Queries through Entities
Understanding Queries through Entities
 
Smart Enterprises
Smart EnterprisesSmart Enterprises
Smart Enterprises
 

Similaire à Hackathon s pb

Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Cory Lampert
 
SemanticWeb Nuts 'n Bolts
SemanticWeb Nuts 'n BoltsSemanticWeb Nuts 'n Bolts
SemanticWeb Nuts 'n Bolts
Rinke Hoekstra
 

Similaire à Hackathon s pb (20)

ontology.ppt
ontology.pptontology.ppt
ontology.ppt
 
Hacking with Semantic Web
Hacking with Semantic WebHacking with Semantic Web
Hacking with Semantic Web
 
General Introduction for Semantic Web and Linked Open Data
General Introduction for Semantic Web and Linked Open DataGeneral Introduction for Semantic Web and Linked Open Data
General Introduction for Semantic Web and Linked Open Data
 
Metadata is back!
Metadata is back!Metadata is back!
Metadata is back!
 
Publishing data on the Semantic Web
Publishing data on the Semantic WebPublishing data on the Semantic Web
Publishing data on the Semantic Web
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
 
An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4j
 
Radically Open Cultural Heritage Data on the Web
Radically Open Cultural Heritage Data on the WebRadically Open Cultural Heritage Data on the Web
Radically Open Cultural Heritage Data on the Web
 
Widening the limits of cognitive reception with online digital library graph ...
Widening the limits of cognitive reception with online digital library graph ...Widening the limits of cognitive reception with online digital library graph ...
Widening the limits of cognitive reception with online digital library graph ...
 
SemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in PracticeSemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in Practice
 
Book of the Dead Project
Book of the Dead ProjectBook of the Dead Project
Book of the Dead Project
 
An introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked DataAn introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked Data
 
Semantic web
Semantic webSemantic web
Semantic web
 
NCompass Live: RDA: Are We There Yet?
NCompass Live: RDA: Are We There Yet?NCompass Live: RDA: Are We There Yet?
NCompass Live: RDA: Are We There Yet?
 
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
 
Why libraries should embrace Linked Data
Why libraries should embrace Linked DataWhy libraries should embrace Linked Data
Why libraries should embrace Linked Data
 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDF
 
SemanticWeb Nuts 'n Bolts
SemanticWeb Nuts 'n BoltsSemanticWeb Nuts 'n Bolts
SemanticWeb Nuts 'n Bolts
 
Lifting the Lid on Linked Data
Lifting the Lid on Linked DataLifting the Lid on Linked Data
Lifting the Lid on Linked Data
 
First steps towards publishing library data on the semantic web
First steps towards publishing library data on the semantic webFirst steps towards publishing library data on the semantic web
First steps towards publishing library data on the semantic web
 

Plus de Peter Mika (10)

What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
Making the Web searchable
Making the Web searchableMaking the Web searchable
Making the Web searchable
 
SemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialSemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorial
 
Making things findable
Making things findableMaking things findable
Making things findable
 
Hack U Barcelona 2011
Hack U Barcelona 2011Hack U Barcelona 2011
Hack U Barcelona 2011
 
Semantic Search Summer School2009
Semantic Search Summer School2009Semantic Search Summer School2009
Semantic Search Summer School2009
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkey
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin Yahoo
 

Hackathon s pb

  • 1. Data on the Semantic Web Peter Mika Senior Research Scientist Yahoo! Research
  • 2. Vague, but exciting… Berners-Lee and the dawn of the Web -2-
  • 3. Semantic Web • Publish information in a way that is easier to process for machines • Web of Data instead of Web of Documents • Two main architectural challenges – A common format for sharing data – Sharing the meaning of data • Through social means (shared schemas) • By using powerful schema languages • Semantic Web standards from W3C – Languages (RDF, OWL, RIF) – Serializations (RDF/XML, RDFa) – Protocols (SPARQL, HTTP) • Semantic Web research into knowledge representation and reasoning, data integration, data quality and many other topics • Community efforts to publish data and develop schemas -3-
  • 4. Resource Description Framework (RDF) • Each resource (thing, entity) is identified by a URI – Globally unique identifiers • RDF represents knowledge as a set of triples – Each triple is a single fact about the entity (an attribute or a relationship) • A set of triples forms an RDF graph RDF document type foaf:Person example:roi name “Roi Blanco” -4-
  • 5. Linking across the Web Roi’s homepage Friend-of-a-Friend ontology type example:roi foaf:Person name “Roi Blanco” knows sameAs Yahoo!’s website type worksWith #roi2 #peter email “pmika@yahoo-inc.com” -5-
  • 6. Vocabularies (ontologies) • Ontologies are collections of classes and properties used to describe objects in a particular domain – OWL (the Web Ontology Language) is the standard ontology language – OWL has an RDF serialization: ontologies are part of the Semantic Web • Classes can be described by sub- and superclasses, required properties – Class membership in RDF is expressed using the rdf:type property – An instance can have multiple classes (types) – A class can have multiple superclasses • Properties can be described by their domain, range, cardinalities, etc. -7-
  • 7. Example: schema.org • Agreement on a shared set of schemas for common types of web content – Bing, Google, and Yahoo! as initial supporters – Similar in intent to sitemaps.org (2006) • Use a single format to communicate the same information to all three search engines • Support for microdata • schema.org covers areas of interest to all search engines – Business listings (local), creative works (video), recipes, reviews – User defined extensions • Each search engine continues to develop its products -8-
  • 8. Documentation and OWL ontology -9-
  • 10. Data on the Web • Most web pages on the Web are generated from structured data – Data is stored in relational databases (typically) – Queried through web forms – Presented as tables or simply as unstructured text • The structure and semantics (meaning) of the data is not directly accessible to search engines • Two solutions – Extraction using Information Extraction (IE) techniques (implicit metadata) • Supervised vs. unsupervised methods – Relying on publishers to expose structured data using standard Semantic Web formats (explicit metadata) • Particularly interesting for long tail content - 11 -
  • 11. Information Extraction methods • Natural Entity Recognition (NER) and Disambiguation (NED) • OpenCalais, Zemanta API, Dbpedia Spotlight • Yahoo! Placemaker • Extraction of structured data from text – Yago system (demo) • Exploiting patterns in web page structure – Dapper – ScraperWiki • Extraction from HTML tables – Google Squared (deprecated) - 12 -
  • 12. Publishing and consuming data on the Semantic Web • Publishing data involves – Deciding in which format to publish your data – Deciding which schema (ontology, vocabulary) to use • OR you can create a new schema and publish it as well • Multiple ways of publishing RDF data: 1. Linked Data 2. Metadata in HTML 3. SPARQL endpoints 4. Feeds, e.g. OData Note: you may implement more than one - 13 -
  • 13. Option 1: Linked Data • A web of RDF documents in parallel to the current Web – Most often implemented as wrappers around databases or APIs • The four rules of Linked Data: – Use URIs to identify things. – Use HTTP URIs so that these things can be referred to and looked up ("dereference") by people and user agents. – Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF-XML. – Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web. .. #PeterM “Peter Mika” “Budapest” .. “Peter Mika” .. #PeterM “Peter Mika” “Budapest” label #Bud “2,000,000” “Budapest” label #PeterM born label “2,000,000” #Hun label #Bud population “2,000,000” #Bud label born capital-of born label #Hun #Hun population population capital-of capital-of - 14 -
  • 14. Option 1: Linked Data • Advantages: – No change to the publishing of the HTML documents – Data can be published by third party (e.g. Dbpedia) • Disadvantages: – Web servers need to be configured to properly handle URIs that identify concepts instead of documents – Not favored by search engines • Lack of use cases • Crawling needs to be changed • Authority is difficult to determine • Tools – Triple stores (Virtuoso, Oracle etc.) and front-ends (Pubby) – RDB-to-RDF mappers (e.g. D2RQ, Triplify) – Validators (Vapour) – Linked Data browsers (many) - 15 -
  • 15. Growth of Linked Data • Community effort to (re)publish open datasets as Linked Data – In particular, scientific and government datasets – see linkeddata.org, the Data Hub - 16 -
  • 16. Option 2: Metadata in HTML • Using microformats, RDFa, Microdata (more later) • Advantages: – Data and document are always in sync Peter Mika Peter Mika – Browser plug-in friendly was born was born in in – Search engine friendly Budapest. Budapest. – Copy-paste friendly “Peter Mika” “Budapest” #PeterM label “2,000,000” #Bud label • born Tools: #Hun population capital-of – Any23 (Anything to Triples) Peter Mika Peter Mika – RDFaCE was born was born in in – RDFa Distiller Budapest. Budapest. “Peter Mika” “Budapest” #PeterM label “2,000,000” #Bud label born #Hun population capital-of - 17 -
  • 17. Example: Facebook’s Open Graph Protocol • RDF vocabulary to be used in conjunction with RDFa – Simplify the work of developers by restricting the freedom in RDFa • Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment • Only HTML <head> accepted • http://opengraphprotocol.org/ <html xmlns:og="http://opengraphprotocol.org/schema/"> <head> <title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media- imdb.com/images/rock.jpg" /> … </head> ... - 18 -
  • 18. Current state of metadata on the Web • 31% of webpages, 5% of domains contain some metadata – Analysis of the Bing Crawl (US crawl, January, 2012) – RDFa is most common format • By URL: 25% RDFa, 7% microdata, 9% microformat • By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat – Adoption is stronger among large publishers • Especially for RDFa and microdata • See also – P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012 – H.Mühleisen, C.Bizer.Web Data Commons - Extracting Structured Data from Two Large Web Corpo , LDOW 2012 - 19 -
  • 19. Exponential growth in RDFa data Another five-fold increase Another five-fold increase between October 2010 and between October 2010 and January, 2012 January, 2012 Five-fold increase Five-fold increase between March, 2009 and between March, 2009 and October, 2010 October, 2010 Percentage of URLs with embedded metadata in various formats - 20 -
  • 20. Option 3: SPARQL endpoints • An API for accessing RDF databases on the Web – A query language and an HTTP protocol “Peter Mika” “Budapest” #PeterM • Advantages: born label #Bud label “2,000,000” #Hun population – Flexible access: make any query you want capital-of – Also possible to expose a traditional RDBMs via a wrapper • Disadvantages: – For the publisher: cost of supporting arbitrary queries – For the search engine: discovery of SPARQL servers is unsolved • Tools: – Triple stores • Sesame, Jena, OWLIM, Redland, Oracle, Virtuoso, Stardog etc. – RDB-to-RDF mappers such as D2RQ and Triplify - 21 -
  • 21. Example: Dbpedia • demo - 22 -
  • 22. Crawling the Semantic Web • Linked Data – Similar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawled – Semantic Sitemap/VOID descriptions • RDFa – Same as HTML crawling, but data is extracted after crawling – Mika et al. Investigating the Semantic Gap through Query Log Analysis, ISWC 2010. • SPARQL endpoints – Endpoints are not linked, need to be discovered by other means – Semantic Sitemap/VOID descriptions - 24 -
  • 23. Data fusion • Ontology (schema) matching – Widely studied in Semantic Web research • ontologymatching.org • Entity resolution – Finding links between datasets – Tools: SILK, LIMES • Blending – Merging objects that represent the same real world entity and reconciling information from multiple sources • Cleaning – Google Refine - 25 -
  • 24. More info • Ideas for hacks – http://challenge.semanticweb.org/ – http://iswc2011.semanticweb.org/calls/linked-data-a-thon/ • Book – Segaran, Evans and Taylor. Programming the Semantic Web. O’Reilly, 2009. • More tools – Exhibit: faceted browsing and other visualizations – http://www.dajobe.org/talks/200906-semtech-open/ – LOD2 stack (stack.lod2.eu) - 26 -

Notes de l'éditeur

  1. Facebook invited, but continues to pursue OGP