Being promoted by major search engines such as Google,
Yahoo!, Bing, and Yandex, Microdata embedded in web pages, especially using schema.org, has become one of the most important markup languages for the Web. However, deployed Microdata is most often not free from errors, which limits its practical use. In this paper, we use
the WebDataCommons corpus of Microdata extracted from more than 250 million web pages for a quantitative analysis of common mistakes in Microdata provision. Since it is unrealistic that data providers will provide clean and correct data, we discuss a set of heuristics that can be applied on the data consumer side to fix many of those mistakes in a
post-processing step. We apply those heuristics to provide an improved knowledge base constructed from the raw Microdata extraction.
3. 3
Microdata in a Nutshell
Adding structured information to web pages
• By marking up contents and entities
Arbitrary vocabularies are possible
• Practically, only schema.org is deployed on a large scale
• Plus its historical predecessor: data-vocabulary.org
Similar to RDFa
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
<div itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="name">Data and Web Science Group</span>
<span itemprop="addressLocality">Mannheim</span>,
<span itemprop="postalCode">68131</span>
<span itemprop="addressCountry">Germany</span>
</div>
4. 4
Schema.org in a Nutshell
Vocabulary for marking up entities on web pages
• 675 classes and 965 properties (as of May 2015, release 2.0)
Promoted and consumes by major search engine companies
• Google, Bing, Yahoo!, and Yandex
• Google Rich Snippets
Community-driven evolution and development
Can be used with Microdata and RDFa
• Hardly used together with RDFa (<0.1% of RDFa-using websites [1])
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
[1] http://webdatacommons.org/structureddata/2014-12/stats/stats.html
5. 5
Schema.org in a Nutshell – Coverage
Schema.org has incorporated some popular vocabularies, like:
• Good Relations (2012)
• W3C BibExtend (2014)
• MusicBrainz vocabulary (2015)
• Automotive Ontology (2015)
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
6. 6
Microdata with Schema.org in HTML Pages
<html>
…
<body>
…
<div id="main-section" class="performance left" data-
sku="M17242_580“>
<h1> Predator Instinct FG Fußballschuh
</h1>
<div>
<meta content="EUR">
<span
data-sale-price="219.95">219,95</span>
…
</body>
</html>
HTML pages embed directly
markup languages to annotate
items using different vocabularies
<html>
…
<body>
…
<div id="main-section" class="performance left" data-
sku="M17242_580" itemscope
itemtype="http://schema.org/Product">
<h1 itemprop="name"> Predator Instinct FG Fußballschuh
</h1>
<div itemscope itemtype="http://schema.org/Offer"
itemprop="offers">
<meta itemprop="priceCurrency" content="EUR">
<span itemprop="price" data-sale-
price="219.95">219,95</span>
…
</body>
</html>
1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-
ns#type> <http://schema.org/Product> .
2._:node1 <http://schema.org/Product/name> "Predator
Instinct FG Fußballschuh"@de .
3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-
ns#type> <http://schema.org/Offer> .
4._:node1 <http://schema.org/Offer/price>
"219,95"@de .
5._:node1 <http://schema.org/Offer/priceCurrency>
"EUR" .
6.…
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
7. 7
So Far, So Good …
Schema is well explained on the schema.org websites
Data providers are supported by validation tools
(e.g. Yandex structured data validator) when deploying
Win-Win for both sides
Plus: Data is (mostly) free accessible in the Web
…. but:
>100.000s of data providers, which are mostly no schema.org
experts or evangelists
Validators & schema might help but there is no need to use
them
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
8. 8
So What Could Possibly Go Wrong?
Usage of wrong namespaces
• http./schema.org
Usage of undefined types
• http://schema.org/Breadcrumb
Usage of undefined properties
• http://schema.org/postID
Confusion of datatype properties and object properties
• _:n1 s:address “Jump Street 21”
Property domain and range violations
• _:n1 a s:Product
_:n1 s:price “for free”
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
9. 9
Compiling a Schema.org Dataset
Starting point: all pages in the CommonCrawl that contain
Microdata
What could be (meant to be) schema.org?
• Everything that contains “schema.org” as substring in a namespace
• Everything that contains URIs where the protocol and authority is similar
to “http://schema.org/” (with an EditDistance of 1)
• Filter noise: removing all namespaces that occur only on one website
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
Final corpus consists of:
6.4 billion triples
extracted from over 217 billion pages
belonging to 398,542 data providers
which is 86% of all Microdata in the corpus.
10. 10
Namespace Violations
More than 98% of the preselected pages use a correct
namespace
Frequent namespace variations:
• http://www.schema.org/
• https://schema.org
• http:/schema.org
• http://SChema.org
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
Debated!
11. 11
Undefined Types
Used by around 6% of all data providers
Typical causes:
• Misspellings: http://schema.org/Stores
• Miscapitalization: http://schema.org/localbusiness
Comparison to LOD Compliance
• 5.8% of all Microdata documents
• 38.8% of all LOD documents (Hogan et al., 2010)
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
…/Store
…/LocalBusiness
12. 12
Undefined Properties
Used by around 4% of all data providers
Typical Causes:
• Miscapitalization: http://schema.org/contentURL
• Close but miss: http://schema.org/currency
http://schema.org/fax
• Made up: http://schema.org/blogId
http://schema.org/postId
Comparison to LOD Compliance
• 9.7% of all Microdata documents
• 72.4% of all LOD documents (Hogan et al., 2010)
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
…/contentUrl
…/priceCurrency
13. 13
Confusion of Object Properties with Data Properties
i.e. using an object property with a string values
Used by over 56.6% of all data providers
Typical properties:
• http://schema.org/addresscountry
• http://schema.org/manufacturer
• http://schema.org/author
• http://schema.org/brand
Comparison to LOD Compliance
• 24.35% of all Microdata documents
• 8% of all LOD documents (Hogan et al., 2010)
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
14. 14
Confusion of Data Properties with Object Properties
i.e. using a data property with a complex object
Used by less than 0.2% of all data providers
Comparison to LOD Compliance
• 0.6% of all Microdata documents
• 2.2% of all LOD documents (Hogan et al., 2010)
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
15. 15
Property Domain Violations
i.e. using a property with a subject not included in its domain
Used by 4% of all data providers
Typical violations are mainly shortcuts
• s:price used on s:Product
• s:streetAddress used on s:LocalBusiness
Comparison to LOD Compliance:
• Difficult to compare as semantics are different
• List of schema.org domains is exhaustive
• LOD: open world assumption
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
s:Product s:Offer s:price
s:LocalBusiness s:PostalAddress
s:streetAddress
16. 16
Data Property Range Violations
i.e. using a data property with an incompatible literal
Used by 9.6% of all data providers
20 most common violations:
• 13 dates
• 3 Urls
• 2 numbers
• 2 times
Comparison to LOD Compliance:
• 12.06% of all Microdata documents
• 4.6% of all LOD documents (Hogan et al., 2010)
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
“a month ago”
“2 pieces”
“last week”
17. 17
Object Property Range Violations
i.e. using an object property with a type outside its range
Used by 8.6% of all data providers
Typical violations:
• s:mainContentOfPage with s:Blog instead of
s:WebPageElement
Comparison to LOD Compliance
• 3.2% of all Microdata documents
• 2.4% of all LOD documents (Hogan et al., 2010)
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
Maybe a hint at a missing
hierarchy relation?
18. 18
Schema.org Compliance Summary
Surprisingly high level of compliance
Providers are often not technology evangelists (unlike in LOD)
• Anybody can start publishing Microdata annotated HTML
Most often higher than for LOD
• Except for the confusion of data and object properties
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
But still the number of erroneous pages could prevent data
consumers to make use of the annotated data and understand
the semantics.
19. 19
Identifying and Fixing Wrong Namespaces
Main errors due to missing slashes, wrong protocol and
capitalization
Simple rules to handle wrong namespaces
• Removal of www
• Replacement of https by http
• Conversion to lower case
• Adding of missing slashes and removal of prefixes before schema.org
Impact:
• 147 of 148 wrongly spelled namespaces could be fixed
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
20. 20
Handling Undefined Types and Properties
Main errors due to wrong capitalization
Heuristic: Ignore capitalization when parsing entities from web
pages, and replace the schema element with the properly
capitalized version
Impact (together with namespace fixes):
• Correct type replacement within 71% of all data providers
• Correct property replacement within 65% of all data providers
• Remaining data providers account for over 70% of all
undefined types and properties and
are hard-to-detect typos
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
21. 21
Handling Object Properties with Literal Values
Main objects modeled as literals are s:Organization,
s:Person and s:PostalAddress
Manually inspecting those values for the object
properties s:author, s:creator and s:address
Impact
• The heuristic could replace all misused
object properties on 92,449 data providers
• Might lead to changes in the type distribution
• E.g. 14 million new entities of type
s:PostalAddress
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
_:1 s:author “Robert” .
_:1 s:author _:2 .
_:2 a s:Person .
_:2 s:name “Robert” .
22. 22
Handling Property Domain Violations
Main cause are shortcuts
Heuristic to find the
property R and type T
for a domain violation
of property s:r:
One unique solution for only one of
the two patterns:
Impact:
• 31% of erroneous data providers could be fixed
• No solution or multiple solutions for the rest
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
_:1 “5”
s:aggregatedRating
s:aggregatedRating
is not defined for
type of _:1
_:2
s:aggregatedRating
Type?
Property?
R s:domainIncludes s:t .
R s:rangeIncludes T .
s:r s:domainIncludes T .
R s:rangeIncludes s:t .
R s:domainIncludes T .
s:r s:domainIncludes T .
23. 23
Heuristics Summary
Over 410 million wrong triples could be corrected
Over 700 million missing triples could be added
Corrections affected in total over 115.000 data providers
• ~ 28% of all data providers in the data set
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
24. 24
LD4IE Challenge @ ISWC 2015
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
Learn to annotate entities on HTML pages using already
annotated pages as training set.
Deadline: 2015-07-15
Challenge Page: goo.gl/laF6yl
Contact: Heiko Paulheim
(heiko@dwslab.de)
Good Luck!
25. 25
Thank you! Questions? Feedback?
Data and more insights can be found at:
http://webdatacommons.org/structureddata/2013-
11/stats/fixing_common_errors.html
More interesting datasets and analysis can be found at the
website of WebDataCommons:
http://webdatacommons.org/index.html
Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015
Acknowledgement
The extraction and analysis of the datasets was supported
by AWS in Education Grant.