Promoted by major search engines, schema.org has become a widely adopted standard for marking up structured data in HTML web pages. In this paper, we use a series of largescale Web crawls to analyze the evolution and adoption of schema.org over time. The availability of data from dierent points in time for both the schema and the websites deploying data allows for a new kind of empirical analysis of standards adoption, which has not been possible before. To conduct our analysis, we compare dierent versions of the schema.org vocabulary to the data that was deployed on hundreds of thousands of Web pages at dierent points in time. We measure both top-down adoption (i.e., the extent to which changes in the schema are adopted by data providers) as well as bottom-up evolution (i.e., the extent to which the actually deployed data drives changes in the schema). Our empirical analysis shows that both processes can be observed.
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time
1. A Web-scale Study of the Adoption and
Evolution of the schema.org Vocabulary
over Time
Robert Meusel, Christian Bizer and
Heiko Paulheim
2. 2
Motivation - LOD Cloud with 1.000 data providers
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
3. 3
Motivation - schema.org MD with 700k data providers
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
4. 4
Microdata in a Nutshell
Adding structured information to web pages
• By marking up contents and entities
Arbitrary vocabularies are possible
• Practically, only schema.org is deployed on a large scale
• Plus its historical predecessor: data-vocabulary.org
Similar to RDFa
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
<div itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="name">Data and Web Science Group</span>
<span itemprop="addressLocality">Mannheim</span>,
<span itemprop="postalCode">68131</span>
<span itemprop="addressCountry">Germany</span>
</div>
5. 5
Schema.org in a Nutshell
Vocabulary for marking up entities on web pages
• 675 classes and 965 properties (as of May 2015, release 2.0)
Promoted and consumes by major search engine companies
• Google, Bing, Yahoo!, and Yandex
• Google Rich Snippets
Community-driven
evolution and
development
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
6. 6
Schema.org in a Nutshell – Coverage
Schema.org has incorporated some popular vocabularies, like:
• Good Relations (2012)
• W3C BibExtend (2014)
• MusicBrainz vocabulary (2015)
• Automotive Ontology (2015)
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
7. 7
Microdata with Schema.org in HTML Pages
<html>
…
<body>
…
<div id="main-section" class="performance left" data-
sku="M17242_580“>
<h1> Predator Instinct FG Fußballschuh
</h1>
<div>
<meta content="EUR">
<span
data-sale-price="219.95">219,95</span>
…
</body>
</html>
HTML pages embed directly
markup languages to annotate
items using different vocabularies
<html>
…
<body>
…
<div id="main-section" class="performance left" data-
sku="M17242_580" itemscope
itemtype="http://schema.org/Product">
<h1 itemprop="name"> Predator Instinct FG Fußballschuh
</h1>
<div itemscope itemtype="http://schema.org/Offer"
itemprop="offers">
<meta itemprop="priceCurrency" content="EUR">
<span itemprop="price" data-sale-
price="219.95">219,95</span>
…
</body>
</html>
1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-
ns#type> <http://schema.org/Product> .
2._:node1 <http://schema.org/Product/name> "Predator
Instinct FG Fußballschuh"@de .
3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-
ns#type> <http://schema.org/Offer> .
4._:node1 <http://schema.org/Offer/price>
"219,95"@de .
5._:node1 <http://schema.org/Offer/priceCurrency>
"EUR" .
6.…
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
8. 8
Wrap-Up
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
Semantic annotations are used by more and more websites
Entities on websites become machine-readable and machine-
understandable
schema.org together with Microdata is a success story
• Promoted by search engine companies
• Deployed by over 17% of all websites [1] (over 700k data providers)
Usage is more compliant to the schema than e.g. LOD [2]
[1] http://webdatacommons.org/structureddata/2014-12/stats/stats.html
[2] Meusel and Paulheim, ESWC 2015
9. 9
Digging for Reasons
So, Microdata is more often deployed and is often more
schema compliant, although there are millions of uncontrolled
providers with different skill sets
But why? Some hypotheses…
• Availability of documentation
• Tool support
• Business incentive
• Schema flexibility
Can we confirm/reject those from looking at the data?
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
10. 10
A Diachronic Perspective
Versions of schema.org are archived over time
• Plus: there are several crawl releases per year
• i.e., we can look at change over time
If we look at both schema and deployed data, we may observe
• Adoption rates of schema changes
• Data-first changes to the schema
• Convergence or divergence of deployed data
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
11. 11
A Diachronic Perspective
Three releases of WDC Microdata corpus [1]
• 2012, 2013, and 2014
Versions of schema.org that were valid
• At the beginning of the crawl
• At the end of the crawl
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
[1] http://webdatacommons.org/structureddata
12. 12
Top-down Adoption
How fast are changes in the schema adopted?
• New classes/properties
• Deprecations
• Domain/range changes
Measuring adoption: challenges
• Different crawls
• Overall growth of deployed schema.org
Measure: normalized usage increase (nui) from i to j:
• nui(s)>1.05: usage of schema element s has increased significantly
• nui(s)<0.95: usage of schema element s has decreased significantly
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
13. 13
Top-down Adoption
Adoption of new classes and properties
• Almost half of all introduced classes are never used!
• Similar for new properties
Reasons
• Bulk-addition of vocabularies
• not every term is equally needed
• e.g., medical vocabulary
• Blind spot of our approach
• some terms are mainly for e-mail markup
• e.g., Actions
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
SURPRISE!
14. 14
Top-down Adoption
Main domains of positive adoption
• Meta data for web content
(schema.org/Website has the highest nui)
• Broadcasting (e.g., TV Episodes)
• Questions & Answers
• Postal addresses
Classes featured in Google Rich Snippets
• Still growth on high level (tens of thousands of data providers)
• But nui(s)<0.95
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
Yellow Pages
Search Engine Listings
Collaboration
with BBC and EBU
Influence of CMS adoption
Q&A Pages, such as
Stackoverflow
15. 15
Top-down Adoption
Adoption of domain/range changes
• Again: rather low overall adoption
Adopted well for
• Products (height, width, itemCondition, …)
• Broadcasting domain (episode, actor, ...)
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
Search Engine Listings
Collaboration
with BBC and EBU
16. 16
Top-down Adoption
Adoption of deprecations
• Works well (29 out of 32 have a significantly low nui)
Exceptions
• s:map (← s:hasMap)
• s:maps (← s:hasMap)
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
For Google Maps
(lots of outdated tutorials)
17. 17
Bottom-up Evolution
Martin Luther
• Started the protestant church
• A success story, too (like schema.org)
• (i.e., 800 million adopters worldwide)
Famous quote:
• “Man muss […] dem gemeinen Mann aufs Maul schauen”
• (roughly:
“You have to listen to the way the common man really speaks.”)
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
Martin Luther,
1483-1546
Disclaimer:
I do not speak for the
protestant church.
18. 18
Bottom-up Evolution
Are new features in the schema first used “inofficially”?
• New classes/properties
• Domain/range changes
Instrument for measurement: ROC curves
• True positives mapped against false positives
• tp: elements used before
• fp: elements not used before
• Ranking by #PLDs
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
19. 19
Bottom-up Evolution
There are some mild influences observable
• Stronger for domain/range changes
• especially range changes
• Weaker for new classes/properties
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
2012→ 2013 2013→ 2014 2012→ 2014
classes properties domains ranges
20. 20
Bottom-up Evolution
Extension mechanism
• Allows for user-defined classes/properties
• Those become subclasses implicitly
Analysis over time
• No measurable impact on standard evolution
• “Inofficial” use is likelier than use of extension mechanism
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
s:Product/ElectronicProduct
s:price/reducedPrice
21. 21
Overall Convergence
Measuring convergence
• i.e., homogeneity of descriptions of classes
• Example: two instances of s:LocalBusiness
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
_:1
_:2 “Birmingham”
“Main Street 24”
s:LocalBusiness
s:PostalAddress _:1
_:2 “Liverpool”
“Church Street 1”
s:LocalBusiness
s:PostalAddress
22. 22
Overall Convergence
Recap
• RDF from Microdata is a set of trees
• i.e., we can enumerate all paths to leaf nodes
(omitting literals)
Example:
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
_:1
_:2 “Liverpool”
“Church Street 1”
s:LocalBusiness
s:PostalAddress
rdf:type-s:LocalBusiness,
s:address-rdf:type-s:PostalAddress,
s:address-s:addressLocality,
s:address-s:streetAddress
23. 23
Overall Convergence
Using all paths, we can compute the entropy for each class as
A low entropy refers to a high homogeneity
We normalize both by maximum entropy
and the total number of paths
• i.e., we use normalized entropy rate as a measure for homogeneity
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
24. 24
Overall Convergence
Observations
• Overall entropy decreases over time
Classes with high convergence rates
• WebSite, Blog, …
• Hotel, Restaurant, …
• Product, Offer, …
• Rating, Review
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
Influence of CMS adoption
Yellow pages
Google Rich Snippets
...all of the above
25. 25
Key Adoption Drivers
Search Engine Optimization
• Web site providers want to be high in Google rankings
• Direct business incentive!
Tool adoption
• Major CMSs use schema.org
Standard Agility
• schema.org: 25 revisions in last three years
• cf. FOAF: six revisions in last eight years
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
26. 26
Summary
Both ways, top-down and bottom-up adoptions can be
observed
Homogeneity of deployed schema increase over time
Described empirical data-driven study reveals valuable insights
to understand how and why schema.org is a success story
Observed key drivers and obstacles can also help to understand
and analysis adoption of other standards, e.g. LOD
More fine-grained insights might be revealed when extending
the analysis corpus to the mailing list archive and issue tracker
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
27. 27
Thank you! Questions? Feedback?
Raw data can be found on the website of WebDataCommons:
http://webdatacommons.org/structureddata/
More interesting datasets and analysis:
http://webdatacommons.org/index.html
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
Acknowledgement
The extraction and analysis of the datasets was supported
by AWS in Education Grant.