Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Analysing & Improving Learning Resources Markup on the Web

528 vues

Publié le

Talk at WWW2017 on LRMI adoption, quality and usage. Full paper here: http://papers.www2017.com.au.s3-website-ap-southeast-2.amazonaws.com/companion/p283.pdf.

Publié dans : Technologie
  • Identifiez-vous pour voir les commentaires

  • Soyez le premier à aimer ceci

Analysing & Improving Learning Resources Markup on the Web

  1. 1. Analysing and Improving embedded Markup of Learning Resources on the Web Stefan Dietze, Davide Taibi, Ran Yu, Phil Barker, Mathieu d’Aquin - WWW2017, Digital Learning Track - 05/04/17 1Stefan Dietze
  2. 2. Open Data & Linked Data Structured data about learning resources on the Web? 05/04/17 2Stefan Dietze Resource metadata  Standards: LOM, ADL SCORM, IMS LD etc.  Repositories: Open Courseware, Merlot, ARIADNE etc Educational(ly relevant) linked data  Vocabularies: BIBO, LOM/RDF, mEducator etc  Datasets: e.g. LinkedUp Catalog (approx. 50 M resources) http://data.linkededucation.org/linkedup/catalog/
  3. 3. Structured data about learning resources on the Web? 05/04/17 3Stefan Dietze Web: approx. 46.000.000.000.000 (46 trillion) Web pages indexed by Google Open Data & Linked Data Resource metadata  Standards: LOM, ADL SCORM, IMS LD etc.  Repositories: Open Courseware, Merlot, ARIADNE etc Educational(ly relevant) linked data  Vocabularies: BIBO, LOM/RDF, mEducator etc  Datasets: e.g. LinkedUp Catalog (approx. 50 M resources)
  4. 4.  Embedded markup (RDFa, Microdata, Microformats) for interpretation of Web documents (search, retrieval)  schema.org vocabulary used at scale (700 classes, 1000 predicates) and supported by Yahoo, Yandex, Bing, Google  Adoption on the Web (2016): o 38 % out of 3.2 bn pages o 44 bn statements/quads (see “Web Data Commons”, see Meusel & Paulheim [ISWC2014])  Same order of magnitude as “the Web” (scale, dynamics) Embedded markup data & schema.org <div itemscope itemtype ="http://schema.org/Movie"> <h1 itemprop="name">Forrest Gump</h1> <span>Actor: <span itemprop=„actor">Tom Hanks</span> <span itemprop="genre">Drama</span> ... </div> 05/04/17 4 RDF statements node1 actor _node-x node1 actor Robin Wright node1 genre Comedy node2 actor T. Hanks node2 distributed by Paramount Pic. node3 actor Tom Cruise node3 distributed by Paramount Pic. Stefan Dietze
  5. 5.  schema.org extension providing vocabulary for annotation of learning resources  Association of resources (s:CreativeWork, e.g. books, videos etc) with learning-related attributes (typical age, learning resource type, educational frameworks etc)  Dublin Core Metadata Initiative task force on LRMI Learning Resources Metadata Initiative (LRMI) 05/04/17 5Stefan Dietze http://lrmi.dublincore.net/
  6. 6. Learning Resources Metadata Initiative: research questions 05/04/17 6Stefan Dietze How is LRMI actually being used on the Web?  RQ1) Adoption of LRMI terms / patterns and its evolution?  RQ2) Distribution across the Web?  RQ3) Quality (and how to improve/cleanse/interpret)? Why is it important?  Enable data reuse (KB construction, recommenders, search)  Inform vocabulary design (LRMI, schema.org)
  7. 7. 2013 2014 2015 Documents (CC) 2,224,829,946 2,014,175,679 1,770,525,212 URLs (WDC) 585,792,337 (26.3%) 620,151,400 (30.7%) 541,514,775 (30.5%) Quads (WDC) 17,241,313,916 20,484,755,485 24,377,132,352 URLs (LRMI) 83,791 430,861 779,260 URLs (LRMI’) 84,098 430,895 929,573 Quads (LRMI) 9,245,793 26,256,833 44,108,511 Quads(LRMI’) 9,251,553 26,258,524 69,932,849  CC: Common Crawl, 2013-2015 (http://commoncrawl.org)  WDC: Web Data Commons, 2013-2015: statements/quads extracted from CC (http://webdatacommons.org)  LRMI: all quads extracted from WDC/CC which include or co-occur with an LRMI term (according to LRMI spec)  LRMI‘: extracted from WDC/CC as above, but considering „common errors“ [Meusel et al 2015] Data extraction 05/04/17 7Stefan Dietze
  8. 8.  CC: Common Crawl, 2013-2015 (http://commoncrawl.org)  WDC: Web Data Commons, 2013-2015: statements/quads extracted from CC (http://webdatacommons.org)  LRMI: all quads extracted from WDC/CC which include or co-occur with an LRMI term (LRMI spec)  LRMI‘: extracted from WDC/CC as above, but considering „common errors“ [Meusel et al 2015] Data extraction 05/04/17 8Stefan Dietze 2013 2014 2015 Documents (CC) 2,224,829,946 2,014,175,679 1,770,525,212 URLs (WDC) 585,792,337 (26.3%) 620,151,400 (30.7%) 541,514,775 (30.5%) Quads (WDC) 17,241,313,916 20,484,755,485 24,377,132,352 URLs (LRMI) 83,791 430,861 779,260 URLs (LRMI’) 84,098 430,895 929,573 Quads (LRMI) 9,245,793 26,256,833 44,108,511 Quads(LRMI’) 9,251,553 26,258,524 69,932,849
  9. 9.  Power law distribution across approx. 300 PLDs and 4000 subdomains (2015)  Top 10% of contributors provide 98.4% of all quads (2015) LRMI distribution across pay-level-domains (PLDs) 05/04/17 9Stefan Dietze 7xxxtube.com 1amateurporntube.com virtualpornstars.com sunriseseniorliving.com simplyfinance.co.uk menslifestyles.com audiobooks.com simplypsychology.org helles-koepfchen.de
  10. 10. 05/04/17 10Stefan Dietze Markup quality (1/2): addressing schema misuse sunriseseniorliving.com 7xxxtube.com 1amateurporntube.com virtualpornstars.com simplyfinance.co.uk menslifestyles.com audiobooks.com simplypsychology.org helles-koepfchen.de Clustering/classification of unintended uses of LRMI terms? • Domain blacklist: recall 96%, roughly 10% of PLDs (0,5 % of documents) affected • Clustering of PLDs/resource types (XMeans) • Variety of features, in particular related to term adoption
  11. 11. Term co-occurrence within markup from top-ranked PLDs („learning resources in the LRMI sense“) Unintended schema use: term distribution as clustering feature? 05/04/17 11Stefan Dietze Term co-occurrence within markup from filtered adult content PLDs
  12. 12. Rank Year Type # Quads # PLDs 1 2013 EducationalEvent 6004 1 2014 EducationalEvent 3047 1 2015 offer 100516 1 2 2013 UserComment 20 1 2014 Therapist 25 1 2015 headline 6724 1 3 2013 CompetencyObject 4 1 2014 UserComment 23 1 2015 URL 693 1 4 2013 Webpage 2 1 2014 learningResourceType 21 1 2015 webpage 360 1 5 2013 about 1 1 2014 EducationalEvent 19 1 2015 musicrecording 296 1  Heuristics for fixing frequent errors (see Meusel et al., ESWC2015) o Wrong namespaces (eg.: “htp:/schema.org”): 501,530 quads in 2015 o Undefined types and properties: 1,172,893 quads in 2015 o Object properties misused as data type property: 10,288,717 quads in 2015  Errors fixed in most PLDs and documents  But: lower error rate in LRMI corpus than markup in general (WDC) Markup quality (2/2): heuristics for fixing frequent errors 05/04/17 12Stefan Dietze Top-5 undefined types “Strings, not things”  Numbers from 2015: o 46 million “transversal” quads (i.e. non-hierarchical statements) o 64% datatype properties, yet 97% refer to literals (up from 70% in 2013)  Issues o Lack of links and controlled vocabularies o Data reuse requires identity resolution 2013 2014 2015 # quads 520,815 (5.63%) 1,601,796 (6.10%) 6,179,097 (8.84%) # docs 46,382 (55.15%) 369,772 (85.81%) 754,863 (81.21%) # PLDs 75 (75.76%) 154 (67.54%) 291 (77.39%) Fixed quads/documents/PLDs
  13. 13. Key findings & implications 05/04/17 13Stefan Dietze I. Significant growth, but biased term adoption.  Growing adoption: 138 M (48 M) statements in 2016 (2015) (observable even in general-purpose crawl/CC)  Bias towards simple data type & generic properties  Implications for data consumption & identity resolution II. Power-law distribution of LRMI markup.  Top 10% contributors provide 98.4% of quads 2015  Efficient crawling / extraction of LRMI-specific data (eg for building index or recommender) => focused crawling of most probable data providers III. Frequent errors.  Vast amounts of erroneous statements (80% of PLDs in 2015), yet fewer than in markup in general  Steady increase (total and relative) of errors  Need for data cleansing & fixing: heuristics and frequency-based approaches (e.g. erroneous terms usually in few PLDs only) IV. Unintended use of vocabulary terms.  Terms applied in variety of contexts (e.g. adult content)  Not necessarily schema violation  But: need for further processing (e.g. clustering/classification) when interpreting/using LRMI
  14. 14. Consumption, reuse & fusion of markup data  Clustering for data cleansing and categorisation (features: eg term distribution, page-rank, etc)  Supervised data fusion for entity matching and fact verification – related work [ICDE2017, SWJ2017]  Augmenting knowledge bases Vocabulary design  Feed findings into DCMI task force on LRMI  Bootstrap pattern and terms (from actual usage) ?  Wider schema.org question: reflecting lack of acceptance of object-object relationships in vocabularies? Future work 05/04/17 14Stefan Dietze Yu, R., Fetahu, B., Gadiraju, U., Dietze, S., FuseM: Query- Centric Data Fusion on Structured Web Markup, ICDE2017. Yu, R., Fetahu, B., Gadiraju, U., Lehmberg, O., Ritze, D., Dietze, S., KnowMore - Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2017, under review.
  15. 15. Contact, data & stats 05/04/17 15Stefan Dietze Data http://lrmi.itd.cnr.it/ Contact @stefandietze | http://stefandietze.net

×