Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Knowledge Graph Engineering

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Prochain SlideShare
Knowledge graphs on the Web
Knowledge graphs on the Web
Chargement dans…3
×

Consultez-les par la suite

1 sur 28 Publicité

Knowledge Graph Engineering

Télécharger pour lire hors ligne

Keynote at Summer School on AI for Industry 4.0 that discusses the benefits of knowledge graphs and some of the challenges when developing Knowledge Graphs. It also gives an overview of some of the tooling that is available to build and maintain Knowledge Graphs.

Keynote at Summer School on AI for Industry 4.0 that discusses the benefits of knowledge graphs and some of the challenges when developing Knowledge Graphs. It also gives an overview of some of the tooling that is available to build and maintain Knowledge Graphs.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Knowledge Graph Engineering (20)

Publicité

Plus récents (20)

Knowledge Graph Engineering

  1. 1. Knowledge Graph Engineering Keynote at Summer School on AI for Industry 4.0 Armin Haller Associate Professor, ANU
  2. 2. Knowledge Graphs (KGs) “A Knowledge Graph is a graph of data intended to accumulate and convey knowledge of the real world, whose nodes represent entities of interest and whose edges represent relations between these entities.” [Hogan et al., 2020] • Knowledge graphs are created collaboratively by many users • Information can be added in a relatively arbitrary manner as structural constraints are few Closed KGs (~2019) [Noy et al., 2019] Microsoft ~2bn entities, ~55bn facts Google ~1bn entities, ~70bn assertions Facebook ~50m entities, ~500m assertions eBay ~1bn triples IBM ~100m entities, 5bn relationships Open KGs (April 2021) DBpedia ~4.58m entities, ~9.25GB Yago4 ~50m entities, ~18.4GB Wikidata ~93m entities, ~99GB
  3. 3. Knowledge Graphs (KGs) Graphs Natural way of structuring and presenting knowledge Heterogenous Knowledge from different sources can be integrated and/or interlinked Schema-later Schema often not decided until later, and does not impose integrity constraints
  4. 4. Schema in KGs Ontologies as schemas in KGs An ontology is an “explicit specification of a conceptualization consisting of a set of objects, and the describable relationships among them” [Gruber, 1993] Components of an Ontology • Classes: abstract groups (sets) of objects that are defined by properties that all its members share (e.g., Person, Organisation, Event) • Attributes: characteristics or parameters that objects (and classes) can have (e.g., data of birth, longitude, latitude, timestamp) • Relationships: ways in which classes and individuals can be related to one another (e.g., role, attributed to, observed by) • Individuals: Concrete objects that are inherent to the domain of discourse, such as specific people, organisations or abstract individuals such as numbers (e.g., g, π)
  5. 5. Limited many entities Generic applies to many Specific applies to few RDF Knowledge Graphs Comprehensive fewer entities ABox (Data) TBox (Schema) Q58043963 Q76 Barack Obama (3,947 axioms) Armin Haller (189 axioms) P361 Q35120 Entity partOf minimum no of players Chess Person Q73145133 P1872
  6. 6. Meta-modelling issues in KGs Without enforced (upfront designed) schemas, KGs suffer from, e.g.: • Inconsistent modelling of classes/instances <Q1412680> <P279> <Q28100368> | <Beef Wellington> <subclass of> <Beef Dish> <Q6497852> <P31> <Q28100665> | <Wiener Schnitzel> <instance of> <Veal Dish> • Subclassing of disjoint super-classes <Q190928> <P279> <Q124282> | <shipyard> <subclass of> <dock> <Q190928> <P279> <Q4830453> | <shipyard> <subclass of> <business> <Q124282> <P279> <Q7184903> | <shipyard> <subclass of> <abstract object> <Q190928> <P279> <Q223557> | <shipyard> <subclass of> <physical object> • Instance of relations between first-order classes <Q12156> <P31> <Q12136> | <Malaria> <instance of> <Disease> <Q12156> <P279> <Q12136> | <Malaria> <subclass of> <Disease> • Redundant/circular inheritances between first-order classes <Q18557307> <P279> <Q692536> | <muscle tissue disease> <subclass of> <muscular disease> <Q692536> <P279> <Q18557307> | <muscular disease <subclass of> <muscle tissue disease>
  7. 7. Types of Schemas (Ontologies) Level of Abstraction Most General Most Specific Reusability Highest Lowest Upper Ontologies Mid-Level Ontologies Domain Ontologies Use-Case Ontologies e.g., CyC, SUMO, DOLCE, BFO, CYC e.g., PROV-O, FOAF, ORG, SOSA/SSN, AGRIF e.g., GO, ChEBI, DO, BTO [Haller & Polleres, 2020a]
  8. 8. KG Engineering KG Creation Extract data from existing resources KG Usage KG Linking Add instance assertions KG Curation Add schema assertions
  9. 9. KG Creation – Develop Schema Top-Down Schema first, Data later Bottom Up Data first, Schema later ABox (Data) TBox (Schema) Middle-Out
  10. 10. KG Creation Bottom-Up KG Creation • Schema is not defined, and data is added organically and manually using tools such as: – OntoWiki [Frischmuth et al., 2015] – Semantic MediaWiki [Krötzsch et al., 2006] – Wikibase – Schímatos [Wright et al., 2020] Top-Down KG Creation • Schema is created upfront, existing data mapped to schema using languages/tools such as: – R2RML – SPARQL Generate [Lefrançois et al., 2017] – SHACL Rules – TARQL – NLP/NER from unstructured text Middle-Out KG Creation [Sure et al., 2004] • Schema is partly defined upfront, with mappings added later when data defines semantics • Use case data is provided upfront
  11. 11. KG Curation Correctness – Evaluation Accessibility, Accuracy, Consistency, Conciseness, Trustability, Dynamicity, Representationality [Zaveri et al., 2016] – Correction Evaluating data quality (SHACL, SheX) • Syntactic errors • Semantic errors Completeness – KG Completion [Paulheim, 2017] Using structural information observed in triples • Classification • Probabilistic and Statistical Methods
  12. 12. KG Linking Linked Data Principles [Berners-Lee, 2006] • LDP1: Use URIs as identifiers for things; • LDP2: Use HTTP URIs so those identifiers can be dereferenced; • LDP3: return useful information upon dereferencing of those URIs using a standard format (typically, RDF); • LDP4: include links using externally dereferenceable URIs
  13. 13. KG Linking Linking Issues [Haller et al., 2020b] • References to many inaccessible URIs (i.e., broken links) may render a KG largely useless • Changes in linked external KGs are out of control of the KG publisher • Previously, no definition of what constitutes a “link”, specifically “internal links”, i.e., links between parts of one coherent KG, and “external links”, i.e., links between different KGs) – A triple is a link if it contains a URI in a namespace other than the authoritative namespace URI of the dataset/KG where the triple is defined. [Haller et al., 2020b]
  14. 14. KG Linking – Link Types • Ontology links [Haller et al., 2020b] – class link t:[dbo:Person, rdfs:subClassOf, foaf:Person] – instance typing link t:[dbr:Wolfgang_Amadeus_Mozart, rdf:type, foaf:Person] – property link t:[dbr:Wolfgang_Amadeus_Mozart, foaf:name, "Wolfgang Amadeus Mozart"@en] – instance role link t:[dbr:Wolfgang_Amadeus_Mozart, foaf:knows, wd:Q51088] (Antonio Salieri) • Instance link t:[dbr:Wolfgang_Amadeus_Mozart, owl:sameAs, wd:Q254]
  15. 15. KG Linking in the wild • Crawl of the LODcloud [Abele et al., 2017] + historical datasets from the LODcloud that were cached in the LODLaundromat • 430 Linked datasets in resulting corpus, each encoded in HDT for a total size of 51 GB (3.3bn triples) % of total Available Available as % of total Total # of datasets 1,359 100% SPARQL endpoint 459 33.5% 125 9.1% Available as download 890 65.4% 226 16.6% Characteristic Median Mean Number of Triples 4,478 17,860,436 Number of Unique Subjects 613 1,774,578 Number of Unique Predicates 31 65.4% Number of unique objects 2,245 5,296,390
  16. 16. KG Linking in the wild (cont’d) Class Links http://vivo.iu.edu 119,538 http://vivo.scripps.edu 63,128 http://www.imagesnippets.com 12,874 http://core.kmi.open.ac.uk 9,143 http://commons.wikimedia.org 8,258 http://vivo.psm.edu 8,036 http://datos.bne.es 2,778 http://dbpedia.org 1,614 http://www.productontology.org 1,000 http://vivoweb.org 84 http://commons.wikimedia.org 4,995 http://datos.bne.es 1,255 http://vivo.iu.edu 510 http://vivo.psm.edu 481 http://vivoweb.org 386 http://vivo.scripps.edu 187 http://semanticscience.org 168 http://www.iupac.org 102 http://dbpedia.org 101 http://tkm.kiom.re.kr 60 Property Links Median 0 Mean 1,299 % above 0 44% Median 0 Mean 47 % above 0 18%
  17. 17. Instance Typing Links KG Linking in the wild (cont’d) Instance Links http://webisa.webdatacommons.org 101,491,507 http://commons.wikimedia.org 100,022,186 http://lod.b3kat.de 40,674,519 http://lod.hebis.de 39,160,423 http://d-nb.info 20,096,228 http://datos.bne.es 7,419,630 http://data.ordnancesurvey.co.uk 5,653,997 http://data.europeana.eu 4,987,332 http://id.loc.gov 1,570,877 http://data.bibsys.no 1,440,011 http://ld.zdb-services.de 398,381,851 http://commons.wikimedia.org 319,988,690 http://d-nb.info 14,160,649 http://data.ordnancesurvey.co.uk 13,277,718 https://data.gov.cz 3,081,559 http://core.kmi.open.ac.uk 1,696,618 http://lod.hebis.de 1,624,579 http://id.loc.gov 1,143,545 http://data.europeana.eu 687,735 http://spraakbanken.gu.se 451,081 http://www.imagesnippets.com 214,362 http://data.coi.cz 34,277 Median 206 Mean 1,967,570 % above 0 97% Median 206 Mean 4,240,890 % above 0 72%
  18. 18. KG Linking in the wild (cont’d) • Selected predicates used in links owl:samesAs owl:DifferentFrom Rdfs:seeAlso owl:AllDifferent Median 0 0 0 0 Mean 503,859 581 2,735 0 % above 0 53% <1% 14% 0 P90% 1,460 0 1 0 1st 1st # http://commons.wikimedia.org N/A 40,636,493 103,439 324,659 2nd 2nd # http://ld.zdb-services.de 18,049,155 N/A http://stitch.cs.vu.nl N/A 3rd 3rd # http://d-nb.info 17,410,586 N/A http://data.nobelprize.org N/A
  19. 19. KG Linking in the wild (cont’d) Total Links http://ld.zdb-services.de 421,206,061 http://commons.wikimedia.org 420,024,129 http://webisa.webdatacommons.org 101,491,507 http://lod.hebis.de 40,785,002 http://lod.b3kat.de 40,677,795 http://d-nb.info 34,256,877 http://data.ordnancesurvey.co.uk 18,931,817 http://datos.bne.es 7,428,111 http://data.europeana.eu 5,675,067 https://data.gov.cz 3,958,043 Median 416 Mean 6,209,808 % above 0 96% Broken Class URIs Broken Property URIs Prefix.cc crawl LOD corpus Prefix.cc crawl LOD corpus HTTP Code # % # % # % # % 200 7,175 12.3% 2,579 12.8% 814 44.7% 58,108 40.9% 301 18,598 31.8% 2,610 12.9% 442 24.3% 1,137 0.8% 302 4,331 7.4% 925 0.5% 194 10.7% 1,391 1.0% 303 12,805 21.9% 3,903 19.3% 108 5.9% 5,247 3.7% 40x 12,054 20.6% 8,664 42.9% 130 7.1% 73,366 51.7% 50x 66 <0.1% 111 <0.1% 4 <0.1% 362 0.3% No response 146,145 5.9% 1,425 7% 129 7.1% 2,332 1.6% Total 204,616 100% 20,217 100% 1,821 100% 141,943 100%
  20. 20. KG Linking in the wild – Wikidata • Wikidata by far the largest openly available KG and the only one truly built bottom-up → cause of many modelling errors/inconsistencies • Not part of the LODCloud, therefore was not included in [Haller et al., 2020b], however, we did an analysis since for the 9th of March 2020 Wikidata dump (HDT file 49.4GB compressed) Number of triples 3,381,623,911 Number of unique subjects 1,327,447,995 Number of predicates 32,713 Number of unique objects 2,010,015,636 Number of shared subject-object 1,173,987,281 Unique Individuals 75,261,968 Class Links 375,351,770 Property Links 2,723,834 of which sameAs links 2,723,834 Instance Typing Links 77,479,623 # of Classes 1,045,455 # of Properties 74,746 Ratio 1/14 # of unique Properties 7,259
  21. 21. KG Linking in the wild (cont’d) • Ontologies are reused widely – Only a few KGs define their own ontology → a large number of ontologies exist that cover already many domains • Ubiquity of broken Class and Property links – Alarming number of broken links, i.e., more than half of all class and property URIs – Data publishers need to consider to replicate linked ontologies • Lack of Instance Links – Many (28% of all) KGs do not use any Instance Links, and owl:sameAs is not particularly popular at all (other than in Wikidata) 1. these links are expensive to establish manually 2. expensive to maintain, and 3. even if they exist, there is no incentive to publish them openly.
  22. 22. KG Usage • Knowledge Management, Knowledge Discovery • Training of ML models with KGs • Conversational Agents – Q&A – Personal Assistants – Chatbots • Open Data
  23. 23. Building the AGRIF KG Australian Government Records Interoperability Framework • Address discovery and semantic interoperability needs in Australian Government • Combine records/archives/information management with contemporary data science • Emphasis business benefit to the creators of information • Make sure it does not require an entirely new skillset for everyone involved • Build proof-of-concept KG for two use case agencies
  24. 24. Building the AGRIF KG Learning graph shapes from KG KG Usage Adding schema links to external KG Develop AGRIF ontology Map from source metadata to JSON objects Map from JSON objects to RDFS/OWL Extract data from unstructured sources using NLP/NER KG Curation (e.g., entity reconciliation)
  25. 25. Building the AGRIF KG Metadata Extractor Document Store (CouchDB) Triple Store (Virtuoso) JSON NLP/NER-Toolkit Schímatos Platform SHACL Learner Active Knowledge Graph Completion J2RM RDF A P I .pdf .docx .msg .xlsx .csv … End User Domain Expert A P I KG-I Protégé Architecture
  26. 26. AGRIF KG tools • Schema – AGRIF Ontology http://reference.data.gov.au/def/ont/agrif • Open-source software – Metadata Extractor & Loader (MEL) – JSON to RDF Mappings (J2RM) [Méndez et al., 2020] – SHACLearner [Omran et al., 2020] – Schímatos [Wright et al., 2020]
  27. 27. Conclusions • Stronger focus on the end user needed – Tools/methods needed for creating/maintaining KGs – Tools/methods needed to support querying/analysing KG Schemas • Improved NLP/NER-based learning techniques needed (distant supervision) that build s-p-o relations from unstructured text [Mintz et al., 2009] • Permanent Distributed querying/replication of data/schema
  28. 28. References • Hogan, A., et al.: Knowledge Graphs. ACM Computing Surveys (to appear), 2021. • Noy, N., Gao, Y., Jain, A., Narayanan, A., Patterson, A. , Taylor, J.: Industry-scale Knowledge Graphs: Lessons and Challenges. ACM Queue 17(2), 2019. • Gruber, T.: A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition, 5(2):199-220, 1993. • Frischmuth, P., Martin, M., Tramp, S., Riechert, T., Auer, S.: OntoWiki – An Authoring, Publication and Visualization Interface for the Data Web. Semantic Web, vol. 6, no. 3, pp. 215-240, 2015. • Krötzsch, M., Vrandečić, D., Völkel, M.: Semantic MediaWiki. The Semantic Web – ISWC 2006. • Wright, J., Méndez, S. J. R., Haller, A., Taylor, K., Omran, P. G.: Schímatos: a SHACL-based Web-Form Generator for Knowledge Graph Editing. The Semantic Web – ISWC 2020. • Lefrançois, M., Zimmermann, A., Bakerally, N.: A SPARQL Extension for Generating RDF from Heterogeneous Formats. ESWC (1), 2017. • Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S.: Quality assessment for linked data: A survey. Semantic Web 7 (1), 63-93, 2016. • Paulheim, H.: Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web 8(3): 489-508, 2017. • Berners-Lee, T.: Linked Data. W3C Design Issues. URL: http://www.w3.org/DesignIssues/LinkedData.html, 2006. • Haller, A., Polleres, A.: Are we better off with just one ontology on the Web? Semantic Web 11(1): 87-99, 2020a. • Sure, Y., Staab, S., Studer, R., On-To-Knowledge Methodology (OTKM), Handbook on Ontologies (2004) pp 117-132. • Haller, A., Fernández, J. D., Kamdar, M. R. , Polleres, A.: What Are Links in Linked Open Data? A Characterization and Evaluation of Links between Knowledge Graphs on the Web. ACM J. Data Inf. Qual. 12(2): 9:1-9:34, 2020b. • Abele, A., McCrae, J. P., Buitelaar, P., Jentzsch, A., Cyganiak, R: Linking open data cloud diagram. URL: http://lod-cloud.net. Insight-Centre. 2017. • Méndez, S. J. R., Haller, A., Omran, P.G., Wright, J., Taylor, K.: J2RM: An ontology-based JSON-to-RDF Mapping tool. ISWC (Demos/Industry) 2020. • Omran, P. G., Taylor, K., Méndez, S. J. R., Haller, A.: Towards SHACL Learning from Knowledge Graphs. ISWC (Demos/Industry) 2020. • Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, (ACL ‘09), 2009.

×