Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
making sense of text and data
October, 2019
Connected Data London
Semantic Similarity for Faster
Knowledge Graph Delivery ...
Why Knowledge Graphs?
“Cross-industry studies show that on average, less than half of an
organization’s structured data is...
Presentation Outline
Enterprise Knowledge Graphs
Smart Graphs with Embeddings
Implementing Knowledge Graphs
Presentation O...
What is a Knowledge Graph?
Graph, Semantics, Smart, Alive
Multiple Enterprise Data Management Systems
KG platforms combine capabilities of several enterprise systems:
o Master and ...
Challenges in Enterprise Semantic Integration
Type Titles
TV Episodes 4’044’529
Short film 681’067
Feature film 516’726
Vi...
Challenges in Enterprise Semantic Integration
Multiple levels of inconsistencies:
o Types: film vs “TV movie”
o Meta-data:...
A Classical Approach
o Start with string matching of the Titles
“Harry Potter and the Deathly Hallows: Part II” vs.
“Harry...
A Classical Approach with extra Rules
o Add release date matching
Lose 10% of the matches due to bad dates
o Ambiguity is ...
Presentation Outline
Enterprise Knowledge Graphs
Smart Graphs with Embeddings
Implementing Knowledge Graphs
Presentation O...
What is Knowledge Graph Embedding?
o Predict similar graph nodes or properties
o Require no input training data
o Mathemat...
o For each film include all actors, director, country of origin
o Vast matrix with entities and literals
Knowledge Graph E...
Random Indexing (RI) Algorithm
o Reduces the matrix dimension
with elemental vectors
For each term, w calculate a context ...
Random Indexing (RI) Algorithm #2
o Supports similarity searches for:
Document to Document – similar movies
Document to Te...
Presentation Outline
Enterprise Knowledge Graphs
Smart Graphs with Embeddings
Implementing Knowledge Graphs
Presentation O...
KG Consumers
GraphDB
Reference Software Architecture
o Easy consumption of data
o No backend development
o Flexible data p...
Transform CSV to RDF
o Perform standard ETL tasks
o Trim spaces, parse numbers and dates
o Parse IMDB ids from links for t...
Similarity Plugin API
subject predicate object
wd:Q550232 :actor “Adam LeFevre”
imdb:tt0344854 :actor "Adam LeFevre”
… … …...
Specify KG Embeddings – Select Predicates
o Similarity plugin expects triples <s, p, o>
Specify KG Embeddings – Align Schema
o Set a translation table of the predicates
Results
o Find similar RDF resources to “Pirate Radio”
o Even a limited set of predicates return acceptable results
o Impo...
Important Design Considerations
o Prefer RDF over Property Graph
o Much richer technology ecosystem (schema, dataset, reas...
Questions & Answering
Prochain SlideShare
Chargement dans…5
×

0

Partager

Télécharger pour lire hors ligne

Semantic similarity for faster Knowledge Graph delivery at scale

Télécharger pour lire hors ligne

Knowledge graphs promise a novel platform for better holistic decision making and analytics. Many projects fail to reach their full potential because of the prohibitively high cost of integrating new knowledge from the required information sources.

The talk explains the concept of semantic similarity as a tool for efficient entity clustering and matching based on graph and text embeddings. It will demonstrate the underlying scalable and easy to understand algorithm of Random Indexing.

This work is part of the Ontotext Platform, which increases productivity in developing and maintaining large scale knowledge graphs. The platform enables enterprises to develop and operate on top of such mission-critical systems for decision support, information discovery and metadata management.

  • Soyez le premier à aimer ceci

Semantic similarity for faster Knowledge Graph delivery at scale

  1. 1. making sense of text and data October, 2019 Connected Data London Semantic Similarity for Faster Knowledge Graph Delivery at Scale
  2. 2. Why Knowledge Graphs? “Cross-industry studies show that on average, less than half of an organization’s structured data is actively used in making decisions—and less than 1% of its unstructured data is analyzed or used at all” What’s Your Data Strategy? Leandro DalleMule and Thomas H. Davenport, Harvard Business Review Top 5 USA Banks
  3. 3. Presentation Outline Enterprise Knowledge Graphs Smart Graphs with Embeddings Implementing Knowledge Graphs Presentation Outline
  4. 4. What is a Knowledge Graph? Graph, Semantics, Smart, Alive
  5. 5. Multiple Enterprise Data Management Systems KG platforms combine capabilities of several enterprise systems: o Master and reference data management o Corporate/Enterprise Taxonomy o Datawarehouse o Metadata management o Digital asset management o Enterprise search
  6. 6. Challenges in Enterprise Semantic Integration Type Titles TV Episodes 4’044’529 Short film 681’067 Feature film 516’726 Video 164’061 TV series 164’061 TV movies 126’206 … … Total * 5’838’514 Type Titles film 235’707 silent short film 16’377 television film 15’345 short film 11’225 animated film 3’785 … … … … Total 289’650 IMDB WikiData * Later the tests use only 5K crawled datasets
  7. 7. Challenges in Enterprise Semantic Integration Multiple levels of inconsistencies: o Types: film vs “TV movie” o Meta-data: “science fiction”, “military science fiction” vs “Sci-Fi” o Reference data: “US” vs. “United States” o Manually curated cross-links (!) for testing purposes only
  8. 8. A Classical Approach o Start with string matching of the Titles “Harry Potter and the Deathly Hallows: Part II” vs. “Harry Potter and the Deathly Hallows – Part 2” “Perfume: The Story of a Murderer” vs “Perfume” “Pirate Radio” vs. “The Boat That Rocked” “Avatar” vs ”Avatar” (4 movies)
  9. 9. A Classical Approach with extra Rules o Add release date matching Lose 10% of the matches due to bad dates o Ambiguity is greatly reduced but still many: tt0238520 16 October 1995 50 min tt1125875 11 April 1995 48 min tt0238520 23 June 1995 1h 21 min
  10. 10. Presentation Outline Enterprise Knowledge Graphs Smart Graphs with Embeddings Implementing Knowledge Graphs Presentation Outline
  11. 11. What is Knowledge Graph Embedding? o Predict similar graph nodes or properties o Require no input training data o Mathematical representation of graph nodes as vectors: duration drama comedy The Godfather (2h 58m) American Pie (1h 15 min) vs.
  12. 12. o For each film include all actors, director, country of origin o Vast matrix with entities and literals Knowledge Graph Embedding Example Movie [Actor] “Adam LeFevre” [Actor] “Anthony Anderson ” [Actor] “Mia Farrow” [Country] “France” [Country] ”US” [Country] ”United states” [Director]” Luc Besson” … wd: Q550232 1 1 1 1 1 imdb: tt0344854 1 1 1 1 ... … … … … … … … … TermsDocument
  13. 13. Random Indexing (RI) Algorithm o Reduces the matrix dimension with elemental vectors For each term, w calculate a context vector S(w) by summing the index vectors of all elemental vectors x appearing in the context of w o Light-weight and fast (250K x 1.45M matrix in < 5m) o Fast sub-second searches and requires limited RAM Actors Movie Adam LeFevre Anthony Anderson Mia Farrow Elemental vectors wd: Q550232 1 1 1 imdb: tt0344854 1 0 1 ... … … …
  14. 14. Random Indexing (RI) Algorithm #2 o Supports similarity searches for: Document to Document – similar movies Document to Term – specific actor/director Term to Term – similar actor/directors Term to Document – find movies specific for this actor/director o Features all properties of a Vector Space model o Partial matching, weights, ranking + context sensitive semantic search Actors Movie Adam LeFevre Anthony Anderson Mia Farrow Elemental vectors wd: Q550232 1 1 1 imdb: tt0344854 1 0 1 ... … … …
  15. 15. Presentation Outline Enterprise Knowledge Graphs Smart Graphs with Embeddings Implementing Knowledge Graphs Presentation Outline
  16. 16. KG Consumers GraphDB Reference Software Architecture o Easy consumption of data o No backend development o Flexible data processing tools o Standard and open interfaces Ontotext Platform GQL query SPARQL RDF / Structured data GQL mutation GQL Federation Similarity Plugin
  17. 17. Transform CSV to RDF o Perform standard ETL tasks o Trim spaces, parse numbers and dates o Parse IMDB ids from links for testing o Map table data to RDF o SPARQL over tabular data o Split multi-valued fields like ”Action|Thriller” o Not yet applied schema level alignment
  18. 18. Similarity Plugin API subject predicate object wd:Q550232 :actor “Adam LeFevre” imdb:tt0344854 :actor "Adam LeFevre” … … … o Accepts a graph described by <s, p, o> o Indexes any RDF types o Works with virtual overlays like: “Adam LeFevre” imdb: tt0344854 wd: Q550232 “Adam LeFevre” wd:Q2702 964 rdfs:label wdt:P161 imdb:actor_2_name
  19. 19. Specify KG Embeddings – Select Predicates o Similarity plugin expects triples <s, p, o>
  20. 20. Specify KG Embeddings – Align Schema o Set a translation table of the predicates
  21. 21. Results o Find similar RDF resources to “Pirate Radio” o Even a limited set of predicates return acceptable results o Important independent alternative for entity matching
  22. 22. Important Design Considerations o Prefer RDF over Property Graph o Much richer technology ecosystem (schema, dataset, reasoning, strings vs things) o Virtualization versus Consolidation o Virtualization works only for simple lookup queries, but not real data integration o Push result federation to the GraphQL data consumption layer o Integrating Random Indexing in the KG database o Push heavy computation as closest to the data o Choose GraphQL over SPARQL for app developers:
  23. 23. Questions & Answering

Knowledge graphs promise a novel platform for better holistic decision making and analytics. Many projects fail to reach their full potential because of the prohibitively high cost of integrating new knowledge from the required information sources. The talk explains the concept of semantic similarity as a tool for efficient entity clustering and matching based on graph and text embeddings. It will demonstrate the underlying scalable and easy to understand algorithm of Random Indexing. This work is part of the Ontotext Platform, which increases productivity in developing and maintaining large scale knowledge graphs. The platform enables enterprises to develop and operate on top of such mission-critical systems for decision support, information discovery and metadata management.

Vues

Nombre de vues

436

Sur Slideshare

0

À partir des intégrations

0

Nombre d'intégrations

45

Actions

Téléchargements

5

Partages

0

Commentaires

0

Mentions J'aime

0

×