SlideShare une entreprise Scribd logo
1  sur  70
Extracting, Aligning, and
Linking Data to Build
Knowledge Graphs
Craig Knoblock
University of Southern California
Thanks to my collaborators: Pedro Szekely, Linhong Zhu, Majid
Ghasemi-Gol, Mohsen Taheriyan, Minh Pham, and Steve Minton
Goal
USC Information Sciences Institute CC-By 2.0 2
raw  messy  disconnected clean  organized  linked
hard to query, analyze & visualize easy to query, analyze & visualize
Use Case: Human Trafficking
USC Information Sciences Institute CC-By 2.0 3
raw  messy  disconnected clean  organized  linked
hard to query, analyze & visualize easy to query, analyze & visualize
Use Case: Human Trafficking
USC Information Sciences Institute CC-By 2.0 4
100 million pages
~ 100 Web sites
help victims
prosecute traffickers
Example: Investigating a Reported Victim
San Diego, where else?
USC Information Sciences Institute CC-By 2.0 5
DIG Interface: Find the locations where a
potential victim was advertised
CC-By 2.0 6
Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 7
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Data
Acquisition
Data Acquisition
USC Information Sciences Institute CC-By 2.0 8
downloading relevant data
batch  real-time
Web pages Web service  database 
CSV  Excel  XML  JSON
Traditional Web Crawler
(e.g., Nutch, Scrapy)
CC-By 2.0 9USC Information Sciences Institute
Web Crawling
24/7
5,000 Pages/Hour
~100,000,000 pages
Total
Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 11
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Feature Extraction
USC Information Sciences Institute CC-By 2.0 12
from raw sources to structured data
• extraction from text
• extraction from structured Web pages
• extraction of image features
Extraction
USC Information Sciences Institute CC-By 2.0 13
Structured Extraction
CC-By 2.0 14
Automated Extraction
[Minton et al., Inferlink]
• Title
• Description
• Seller
• Post Date
• Expiry Date
• Price
• Location
• Category
• Member Since
• Num Views
• Post ID
USC Information Sciences Institute CC-By 2.0 15
Automated Extraction
Input: A Pile of Pages
USC Information Sciences Institute CC-By 2.0 16
Automated Extraction
input:
a pile of pages
Classify by
Templates
pages clustered
by template
USC Information Sciences Institute CC-By 2.0 17
Automated Extraction
input:
a pile of pages
Classify by
Templates
pages clustered
by template
Infer
Extractor
Infer
Extractor
Infer
Extractor
Infer
Extractor
extractor
USC Information Sciences Institute CC-By 2.0 18
Unsupervised Extraction Tool
USC Information Sciences Institute CC-By 2.0 19
Pretty Good Extractions
Want Extracted
Extra Jan. 23, 2015 Jan. 23, 2015 expires Feb
Partial Jan. 23, 2015 Jan. 23
Extraction Evaluation
Title Desc Seller Date Price Loc Cat
Member
Since
Expires Views ID
Perfect 1.0
(50/50)
.76
(37/49)
.95
(40/42)
.83
(40/48)
.87
(39/45)
.51
(23/45)
.68
(34/50)
1.0
(35/35)
.52
(15/29)
.76
(19/25)
.97
(35/36)
Pretty
Good
1.0
(50/50)
.98
(48/49)
.95
(40/42)
.83
(40/48)
.98
(44/45)
.84
(38/45)
.88
(44/50)
1.0
(35/35)
.55
(16/29)
1.0
(25/25)
1.0
(36/36)
10 websites, 5 pages each
fields
USC Information Sciences Institute CC-By 2.0 21
Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 22
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Feature Alignment
USC Information Sciences Institute CC-By 2.0 23
from multiple schemas to a common domain schema
- CSV, Excel
- Database tables
- Web services
- Extractors
- Nomenclature
- Spelling
Multiple Schemas
Karma: Mapping Data to Ontologies
Services
Relational
Sources
Karma
{ JSON-LD }
Hierarchical
Sources
Schema.org
USC Information Sciences Institute CC-By 2.0 24
Semantic Labeling
[Pham et al., ISWC’16]
Offer Place Person
name price idname
Offer
Column-1 Column-2 Column-3 Column-4
British Lee-Enfield
No 4 MK 2 still …
1,000 68155c13de2f2532
Cabelas Millenium
Revolver in .45 colt
700 1711 Anderson Rd 12155a1a2938bc1
e
Learning Semantic Types
Requirements:
Learn from a small number of examples
Distinguish both string and numeric values
Can be learned quickly and is highly scalable to large
numbers of semantic types
Person OrganizationCity State
name birthdate name namename
Person
name date city state workplace
1 Fred Collins Oct 1959 Seattle WA Microsoft
2 Tina Peterson May 1980 New York NY Google
Domain Ontology
Textual
Data
Learning Semantic Types
Textual Data
Treat each column of data as a document
Apply TF-IDF Cosine Similarity
Numeric
Data
Learning Semantic Types
Numeric Data:
Apply statistical hypothesis testing to
determine which distribution fits best
Apply Kolmogorov-Smirnov Test
Features for
Semantic Labeling
• Features
– KS = Kolmogorov-Smirnov
– MW = Mann-Whitney
CC-By 2.0 29USC Information Sciences Institute
Combining the Features for
Semantic Labeling
CC-By 2.0 30USC Information Sciences Institute
Automatically Assigned
Semantic Labels
Offer
name
CreativeWork
fragment
Offer
description
Offer
identifier
Offer
datePosted
CreativeWork
Fragment
35 Whelen
Handi-Rifle
No Tags 35 Whelen Handi-rifle.
Black synthetic
stock/forearm, blued
barrel. Text 601-813-7280
….
245625390711756 October 19,
2015 12:43 pm
Cabelas
Millenium
Revolver in
.45 colt
No Tags This single action is built
to shoot and is a great
way for any level of
shooter to get involved
with a single action. …
12155a1a2938bc1e July 11, 2015
5:17 pm
1711 Anderson
Rd
swap stocks No Tags want to trade butler
creek folding stock for
black stock ruger mini
stock folder by butler
creek will swap even for
full rifle stock ….
5815600fd181fe3b September 22,
2015 1:05 am
white
streetAddress does not appear in training data -> more similar to noisy data
Results on www.msguntrader.com
number of attributes 19
Correct prediction 16
Correct label is in the top 4 predictions 18
Accuracy 84%
MRR 89%
Results on Gun Sites
Evaluation Dataset
Average number of attributes 18
Total number of attributes 176
Correct prediction (Accuracy) 56%
Correct label is in the top 4 predictions 89%
MRR 70%
Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 34
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Entity Resolution
USC Information Sciences Institute CC-By 2.0 35
merging records that refer to the same entity
missing data
incorrect data
scale (~100 million records)
techniques to address
Unsupervised Collective Entity Resolution
36
USC Information Sciences Institute
same victim
same Trafficker
Unsupervised Collective Entity Resolution
USC Information Sciences Institute CC-By 2.0 37
Collective Entity Resolution
[Zhu et al, ISWC’16]
Identifying and linking instances of the same real world entity
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
Noise
Cancelling
Headphones
Son
y
Product
3
599
Dish Washer
Bosch
Product
4
229
Bose Noise
Cancelling
Headphones
Bos
e
Product
5
299
price description
manufacturerproduct
Multi-Type Graph
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
Noise
Cancelling
Headphones
Son
y
Product
3
599
Dish Washer
Bosch
Product
4
229
Bose Noise
Cancelling
Headphones
Bos
e
Product
5
299
price description
manufacturerproduct
Multi-Type Graph
Collective Entity Resolution
[Zhu et al, ISWC’16]
Identifying and linking instances of the same real world entity
Common Approach:
Pairwise Comparisons
Product 5 299
Quiet Comfort 25 Noise Cancelling
Headphone
Bose
Electronic
299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4
599 Dish WasherBoschProduct 3
292 Premium Noise Cancelling HeadphonesSonyProduct 2
Noise Cancelling HeadphonesSonyProduct 1
Price TitleManufacturer
Jaro
0.5
distance
0.2
Jaccard
0.3
Acceptance Threshold: 0.8
Missing Values
Product 5 299
Quiet Comfort 25 Noise Cancelling
Headphone
Bose
Electronic
299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4
599 Dish WasherBoschProduct 3
292 Premium Noise Cancelling HeadphonesSonyProduct 2
Noise Cancelling HeadphonesSonyProduct 1
Price TitleManufacturer
Jaro
0.5
distance
0.2
Jaccard
0.3
Multiple Values
Product 5 299
Quiet Comfort 25 Noise Cancelling
Headphone
Bose
Electronic
299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4
599 Dish WasherBoschProduct 3
292 Premium Noise Cancelling HeadphonesSonyProduct 2
Noise Cancelling HeadphonesSonyProduct 1
Price TitleManufacturer
Jaro
0.5
distance
0.2
Jaccard
0.3
Weights
Product 5 299
Quiet Comfort 25 Noise Cancelling
Headphone
Bose
Electronic
299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4
599 Dish WasherBoschProduct 3
292 Premium Noise Cancelling HeadphonesSonyProduct 2
Noise Cancelling HeadphonesSonyProduct 1
Price TitleManufacturer
Jaro
0.5
distance
0.2
Jaccard
0.30.5 0.2 0.3
Unidirectional
Product 5 299
Quiet Comfort 25 Noise Cancelling
Headphone
Bose
Electronic
299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4
599 Dish WasherBoschProduct 3
292 Premium Noise Cancelling HeadphonesSonyProduct 2
Noise Cancelling HeadphonesSonyProduct 1
Price TitleManufacturer
Jaro
0.5
distance
0.2
Jaccard
0.30.5 0.2 0.3
Graph Summarization:
Original Graph
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
Noise
Cancelling
Headphones
Son
y
Product
3
599
Dish Washer
Bosch
Product
4
229
Bose Noise
Cancelling
Headphones
Bos
e
Product
5
299
price description
manufacturerproduct
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
Noise
Cancelling
Headphones
Son
y
Product
3
599
Dish Washer
Bosch
229
Bose Noise
Cancelling
Headphones
Bos
e
Product
5
299
Product
4
Similar Nodes simt(x, y)
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
Noise
Cancelling
Headphones
Son
y
Product
3
599
Dish Washer
Bosch
229
Bose Noise
Cancelling
Headphones
Bos
e
Product
5
299
Product
4
Graph Sumarization:
Super-Nodes
Quiet Comfort 25 Noise
Cancelling Headphone
Noise Cancelling
Headphones
Premium Noise
Cancelling Headphones
Dish Washer
Bose Noise Cancelling
Headphones
Super-nodes Ct(x)
0.7 0.2 0.1
0.7 0.2 0.1
0.2 0.7 0.1
0.2 0.7 0.1
0.1 0.1 0.8
probability that a node x belongs to each super-node
one matrix for each type
Ct
Noise
Cancelling
Headphones
Premium
Noise
Cancelling
Headphones
Dish Washer
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose Noise
Cancelling
Headphones
Similar Nodes Should Be In The Same
Super-Node
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
Noise
Cancelling
Headphones
Son
y
Product
3
599
Dish Washer
Bosch
229
Bose Noise
Cancelling
Headphones
Bos
e
Product
5
299
Product
4
Super-Links
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
Noise
Cancelling
Headphones
Son
y
Product
3
599
Dish Washer
Bosch
229
Bose Noise
Cancelling
Headphones
Bos
e
Product
5
299
Product
4
Super-Links
Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4
Predict Links In Original Graph
Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4
Predict Links In Original Graph
Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4
Predict Links In Original Graph
Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4
Re-Clustering Improves Reconstruction
Quality
Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4
Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4
Comparable Approaches
Pairwise Clustering Unsupervised Supervised
Limes, Ngomo’11 ✔ ✔
SILK, Isele’10 ✔ ✔ ✔
Serf, Benjelloun’10 ✔ ✔
*Commercial, Kӧpcke’10 ✔ ✔
GraphSum, Riondato’14 ✔ ✔
*AuthorLDA, Bhattacharya’07 ✔ ✔
CoSum (proposed) ✔ ✔
Quality Comparison
Precision Recall F-measure
Author Paper Product Author Paper Product Author Paper Product
Limes-F 0.958 0.827 0.446 0.864 0.761 0.16 0.909 0.792 0.236
Silk-F 0.846 0.877 0.459 0.986 0.756 0.348 0.91 0.812 0.395
Gsum 0.727 0.668 0.01 0.569 0.624 0.587 0.638 0.645 0.02
CoSum-B 0.993 0.871 0.58 0.94 0.611 0.477 0.966 0.718 0.524
Limes-MO 0.912 0.827 0.446 0.944 0.761 0.16 0.928 0.792 0.236
Silk-MO 0.932 0.877 0.459 0.958 0.756 0.348 0.945 0.812 0.395
Serf 0.985 0.837 0.436 0.687 0.808 0.186 0.809 0.822 0.261
CoSum-P 0.999 0.771 0.639 0.997 0.997 0.695 0.998 0.87 0.666
Commercial 0.615 0.63 0.622
AuthorLDA 0.995
Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 58
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Graph Construction
USC Information Sciences Institute CC-By 2.0 59
assembling the data for efficient query & analysis
- ElasticSearch: scalable, efficient query
- graph databases: network analytics
- NoSQL: scalable analytics
- bulk loading: massive data imports
- real-time updates: live, changing data
elasticsearch
• Cloud-based search engine
• Based on Apache Lucene
• Horizontal scaling, replication, load balancing
• Blazingly fast!
• Everything is a document
– Documents are JSON objects
– Index what you want to find
– Fields can contain strings, numbers, booleans,
etc.
CC-By 2.0 60USC Information Sciences Institute
Adult
Service
Offer Person
Efficient indexing and query
Phone
Web
Page
ElasticSearch Data Model
Offers As Roots
Products (AdultService) As Roots
Indexing for High Performance
Knowledge Graph Queries
Avg. Query Times in Milliseconds
Single User Query Load
1.2 billion triples
State of the Art Graph Database (RDF)
DIG indexing deployed in ElasticSearch
USC Information Sciences Institute CC-By 2.0 65
Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 66
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
DIG Deployment for Human Trafficking
USC Information Sciences Institute CC-By 2.0 68
- 100 million Web pages
- Live updates (~5,000 pages/hour)
- ElasticSearch database (7 nodes)
- Hadoop workflows (20 nodes)
- District Attorney
- Law Enforcement
- NGOs
DIG Applications
Human Trafficking
large, real users
Material Science Research
70,000 paper abstracts (built in 1 week)
Arms Trafficking
identify illegal sales
Patent Trolls
identifies patent trolls
Predicting Cyber Attacks
combines diverse sources about vulnerabilities,
exploits, etc.
CC-By 2.0 69USC Information Sciences Institute
Conclusions
• Presented the end-to-end tool-chain to
build domain-specific knowledge graphs
• Integrates heterogeneous data: web
pages, databases, CSV, web APIs,
images, etc.
• Approach scales to million of pages, and
billions facts
• Has been used to build real-world
deployed applicationsUSC Information Sciences Institute CC-By 2.0 70

Contenu connexe

Tendances

Tendances (20)

AI, Knowledge Representation and Graph Databases -
 Key Trends in Data Science
AI, Knowledge Representation and Graph Databases -
 Key Trends in Data ScienceAI, Knowledge Representation and Graph Databases -
 Key Trends in Data Science
AI, Knowledge Representation and Graph Databases -
 Key Trends in Data Science
 
Intro to Cypher
Intro to CypherIntro to Cypher
Intro to Cypher
 
RDF 해설서
RDF 해설서RDF 해설서
RDF 해설서
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Neo4j: Import and Data Modelling
Neo4j: Import and Data ModellingNeo4j: Import and Data Modelling
Neo4j: Import and Data Modelling
 
Eclipse RDF4J - Working with RDF in Java
Eclipse RDF4J - Working with RDF in JavaEclipse RDF4J - Working with RDF in Java
Eclipse RDF4J - Working with RDF in Java
 
Applying Network Analytics in KYC
Applying Network Analytics in KYCApplying Network Analytics in KYC
Applying Network Analytics in KYC
 
Getting Started with Knowledge Graphs
Getting Started with Knowledge GraphsGetting Started with Knowledge Graphs
Getting Started with Knowledge Graphs
 
Building Applications with a Graph Database
Building Applications with a Graph DatabaseBuilding Applications with a Graph Database
Building Applications with a Graph Database
 
Introduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AIIntroduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AI
 
Introduction to Neo4j and .Net
Introduction to Neo4j and .NetIntroduction to Neo4j and .Net
Introduction to Neo4j and .Net
 
DBpedia InsideOut
DBpedia InsideOutDBpedia InsideOut
DBpedia InsideOut
 
Preparing a data migration plan: A practical guide
Preparing a data migration plan: A practical guidePreparing a data migration plan: A practical guide
Preparing a data migration plan: A practical guide
 
NOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4jNOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4j
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval
 
RDF 개념 및 구문 소개
RDF 개념 및 구문 소개RDF 개념 및 구문 소개
RDF 개념 및 구문 소개
 
Introduction: Relational to Graphs
Introduction: Relational to GraphsIntroduction: Relational to Graphs
Introduction: Relational to Graphs
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Google Knowledge Graph
Google Knowledge GraphGoogle Knowledge Graph
Google Knowledge Graph
 

Similaire à Extracting, Aligning, and Linking Data to Build Knowledge Graphs

Learning the Semantics of Structured Data Sources
Learning the Semantics of Structured Data SourcesLearning the Semantics of Structured Data Sources
Learning the Semantics of Structured Data Sources
Mohsen Taheriyan
 
Automatic Hypernym Classification: Towards the Induction of ...
Automatic Hypernym Classification: Towards the Induction of ...Automatic Hypernym Classification: Towards the Induction of ...
Automatic Hypernym Classification: Towards the Induction of ...
butest
 
Automatic Hypernym Classification: Towards the Induction of ...
Automatic Hypernym Classification: Towards the Induction of ...Automatic Hypernym Classification: Towards the Induction of ...
Automatic Hypernym Classification: Towards the Induction of ...
butest
 
DBpedia Framework - BBC Talk
DBpedia Framework - BBC TalkDBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
Georgi Kobilarov
 

Similaire à Extracting, Aligning, and Linking Data to Build Knowledge Graphs (20)

Building Knowledge Graphs in DIG
Building Knowledge Graphs in DIGBuilding Knowledge Graphs in DIG
Building Knowledge Graphs in DIG
 
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
 
Cloud computing and networking course: paper presentation -Data Mining for In...
Cloud computing and networking course: paper presentation -Data Mining for In...Cloud computing and networking course: paper presentation -Data Mining for In...
Cloud computing and networking course: paper presentation -Data Mining for In...
 
Zühlke Meetup - Mai 2017
Zühlke Meetup - Mai 2017Zühlke Meetup - Mai 2017
Zühlke Meetup - Mai 2017
 
IJCAI 2015 Presentation: Did you know?- Mining Interesting Trivia for Entitie...
IJCAI 2015 Presentation: Did you know?- Mining Interesting Trivia for Entitie...IJCAI 2015 Presentation: Did you know?- Mining Interesting Trivia for Entitie...
IJCAI 2015 Presentation: Did you know?- Mining Interesting Trivia for Entitie...
 
20220307 utah state dixon_class v15
20220307 utah state dixon_class v1520220307 utah state dixon_class v15
20220307 utah state dixon_class v15
 
Future of AI-powered automation in business
Future of AI-powered automation in businessFuture of AI-powered automation in business
Future of AI-powered automation in business
 
Fairness, Transparency, and Privacy in AI @ LinkedIn
Fairness, Transparency, and Privacy in AI @ LinkedInFairness, Transparency, and Privacy in AI @ LinkedIn
Fairness, Transparency, and Privacy in AI @ LinkedIn
 
Learning the Semantics of Structured Data Sources
Learning the Semantics of Structured Data SourcesLearning the Semantics of Structured Data Sources
Learning the Semantics of Structured Data Sources
 
Automatic Hypernym Classification: Towards the Induction of ...
Automatic Hypernym Classification: Towards the Induction of ...Automatic Hypernym Classification: Towards the Induction of ...
Automatic Hypernym Classification: Towards the Induction of ...
 
Automatic Hypernym Classification: Towards the Induction of ...
Automatic Hypernym Classification: Towards the Induction of ...Automatic Hypernym Classification: Towards the Induction of ...
Automatic Hypernym Classification: Towards the Induction of ...
 
A Survey on Security and Privacy of Machine Learning
A Survey on Security and Privacy of Machine LearningA Survey on Security and Privacy of Machine Learning
A Survey on Security and Privacy of Machine Learning
 
Cyber Crimes & Cyber Forensics
Cyber Crimes & Cyber ForensicsCyber Crimes & Cyber Forensics
Cyber Crimes & Cyber Forensics
 
[2B1]검색엔진의 패러다임 전환
[2B1]검색엔진의 패러다임 전환[2B1]검색엔진의 패러다임 전환
[2B1]검색엔진의 패러다임 전환
 
DBpedia Framework - BBC Talk
DBpedia Framework - BBC TalkDBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
 
Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석
 
Making the web of things
Making the web of thingsMaking the web of things
Making the web of things
 
2013 10-03-semantics-meetup-s buxton-mark_logic_pub
2013 10-03-semantics-meetup-s buxton-mark_logic_pub2013 10-03-semantics-meetup-s buxton-mark_logic_pub
2013 10-03-semantics-meetup-s buxton-mark_logic_pub
 
Some recent Research and Resources in the area of Data Science
Some recent Research and Resources in the area of  Data ScienceSome recent Research and Resources in the area of  Data Science
Some recent Research and Resources in the area of Data Science
 
Started from the Bottom: Exploiting Data Sources to Uncover ATT&CK Behaviors
Started from the Bottom: Exploiting Data Sources to Uncover ATT&CK BehaviorsStarted from the Bottom: Exploiting Data Sources to Uncover ATT&CK Behaviors
Started from the Bottom: Exploiting Data Sources to Uncover ATT&CK Behaviors
 

Plus de Craig Knoblock

Plus de Craig Knoblock (10)

Learning to Adapt to Sensor Changes and Failures
Learning to Adapt to Sensor Changes and FailuresLearning to Adapt to Sensor Changes and Failures
Learning to Adapt to Sensor Changes and Failures
 
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
 
Lessons Learned in Building Linked Data for the American Art Collaborative
Lessons Learned in Building Linked Data for the American Art CollaborativeLessons Learned in Building Linked Data for the American Art Collaborative
Lessons Learned in Building Linked Data for the American Art Collaborative
 
Assigning semantic labels to data sources
Assigning semantic labels to data sourcesAssigning semantic labels to data sources
Assigning semantic labels to data sources
 
A scalable architecture for extracting, aligning, linking, and visualizing mu...
A scalable architecture for extracting, aligning, linking, and visualizing mu...A scalable architecture for extracting, aligning, linking, and visualizing mu...
A scalable architecture for extracting, aligning, linking, and visualizing mu...
 
Building and Using a Knowledge Graph to Combat Human Trafficking
Building and Using a Knowledge Graph to Combat Human TraffickingBuilding and Using a Knowledge Graph to Combat Human Trafficking
Building and Using a Knowledge Graph to Combat Human Trafficking
 
From Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked KnowledgeFrom Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
 
Semantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisSemantics for Big Data Integration and Analysis
Semantics for Big Data Integration and Analysis
 
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...
 
Discovering Alignments in Ontologies of Linked Data
Discovering Alignments in Ontologies of Linked DataDiscovering Alignments in Ontologies of Linked Data
Discovering Alignments in Ontologies of Linked Data
 

Dernier

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 

Dernier (20)

Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 

Extracting, Aligning, and Linking Data to Build Knowledge Graphs

  • 1. Extracting, Aligning, and Linking Data to Build Knowledge Graphs Craig Knoblock University of Southern California Thanks to my collaborators: Pedro Szekely, Linhong Zhu, Majid Ghasemi-Gol, Mohsen Taheriyan, Minh Pham, and Steve Minton
  • 2. Goal USC Information Sciences Institute CC-By 2.0 2 raw  messy  disconnected clean  organized  linked hard to query, analyze & visualize easy to query, analyze & visualize
  • 3. Use Case: Human Trafficking USC Information Sciences Institute CC-By 2.0 3 raw  messy  disconnected clean  organized  linked hard to query, analyze & visualize easy to query, analyze & visualize
  • 4. Use Case: Human Trafficking USC Information Sciences Institute CC-By 2.0 4 100 million pages ~ 100 Web sites help victims prosecute traffickers
  • 5. Example: Investigating a Reported Victim San Diego, where else? USC Information Sciences Institute CC-By 2.0 5
  • 6. DIG Interface: Find the locations where a potential victim was advertised CC-By 2.0 6
  • 7. Steps To Build a KG USC Information Sciences Institute CC-By 2.0 7 Crawling Extraction DataAcquisition Mapping To Ontology Entity Linking &Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface Data Acquisition
  • 8. Data Acquisition USC Information Sciences Institute CC-By 2.0 8 downloading relevant data batch  real-time Web pages Web service  database  CSV  Excel  XML  JSON
  • 9. Traditional Web Crawler (e.g., Nutch, Scrapy) CC-By 2.0 9USC Information Sciences Institute
  • 11. Steps To Build a KG USC Information Sciences Institute CC-By 2.0 11 Crawling Extraction DataAcquisition Mapping To Ontology Entity Linking &Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  • 12. Feature Extraction USC Information Sciences Institute CC-By 2.0 12 from raw sources to structured data • extraction from text • extraction from structured Web pages • extraction of image features
  • 13. Extraction USC Information Sciences Institute CC-By 2.0 13
  • 15. Automated Extraction [Minton et al., Inferlink] • Title • Description • Seller • Post Date • Expiry Date • Price • Location • Category • Member Since • Num Views • Post ID USC Information Sciences Institute CC-By 2.0 15
  • 16. Automated Extraction Input: A Pile of Pages USC Information Sciences Institute CC-By 2.0 16
  • 17. Automated Extraction input: a pile of pages Classify by Templates pages clustered by template USC Information Sciences Institute CC-By 2.0 17
  • 18. Automated Extraction input: a pile of pages Classify by Templates pages clustered by template Infer Extractor Infer Extractor Infer Extractor Infer Extractor extractor USC Information Sciences Institute CC-By 2.0 18
  • 19. Unsupervised Extraction Tool USC Information Sciences Institute CC-By 2.0 19
  • 20. Pretty Good Extractions Want Extracted Extra Jan. 23, 2015 Jan. 23, 2015 expires Feb Partial Jan. 23, 2015 Jan. 23
  • 21. Extraction Evaluation Title Desc Seller Date Price Loc Cat Member Since Expires Views ID Perfect 1.0 (50/50) .76 (37/49) .95 (40/42) .83 (40/48) .87 (39/45) .51 (23/45) .68 (34/50) 1.0 (35/35) .52 (15/29) .76 (19/25) .97 (35/36) Pretty Good 1.0 (50/50) .98 (48/49) .95 (40/42) .83 (40/48) .98 (44/45) .84 (38/45) .88 (44/50) 1.0 (35/35) .55 (16/29) 1.0 (25/25) 1.0 (36/36) 10 websites, 5 pages each fields USC Information Sciences Institute CC-By 2.0 21
  • 22. Steps To Build a KG USC Information Sciences Institute CC-By 2.0 22 Crawling Extraction DataAcquisition Mapping To Ontology Entity Linking &Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  • 23. Feature Alignment USC Information Sciences Institute CC-By 2.0 23 from multiple schemas to a common domain schema - CSV, Excel - Database tables - Web services - Extractors - Nomenclature - Spelling Multiple Schemas
  • 24. Karma: Mapping Data to Ontologies Services Relational Sources Karma { JSON-LD } Hierarchical Sources Schema.org USC Information Sciences Institute CC-By 2.0 24
  • 25. Semantic Labeling [Pham et al., ISWC’16] Offer Place Person name price idname Offer Column-1 Column-2 Column-3 Column-4 British Lee-Enfield No 4 MK 2 still … 1,000 68155c13de2f2532 Cabelas Millenium Revolver in .45 colt 700 1711 Anderson Rd 12155a1a2938bc1 e
  • 26. Learning Semantic Types Requirements: Learn from a small number of examples Distinguish both string and numeric values Can be learned quickly and is highly scalable to large numbers of semantic types Person OrganizationCity State name birthdate name namename Person name date city state workplace 1 Fred Collins Oct 1959 Seattle WA Microsoft 2 Tina Peterson May 1980 New York NY Google Domain Ontology
  • 27. Textual Data Learning Semantic Types Textual Data Treat each column of data as a document Apply TF-IDF Cosine Similarity
  • 28. Numeric Data Learning Semantic Types Numeric Data: Apply statistical hypothesis testing to determine which distribution fits best Apply Kolmogorov-Smirnov Test
  • 29. Features for Semantic Labeling • Features – KS = Kolmogorov-Smirnov – MW = Mann-Whitney CC-By 2.0 29USC Information Sciences Institute
  • 30. Combining the Features for Semantic Labeling CC-By 2.0 30USC Information Sciences Institute
  • 31. Automatically Assigned Semantic Labels Offer name CreativeWork fragment Offer description Offer identifier Offer datePosted CreativeWork Fragment 35 Whelen Handi-Rifle No Tags 35 Whelen Handi-rifle. Black synthetic stock/forearm, blued barrel. Text 601-813-7280 …. 245625390711756 October 19, 2015 12:43 pm Cabelas Millenium Revolver in .45 colt No Tags This single action is built to shoot and is a great way for any level of shooter to get involved with a single action. … 12155a1a2938bc1e July 11, 2015 5:17 pm 1711 Anderson Rd swap stocks No Tags want to trade butler creek folding stock for black stock ruger mini stock folder by butler creek will swap even for full rifle stock …. 5815600fd181fe3b September 22, 2015 1:05 am white streetAddress does not appear in training data -> more similar to noisy data
  • 32. Results on www.msguntrader.com number of attributes 19 Correct prediction 16 Correct label is in the top 4 predictions 18 Accuracy 84% MRR 89%
  • 33. Results on Gun Sites Evaluation Dataset Average number of attributes 18 Total number of attributes 176 Correct prediction (Accuracy) 56% Correct label is in the top 4 predictions 89% MRR 70%
  • 34. Steps To Build a KG USC Information Sciences Institute CC-By 2.0 34 Crawling Extraction DataAcquisition Mapping To Ontology Entity Linking &Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  • 35. Entity Resolution USC Information Sciences Institute CC-By 2.0 35 merging records that refer to the same entity missing data incorrect data scale (~100 million records) techniques to address
  • 36. Unsupervised Collective Entity Resolution 36 USC Information Sciences Institute
  • 37. same victim same Trafficker Unsupervised Collective Entity Resolution USC Information Sciences Institute CC-By 2.0 37
  • 38. Collective Entity Resolution [Zhu et al, ISWC’16] Identifying and linking instances of the same real world entity Quiet Comfort 25 Noise Cancelling Headphone Bose Electroni c Product 1 Noise Cancelling Headphones Product 2 292 Premium Noise Cancelling Headphones Son y Product 3 599 Dish Washer Bosch Product 4 229 Bose Noise Cancelling Headphones Bos e Product 5 299 price description manufacturerproduct Multi-Type Graph
  • 39. Quiet Comfort 25 Noise Cancelling Headphone Bose Electroni c Product 1 Noise Cancelling Headphones Product 2 292 Premium Noise Cancelling Headphones Son y Product 3 599 Dish Washer Bosch Product 4 229 Bose Noise Cancelling Headphones Bos e Product 5 299 price description manufacturerproduct Multi-Type Graph Collective Entity Resolution [Zhu et al, ISWC’16] Identifying and linking instances of the same real world entity
  • 40. Common Approach: Pairwise Comparisons Product 5 299 Quiet Comfort 25 Noise Cancelling Headphone Bose Electronic 299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4 599 Dish WasherBoschProduct 3 292 Premium Noise Cancelling HeadphonesSonyProduct 2 Noise Cancelling HeadphonesSonyProduct 1 Price TitleManufacturer Jaro 0.5 distance 0.2 Jaccard 0.3 Acceptance Threshold: 0.8
  • 41. Missing Values Product 5 299 Quiet Comfort 25 Noise Cancelling Headphone Bose Electronic 299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4 599 Dish WasherBoschProduct 3 292 Premium Noise Cancelling HeadphonesSonyProduct 2 Noise Cancelling HeadphonesSonyProduct 1 Price TitleManufacturer Jaro 0.5 distance 0.2 Jaccard 0.3
  • 42. Multiple Values Product 5 299 Quiet Comfort 25 Noise Cancelling Headphone Bose Electronic 299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4 599 Dish WasherBoschProduct 3 292 Premium Noise Cancelling HeadphonesSonyProduct 2 Noise Cancelling HeadphonesSonyProduct 1 Price TitleManufacturer Jaro 0.5 distance 0.2 Jaccard 0.3
  • 43. Weights Product 5 299 Quiet Comfort 25 Noise Cancelling Headphone Bose Electronic 299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4 599 Dish WasherBoschProduct 3 292 Premium Noise Cancelling HeadphonesSonyProduct 2 Noise Cancelling HeadphonesSonyProduct 1 Price TitleManufacturer Jaro 0.5 distance 0.2 Jaccard 0.30.5 0.2 0.3
  • 44. Unidirectional Product 5 299 Quiet Comfort 25 Noise Cancelling Headphone Bose Electronic 299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4 599 Dish WasherBoschProduct 3 292 Premium Noise Cancelling HeadphonesSonyProduct 2 Noise Cancelling HeadphonesSonyProduct 1 Price TitleManufacturer Jaro 0.5 distance 0.2 Jaccard 0.30.5 0.2 0.3
  • 45. Graph Summarization: Original Graph Quiet Comfort 25 Noise Cancelling Headphone Bose Electroni c Product 1 Noise Cancelling Headphones Product 2 292 Premium Noise Cancelling Headphones Son y Product 3 599 Dish Washer Bosch Product 4 229 Bose Noise Cancelling Headphones Bos e Product 5 299 price description manufacturerproduct
  • 47. Quiet Comfort 25 Noise Cancelling Headphone Bose Electroni c Product 1 Noise Cancelling Headphones Product 2 292 Premium Noise Cancelling Headphones Son y Product 3 599 Dish Washer Bosch 229 Bose Noise Cancelling Headphones Bos e Product 5 299 Product 4 Graph Sumarization: Super-Nodes
  • 48. Quiet Comfort 25 Noise Cancelling Headphone Noise Cancelling Headphones Premium Noise Cancelling Headphones Dish Washer Bose Noise Cancelling Headphones Super-nodes Ct(x) 0.7 0.2 0.1 0.7 0.2 0.1 0.2 0.7 0.1 0.2 0.7 0.1 0.1 0.1 0.8 probability that a node x belongs to each super-node one matrix for each type Ct
  • 49. Noise Cancelling Headphones Premium Noise Cancelling Headphones Dish Washer Quiet Comfort 25 Noise Cancelling Headphone Bose Noise Cancelling Headphones Similar Nodes Should Be In The Same Super-Node
  • 53. Bose Electroni c Product 3 Bosch Bos e Product 5 Product 4 Predict Links In Original Graph Bose Electroni c Product 3 Bosch Bos e Product 5 Product 4
  • 54. Predict Links In Original Graph Bose Electroni c Product 3 Bosch Bos e Product 5 Product 4
  • 56. Comparable Approaches Pairwise Clustering Unsupervised Supervised Limes, Ngomo’11 ✔ ✔ SILK, Isele’10 ✔ ✔ ✔ Serf, Benjelloun’10 ✔ ✔ *Commercial, Kӧpcke’10 ✔ ✔ GraphSum, Riondato’14 ✔ ✔ *AuthorLDA, Bhattacharya’07 ✔ ✔ CoSum (proposed) ✔ ✔
  • 57. Quality Comparison Precision Recall F-measure Author Paper Product Author Paper Product Author Paper Product Limes-F 0.958 0.827 0.446 0.864 0.761 0.16 0.909 0.792 0.236 Silk-F 0.846 0.877 0.459 0.986 0.756 0.348 0.91 0.812 0.395 Gsum 0.727 0.668 0.01 0.569 0.624 0.587 0.638 0.645 0.02 CoSum-B 0.993 0.871 0.58 0.94 0.611 0.477 0.966 0.718 0.524 Limes-MO 0.912 0.827 0.446 0.944 0.761 0.16 0.928 0.792 0.236 Silk-MO 0.932 0.877 0.459 0.958 0.756 0.348 0.945 0.812 0.395 Serf 0.985 0.837 0.436 0.687 0.808 0.186 0.809 0.822 0.261 CoSum-P 0.999 0.771 0.639 0.997 0.997 0.695 0.998 0.87 0.666 Commercial 0.615 0.63 0.622 AuthorLDA 0.995
  • 58. Steps To Build a KG USC Information Sciences Institute CC-By 2.0 58 Crawling Extraction DataAcquisition Mapping To Ontology Entity Linking &Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  • 59. Graph Construction USC Information Sciences Institute CC-By 2.0 59 assembling the data for efficient query & analysis - ElasticSearch: scalable, efficient query - graph databases: network analytics - NoSQL: scalable analytics - bulk loading: massive data imports - real-time updates: live, changing data
  • 60. elasticsearch • Cloud-based search engine • Based on Apache Lucene • Horizontal scaling, replication, load balancing • Blazingly fast! • Everything is a document – Documents are JSON objects – Index what you want to find – Fields can contain strings, numbers, booleans, etc. CC-By 2.0 60USC Information Sciences Institute
  • 61.
  • 62. Adult Service Offer Person Efficient indexing and query Phone Web Page ElasticSearch Data Model
  • 65. Indexing for High Performance Knowledge Graph Queries Avg. Query Times in Milliseconds Single User Query Load 1.2 billion triples State of the Art Graph Database (RDF) DIG indexing deployed in ElasticSearch USC Information Sciences Institute CC-By 2.0 65
  • 66. Steps To Build a KG USC Information Sciences Institute CC-By 2.0 66 Crawling Extraction DataAcquisition Mapping To Ontology Entity Linking &Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  • 67.
  • 68. DIG Deployment for Human Trafficking USC Information Sciences Institute CC-By 2.0 68 - 100 million Web pages - Live updates (~5,000 pages/hour) - ElasticSearch database (7 nodes) - Hadoop workflows (20 nodes) - District Attorney - Law Enforcement - NGOs
  • 69. DIG Applications Human Trafficking large, real users Material Science Research 70,000 paper abstracts (built in 1 week) Arms Trafficking identify illegal sales Patent Trolls identifies patent trolls Predicting Cyber Attacks combines diverse sources about vulnerabilities, exploits, etc. CC-By 2.0 69USC Information Sciences Institute
  • 70. Conclusions • Presented the end-to-end tool-chain to build domain-specific knowledge graphs • Integrates heterogeneous data: web pages, databases, CSV, web APIs, images, etc. • Approach scales to million of pages, and billions facts • Has been used to build real-world deployed applicationsUSC Information Sciences Institute CC-By 2.0 70

Notes de l'éditeur

  1. Karma offers suggestions on how to do the mapping
  2. Tokenize values in a given labeled column into pure alphabetic, numeric and symbol tokens Extract features from the tokens and the column name and associate them with column’s semantic type
  3. Why is linking significant in this domain? Slide shows why.