SlideShare a Scribd company logo
1 of 35
Graphinder Semantic Search
Relational Keyword Search over Data Graphs
Thanh Tran, Lei Zhang, Veli Bicer, Yongtao Ma
Researcher: www.sites.google.com/site/kimducthanh
Co-Founder: www.graphinder.com
Agenda
•
•
•
•
•

Introduction
Graphinder: Overview
Keyword Query Translation
Keyword Query Result Ranking
Keyword Query Rewriting
– Suggesting correct and meaningful queries
– Auto-complete as user types
INTRODUCTION
Motivation: lots of structured data
Semantic Search: use information about entities and
relationships explicitly given in structured data to provide
relevant answers for complex questions asked using
intuitive interfaces
“singles written by freddie, who is
member of the band queen”
“single written by freddie queen”

MusicBrainz
Single

Artist
Queen

Person

Queen
Elizabeth 1

<x, type, Single>
<Freddie Mercury, writer, x>
<Freddie Mercury, type, Artist>
<Freddie Mercury, member, Queen>
<Queen, type, Band>

DBpedia
Freddie
Mercury

Brian
May
writer

Liar

1971

single

<x, type, Single>
<x, wrritenBy, Freddy>

Links

<Freddy, same-as, Freddy Mercury>
Entity Semantic Search: find relevant entity, return
structured data summary, facts, related entities
Relational Semantic Search: find relevant entities
involved in a relationship, return entity summaries…
Semantic Search Problem: understand user inputs as
entities and relationships and find relevant answers

“single written by freddie queen”
“singles written by freddie, who is
member of the band queen”
Single

Artist

Queen

Freddie
Mercury

Brian
May
writer

Person

Queen
Elizabeth 1

Liar

1971

single

Query Translation: What are possible
connections (schema-level) between
recognized entities and relationships?
1)
<x, type, Single>
<Freddie Mercury, writer, x>
<Freddie Mercury, member, Queen>
2)
….
Query Answering: What are actual
connections (data-level) between
recognized entities and relationships?
1)
<Liar Liar, type, Single>
<Freddie Mercury, writer, Liar Liar>
<Freddie Mercury, member, Queen>
2)
…
Relational Semantic Search at Facebook: recognizes entities and
relationships via LMs, uses manually specified template (grammar) to
find possible connections between them and computes answers via
resulting translated queries
“my friends, who is member of queen”
[start]
my friends, who is member of [id:Queen1]
friends(x,me), member(x,Queen1)
[user-head]
my friends
friends(x,me)

[user-filter]
who is member of [id:1]
member(x,Queen1)
[who]
who
-

[member-vp]
is member of [id:1]
member(x,Queen1)
[member-of-v]
is member of
member()

friends

member

{band}
[id:Queen1]
Queen1

queen

Grammar: set of production rules,
capturing all possible connections,
i.e. the search space of all parse trees
[start]  [users]
[users]  my friends
friends(x, me)
[…]  is member of [bands]
member(x, $1)
[bands]  {band}
$1
…
Grammar-based Query Translation:
which combination of production
rules results in a parse tree that
connects the recognized entities and
relationships?
OVERVIEW
Graphinder Semantic Search: a translation-based approach
for relational keyword search over data graphs

Single

Artist

Person

Queen

Queen Elizabeth 1

Freddie Mercury

Brian
May

Liar

1971

single

writer

Sem. Auto-completion

Query Translation
- Entity + Relationships
- Multi-source
- Domain-independent
- Low manual effort
Graphinder: selected publications
• On-demand, domain-independent, relational keyword search
over data graphs
–
–
–
–

Structure index for data graphs (TKDE13b)
Top-k exploration of translation candidates (ICDE09)
Index-based materialization of graphs (CIKM11a)
Ranking results using structured relevance model (SRM) (CIKM11b)

• Multi-source
– Deduplication using inferred type information: TYPifier (ICDE13),
TYPimatch (WSDM13)
– On-the-fly deduplication using SRM (WWW11)
– Ranking with deduplication (ISWC13)
– Routing keyword queries to relevant data graphs (TKDE13a)
– Hermes: keyword search over heterogeneous data graphs
(SIGMOD09)

• Semantic auto-completion
– Computing valid query rewrites for given keywords (VLDB14)
QUERY TRANSLATION
0) Query Translation: constructing pseudo schema graph
representing all possible connections between data elements
•

•

•

Structure index for data graph:
nodes are groups of data elements
that are share same structure
pattern
Parameters: structure pattern with
edge labels L and paths of maximum
length n
Pseudo schema
– Node groups all instances that have
same set of properties
– structure pattern: all properties, i.e.
all outgoing paths with n = 1, L = all
edge labels

•

Algorithm:
– Start with one single partition/node
representing all instances
– Spit until all nodes are “stable”, i.e.,
all contained instances share same
structure pattern

Single

Artist
Queen

Freddie
Mercury

Brian
May

Person

Queen
Elizabeth 1

Liar

single

writer

member

Artist

producer
Thing12

writer
Single

marital status
Person

Value2
1) Query Translation: constructing search space
representing all possible interpretations of query keywords
“written by freddie queen single”
Freddie
Mercury

Queen
Elizabeth 1

Artist

Freddie
Mercury

producer

Band

Queen

Data
Index

single

writer

member

Queen

Single

Single

Schema
Index

marital status

writer

Keyword Interpretation: use inverted
index and LM-based ranking function to
return relevant schema and data
elements

Person

Literal

Queen
Elizabeth 1

single

Search Space Construction: augment
pseudo schema with query-specific
keyword matching elements
• All possible connections of predicates
applicable to recognized query
keywords
Top-k Subgraph Exploration
Result Retrieval & Ranking
2) Query Translation: score-directed algorithm for finding
top-k subgraphs connecting keyword matching elements
“written by freddie queen single”

member
Artist

Freddie
Mercury

•
•
•

•

•
•

producer
Band

Queen

marital status

writer
Single

Person

Literal

Queen
Elizabeth 1

single

<x, type, Single>
<Queen, producer, x>
<Freddie Mercury, writer, x>
<Queen, type, Band>
<Freddy Mercury, type, Artist>

Algorithm: score-directed top-k Steiner graph search
Start: explore all distinct paths starting from keyword elements
Every iteration
• One step expansion of current path with highest score
• When connecting element found, merge paths and add resulting graph to list
Top-k termination: lowest score of the candidate list > highest possible score that
can achieved with paths in the queues yet to be explored
Termination: all paths of maximum length d have been explored
Final step: mapping rules to translate Steiner graph to structured query
RESULT RANKING
Ranking Using Structured LMs: Keyword query is short and
ambiguous, while structured data provide rich structure
information: ranking based on LMs capturing both content and
structure

• Structured LMs for
structured results r
• Structured LM for queries
using structured pseudorelevant feedback results FR
(relevance model)
• Compute distance between
query and result LMs

RM r (v )

P(v | r )

RMFr (v)

P(v | Fr )

Score( r )

RM Fr ( v ) log RM r ( v )
v V
Relevance Models
freddie queen
Query
F Documents

Merc
ury
Brian
May
Prote
st
Raid
Clas
h
Bank
West

• Term probabilities of query model is
based on documents
• Ranking behaves like similarity search
between pseudo-relevant feedback
documents and corpus documents

Candidate Documents

Merc
ury
Brian
May
Prote
st
Raid
Clas
h
Bank
West
Structured Relevance Models
Structured Data

queen single
Query

F Results

Merc
ury
Brian
May
Prote
st
Raid
Clas
h
Bank
West

• Term probabilities of query model is
based on pseudo-relevant structured data
• Ranking behaves like similarity search
between pseudo-relevant structured
results and structured result
candidates
Structured Data

Candidate Results

Merc
ury
Brian
May
Prote
st
Raid
Clas
h
Bank
West
Ranking: construct edge-specific query model for each unique e
from feedback resources FR, edge-specific model for every
candidate r, and finally, compute distance
For all
resources r
in FR

Prob of observing
term v in value of
property e of
resource r

RMname

RMcomment

RMx

Mercury

.091

.01

…

Brian

.082

.01

…

Champion
Importance of resource r w.r.t. query

v

.081

.02

…

Protest

.001

.042

…

Raid

.006

.014

…

…

…

…

…

v

RMname

RMcomment

RMx

Mercury

.073

.01

…

Brian

.052

.01

…

…

…

…

…
QUERY REWRITING
Query Rewriting: find syntactically and semantically valid
rewrites to suggest as user types
single from freddy mercury que
Freddie
Mercury

Queen
Elizabeth 1

Queen

single

writer

Single

Data
Index
Schema
Index

Benefits:
- Higher selectivity of query terms (quality)
- Reduced number of query terms (efficiency)
- Better search experience…
Freddie
Mercury

Data
Index

Queen

writer

Single

Schema
Index

Challenges: many rewrite candidates, some are
semantically not “valid” in the relational setting
single (marital status) writer “freddie mercury” queen
(the queen of UK)

Token rewriting via syntactic distance
Keyword Interpretation:
- Imprecise / fuzzy matching
1) single from freddie mercury queen
- Match every keyword
…
Token rewriting via semantic distance
1) single writer freddie mercury queen
…

Query segmentation
1) single writer “freddie mercury” queen
…

Keyword / Key Phrase Interpretation:
- Precise matching
- Match keyword and key phrases
Search Space Construction
Search Space Construction
Result Retrieval & Ranking
Probabilistic Model for Query Rewriting: the rank of a
query rewrite (suggestion) S is based on the
probability of observing S in the data, given the query
Based on
Bayes„ Theorem

Probability
users write
spelling errors
/ semantically
related query
independent of
data D

single writer freddy mercury que

1) single writer freddie mercury queen
2) single writer freddrick mercury monarch
3) song writer freddrick mercury head of state

Constant
given query Q
and data D

Single

Artist

Person

Queen

Queen Elizabeth 1

Token Rewriting: S is
ranked high when prob
that query Q can be
observed in S is high

Query Segmentation: S is
ranked high when prob that
S can be observed in the
data D is high

Freddie Mercury

Brian
May

Liar

writer

1971

single
Token Rewriting
• Modeling token rewriting P(Q|S)

Split: |
Concatenate: +

• Independence assumption

• Modeling syntactic and semantic differences

P(q|t): is high when q is
syntactically and
semantically close to t

single writer freddy mercury que
1) single writer “freddie mercury” queen
2) single writer “freddrick mercury” monarch
3) single writer “freddrick mercury” head of state

single | writer | freddie + mercury | queen
Query Segmentation
• Modeling query segmentation P(S|D)
single writer freddie mercury que

α = concatenate?
α = split?
where PD(αiti+1|t1α1t2…αi-1ti)
stands for P(αiti+1|t1α1t2…αi-1ti,D).

Singl
e

Art
ist

single writer freddie

Queen Elizabeth 1

Freddie
Mercury

Brian
May

Liar

writer

• Nth order Markov assumption

Person

Queen

1
9
7
1

single
Estimating Probability of Segmentation
• Maximum likelihood estimation (MLE)

where C(ti…tj) denotes the count of occurrences of the token sequence ti…tj

Segmentation in structured data setting
• Concatenate two segments si and sj when they co-occur in the data
• Split when si and sj are connected (si ↭ sj), i.e., when the two data
elements ni and ni mentioning si and sj are connected in the data
single writer freddie mercury queen

Single

Artist

α = concatenate?
α = split?

single writer freddie

Person

Queen

Freddie
Mercury

Brian
May
writer

Queen
Elizabeth 1

Liar

1971

single
Estimating Probability of Segmentation Case 1: previous
segment si has length equal or more than context N
• Two cases: (1) l(si) ≥ N; (2) l(si) < N
• (1) When the previously induced segment si has length equal or
more than N, i.e. l(si) ≥ N, it suffices to focus on si (N) to predict
the next action αi on ti+1
freddie j. mercury

queen

freddie j. mercury

queen

• Estimation of probability

where C(st) denotes the count of co-occurrences of the sequence st in D and
C(s ↭ t) is the count of all occurrences of token t connected to segment s
Estimating Probability of Segmentation Case 2: previous
segment si has length less than context N
• (2) When the previous segment si has length less than N, i.e. l(si) <
N, the action αi on the next token ti+1 depends on si and Pi(N), the
set of segments that precede si that together with si, contains at
most N tokens in total, i.e.,
single

writer

freddie

mercury single

writer

freddie

mercury

• Estimation of probability

where C(P ↭ s) denotes the count of all occurrences of the segment s
connected to all segments in P
EXPERIMENTAL RESULTS &
CONCLUSIONS
• Graphinder, a relational keyword search approach for suggesting query
•

•

•

•
•

completions, translating queries and ranking results
Keyword translation performance
– Query translation and index-based approaches at least one-order of magnitude
faster than online in-memory search (bidirectional)
– Query translation comparable with index-based approaches, but less space
Keyword translation result quality
– According to recent benchmark, our ranking consistently outperforms all
existing ranking systems in precision, recall and MAP (10% - 30% improvement)
Effect of query rewriting
– Better user experience
– Improves efficiency by reducing number of query terms
– Improves quality / selectivity of query terms
– …depends on complexity of queries and underlying keyword search engine
Tight integration of query suggestion and translation
From research prototypes to Graphinder, a powerful, flexible, low upfront-cost
semantic search system
Thanks!

Tran Duc Thanh
tran.du.th@gmail.com
http://sites.google.com/site/kimducthanh/
References (1)
– [VLDB14] Yongtao Ma, Thanh Tran
Probabilistic Query Rewriting for Efficient and and Effective Keyword Search on
Graph Data
In International Conference on Very Large Data Bases (VLDB'14). Hangzhou,
China, September, 2014
– [ISWC13] Daniel Herzig, Roi Blanco, Peter Mika and Thanh Tran
Federated Entity Search Using On-the-Fly Consolidation
In International Semantic Web Conference (ISWC'13). Sydney, Australia, October,
2013
– [ICDE13] Yongtao Ma, Thanh Tran
TYPifier: Inferring the Type Semantics of Structured Data
In International Conference on Data Engineering (ICDE'13). Brisbane, Australia, April,
2013
– [WSDM13] Yongtao Ma, Thanh Tran
TYPiMatch: Type-specific Unsupervised Learning of Keys and Key Values for
Heterogeneous Web Data Integration
In International Conference on Web Search and Data Mining (WSDM'13). Rome,
Italy, February, 2013
– [TKDE12a] Thanh Tran, Günter Ladwig, Sebastian Rudolph
Managing Structured and Semi-structured RDF Data Using Structure Indexes
In Transactions on Knowledge and Data Engineering journal.
– [TKDE12b] Thanh Tran, Lei Zhang
Keyword Query Routing
In Transactions on Knowledge and Data Engineering journal.
References (2)
– [WWW12] Daniel Herzig, Thanh Tran
Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration
In Proceedings of 21st International World Wide Web Conference (WWW'12). Lyon,
France, April, 2012
– [CIKM11a] Günter Ladwig, Thanh Tran
Index Structures and Top-k Join Algorithms for Native Keyword Search Databases
In Proceedings of 20th ACM Conference on Information and Knowledge
Management (CIKM'11). Glasgow, UK, October, 2011
– [CIKM11b] Veli Bicer, Thanh Tran
Ranking Support for Keyword Search on Structured Data using Relevance Models
In Proceedings of 20th ACM Conference on Information and Knowledge
Management (CIKM'11). Glasgow, UK, October, 2011
– [SIGIR11] Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey
Pound, Henry S. Thompson, Thanh Tran Duc
Repeatable and Reliable Search System Evaluation using Crowdsourcing
In Proceedings of 34th Annual International ACM SIGIR Conference (SIGIR'11),
Beijing, China, July, 2011
– [ICDE09] Duc Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano
Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF
In Proceedings of the 25th International Conference on Data Engineering (ICDE'09).
Shanghai, China, March 2009
– [SIGMOD09] Haofen Wang, Thomas Penin, Kaifeng Xu, Junquan Chen, Xinruo Sun,
Linyun Fu, Yong Yu, Thanh Tran, Peter Haase, Rudi Studer
Hermes: A Travel through Semantics in the Data Web
In Proceedings of SIGMOD Conference 2009. Providence, USA, June-July, 2009
BACKUP

More Related Content

Similar to Graphinder semantic search

FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczFOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
Ioan Toma
 
Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461
Margaret Wang
 

Similar to Graphinder semantic search (20)

Big data search
Big data search Big data search
Big data search
 
Summarizing Semantic Data
Summarizing Semantic DataSummarizing Semantic Data
Summarizing Semantic Data
 
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczFOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
 
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
 
Stack_Overflow-Network_Graph
Stack_Overflow-Network_GraphStack_Overflow-Network_Graph
Stack_Overflow-Network_Graph
 
Social (1)
Social (1)Social (1)
Social (1)
 
NLP & DBpedia
 NLP & DBpedia NLP & DBpedia
NLP & DBpedia
 
Fast, Lenient, and Accurate – Building Personalized Instant Search Experience...
Fast, Lenient, and Accurate – Building Personalized Instant Search Experience...Fast, Lenient, and Accurate – Building Personalized Instant Search Experience...
Fast, Lenient, and Accurate – Building Personalized Instant Search Experience...
 
Data mining and warehouse by dr D. R. Patil sir
Data mining and warehouse by dr D. R. Patil sirData mining and warehouse by dr D. R. Patil sir
Data mining and warehouse by dr D. R. Patil sir
 
BoTLRet: A Template-based Linked Data Information Retrieval
 BoTLRet: A Template-based Linked Data Information Retrieval BoTLRet: A Template-based Linked Data Information Retrieval
BoTLRet: A Template-based Linked Data Information Retrieval
 
Web mining
Web miningWeb mining
Web mining
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF data
 
B 4 gravty
B 4 gravtyB 4 gravty
B 4 gravty
 
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014:  Social Network Benchmark (SNB) Graph GeneratorFOSDEM 2014:  Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
 
Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461
 
The 2nd graph database in sv meetup
The 2nd graph database in sv meetupThe 2nd graph database in sv meetup
The 2nd graph database in sv meetup
 
Domain Identification for Linked Open Data
Domain Identification for Linked Open DataDomain Identification for Linked Open Data
Domain Identification for Linked Open Data
 
Bootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4jBootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4j
 
DaCENA Personalized Exploration of Knowledge Graphs Within a Context. Seminar...
DaCENA Personalized Exploration of Knowledge Graphs Within a Context. Seminar...DaCENA Personalized Exploration of Knowledge Graphs Within a Context. Seminar...
DaCENA Personalized Exploration of Knowledge Graphs Within a Context. Seminar...
 
DBtrends Semantics 2016
DBtrends Semantics 2016DBtrends Semantics 2016
DBtrends Semantics 2016
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Graphinder semantic search

  • 1. Graphinder Semantic Search Relational Keyword Search over Data Graphs Thanh Tran, Lei Zhang, Veli Bicer, Yongtao Ma Researcher: www.sites.google.com/site/kimducthanh Co-Founder: www.graphinder.com
  • 2. Agenda • • • • • Introduction Graphinder: Overview Keyword Query Translation Keyword Query Result Ranking Keyword Query Rewriting – Suggesting correct and meaningful queries – Auto-complete as user types
  • 4. Motivation: lots of structured data
  • 5. Semantic Search: use information about entities and relationships explicitly given in structured data to provide relevant answers for complex questions asked using intuitive interfaces “singles written by freddie, who is member of the band queen” “single written by freddie queen” MusicBrainz Single Artist Queen Person Queen Elizabeth 1 <x, type, Single> <Freddie Mercury, writer, x> <Freddie Mercury, type, Artist> <Freddie Mercury, member, Queen> <Queen, type, Band> DBpedia Freddie Mercury Brian May writer Liar 1971 single <x, type, Single> <x, wrritenBy, Freddy> Links <Freddy, same-as, Freddy Mercury>
  • 6. Entity Semantic Search: find relevant entity, return structured data summary, facts, related entities
  • 7. Relational Semantic Search: find relevant entities involved in a relationship, return entity summaries…
  • 8. Semantic Search Problem: understand user inputs as entities and relationships and find relevant answers “single written by freddie queen” “singles written by freddie, who is member of the band queen” Single Artist Queen Freddie Mercury Brian May writer Person Queen Elizabeth 1 Liar 1971 single Query Translation: What are possible connections (schema-level) between recognized entities and relationships? 1) <x, type, Single> <Freddie Mercury, writer, x> <Freddie Mercury, member, Queen> 2) …. Query Answering: What are actual connections (data-level) between recognized entities and relationships? 1) <Liar Liar, type, Single> <Freddie Mercury, writer, Liar Liar> <Freddie Mercury, member, Queen> 2) …
  • 9. Relational Semantic Search at Facebook: recognizes entities and relationships via LMs, uses manually specified template (grammar) to find possible connections between them and computes answers via resulting translated queries “my friends, who is member of queen” [start] my friends, who is member of [id:Queen1] friends(x,me), member(x,Queen1) [user-head] my friends friends(x,me) [user-filter] who is member of [id:1] member(x,Queen1) [who] who - [member-vp] is member of [id:1] member(x,Queen1) [member-of-v] is member of member() friends member {band} [id:Queen1] Queen1 queen Grammar: set of production rules, capturing all possible connections, i.e. the search space of all parse trees [start]  [users] [users]  my friends friends(x, me) […]  is member of [bands] member(x, $1) [bands]  {band} $1 … Grammar-based Query Translation: which combination of production rules results in a parse tree that connects the recognized entities and relationships?
  • 11. Graphinder Semantic Search: a translation-based approach for relational keyword search over data graphs Single Artist Person Queen Queen Elizabeth 1 Freddie Mercury Brian May Liar 1971 single writer Sem. Auto-completion Query Translation - Entity + Relationships - Multi-source - Domain-independent - Low manual effort
  • 12. Graphinder: selected publications • On-demand, domain-independent, relational keyword search over data graphs – – – – Structure index for data graphs (TKDE13b) Top-k exploration of translation candidates (ICDE09) Index-based materialization of graphs (CIKM11a) Ranking results using structured relevance model (SRM) (CIKM11b) • Multi-source – Deduplication using inferred type information: TYPifier (ICDE13), TYPimatch (WSDM13) – On-the-fly deduplication using SRM (WWW11) – Ranking with deduplication (ISWC13) – Routing keyword queries to relevant data graphs (TKDE13a) – Hermes: keyword search over heterogeneous data graphs (SIGMOD09) • Semantic auto-completion – Computing valid query rewrites for given keywords (VLDB14)
  • 14. 0) Query Translation: constructing pseudo schema graph representing all possible connections between data elements • • • Structure index for data graph: nodes are groups of data elements that are share same structure pattern Parameters: structure pattern with edge labels L and paths of maximum length n Pseudo schema – Node groups all instances that have same set of properties – structure pattern: all properties, i.e. all outgoing paths with n = 1, L = all edge labels • Algorithm: – Start with one single partition/node representing all instances – Spit until all nodes are “stable”, i.e., all contained instances share same structure pattern Single Artist Queen Freddie Mercury Brian May Person Queen Elizabeth 1 Liar single writer member Artist producer Thing12 writer Single marital status Person Value2
  • 15. 1) Query Translation: constructing search space representing all possible interpretations of query keywords “written by freddie queen single” Freddie Mercury Queen Elizabeth 1 Artist Freddie Mercury producer Band Queen Data Index single writer member Queen Single Single Schema Index marital status writer Keyword Interpretation: use inverted index and LM-based ranking function to return relevant schema and data elements Person Literal Queen Elizabeth 1 single Search Space Construction: augment pseudo schema with query-specific keyword matching elements • All possible connections of predicates applicable to recognized query keywords Top-k Subgraph Exploration Result Retrieval & Ranking
  • 16. 2) Query Translation: score-directed algorithm for finding top-k subgraphs connecting keyword matching elements “written by freddie queen single” member Artist Freddie Mercury • • • • • • producer Band Queen marital status writer Single Person Literal Queen Elizabeth 1 single <x, type, Single> <Queen, producer, x> <Freddie Mercury, writer, x> <Queen, type, Band> <Freddy Mercury, type, Artist> Algorithm: score-directed top-k Steiner graph search Start: explore all distinct paths starting from keyword elements Every iteration • One step expansion of current path with highest score • When connecting element found, merge paths and add resulting graph to list Top-k termination: lowest score of the candidate list > highest possible score that can achieved with paths in the queues yet to be explored Termination: all paths of maximum length d have been explored Final step: mapping rules to translate Steiner graph to structured query
  • 18. Ranking Using Structured LMs: Keyword query is short and ambiguous, while structured data provide rich structure information: ranking based on LMs capturing both content and structure • Structured LMs for structured results r • Structured LM for queries using structured pseudorelevant feedback results FR (relevance model) • Compute distance between query and result LMs RM r (v ) P(v | r ) RMFr (v) P(v | Fr ) Score( r ) RM Fr ( v ) log RM r ( v ) v V
  • 19. Relevance Models freddie queen Query F Documents Merc ury Brian May Prote st Raid Clas h Bank West • Term probabilities of query model is based on documents • Ranking behaves like similarity search between pseudo-relevant feedback documents and corpus documents Candidate Documents Merc ury Brian May Prote st Raid Clas h Bank West
  • 20. Structured Relevance Models Structured Data queen single Query F Results Merc ury Brian May Prote st Raid Clas h Bank West • Term probabilities of query model is based on pseudo-relevant structured data • Ranking behaves like similarity search between pseudo-relevant structured results and structured result candidates Structured Data Candidate Results Merc ury Brian May Prote st Raid Clas h Bank West
  • 21. Ranking: construct edge-specific query model for each unique e from feedback resources FR, edge-specific model for every candidate r, and finally, compute distance For all resources r in FR Prob of observing term v in value of property e of resource r RMname RMcomment RMx Mercury .091 .01 … Brian .082 .01 … Champion Importance of resource r w.r.t. query v .081 .02 … Protest .001 .042 … Raid .006 .014 … … … … … v RMname RMcomment RMx Mercury .073 .01 … Brian .052 .01 … … … … …
  • 23. Query Rewriting: find syntactically and semantically valid rewrites to suggest as user types single from freddy mercury que Freddie Mercury Queen Elizabeth 1 Queen single writer Single Data Index Schema Index Benefits: - Higher selectivity of query terms (quality) - Reduced number of query terms (efficiency) - Better search experience… Freddie Mercury Data Index Queen writer Single Schema Index Challenges: many rewrite candidates, some are semantically not “valid” in the relational setting single (marital status) writer “freddie mercury” queen (the queen of UK) Token rewriting via syntactic distance Keyword Interpretation: - Imprecise / fuzzy matching 1) single from freddie mercury queen - Match every keyword … Token rewriting via semantic distance 1) single writer freddie mercury queen … Query segmentation 1) single writer “freddie mercury” queen … Keyword / Key Phrase Interpretation: - Precise matching - Match keyword and key phrases Search Space Construction Search Space Construction Result Retrieval & Ranking
  • 24. Probabilistic Model for Query Rewriting: the rank of a query rewrite (suggestion) S is based on the probability of observing S in the data, given the query Based on Bayes„ Theorem Probability users write spelling errors / semantically related query independent of data D single writer freddy mercury que 1) single writer freddie mercury queen 2) single writer freddrick mercury monarch 3) song writer freddrick mercury head of state Constant given query Q and data D Single Artist Person Queen Queen Elizabeth 1 Token Rewriting: S is ranked high when prob that query Q can be observed in S is high Query Segmentation: S is ranked high when prob that S can be observed in the data D is high Freddie Mercury Brian May Liar writer 1971 single
  • 25. Token Rewriting • Modeling token rewriting P(Q|S) Split: | Concatenate: + • Independence assumption • Modeling syntactic and semantic differences P(q|t): is high when q is syntactically and semantically close to t single writer freddy mercury que 1) single writer “freddie mercury” queen 2) single writer “freddrick mercury” monarch 3) single writer “freddrick mercury” head of state single | writer | freddie + mercury | queen
  • 26. Query Segmentation • Modeling query segmentation P(S|D) single writer freddie mercury que α = concatenate? α = split? where PD(αiti+1|t1α1t2…αi-1ti) stands for P(αiti+1|t1α1t2…αi-1ti,D). Singl e Art ist single writer freddie Queen Elizabeth 1 Freddie Mercury Brian May Liar writer • Nth order Markov assumption Person Queen 1 9 7 1 single
  • 27. Estimating Probability of Segmentation • Maximum likelihood estimation (MLE) where C(ti…tj) denotes the count of occurrences of the token sequence ti…tj Segmentation in structured data setting • Concatenate two segments si and sj when they co-occur in the data • Split when si and sj are connected (si ↭ sj), i.e., when the two data elements ni and ni mentioning si and sj are connected in the data single writer freddie mercury queen Single Artist α = concatenate? α = split? single writer freddie Person Queen Freddie Mercury Brian May writer Queen Elizabeth 1 Liar 1971 single
  • 28. Estimating Probability of Segmentation Case 1: previous segment si has length equal or more than context N • Two cases: (1) l(si) ≥ N; (2) l(si) < N • (1) When the previously induced segment si has length equal or more than N, i.e. l(si) ≥ N, it suffices to focus on si (N) to predict the next action αi on ti+1 freddie j. mercury queen freddie j. mercury queen • Estimation of probability where C(st) denotes the count of co-occurrences of the sequence st in D and C(s ↭ t) is the count of all occurrences of token t connected to segment s
  • 29. Estimating Probability of Segmentation Case 2: previous segment si has length less than context N • (2) When the previous segment si has length less than N, i.e. l(si) < N, the action αi on the next token ti+1 depends on si and Pi(N), the set of segments that precede si that together with si, contains at most N tokens in total, i.e., single writer freddie mercury single writer freddie mercury • Estimation of probability where C(P ↭ s) denotes the count of all occurrences of the segment s connected to all segments in P
  • 31. • Graphinder, a relational keyword search approach for suggesting query • • • • • completions, translating queries and ranking results Keyword translation performance – Query translation and index-based approaches at least one-order of magnitude faster than online in-memory search (bidirectional) – Query translation comparable with index-based approaches, but less space Keyword translation result quality – According to recent benchmark, our ranking consistently outperforms all existing ranking systems in precision, recall and MAP (10% - 30% improvement) Effect of query rewriting – Better user experience – Improves efficiency by reducing number of query terms – Improves quality / selectivity of query terms – …depends on complexity of queries and underlying keyword search engine Tight integration of query suggestion and translation From research prototypes to Graphinder, a powerful, flexible, low upfront-cost semantic search system
  • 33. References (1) – [VLDB14] Yongtao Ma, Thanh Tran Probabilistic Query Rewriting for Efficient and and Effective Keyword Search on Graph Data In International Conference on Very Large Data Bases (VLDB'14). Hangzhou, China, September, 2014 – [ISWC13] Daniel Herzig, Roi Blanco, Peter Mika and Thanh Tran Federated Entity Search Using On-the-Fly Consolidation In International Semantic Web Conference (ISWC'13). Sydney, Australia, October, 2013 – [ICDE13] Yongtao Ma, Thanh Tran TYPifier: Inferring the Type Semantics of Structured Data In International Conference on Data Engineering (ICDE'13). Brisbane, Australia, April, 2013 – [WSDM13] Yongtao Ma, Thanh Tran TYPiMatch: Type-specific Unsupervised Learning of Keys and Key Values for Heterogeneous Web Data Integration In International Conference on Web Search and Data Mining (WSDM'13). Rome, Italy, February, 2013 – [TKDE12a] Thanh Tran, Günter Ladwig, Sebastian Rudolph Managing Structured and Semi-structured RDF Data Using Structure Indexes In Transactions on Knowledge and Data Engineering journal. – [TKDE12b] Thanh Tran, Lei Zhang Keyword Query Routing In Transactions on Knowledge and Data Engineering journal.
  • 34. References (2) – [WWW12] Daniel Herzig, Thanh Tran Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration In Proceedings of 21st International World Wide Web Conference (WWW'12). Lyon, France, April, 2012 – [CIKM11a] Günter Ladwig, Thanh Tran Index Structures and Top-k Join Algorithms for Native Keyword Search Databases In Proceedings of 20th ACM Conference on Information and Knowledge Management (CIKM'11). Glasgow, UK, October, 2011 – [CIKM11b] Veli Bicer, Thanh Tran Ranking Support for Keyword Search on Structured Data using Relevance Models In Proceedings of 20th ACM Conference on Information and Knowledge Management (CIKM'11). Glasgow, UK, October, 2011 – [SIGIR11] Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S. Thompson, Thanh Tran Duc Repeatable and Reliable Search System Evaluation using Crowdsourcing In Proceedings of 34th Annual International ACM SIGIR Conference (SIGIR'11), Beijing, China, July, 2011 – [ICDE09] Duc Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF In Proceedings of the 25th International Conference on Data Engineering (ICDE'09). Shanghai, China, March 2009 – [SIGMOD09] Haofen Wang, Thomas Penin, Kaifeng Xu, Junquan Chen, Xinruo Sun, Linyun Fu, Yong Yu, Thanh Tran, Peter Haase, Rudi Studer Hermes: A Travel through Semantics in the Data Web In Proceedings of SIGMOD Conference 2009. Providence, USA, June-July, 2009

Editor's Notes

  1. Construct query model from structured data elements that are close to the queryIndex resources in the data graph where resources are treated as documents and attributes and attribute values are indexed as document terms use standard inverted index implementation and IR search engine to retrieve resources for a given keyword query initial run of the query yields F results
  2. Query model: probability of terms in the query model is estimated using F resources: intuitively, probability of a term is estimated as the probability of observing these terms in the F resources (based on the probability of observing the term in the e-value of r, and the probability of e) Weight by the importance of that resource: a resource is more important if query terms are more likely to be observed in that resources, compared to other resources in FEdge-specific resourcemodel:probability of observingterm v in e-value of r, smoothing with prpobability of observing term v in all values of rThe score of a resource calculated based on cross-entropy of edge-specific RM and edge-specific ResM:Aggrgated over EVERY E: Alpha allows to control the importance of edgesInstead of singleentities, rankingcomplexgraphscomprisingmultupleentities,calledJoinedResultTuple: modelcomplexresultsas a geometricmean of the entitymodelsRanking aggregated JRTs: The cross entropy between the edge-specific RM (Query Model) and geometric mean of combined edge-specific ResM:The proposed ranking function is monotonic with respect to the individual resource scores (a necessary property for using top-k algorithms)A language model is constructed for every attribute of the resource to capture the probability of a word being observed via repeated sampling from the content of a specific attribute of rLambda controls the weight of the edge-specific attribute, small value means less emphasis on the term of the attribute and more emphasis on the terms of the entire resource (terms in all attributes)Pe is the probability of observing a word v in the edge specific attribute a P* is the probability of observing a word v in all attributes of rConsider the co-occurences of a word and query words in the content of a specific attribute aThe sampling process we implement is iidiidsamping: query words and w are iid sampled from a unigram distribution a, i.e. representing content of the specific attribute a, then sample v from a, and then sample k times query words from a distribution representing the content of all attributes of r