HARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing

HARE: A Hybrid SPARQL Engine to Enhance
Query Answers via Crowdsourcing
Maribel Acosta, Elena Simperl, Fabian Flöck, Maria-Esther Vidal!
?x

dbp:producer
dbr:

Bad_Hair

Motivation (1)
Query Answers via Crowdsourcing – Acosta et al.!
2

Motivation (1)
Due to the semi-structured nature of RDF,
incomplete values cannot be easily detected. !
3

Motivation (2)
SELECT
DISTINCT
?movie
WHERE
{

?movie
rdf:type
schema.org:Movie
.

?movie
dbp:producer
?producer
.

?movie
dct:subject
dbc:Universal_Pictures_film
.

?movie
dct:subject
dbc:Films_shot_in_New_York_City
.

}

Retrieve
movies
that
have
producers
and
have
been
ﬁlmed
in

New
York
City
by
Universal
Pictures.

39 movies!
(v. 2015-04)!
4

Motivation (2)
SELECT
DISTINCT
?movie
WHERE
{

?movie
rdf:type
schema.org:Movie
.

?movie
dbp:producer
?producer
.

?movie
dct:subject
.

?movie
dct:subject
dbc:Films_shot_in_New_York_City
.

}

46 movies!
(There are 7 movies
without producers)!
Retrieve
movies
that
have
producers
and
have
been
ﬁlmed
in

New
York
City
by
Universal
Pictures.

5

(v. 2015-04)!

Motivation
Movies (shot in NYC by Universal Pictures) with no producers in!
All images licensed under Fair use via Wikipedia.!
dbr:Legal_Eagles

6

dbr:Wanderlust

dbr:Barney’s_

Version_(film)

dbr:Non_Stop_

(film)

dbr:The_Wolf_of_Wall_
Street_(2013_film)

dbr:Broadway_Love

dbr:Trainwreck_(film)

(v. 2015-04)!
Leonardo
DiCaprio is
a producer!

[[(?movie, dbp:producer, ?producer)]]D [[(?movie, dbp:producer, ?producer)]]D*
Problem Definition
Given an RDF data set D and a SPARQL query Q against
D. Consider D* the virtual data set that contains all the data
that should be in D. !
!
P1) Identifying portions of Q that yield missing values
!
P2) Resolving missing values
⊂
µ={movieàdbr:The_Wolf_of_Wall_Street_(2013)_film, produceràdbr:Leonardo_DiCaprio}
[[(?movie, dbp:producer, ?producer)]]D ∧∉
µ={movieàdbr:The_Wolf_of_Wall_Street_(2013)_film, produceràdbr:Leonardo_DiCaprio}
[[(?movie, dbp:producer, ?producer)]]D*∈
7

Does not belong to DBpedia!
Should belong to DBpedia!

OUR APPROACH: HARE
8

HARE
•  A hybrid machine/human SPARQL query engine that
is able to enhance the size of query answers. !
•  Based on a novel RDF completeness model, HARE
implements query optimization and execution techniques:!
P1) Identifying portions of queries that yield missing values.
•  HARE resorts to microtask crowdsourcing:!
P2) Resolving missing values.
!
9

HARE Architecture
SPARQL Query Q, τ"
RDF
Completeness
Model !
Tasks!
Human
input!
Crowd Knowledge!
Query Engine!
Crowd!
CKB+! CKB-! CKB~!
Query
Optimizer!
Microtask
Manager!
LOD Cloud!
Query plan!
Crowdsourcing triple patterns!
RDF !
Data Set!
Input!
Results for Q"
Bindings from
the crowd!
RDF
data!
Output!
Aggregated!
Human Input!
10

HARE Architecture
SPARQL Query Q, τ"
RDF
Completeness
Model !
Tasks!
Human
input!
Crowd Knowledge!
Query Engine!
Crowd!
CKB+! CKB-! CKB~!
Query
Optimizer!
Microtask
Manager!
LOD Cloud!
Query plan!
RDF !
Data Set!
Input!
Results for Q"
Bindings from
the crowd!
RDF
data!
Output!
Aggregated!
Human Input!
11

RDF Completeness Model (1)
dbr:!
Eric_Fellner!
dbr:!
Tim_Bevan!
dbr:!
Kevin_Misher!
dbp:producer!rdf:type!
rdf:type!
schema.org:!
Movie!
rdf:type!
dbr:!
Bad_Hair!
?!
?!
dbp:producer!
dbp:producer!
Movies have producers (e.g. db:The_Interpreter).!
dbr:!
Tower_Heist!
dbr:!
The_Interpreter!
…

12

①  Predicate multiplicity of an RDF resource!
Number of different objects that a resource has for a certain predicate.!
MD(dbr:The_Interpreter | dbp:producer) = 3
dbr:!
Eric_Fellner!
dbr:!
Tim_Bevan!
dbr:!
Kevin_Misher!
dbp:producer!
dbr:!
The_Interpreter!
13

②  Aggregated predicate multiplicity of a class!
Given a predicate, median number of distinct objects that have all the
resources that belong to a class. !
AMD(schema.org:Movies | dbp:producer) = 3
MD(dbr:The_Interpreter | dbp:producer) = 3
MD(dbr:Legal_Eagles | dbp:producer) = 2
14

③  Completeness of an RDF resource
(with respect to a predicate)!
Given a predicate, the completeness of an RDF resource is determined
by the aggregated predicate multiplicity of the classes that it belongs to.!
CompD(dbr:The_Interpreter | dbp:producer) =
CompD(dbr:Legal_Eagles | dbp:producer) =
CompD(dbr:Bad_Hair) | dbp:producer) =
3
3
2
3
0
3
① 

Computed in !
Computed in !② 

15

HARE Architecture
SPARQL Query Q, τ"
RDF
Completeness
Model !
Tasks!
Human
input!
Crowd Knowledge!
Query Engine!
Crowd!
CKB+! CKB-! CKB~!
Query
Optimizer!
Microtask
Manager!
LOD Cloud!
Query plan!
RDF !
Data Set!
Input!
Results for Q"
Bindings from
the crowd!
RDF
data!
Output!
Aggregated!
Human Input!
16

Crowd Knowledge
•  The knowledge collected from the crowd is captured in
three knowledge bases:!
•  CKB+, CKB–, CKB~ are fuzzy sets over RDF data
composed of 4-tuples of the form:!
CKB = ( , , )
CKB+! CKB–! CKB~!
(subject, predicate, object, membership_degree)
RDF triple
17

Types of Crowd Knowledge Bases!
Crowd Knowledge
(dbr:Bad_Hair, dbp:producer, _:o2, 0.78)!
“Brian Grazer is a producer of Tower Heist.”!
(dbr:Tower_Heist, dbp:producer, dbr:Brian_Grazer, 0.9)!
“Tower Heist does not have a producer.”!
(dbr:Tower_Heist, dbp:producer, _:o1, 0.05)!
“I am not sure if Bad Hair has a producer.”!
CKB+!
CKB-!
CKB~!
18

Types of Crowd Knowledge Bases!
Crowd Knowledge
“Brian Grazer is a producer of Tower Heist.”!
“Tower Heist does not have a producer.”!
“I am not sure if Bad Hair has a producer.”!
CKB+!
CKB-!
CKB~!
Contradiction"
Uncertainty!
19

Measuring Contradiction!
!
•  Contradiction occurs when triples with the same subject
and predicate belong to CKB+ and CKB–.!
•  It is measured as follows:!
•  Contradiction values close to 0.0 indicate high consensus.!
!
Contradiction(dbr:Tower_Heist | dbp:producer) = 1 - | 0.9 – 0.05 | !
= 0.15!
Crowd Knowledge
CKB+!
CKB–!
20

Measuring Uncertainty!
!
•  When a triple belongs to CKB~, the value of the triple
object is unknown or uncertain.!
!
•  Uncertainty is measured as follows:!
•  Uncertainty values close to 1.0 indicate that the crowd has
shown to be unknowledgeable about the fact to be vetted.!
!
Uncertainty(dbr:Bad_Hair| dbp:producer) = avg({0.78})!
= 0.78!
Crowd Knowledge
CKB~!
21

HARE Architecture
SPARQL Query Q, τ"
RDF
Completeness
Model !
Tasks!
Human
input!
Crowd Knowledge!
Query Engine!
Crowd!
CKB+! CKB-! CKB~!
Query
Optimizer!
Microtask
Manager!
LOD Cloud!
Query plan!
RDF !
Data Set!
Input!
Results for Q"
Bindings from
the crowd!
RDF
data!
Output!
Aggregated!
Human Input!
22

Query Optimizer (1)
•  Heuristic-based optimizer that decomposes the BGPs of
a SPARQL query into two subsets:!
–  SQD: triples patterns executed against the data set D,"
–  SQCROWD: triple patterns to be crowdsourced.!
!
23

Query Optimizer (2)
•  Given a SPARQL query Q:!
–  Triple patterns in Q with variables in the subject position
and object position are added to SQCROWD.!
–  The rest of the triple patterns in Q are added to to SQD.!
SELECT
DISTINCT
?movie
WHERE
{

?movie
rdf:type
schema.org:Movie
.

?movie
dbp:producer
?producer
.

?movie
dct:subject
.

?movie
dct:subject
dbxFilms_shot_in_New_York_City
.

}

t1

t2

t3

t4

SQCROWD

SQD

SQD

SQD

24

•  The optimizer builds a query plan TQ for query Q.!
•  Triple patterns from SQD are grouped into star-shaped
sub-queries in a bushy tree [Vidal et al.].!
•  Triple patterns in SQCROWD are added to the plan TQ in a
left-linear fashion.!
!
!
Query Optimizer (3)
t1
t3

t4

t2

SQD

SQCROWD

25

Query Engine (1)
•  Executes the query plan TQ.!
•  Sub-queries that are part of SQD are executed against
the data set:!
•  For each mapping contained in Ω, the engine instantiates
the triple patterns in SQCROWD.!
t1
t3

t4

SQD

Ω = {{movieà dbr:Tower_Heist},

{movieà dbr:Legal_Eagles},

…}

26

Query Engine (2)
Example of an Iteration !
•  The engine processes {movieà dbr:Tower_Heist}. !
•  Following the running example:!
Comp (dbr:Tower_Heist) | dbp:producer) = = 0.33
1
3
Contradiction (dbr:Tower_Heist) | dbp:producer) = 0.15
Uncertainty(dbr:Tower_Heist) | dbp:producer) = 0.0
27

CKB+!
CKB–!
(dbr:Bad_Hair, dbp:producer, _:o2, 0.78)!CKB~!

Query Engine (3)
Example of an Iteration !
•  The algorithm computes the probability of crowdsourcing
the triple pattern (dbr:Tower_Heist, dbp:producer, ?producer):!
•  α is a score weight between 0.0 and 1.0 (in example 0.5)!
•  If P(CROWD | μ(s), p) is greater than a user threshold τ,
then algorithm crowdsources the triple pattern (μ(s), p, o).!
P(CROWD | μ(s), p) =

α (1 – 0.33) + (1 – α) min{0.15, 1 – 0.0} = 0.41

Estimated
incompleteness
Crowd
reliability
28

•  The engine combines mappings obtained from the data
set D and mappings from the crowd stored in CKB+.!
•  The query evaluation terminates when all the sub-
queries are executed. !
Query Engine (4)
The HARE query engine does not increase the
time complexity of executing a SPARQL query.!
(Theorem 1)
29

HARE Architecture
SPARQL Query Q, τ"
RDF
Completeness
Model !
Tasks!
Human
input!
Crowd Knowledge!
Query Engine!
Crowd!
CKB+! CKB-! CKB~!
Query
Optimizer!
Microtask
Manager!
LOD Cloud!
Query plan!
RDF !
Data Set!
Input!
Results for Q"
Bindings from
the crowd!
RDF
data!
Output!
Aggregated!
Human Input!
30

Microtask Manager (1)
• Receives triple patterns to
crowdsource, for example:!
• Creates human tasks.!
!
• Submits tasks to the
crowdsourcing platform.!
(dbr:Tower_Heist, dbp:producer, ?p)
31

dbr:Tower_Heist, rdfs:label,
dbp:producer, rdfs:label,
dbr:Tower_Heist, foaf:depiction,
dbr:Tower_Heist, dbo:abstract,
dbr:Tower_Heis, foaf:primaryTopic,
HARE exploits the semantics
encoded in RDF resources!
32

33

CKB+! CKB-! CKB~!

EXPERIMENTAL STUDY
34

•  Benchmark: 50 queries against (v. 2014).!
–  Ten queries in different knowledge domains: !
History, Life Sciences, Movies, Music, and Sports.!
•  Implementation details:!
–  HARE is implemented in Python 2.7.6.!
–  CrowdFlower is used as crowdsourcing platform.!
•  Crowdsourcing conﬁguration:!
–  Four different RDF triples per task, 0.07 US$ per task.!
–  At least three judgments were collected per task.!
•  Total RDF triple patterns crowdsourced: 502!
•  Total answers collected from the crowd: 1,609!
Experimental Set-Up
35

Results: Size of Query Answer (1)
0
5
10
15
20
25
30
35
40
45
Q1 Q2 Q5 Q6 Q3 Q4 Q10 Q8 Q9 Q7
#Answers
Queries
Crowd Answers
Data Set Answers
Sports!
0
10
20
30
40
50
60
70
80
Q4 Q2 Q3 Q1 Q5 Q4 Q7 Q8 Q9 Q10
#Answers
Queries
Crowd Answers
Data Set Answers
Music! Life Sciences!
0
20
40
60
80
100
120
140
160
180
Q2 Q4 Q1 Q3 Q5 Q8 Q7 Q9 Q6 Q10
#Answers
Queries
Crowd Answers
Data Set Answers
1.25 – 2.00! 1.50 – 2.00! 1.08 – 1.92!
HARE identiﬁes sub-queries that produce incomplete answers.
Crowdsourcing is a feasible solution to resolve missing values. !
36

Metric: Number of answers when queries are executed.!

Results: Size of Query Answer (2)
0
100
200
300
400
500
Q1 Q2 Q3 Q5 Q6 Q4 Q7 Q8 Q10 Q9
#Answers Queries
Crowd Answers
Data Set Answers
0
20
40
60
80
100
120
140
160
Q8 Q3 Q7 Q6 Q5 Q4 Q1 Q2 Q9 Q10
#Answers
Queries
Crowd Answers
Data Set Answers
Movies! History!
1.05 – 3.13! 1.10 – 1.89!
HARE identiﬁes sub-queries that produce incomplete answers.
Crowdsourcing is a feasible solution to resolve missing values. !
37

Metric: Number of answers when queries are executed.!

Metric: Elapsed time since the ﬁrst task until the last answer is retrieved.!
Results: Crowd Response Time (1)
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60
Q1 Q2
Q3 Q4
Q5 Q6
Q7 Q8
Q9 Q10
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90100
Q1 Q2
Q3 Q4
Q5 Q6
Q7 Q8
Q9 Q10
Judgmentscompleted(%)!
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60
Time (min)
Q1 Q2
Q3 Q4
Q5 Q6
Q7 Q8
Q9 Q10
Sports! Music! Life Sciences!
(12th min.): 77%!
Time (min)Time (min)
(12th min.): 82%! (12th min.): 97%!
At the 12th minute after the ﬁrst task is submitted
the crowd produces at least 75% of the answers.!
38

0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60
Q1 Q2
Q3 Q4
Q5 Q6
Q7 Q8
Q9 Q10
Results: Crowd Response Time (2)
Judgmentscompleted(%)!
Movies! History!
(12th min.): 98%!
Time (min)
(12th min.): 75%!
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Q1 Q2
Q3 Q4
Q5 Q6
Q7 Q8
Q9 Q10
Time (min)
At the 12th minute after the ﬁrst task is submitted
the crowd produces at least 75% of the answers.!
39

Metric: Elapsed time since the ﬁrst task until the last answer is retrieved.!

Metric: A true positive is a mapping that belongs to the query answer.!
Sports Music
Life
Sciences Movies History
Q1 1.00 1.00 0.67 0.88 1.00
Q2 1.00 1.00 1.00 0.96 1.00
Q3 1.00 1.00 0.89 0.79 0.67
Q4 0.55 0.67 1.00 1.00 0.96
Q5 0.86 0.67 1.00 1.00 0.95
Q6 0.69 0.83 1.00 1.00 0.96
Q7 1.00 0.63 0.71 1.00 0.57
Q8 1.00 0.67 0.88 0.94 0.72
Q9 0.46 0.73 1.00 1.00 0.64
Q10 0.92 0.49 1.00 1.00 0.95
Avg 0.85 0.77 0.91 0.96 0.84
Results: Quality of Crowd Answers
Sports Music
Life
Sciences Movies History
Q1 1.00 1.00 1.00 0.47 1.00
Q2 1.00 0.29 1.00 1.00 1.00
Q3 1.00 1.00 1.00 1.00 1.00
Q4 0.83 1.00 1.00 1.00 1.00
Q5 1.00 0.86 1.00 1.00 1.00
Q6 1.00 1.00 1.00 1.00 0.96
Q7 1.00 1.00 1.00 1.00 0.84
Q8 1.00 1.00 1.00 1.00 0.78
Q9 1.00 1.00 1.00 1.00 0.92
Q10 1.00 1.00 1.00 1.00 0.98
Avg 0.98 0.91 1.00 0.95 0.95
Recall! Precision!
The crowd exhibits heterogeneous performance within domains.
This supports the importance of HARE triple-based approach.!
40

RELATED WORK
41

Human/computer query processing architectures!
Summary of Related Work
Manual
specification
Automatically
HARE
CrowdDB [Franklin et al.]: Tables, columns
Deco [Park and Widom]: Rules
Qurk [Marcus et al.]: Microtask I/O
HARE relies on the RDF graph and crowd
knowledge to resort to crowdsourcing !
Crowdsourcing
42

Crowdsourcing in other contexts of Data Management
(SPARQL- or RDF-based)
Summary of Related Work
HARE
OASSIS
[Amsterdamer et al.]
KATARA
[Chu et al.]
SPARQL
Query Processing
Tabular Data
Cleansing
Recommendation
System
Mines crowdsourced
patterns specified in a
SPARQL-like language
Compares tabular data
against RDF data sets via
crowdsourced mappings
Resorts to crowdsourcing
to complete missing
values in RDF data sets
43

CONCLUSIONS &
FUTURE WORK
44

Conclusions
•  HARE: Hybrid query engine against RDF data sets.!
•  Supports microtasks to enhance query answers on-the-ﬂy.!
!
!
•  Experimental results conﬁrmed that:!
!
!
Future work
•  Study further approaches to capture crowd reliability.!
•  Consider other quality dimensions on the knowledge collected
from the crowd.!
3.13 times!
Size of query answer!
Crowd response time!
(12th min.): 98%!
Accuracy!
0.84 – 0.96!
45

References
•  [Amsterdamer et al.] Y. Amsterdamer, S. B. Davidson, T. Milo, S.
Novgorodov, and A. Somech. OASSIS: query driven crowd mining. In
SIGMOD, pages 589–600, 2014. !
•  [Chu et al.] X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang,
and Y. Ye. Katara: A data cleaning system powered by knowledge bases
and crowdsourcing. In SIGMOD, pages 1247–1261, 2015. !
•  [Marcus et al.] A. Marcus, D. R. Karger, S. Madden, R. Miller, and S. Oh.
Counting with the crowd. PVLDB, 6(2):109–120, 2012. !
•  [Park and Widom] H. Park and J.Widom. Query optimization over
crowdsourced data. PVLDB, 6(10):781–792, 2013. !
•  [Vidal et al.] M.E. Vidal, E. Ruckhaus, T. Lampo, A. Martínez, J. Sierra, and
A. Polleres. Efﬁciently joining group patterns in SPARQL queries. In ESWC,
pages 228–242, 2010. !
46

Query Answers via Crowdsourcing
Maribel Acosta, Elena Simperl, Fabian Flöck, Maria-Esther Vidal!
SPARQL Query Q, τ"
RDF
Completeness
Model !
Tasks!
Human
input!
Crowd Knowledge!
Query Engine!
Crowd!
CKB+! CKB-! CKB~!
Query
Optimizer!
Microtask
Manager!
LOD Cloud!
Query plan!
RDF !
Data Set!
Input!
Results for Q"
Bindings from
the crowd!
RDF
data!
Output!
Aggregated!
Human Input!

HARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to HARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing

Similar to HARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing (20)

More from Maribel Acosta Deibe

More from Maribel Acosta Deibe (7)

Recently uploaded

Recently uploaded (20)

HARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing