semlavssws2015

Parallel Data Loading during
Querying Deep Web and Linked
Open Data with SPARQL
Pauline Folz 12, Gabriela Montoya 13, Hala Skaf-Molli 1, Pascal Molli
1 and Maria-Esther Vidal 4
1 LINA-- Nantes University, France
2 Nantes Métropole - Direction Recherche, Innovation et Enseignement
Supérieur,
3 Centre National de la Recherche Scientifique (CNRS), France
4 Universidad Simon Bolivar, Venezuela
1
SSWS2015@ISWC2015

Querying Linked Open Data with
SPARQL
• Who in the Semantic Web Community
knows a well known person?
SELECT DISTINCT *
WHERE {
?P foaf:member ?C .
?C rdfs:label ``Semantic Web’’ .
?P foaf:knows ?WKP .
?WKP foaf:name ?N.
FILTER(?N=``Barack Obama”)
}
2
No

Results

L
LOD Data sources

• Who in the Semantic Web Community
knows a well known person?
SELECT DISTINCT *
WHERE {
?P foaf:member ?C .
?C rdfs:label ``Semantic Web’’ .
?P foaf:knows ?WKP .
?WKP foaf:name ?N .
FILTER(?N=``Barack Obama”)
} LOD Data sources
Querying Deep Web and Linked
Open Data with SPARQL
3
Deep Web Data sources
Results

J !

4
P. Folz, G. Montoya, H. Skaf-Molli, P. Molli, and M. Vidal. Semlav: Querying deep web and linked
open data with SPARQL. Demo ESWC 2014, Revised Selected Papers, pages 332–337, 2014.
Video available at: https://www.youtube.com/watch?v=z7w31f-ybuQ

SemLAV: Local-As-View Mediation
for SPARQL
5
G. Montoya, L. D. Ibánez, H. Skaf-Molli, P. Molli, and M.-E. Vidal. SemLAV: Local-As-View
Mediation for SPARQL. Transactions on Large-Scale Data- and Knowledge-Centered Systems,
LNCS, Vol. 8420, pages 33–58, 2014.
Q(P,C,WKP,N):- member(P,C), label(C,”Semantic Web”),
knows(P,WKP), name(WKP,”Barack Obama”)
v1(P,A,I,C,L) :- made(P,A),affiliation(P,I),member(P,C),label(C,L)
v2(A,T,P,N,C) :- title(A,T),made(P,A),name(P,N),member(P,C)
v3(P,N,R,M) :- name(P,N),name(R,M),knows(P,R)
v4(P,N,G,R,C) :-name(P,N),gender(P,G),knows(P,R),member(P,C)
v5(P,N,R,C,L) :-name(P,N),knows(P,R),member(P,C),label(C,L)
Query :
LAV mappings:

Compute Buckets
6
G. Montoya, L. D. Ibánez, H. Skaf-Molli, P. Molli, and M.-E. Vidal. SemLAV: Local-As-View Mediation for
SPARQL. Transactions on Large-Scale Data- and Knowledge-Centered Systems, LNCS, Vol. 8420, pages
33–58, 2014.
Q(P,C,WKP,N):- member(P,C), label(C,”Semantic Web”), knows(P,WKP),
name(WKP,”Barack Obama”)
v1(P,A,I,C,L):-made(P,A),affiliation(P,I),member(P,C),label(C,L)
v2(A,T,P,N,C):-title(A,T),made(P,A),name(P,N),member(P,C)
v3(P,N,R,M):-name(P,N),name(R,M),knows(P,R)
v4(P,N,G,R,C):-name(P,N),gender(P,G),knows(P,R),member(P,C)
v5(P,N,R,C,L):-name(P,N),knows(P,R),member(P,C),label(C,L)
Query :
LAV mappings:
member(P,C) label(C,L) knows(P,WKP) name(WKP,N)
v1(P,A,I,C,L) v1(P,A,I,C,L) v3(P,N,R,M) v2(A,T,P,N,C)
v2(A,T,P,N,C) v5(P,N,R,C,L) v4(P,N,G,R,C) v3(P,N,R,M)
v4(P,N,G,R,C) v5(P,N,R,C,L) v4(P,N,G,R,C)
v5(P,N,R,C,L) v5(P,N,R,C,L)

Bottleneck
of
LAV
approach
• A LAV mediator relies on a query rewriter to translate
a mediator query into the union of queries against the
views.
• The number of candidate rewritings in the worst case
is: (M×|V|)N. N the number of query sub-goals, M the
maximal number of views sub-goals, and V the set of
views,
– For the simple query example -> 96 candidate rewritings
– For a more complex query -> millions of rewritings
• Problems:
– Cannot execute all rewritings
– Cannot guess which rewritings could produce results
7

SemLAV Approach
• Do
not
generate
rewritings
• Materialize
relevant
views
and
execute

original
query
– Problem:
maybe
no
time,
or
no
space
to

materialize
all
views
• Materialization
order
matters:
– Need
to
decide
which
views
to
materialize
views
– We
decide
according
to
the
number
of
“covered

rewritings”
8

Ranking Relevant Views
9
G. Montoya, L. D. Ibánez, H. Skaf-Molli, P. Molli, and M.-E. Vidal. SemLAV: Local-As-View Mediation for
SPARQL. Transactions on Large-Scale Data- and Knowledge-Centered Systems, LNCS, Vol. 8420, pages
33–58, 2014.
Q(P,C,WKP,N):- member(P,C), label(C,”Semantic Web”), knows(P,WKP),
name(WKP,,”Barack Obama”)
v1(P,A,I,C,L):-made(P,A),affiliation(P,I),member(P,C),label(C,L)
v2(A,T,P,N,C):-title(A,T),made(P,A),name(P,N),member(P,C)
v3(P,N,R,M):-name(P,N),name(R,M),knows(P,R)
v4(P,N,G,R,C):-name(P,N),gender(P,G),knows(P,R),member(P,C)
v5(P,N,R,C,L):-name(P,N),knows(P,R),member(P,C),label(C,L)
Query :
LAV mappings:
v5(P,N,R,C,L) v5(P,N,R,C,L) v5(P,N,R,C,L) v5(P,N,R,C,L)
v4(P,N,G,R,C) v1(P,A,I,C,L) v4(P,N,G,R,C) v4(P,N,G,R,C)
v1(P,A,I,C,L) v3(P,N,R,M) v2(A,T,P,N,C)
v2(A,T,P,N,C) v3(P,N,R,M)
4
3
2
2

Materialization
Order
Matters
10
# Included
views (k)
SemLAV ranking Random order
Included views
(Vk)
# Covered
rewritings
Included views
(Vk)
# Covered
rewritings
1
2
3
4
5
v5
v5, v4
v5, v4, v1
v5, v4, v1, v3
v5, v4, v1, v3, v2
1×1×1×1=1
2×1×2×2=8
3 × 2 × 2 × 2 = 24
3 × 2 × 3 × 3 = 54
4 × 2 × 3 × 4 = 96
v1
v1, v2
v1, v2, v3
v1, v2, v3, v4
v1, v2, v3, v4, v5
1×1×0×0=0
2×1×0×1=0
2×1×1×2=4
3 × 1 × 2 × 3 = 18
4 × 2 × 3 × 4 = 96
4
3
2
2

Query processing over materialized views
11
v4
v1
v5
v2
v3

So
SemLAV Works
J
12
Number of Answers produced by SemLAV and randomly selected views during two
minutes.

Drawbacks of SemLAV
• Blocking execution
strategy:
– Views are contacted one by
one in order.
– If v5 is huge..
• Impact performance of
SemLAV:
– Throughput
– Time of first answer
– Total Time
13
v1
v5
v4
v2
v3

View Loading and Query Execution
Sequential loading Parallel loading
14
v5
v4
v1
v2
v3
v5
v1
v2
v4
v3
A pool of 3 threads to download in
parallel.
When v1 is loaded and the query is
executed
- Expect more answers, sooner ??
- But, the number of triples is growing
much faster than in sequential

View Loading and Query Execution
Sequential loading Parallel loading
15
V5
V4
v1
v2
v3
V5
v1
V2
v4
v3
Loading data in parallel requires to :
• Manage concurrent insertions into
the integrated RDF graph

Concurrency Management
• Parallel insertions into a grow only
graph is a lock-free problem.
• However, existing RDF stores are
designed for
insert/delete/transaction.
• Hence, RDF stores poorly support
parallel materialization of views
(need for a dedicated RDF store).
16

parallel SemLAV (PS):
Concurrency Model
– We simulated on the top of JENA a Single-
Reader/Multiple-Writers strategy (SRMW).
– Each view is divided into n blocks of 100
triples.
17
v5
v1
v2
v4
v3
A bock of 100 triples
• Could we have better
performances just with that ?

When to execute the query?
• Why waiting until a view is loaded to execute the
query ? Others simple strategies are possible?
Which one is the best?
• Be careful :
– more query execution -> less loading
– less query execution -> more time for first results
• We define four execution strategies.
– View dependent (PS), Time dependent (PS-TDC),
Data dependent (PS-DDC), Two-phase execution
(DDC-ASK), (TDC-ASK)
18

View Dependent Criterion (PS)
• The query engine is woken up
after a new view is completely
loaded.
19
v5
v1
v2
v4
v3

Time Dependent Criterion (PS-TDC)
• The query engine is woken up after a
period of time t
– if t is n milliseconds, execute query every n
milliseconds
20
v5
V1
v2
V4
v3
0
n
4n
2n
3n
time

Data Dependent Criterion (PS-DDC)
• The query engine is woken up after a
certain number n of triples are inserted
into the integrated RDF graph by the
writers.
21
v5
V1
v2
V4
V3
0
n
4n
2n
3n
Data

size

Two-phases Criterion (PS-DDC-
ASK) and (PS-TDC-ASK)
• First phase performs an ASK query to
check for new results: if yes, 2nd phase.
• Second phase executes the original query
– (PS-TDC-ASK) or (PS-DDC-ASK) .
22
v5
v1 v2v4 v3
ASK
-‐>NO
ASK
-‐>NO
ASK
-‐>
Yes

Experimentations Evaluation
• Implement and compare with SemLAV:
– Berlin Benchmark1: 10,000,736 triples
– 16 queries (out of 18), 510 views
– Linux
server
with
128
GB
of
memory,
124

processors,
20
GB
of
RAM
are
allocated
for
the

experiments.
• For parallel SemLAV (PS)
– Threads are executed in parallel to download views
– Different number of threads: 5, 10 and 20 threads
– More information in the paper and project website:
https://sites.goole.com/site/sematiclav
23

Results of BSBM View Dependent Criterion (PS)
24

Results of BSBM using Time-Dependent Criterion
(PS-TDC). Queries are executed every 500 msecs
25

Results of BSBM using Data Dependent Criterion
(PS-DDC). Queries are executed every the
insertion of 500 triples
26

27
Results of BSBM using PS-DDC-ASK strategy, queries
are executed whenever 500 triples inserted in the
integrated RDF graph

29
Better Total Time for parallel SemLAV
But no dominate strategy

Better throughput for parallel SemLAV
But no dominate strategy

31
Time for First answer

Conclusion and Future Work
• Parallel processing of SPARQL queries using LAV
Views.
• New execution strategies outperforms SemLAV in
terms of throughput and total Time.
• Trade-off between throughput and time for first
answer.
• In the future:
– Build a grow only RDF store to better support parallel
loading
– Incremental evaluation of the query relying on view
update…
32

semlavssws2015

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à semlavssws2015

Similaire à semlavssws2015 (20)

Dernier

Dernier (20)

semlavssws2015