Fedbench - A Benchmark Suite for Federated Semantic Data Processing

FedBench
A Benchmark Suite for
Federated Semantic Data Processing

Michael Schmidt1, Olaf Görlitz2, Peter Haase1, Günter Ladwig3,
Andreas Schwarte1, Thanh Tran3

1 2 3

10th Intl. Semantic Web Conference, Oct 26, 2011, Bonn

Linked Data Evaluation Strategies

Query

Central
Repository

RDF RDF RDF
Data Data Data

Centralized
Linked Data Processing

Linked Data Evaluation Strategies

Query Query

Federation Layer
Dynamic
Central HTTP
Local SPARQL SPARQL
Repository Rep. Endp. Endp. Lookups

RDF RDF RDF RDF RDF RDF
Data Data Data Data Data Data

Centralized Federated
Linked Data Processing Linked Data Processing

Centralized vs. Federated Approaches

Centralized Processing Federated Processing
•  Data periodically crawled, gathered, •  Use of original data sources ensures
and updated that data is always „up-to-date“
•  High reliability and controllability •  No control over federation members
•  Inflexible set of data sources •  Ad-hoc integration of remote sources
•  Comprehensive knowledge about data, •  Requires careful optimization, but also
useful for query optimization offers opportunities (parallelization)

Centralized vs. Federated Approaches

Centralized Processing Federated Processing
•  Data periodically crawled, gathered, •  Use of original data sources ensures
and updated that data is always „up-to-date“
•  High reliability and controllability •  No control over federation members
•  Inflexible set of data sources •  Ad-hoc integration of remote sources
•  Comprehensive knowledge about data, •  Requires careful optimization, but also
useful for query optimization offers opportunities (parallelization)

Key Observations
(1)  Both centralized and federated Linked Data processing have practical use cases
(2)  Radically different requirements, challenges, and characteristics

Benchmarking Linked Data Evaluation
Query Query

Federation Layer
Dynamic
Central HTTP
Local SPARQL SPARQL
Repository Rep. Endp. Endp. Lookups

RDF RDF RDF RDF RDF RDF
Data Data Data Data Data Data

Centralized Federated
Linked Data Processing Linked Data Processing

BSBM, LUBM, SP2Bench, ... So far no benchmarks proposed

Challenges in Federated Linked Data Benchmarking:
Heterogeneity of Use Cases

Data level Query level
¨  (D1) Physical Distribution ¨  (Q1) Query Language
¤  Local vs. remote ¤  Expressiveness
¨  (D2) Data Access Interfaces ¤  Complexity
¤  Native repository ¨  (Q2) Result Completeness
¤  SPARQL Endpoint ¨  (Q3) Ranking
¤  Linked Data (HTTP)
¨  Various other characteristics
¨  (D3) Knowledge about Data ¤  Join types
Source Existence ¤  Result size
¨  (D4) Data Statistics ¤  ...

Challenges in Federated Linked Data Benchmarking:
Heterogeneity of Use Cases

Data level Query level
¨  (D1) Physical Distribution ¨  (Q1) Query Language
¤  Local vs. remote ¤  Expressiveness
¨  (D2) Data Access Interfaces ¤  Complexity
¤  Native repository ¨  (Q2) Result Completeness
¤  SPARQL Endpoint ¨  (Q3) Ranking
¤  Linked Data (HTTP)
¨  Various other characteristics
¨  (D3) Knowledge about Data ¤  Join types
Source Existence ¤  Result size
¨  (D4) Data Statistics ¤  ...

Need for a flexible benchmark suite rather
than “one-size-fits-all“ benchmark scenario!

FedBench Components (ctd)

Data Sets

•  Vary in structuredness,
domain, size, etc.
•  Grouped in collections

Data Collections

Cross-Domain Collection

Life Science Collection SP2Bench Data Collection

•  Synthetic Data
•  Split into sub-datasets
according to types


Data Sets Queries

•  Vary in structuredness, •  Operate on the data
domain, size, etc. collections
•  Grouped in collections •  Logically grouped

Example Query

List all US presidents including their party and associated news.

SELECT ?pres ?party ?page
WHERE {
?pres rdf:type dbpedia-owl:President .
?pres dbpedia-owl:nationality dbpedia:United_States .
?pres dbpedia-owl:party ?party .
?x nytimes:topicPage ?page .
?x owl:sameAs ?pres
}

Queries

¨  Partially taken from prototype systems, partially designed
to capture challenges in federated query processing
¨  Four sets of queries
¤  Life Science
n  Life Science query set (full SPARQL): 7 queries (LS)
¤  Cross Domain
n  Cross Domain query set (full SPARQL): 7 queries (CD)
n  Linked Data query set (BGPs): 11 queries (LD)
¤  SP2Bench
n  SP2Bench query set (full SPARQL): 14 queries (SP)
¨  Focus on different functional aspects
¤  General federated query processing requirements
¤  Pure Linked Data processing

Queries

Operators: A – AND, U – UNION, O – OPTIONAL, F – FILTER
Solution Modifiers: Or – ORDER BY, D – DISTINCT, L – LIMIT, Of – OFFSET


Data Sets Queries

Benchmark Driver
•  Allows to execute FedBench in a unified way
•  Java, Open Source à easily adjustable and extensible

Evaluation Framework
¨  Parametrizable benchmark driver
¨  Implemented in Java using the Sesame framework

¨  Highly customizable via config files

¤  Data and query sets
¤  Number of runs, timeouts
¤  Deployment method of data sets

¤  Metrics (loading time, evaluation time, #requests)

¨  Highly extendable, which makes it easy to connect
new systems on demand


Data Sets Queries

Benchmark Driver
Benchmark
Results

CSV RDF


Data Sets Queries

Benchmark Driver
Benchmark
Results •  Wiki-based platform for
Linked Data
CSV RDF •  Publishing and discussion of
Publishing
benchmark results

Evaluation
¨  Goal: prove practicability & flexibility of benchmark
¤  Cover a variety of scenarios
¤  Assess first state-of-the-art results
¤  Identify weaknesses and strengths of systems
¨  Measures
¤  Queryevaluation time
¤  Number of requests sent to remote sources
¨  Hardware
¤  ILO2 HP server ProLiant DL360
¤  4Core CPU with 2000MHz
¤  64bit Windows Server 2008, running 64bit JVM 1.6.0_22
¤  32GB RAM (20GB for federation mediator, rest distributed
among federation members)

Evaluation: Scenario A

¨  “Centralized vs. Federated“ query processing
¤  Scenario A1: Centralized processing
n  Sesame 2.3.1
¤  Scenario A2: Local federation
n  Sesame 2.3.1 + AliBaba
¤  Scenario A3: SPARQL Endpoint federation (HTTP)
n  Sesame 2.3.1. + AliBaba
n  SPLENDID from WeST

¨  10min timeout per query
¨  Average over three runs (after warm-up phase)

Scenario A: Life Science Queries
Data size: 50M triples in total

#Requests to Endpoints LS1 LS2 LS3 LS4 LS5 LS6 LS7
Endpoint Federation (AliBaba) 13 61 (410) 21k 17k (130) (876)
Endpoint Federation (SPLENDID) 2 49 9 10 4778 322 4889

Evaluation: Scenario B
¨  Scenario B: Linked Data query set on CD collection
¤  Bottom-upapproach
¤  Top-down approach
¤  Mixed approach

¨  Local CumulusRDF Linked Data server
¨  Systems: dedicated prototype implementations*

¨  Major findings

¤  Top-down approach most performant
¤  Mixed approach competitive, bringing the merits of
earlier result reporting
* G. Ladwig, T. Tran: Linked Data Query Processing Strategies. In Proc. ISWC, 2010.

Summary: Central Findings

¨  Effective join ordering often impossible when no
intelligent source selection strategy is given
¨  In such cases: often very high number of requests

(104+) caused by iterative, nested-loop evaluation
strategy of AliBaba
¨  Limited capabilities of Sesame to deal with
parallelization cause problems (locking issues)

In the following talk:
FedX – a federated query processing system that tackles these issues!

Conclusion

¨  Benchmark flexible enough to cover a wide range
of semantic data use cases/applications
¨  Evaluation reveals severe deficiencies of today‘s
approaches
¨  Upcoming tasks/future work
¤  General SPARQL 1.1 extensions
¤  SPARQL 1.1 federation extensions
¤  Distributed reasoning

¨  Laid out as community project: you are invited to
contribute with your own data & queries!

Questions ?

http://code.google.com/p/fbench/

Fedbench - A Benchmark Suite for Federated Semantic Data Processing

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Fedbench - A Benchmark Suite for Federated Semantic Data Processing

Similaire à Fedbench - A Benchmark Suite for Federated Semantic Data Processing (20)

Plus de Peter Haase

Plus de Peter Haase (16)

Dernier

Dernier (20)

Fedbench - A Benchmark Suite for Federated Semantic Data Processing