SlideShare une entreprise Scribd logo
1  sur  31
1
Big Linked Data ETL Benchmark
on Cloud Commodity Hardware
iMinds – Ghent University
Dieter De Witte, Laurens De Vocht,
Ruben Verborgh, Erik Mannens, Rik Van de Walle
Ontoforce
Kenny Knecht, Filip Pattyn, Hans Constandt
2
Introduction
Approach
Benchmark
Results
Conclusions & Next Steps
2
3
Introduction
Approach
Benchmark
Results
Conclusions & Next Steps
3
4
Introduction
 Facilitate development of semantic federated query engine
close the (semantic) analytics gap in life sciences.
 The query engine drives an exploratory search application: DisQover
 Approach to federated querying by implementing ETL pipeline
indexes the user views in advance.
 Combine Linked Open Data with private and licensed (proprietary) data
discovery of biomedical data
new insights in medicine development.
5
DisQover: which data?
6
 Ensure minimal knowledge about data linking or annotation is
required
to explore and find results.
 Write SPARQL directly
detailed knowledge of the predicates is required
might require first exploring to determine the URIs.
 Scaling out to more data
 Search queries are complex because search spans two distinct
domains:
1. the ‘space’ of clinical studies;
2. ‘drugs/chemicals’.
Challenges
7
Introduction
Approach
Benchmark
Results
Conclusions & Next Steps
7
8
Approach
How to do federated search with
minimal latency for end-user?
Which RDF stores support the
infrastructure?
What aspects should the design
of a reusable benchmark take
into account?
9
The scaling-out approach relies on low-end commodity
hardware but uses many nodes in a distributed system:
1. Specialized scalable RDF stores, the focus of this work;
2. Translating SPARQL and RDF to existing NoSQL stores;
3. Translating SPARQL and RDF to existing Big Data approaches
such as MapReduce, Impala, Apache Spark;
4. Distributing the data in physically separated SPARQL endpoints
over the Semantic Web, using federated querying techniques
to resolve complex questions.
Note: Compression (in-memory) is an alternative for distribution.
RDF datasets can be compressed (e.g. “Header Dictionary
Triples” – HDT).
Scaling out: techniques
10
ETL in instead of direct querying
Direct ETL
11
 Typical DisQover queries introduce much query latency when directly
federated.
 Facets consist of multiple separate SPARQL queries and serve both as
filter and as dashboard.
 Data integration in DisQover:
Facets filter across all data originating from multiple different
sources.
Why?
12
Introduction
Approach
Benchmark
Results
Conclusions & Next Steps
12
13
ETL
Design of benchmark focus:
 ETL part needs to be optimally cost efficient.
 SPARQL queries for indexes maximally aligned with
front-end.
 What is are the tradeoffs for each RDF store?
Benchmark
14
 What is the most cost-effective storage solution to support Linked Data
applications that need to be able to deal with heavy ETL query
workloads?
 Which performance trade-offs do storage solutions offer in terms of
scalability?
 What is the impact of different query types (templates)?
 Is there a difference in performance between the stores based on the
structural properties of the queries?
Note: not taken into account implicitly derived facts, inference or reasoning.
Questions the benchmark answers
15
WatDiv provides stress testing tools for SPARQL
existing benchmarks not always suitable for testing systems in
diverse queries and varied workloads:
 generic benchmark + not application specific;
 covers a broad spectrum
result cardinality
triple-pattern selectivity
ensured through the data and query generation method;
 Benchmark is repeatable with different dataset sizes or numbers of
queries.
Data and Query Generation
16
The RDF store should be capable of serving in a production environment with
Linked Data in Life Sciences.
The initial selection was made by choosing stores with:
• a high adoption/popularity as defined by DB-Engines.com ranking for RDF
stores;
• enterprise support;
• support for distributed deployment;
• full SPARQL 1.1 compliance.
The four stores we selected all comply with these constraints.
Note: The names of two stores we tested could not be disclosed.
They are being referred to as Enterprise Store I and II (ESI and ESII)
RDF Store Selection
17
The benchmark process consists of a data loading phase, followed by
running the SPARQL benchmarker:
1. The data is loaded in compressed format (gzip).
2. The benchmarker runs in multi-threaded mode (8 threads),
runs a set of 2000 queries multiple times.
3. These runs consists of at least one warm-up run which is not
counted.
4. In order to obtain robust results the tail results (most extreme) are
discarded before calculating average query runtimes.
5. The benchmarker generates a CSV file containing the run times
and response times etc. of all queries which we visualized.
Process
18
Query Driver
“SPARQL Query Benchmarker” is a general purpose API and CLI that is
designed primarily for testing remote SPARQL servers.
By default operations are run in a random order to avoid the system under
test (SUT) learning the pattern of operations.
Hardware
Executed all benchmarks on the Amazon Web Services (AWS) Elastic
Compute Cloud (EC2) and Simple Storage Solutions (S3).
Used the default (commercial) deployments of the SUT for the results to
be reproducible:
 both the hardware and the machine images can be easily acquired.
 more generally, cloud deployments offer the advantage of not
requiring dedicated on-premises hardware.
Infrastructure
19
Introduction
Approach
Benchmark
Results
Conclusions & Next Steps
19
20
Cost
Scalability
Behavior (Different Query Types)
Errors and Time-outs
Results
21
CostCost
22
Scalability: 0.01 B – 0.1 B – 1 B
23
Scalability: 1B
300
24
Behavior: different query types
S FL
Combinations of those
C
C
25
Behavior: different query types
26
Errors and time-outs
Every runtime > 300s is a time-out.
If the run-time reaches a maximum of < 300s we detect an internal set time-
out.
This was in particular the case voor ESII (3 nodes)
60
27
Scalability: 1B revisited
60
ESII-3 still outperforms ESII-1
when looking at queries that
did not time-out
28
Issues in the followed approach
 Choose for virtual machine images in the cloud (AWS) for
reproducibility;
but cloud solutions might not always be best suited for production.
 The results of different benchmark studies might depend on many
(hidden) configuration factors leading to different or even
contradicting results.
 The difference in performance between the stores might be attributed
to the use of commodity hardware in the cloud.
 Differences partially attributed to the quality of the recommended
configuration parameters as provided by the virtual machine images.
29
Introduction
Approach
Benchmark
Results
Conclusions & Next Steps
29
30
Conclusions & Next steps
 Compared enterprise RDF stores
default configuration
without the intervention of enterprise support.
 Run stores in their optimal configuration (reflecting a production
setting)
with more instances (> 3).
 Repeat the benchmark with DisQover data and queries.
 Create overview of RDF solutions for different
use cases, configurations and real-world (life science) datasets.
 Investigate whether the WatDiv results are confirmed when running the
benchmark with other queries and data.
 Release tools for repeating the benchmark with new storage solutions.
31
Contact Details
laurens.devocht@ugent.beE-MAIL:
@laurens_d_vTWITTER:
SLIDES: slideshare.net/laurensdv

Contenu connexe

Tendances

Tendances (10)

Jeevananthan_Informatica
Jeevananthan_InformaticaJeevananthan_Informatica
Jeevananthan_Informatica
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 
What is ETL testing & how to enforce it in Data Wharehouse
What is ETL testing & how to enforce it in Data WharehouseWhat is ETL testing & how to enforce it in Data Wharehouse
What is ETL testing & how to enforce it in Data Wharehouse
 
TimesTen Overview
TimesTen OverviewTimesTen Overview
TimesTen Overview
 
(ATS4-APP05) What's new in Isentris 4.0SP1
(ATS4-APP05) What's new in Isentris 4.0SP1(ATS4-APP05) What's new in Isentris 4.0SP1
(ATS4-APP05) What's new in Isentris 4.0SP1
 
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
 
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
 
ESG Lab Report - Catalogic Software DPX
ESG Lab Report - Catalogic Software DPXESG Lab Report - Catalogic Software DPX
ESG Lab Report - Catalogic Software DPX
 
Building Efficient Software with Property Based Testing
Building Efficient Software with Property Based TestingBuilding Efficient Software with Property Based Testing
Building Efficient Software with Property Based Testing
 
Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process
 

En vedette

Using Triple Pattern Fragments To Enable Streaming of Top-k Shortest Paths vi...
Using Triple Pattern Fragments To Enable Streaming of Top-k Shortest Paths vi...Using Triple Pattern Fragments To Enable Streaming of Top-k Shortest Paths vi...
Using Triple Pattern Fragments To Enable Streaming of Top-k Shortest Paths vi...
Laurens De Vocht
 
Researcher Profiling based on Semantic Analysis in Social Networks
Researcher Profiling based on Semantic Analysis in Social NetworksResearcher Profiling based on Semantic Analysis in Social Networks
Researcher Profiling based on Semantic Analysis in Social Networks
Laurens De Vocht
 
OSLO: Open Standards for Linked Organizations
OSLO: Open Standards for Linked OrganizationsOSLO: Open Standards for Linked Organizations
OSLO: Open Standards for Linked Organizations
Laurens De Vocht
 

En vedette (20)

Big Data Lessons from the Cloud
Big Data Lessons from the CloudBig Data Lessons from the Cloud
Big Data Lessons from the Cloud
 
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
 
Aligning Web Collaboration Tools with Research Data for Scholars
Aligning Web Collaboration Tools with Research Data for ScholarsAligning Web Collaboration Tools with Research Data for Scholars
Aligning Web Collaboration Tools with Research Data for Scholars
 
The DataTank, RML and Domain Modelling
The DataTank, RML and Domain ModellingThe DataTank, RML and Domain Modelling
The DataTank, RML and Domain Modelling
 
Using Triple Pattern Fragments To Enable Streaming of Top-k Shortest Paths vi...
Using Triple Pattern Fragments To Enable Streaming of Top-k Shortest Paths vi...Using Triple Pattern Fragments To Enable Streaming of Top-k Shortest Paths vi...
Using Triple Pattern Fragments To Enable Streaming of Top-k Shortest Paths vi...
 
Discovering Meaningful Connections between Resources in the Web of Data
Discovering Meaningful Connections between Resources in the Web of DataDiscovering Meaningful Connections between Resources in the Web of Data
Discovering Meaningful Connections between Resources in the Web of Data
 
Providing Interchangeable Open Data to Accelerate Development of Sustainable ...
Providing Interchangeable Open Data to Accelerate Development of Sustainable ...Providing Interchangeable Open Data to Accelerate Development of Sustainable ...
Providing Interchangeable Open Data to Accelerate Development of Sustainable ...
 
A Framework Concept for Profiling Researchers on Twitter using the Web of Data
A Framework Concept for Profiling Researchers on Twitter using the Web of DataA Framework Concept for Profiling Researchers on Twitter using the Web of Data
A Framework Concept for Profiling Researchers on Twitter using the Web of Data
 
Talend for big_data_intorduction
Talend for big_data_intorductionTalend for big_data_intorduction
Talend for big_data_intorduction
 
Meet David - ETL / Informatica Consultant
Meet David - ETL / Informatica ConsultantMeet David - ETL / Informatica Consultant
Meet David - ETL / Informatica Consultant
 
Etl with talend (data integeration)
Etl with talend (data integeration)Etl with talend (data integeration)
Etl with talend (data integeration)
 
A Visual Exploration Workflow as Enabler for the Exploitation of Linked Open ...
A Visual Exploration Workflow as Enabler for the Exploitation of Linked Open ...A Visual Exploration Workflow as Enabler for the Exploitation of Linked Open ...
A Visual Exploration Workflow as Enabler for the Exploitation of Linked Open ...
 
Researcher Profiling based on Semantic Analysis in Social Networks
Researcher Profiling based on Semantic Analysis in Social NetworksResearcher Profiling based on Semantic Analysis in Social Networks
Researcher Profiling based on Semantic Analysis in Social Networks
 
Talend Community Use Group Bristol: Preparing your business for mastering dat...
Talend Community Use Group Bristol: Preparing your business for mastering dat...Talend Community Use Group Bristol: Preparing your business for mastering dat...
Talend Community Use Group Bristol: Preparing your business for mastering dat...
 
Benchmarking the Effectiveness of Associating Chains of Links for Exploratory...
Benchmarking the Effectiveness of Associating Chains of Links for Exploratory...Benchmarking the Effectiveness of Associating Chains of Links for Exploratory...
Benchmarking the Effectiveness of Associating Chains of Links for Exploratory...
 
Effect of Heuristics on Serendipity in Path-Based Storytelling with Linked Data
Effect of Heuristics on Serendipity in Path-Based Storytelling with Linked DataEffect of Heuristics on Serendipity in Path-Based Storytelling with Linked Data
Effect of Heuristics on Serendipity in Path-Based Storytelling with Linked Data
 
OSLO: Open Standards for Linked Organizations
OSLO: Open Standards for Linked OrganizationsOSLO: Open Standards for Linked Organizations
OSLO: Open Standards for Linked Organizations
 
vBACD July 2012 - Apache Hadoop, Now and Beyond
vBACD July 2012 - Apache Hadoop, Now and BeyondvBACD July 2012 - Apache Hadoop, Now and Beyond
vBACD July 2012 - Apache Hadoop, Now and Beyond
 
Talend winter 2017 overview webinar
Talend winter 2017 overview webinarTalend winter 2017 overview webinar
Talend winter 2017 overview webinar
 
Présentation de Talend Winter 2017
Présentation de Talend Winter 2017 Présentation de Talend Winter 2017
Présentation de Talend Winter 2017
 

Similaire à Big Linked Data ETL Benchmark on Cloud Commodity Hardware

1 extreme performance - part i
1   extreme performance - part i1   extreme performance - part i
1 extreme performance - part i
sqlserver.co.il
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
RTTS
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
1 SDEV 460 – Homework 4 Input Validation and Busine
1  SDEV 460 – Homework 4 Input Validation and Busine1  SDEV 460 – Homework 4 Input Validation and Busine
1 SDEV 460 – Homework 4 Input Validation and Busine
VannaJoy20
 

Similaire à Big Linked Data ETL Benchmark on Cloud Commodity Hardware (20)

Building High Performance MySQL Query Systems and Analytic Applications
Building High Performance MySQL Query Systems and Analytic ApplicationsBuilding High Performance MySQL Query Systems and Analytic Applications
Building High Performance MySQL Query Systems and Analytic Applications
 
Building High Performance MySql Query Systems And Analytic Applications
Building High Performance MySql Query Systems And Analytic ApplicationsBuilding High Performance MySql Query Systems And Analytic Applications
Building High Performance MySql Query Systems And Analytic Applications
 
1 extreme performance - part i
1   extreme performance - part i1   extreme performance - part i
1 extreme performance - part i
 
Streaming is a Detail
Streaming is a DetailStreaming is a Detail
Streaming is a Detail
 
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query set
 
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
 
MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...
MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...
MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
Iod session 3423 analytics patterns of expertise, the fast path to amazing ...
Iod session 3423   analytics patterns of expertise, the fast path to amazing ...Iod session 3423   analytics patterns of expertise, the fast path to amazing ...
Iod session 3423 analytics patterns of expertise, the fast path to amazing ...
 
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMsScaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
 
Global Azure Bootcamp 2017 - Why I love S2D for MSSQL on Azure
Global Azure Bootcamp 2017 - Why I love S2D for MSSQL on AzureGlobal Azure Bootcamp 2017 - Why I love S2D for MSSQL on Azure
Global Azure Bootcamp 2017 - Why I love S2D for MSSQL on Azure
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
optimizing_ceph_flash
optimizing_ceph_flashoptimizing_ceph_flash
optimizing_ceph_flash
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 
Get a clearer picture of potential cloud performance by looking beyond SPECra...
Get a clearer picture of potential cloud performance by looking beyond SPECra...Get a clearer picture of potential cloud performance by looking beyond SPECra...
Get a clearer picture of potential cloud performance by looking beyond SPECra...
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
 
1 SDEV 460 – Homework 4 Input Validation and Busine
1  SDEV 460 – Homework 4 Input Validation and Busine1  SDEV 460 – Homework 4 Input Validation and Busine
1 SDEV 460 – Homework 4 Input Validation and Busine
 

Dernier

Dernier (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Big Linked Data ETL Benchmark on Cloud Commodity Hardware

  • 1. 1 Big Linked Data ETL Benchmark on Cloud Commodity Hardware iMinds – Ghent University Dieter De Witte, Laurens De Vocht, Ruben Verborgh, Erik Mannens, Rik Van de Walle Ontoforce Kenny Knecht, Filip Pattyn, Hans Constandt
  • 4. 4 Introduction  Facilitate development of semantic federated query engine close the (semantic) analytics gap in life sciences.  The query engine drives an exploratory search application: DisQover  Approach to federated querying by implementing ETL pipeline indexes the user views in advance.  Combine Linked Open Data with private and licensed (proprietary) data discovery of biomedical data new insights in medicine development.
  • 6. 6  Ensure minimal knowledge about data linking or annotation is required to explore and find results.  Write SPARQL directly detailed knowledge of the predicates is required might require first exploring to determine the URIs.  Scaling out to more data  Search queries are complex because search spans two distinct domains: 1. the ‘space’ of clinical studies; 2. ‘drugs/chemicals’. Challenges
  • 8. 8 Approach How to do federated search with minimal latency for end-user? Which RDF stores support the infrastructure? What aspects should the design of a reusable benchmark take into account?
  • 9. 9 The scaling-out approach relies on low-end commodity hardware but uses many nodes in a distributed system: 1. Specialized scalable RDF stores, the focus of this work; 2. Translating SPARQL and RDF to existing NoSQL stores; 3. Translating SPARQL and RDF to existing Big Data approaches such as MapReduce, Impala, Apache Spark; 4. Distributing the data in physically separated SPARQL endpoints over the Semantic Web, using federated querying techniques to resolve complex questions. Note: Compression (in-memory) is an alternative for distribution. RDF datasets can be compressed (e.g. “Header Dictionary Triples” – HDT). Scaling out: techniques
  • 10. 10 ETL in instead of direct querying Direct ETL
  • 11. 11  Typical DisQover queries introduce much query latency when directly federated.  Facets consist of multiple separate SPARQL queries and serve both as filter and as dashboard.  Data integration in DisQover: Facets filter across all data originating from multiple different sources. Why?
  • 13. 13 ETL Design of benchmark focus:  ETL part needs to be optimally cost efficient.  SPARQL queries for indexes maximally aligned with front-end.  What is are the tradeoffs for each RDF store? Benchmark
  • 14. 14  What is the most cost-effective storage solution to support Linked Data applications that need to be able to deal with heavy ETL query workloads?  Which performance trade-offs do storage solutions offer in terms of scalability?  What is the impact of different query types (templates)?  Is there a difference in performance between the stores based on the structural properties of the queries? Note: not taken into account implicitly derived facts, inference or reasoning. Questions the benchmark answers
  • 15. 15 WatDiv provides stress testing tools for SPARQL existing benchmarks not always suitable for testing systems in diverse queries and varied workloads:  generic benchmark + not application specific;  covers a broad spectrum result cardinality triple-pattern selectivity ensured through the data and query generation method;  Benchmark is repeatable with different dataset sizes or numbers of queries. Data and Query Generation
  • 16. 16 The RDF store should be capable of serving in a production environment with Linked Data in Life Sciences. The initial selection was made by choosing stores with: • a high adoption/popularity as defined by DB-Engines.com ranking for RDF stores; • enterprise support; • support for distributed deployment; • full SPARQL 1.1 compliance. The four stores we selected all comply with these constraints. Note: The names of two stores we tested could not be disclosed. They are being referred to as Enterprise Store I and II (ESI and ESII) RDF Store Selection
  • 17. 17 The benchmark process consists of a data loading phase, followed by running the SPARQL benchmarker: 1. The data is loaded in compressed format (gzip). 2. The benchmarker runs in multi-threaded mode (8 threads), runs a set of 2000 queries multiple times. 3. These runs consists of at least one warm-up run which is not counted. 4. In order to obtain robust results the tail results (most extreme) are discarded before calculating average query runtimes. 5. The benchmarker generates a CSV file containing the run times and response times etc. of all queries which we visualized. Process
  • 18. 18 Query Driver “SPARQL Query Benchmarker” is a general purpose API and CLI that is designed primarily for testing remote SPARQL servers. By default operations are run in a random order to avoid the system under test (SUT) learning the pattern of operations. Hardware Executed all benchmarks on the Amazon Web Services (AWS) Elastic Compute Cloud (EC2) and Simple Storage Solutions (S3). Used the default (commercial) deployments of the SUT for the results to be reproducible:  both the hardware and the machine images can be easily acquired.  more generally, cloud deployments offer the advantage of not requiring dedicated on-premises hardware. Infrastructure
  • 20. 20 Cost Scalability Behavior (Different Query Types) Errors and Time-outs Results
  • 22. 22 Scalability: 0.01 B – 0.1 B – 1 B
  • 24. 24 Behavior: different query types S FL Combinations of those C C
  • 26. 26 Errors and time-outs Every runtime > 300s is a time-out. If the run-time reaches a maximum of < 300s we detect an internal set time- out. This was in particular the case voor ESII (3 nodes) 60
  • 27. 27 Scalability: 1B revisited 60 ESII-3 still outperforms ESII-1 when looking at queries that did not time-out
  • 28. 28 Issues in the followed approach  Choose for virtual machine images in the cloud (AWS) for reproducibility; but cloud solutions might not always be best suited for production.  The results of different benchmark studies might depend on many (hidden) configuration factors leading to different or even contradicting results.  The difference in performance between the stores might be attributed to the use of commodity hardware in the cloud.  Differences partially attributed to the quality of the recommended configuration parameters as provided by the virtual machine images.
  • 30. 30 Conclusions & Next steps  Compared enterprise RDF stores default configuration without the intervention of enterprise support.  Run stores in their optimal configuration (reflecting a production setting) with more instances (> 3).  Repeat the benchmark with DisQover data and queries.  Create overview of RDF solutions for different use cases, configurations and real-world (life science) datasets.  Investigate whether the WatDiv results are confirmed when running the benchmark with other queries and data.  Release tools for repeating the benchmark with new storage solutions.

Notes de l'éditeur

  1. 1. Title slide 2. Problem: What is the problem that you are addressing and why the problem is important? Who will benefit if you succeed? Who should care? 3. State of the art: Why is the problem difficult? What have others tried to do? 4. Research questions and hypothesis: What is the object of your study? What is the hypothesis that you will test? 5. Preliminary results: Do you have any preliminary results that demonstrate that your approach is promising 6. Your approach: What is the main idea behind your approach? The key innovation? 7. Evaluation plan: How do you plan to test your hypothesis? What will you measure? What will you compare to? 8. Reflections: Provide an argument, based either on common knowledge or on evidence that you have accumulated, the your approach is likely to succeed. 9. Lessons Learned: Summarize the lessons that you have learned so far. Discuss the positive and negative results that you have observed.
  2. 1. Title slide 2. Problem: What is the problem that you are addressing and why the problem is important? Who will benefit if you succeed? Who should care? 3. State of the art: Why is the problem difficult? What have others tried to do? 4. Research questions and hypothesis: What is the object of your study? What is the hypothesis that you will test? 5. Preliminary results: Do you have any preliminary results that demonstrate that your approach is promising 6. Your approach: What is the main idea behind your approach? The key innovation? 7. Evaluation plan: How do you plan to test your hypothesis? What will you measure? What will you compare to? 8. Reflections: Provide an argument, based either on common knowledge or on evidence that you have accumulated, the your approach is likely to succeed. 9. Lessons Learned: Summarize the lessons that you have learned so far. Discuss the positive and negative results that you have observed.
  3. 1. Title slide 2. Problem: What is the problem that you are addressing and why the problem is important? Who will benefit if you succeed? Who should care? 3. State of the art: Why is the problem difficult? What have others tried to do? 4. Research questions and hypothesis: What is the object of your study? What is the hypothesis that you will test? 5. Preliminary results: Do you have any preliminary results that demonstrate that your approach is promising 6. Your approach: What is the main idea behind your approach? The key innovation? 7. Evaluation plan: How do you plan to test your hypothesis? What will you measure? What will you compare to? 8. Reflections: Provide an argument, based either on common knowledge or on evidence that you have accumulated, the your approach is likely to succeed. 9. Lessons Learned: Summarize the lessons that you have learned so far. Discuss the positive and negative results that you have observed.
  4. 1. Title slide 2. Problem: What is the problem that you are addressing and why the problem is important? Who will benefit if you succeed? Who should care? 3. State of the art: Why is the problem difficult? What have others tried to do? 4. Research questions and hypothesis: What is the object of your study? What is the hypothesis that you will test? 5. Preliminary results: Do you have any preliminary results that demonstrate that your approach is promising 6. Your approach: What is the main idea behind your approach? The key innovation? 7. Evaluation plan: How do you plan to test your hypothesis? What will you measure? What will you compare to? 8. Reflections: Provide an argument, based either on common knowledge or on evidence that you have accumulated, the your approach is likely to succeed. 9. Lessons Learned: Summarize the lessons that you have learned so far. Discuss the positive and negative results that you have observed.
  5. 1. Title slide 2. Problem: What is the problem that you are addressing and why the problem is important? Who will benefit if you succeed? Who should care? 3. State of the art: Why is the problem difficult? What have others tried to do? 4. Research questions and hypothesis: What is the object of your study? What is the hypothesis that you will test? 5. Preliminary results: Do you have any preliminary results that demonstrate that your approach is promising 6. Your approach: What is the main idea behind your approach? The key innovation? 7. Evaluation plan: How do you plan to test your hypothesis? What will you measure? What will you compare to? 8. Reflections: Provide an argument, based either on common knowledge or on evidence that you have accumulated, the your approach is likely to succeed. 9. Lessons Learned: Summarize the lessons that you have learned so far. Discuss the positive and negative results that you have observed.
  6. There is no clear second place. Whereas ESI performs well on small datasets, Blazegraph shows better results for larger datasets. ESII’s performance on a single instance benchmark is worse than the others, but its claim of being highly scalable is confirmed in a configuration with three instances where it performs significantly better. All data stores have acceptible results for 10 and 100 million triples and the choice to go by one or the other could depend on additional features each of the stores has to offer such as support for full-text indexing, support for linked data fragment interfaces or superior automatic inferencing. For the larger datasets Virtuoso should be the first choice as a single instance solution. The initial results in a distributed setup with ESII shows promising results in terms of scaling out, proving that this store’s power might only be revealed in large multi-instance benchmarks.
  7. 1. Title slide 2. Problem: What is the problem that you are addressing and why the problem is important? Who will benefit if you succeed? Who should care? 3. State of the art: Why is the problem difficult? What have others tried to do? 4. Research questions and hypothesis: What is the object of your study? What is the hypothesis that you will test? 5. Preliminary results: Do you have any preliminary results that demonstrate that your approach is promising 6. Your approach: What is the main idea behind your approach? The key innovation? 7. Evaluation plan: How do you plan to test your hypothesis? What will you measure? What will you compare to? 8. Reflections: Provide an argument, based either on common knowledge or on evidence that you have accumulated, the your approach is likely to succeed. 9. Lessons Learned: Summarize the lessons that you have learned so far. Discuss the positive and negative results that you have observed.