SlideShare une entreprise Scribd logo
1  sur  76
Télécharger pour lire hors ligne
Efficient Distributed
In-Memory Processing of
RDF Datasets
Gezim Sejdiu
PhD Colloquium, Bonn 29.09.2020
Supervisor: Prof. Dr. Jens Lehmann
Introduction
Large-Scale RDF Dataset Statistics
Quality Assessment of RDF Datasets at Scale
Scalable RDF Querying
Use Cases and Applications
Conclusion & Future Directions
Outline
2
Introduction
Get me there!
3
No single definition
Extremely large data sets that may be analysed computationally to
reveal patterns, trends, and associations, especially relating to human
behaviour and interactions
Big data is a term for data sets that are so large or complex that
traditional data processing application softwares are inadequate to deal
with them
What is Big Data?
4
It’s relevance is increasing drastically and Big Data Analytics is an
emerging field to explore
Why ‘BigData’ is so important?
5
https://trends.google.com/trends/explore?date=all&q=%22big%20data%22
6
7
© Sorpresa meme on Memegen
Big Data Europe (BDE) Platform
8https://github.com/big-data-europe
Support Layer
Init Daemon
GUIs
Monitor
App Layer
Traffic
Forecast
Satellite Image
Analysis
Platform Layer
Spark Flink Semantic Layer
Ontario SANSA Semagrow
Kafka
Real-time Stream
Monitoring
...
...
Resource Management Layer (Swarm)
Hardware Layer
Premises Cloud (AWS, GCP, MS Azure, …)
Data Layer
Hadoop NOSQL Store CassandraElasticsearch ...RDF Store
Fast and generic-purpose cluster computing engine
Apache Spark
9
Spark Core Engine (RDD)
Deploy
SparkSQL&
DataFrames
CoreAPIs&
Libraries
SparkStreaming
Local
Single
JVM
Cluster
(Standalone,
Mesos, YARN)
Containers
docker-comp
ose
MLlib
MachineLearning
GraphX
Graphprocessing
Allows for massive parallel processing of
collections of records
- RDD - Resilient Distributed Dataset
- DataFrame - Conceptually a table
- Dataset - Unified access to data as objects
and/or tables
Heterogeneity aka Variety
Key Observation From BDE
10
Banking
Finance
Our
Known
History
PurchaseEntertain
Gaming
Social
Media
VISA
CHASE
SAP
IBM
NORDSTROM
Amazon
LOWES
NETFLIX
HULU
NFb NETWORK
Zynga
XBOX 360
Facebook
Pinterest
Twitter
Customer
Modelling entities and their relationships
The RDF (Resource Description Framework) model
Knowledge Graphs
11
DPDHL Deutsche Post DHL Group
full name
Logistics
industry
Logistik
label
PostTower
headquarters
Bonn
located in
Modelling entities and their relationships
Analysis: finding underlying structure of the graph e.g. to predict
unknown relationships
Examples: Google Knowledge Graph, DBpedia, Facebook, YAGO,
Twitter, LinkedIn, MS Academic Graph, IBM Graph, WikiData
Knowledge Graphs
12
Knowledge Graphs are everywhere
13
Entity Search and Summarization
Discovering Related Entities
Tasks that are hard to solve on single machines (>1 TB memory
consumption):
- Querying and processing LinkedGeoData
- Dataset statistics and quality assessment of the LOD Cloud
- Vandalism and outlier detection in DBpedia
- Inference on life science data (e.g. UniProt, EggNOG, StringDB)
- Clustering of DBpedia data
- Large-scale enrichment and link prediction for e.g. DBpedia →
LinkedGeoData
Why Distributed RDF Data Processing?
14
Main Research Question
Is it possible to process large-scale RDF
datasets efficiently and effectively?
15
RQ1: How can we efficiently explore the structure of large-scale RDF
datasets?
RQ2: Can we scale RDF dataset quality assessment horizontally?
RQ3: Can distributed RDF datasets be queried efficiently and
effectively?
Research Questions
16
RC1: A Scalable Distributed Approach for Computation of RDF Dataset
Statistics
RC2: A Scalable Framework for Quality Assessment of RDF Datasets
RC3: A Scalable Framework for SPARQL Evaluation of Large RDF Data
Contributions
17
SANSA
Scalable Semantic Analytics Stack
18
SANSA [1] is a processing data flow engine that provides data
distribution, and fault tolerance for distributed computation over
large-scale RDF datasets
SANSA includes several libraries:
- Read / Write RDF / OWL library
- Querying library
- Inference library
- ML library
SANSA
19
BigDataEurope
Inference
Knowledge Distribution &
Representation
DeployCoreAPIs&Libraries
Local Cluster
Standalone Resource manager
Querying
Machine Learning
RQ1: How can we efficiently explore the structure of large-scale RDF
datasets?
RQ2: Can we scale RDF dataset quality assessment horizontally?
RQ3: Can distributed RDF datasets be queried efficiently and
effectively?
Research Questions
20
Large-Scale RDF Dataset
Statistics
A Scalable Distributed Approach for
Computation of RDF Dataset Statistics [2]
21
Obtaining an overview over the Web of Data, it is important to gather
statistical information describing characteristics of the internal
structure of datasets
This process is both data-intensive and computing-intensive and it is a
challenge to develop fast and efficient algorithms that can handle large
scale RDF datasets
There are no approaches for RDF that computes those statistical criteria
and scales to large data sets
Motivation
22
A statistical criterion C is a triple C = (F, D, P), where:
- F is a SPARQL filter condition
- D is a derived dataset from the main dataset (RDD of triples) after
applying F
- P is a post-processing operation on the data structure D
RDDs are in-memory collections of records that can be operated in
parallel on large clusters
- We use RDDs to represent RDF triples
Approach
23
Architecture Overview
24
Experimental Setup
- Cluster configuration
- 6 machines (1 master, 5 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
(32 Cores), 128 GB RAM, 12 TB SATA RAID-5
- Spark-2.2.0, Hadoop 2.8.0, Scala 2.11.11 and Java 8
- Datasets (all in nt format)
Evaluation
25
DBpedia BSBM
LinkedGeoData en de fr 2GB 20GB 200GB
#nr. of triples 1,292,933,812 812,545,486 336,714,883 340,849,556 8,289,484 81,980,472 817,774,057
size (GB) 191.17 114.4 48.6 49.77 2 20 200
Distributed Processing on Large-Scale Datasets
* e) = d) / c) - 1
Evaluation
26
Runtime (in hours)
LODStats DistLODStats
a) files
b)
bigfile c) local d) cluster
e) speedup
ratio
LinkedGeoData n/a n/a 36.65 4.37 7.4x
DBpedia_en 24.63 fail 25.34 2.97 7.6x
DBpedia_de n/a n/a 10.34 1.2 7.3x
DBpedia_fr n/a n/a 10.49 1.27 7.3x
Performance evaluation of DistLODStats
Evaluation
27Node scalability (BSBM-50GB) Sizeup scalability
RQ1: How can we efficiently explore the structure of large-scale RDF
datasets?
RQ2: Can we scale RDF dataset quality assessment horizontally?
RQ3: Can distributed RDF datasets be queried efficiently and
effectively?
Research Questions
28
Quality Assessment of RDF
Datasets at Scale
A Scalable Framework for Quality Assessment of
RDF Datasets [3]
29
Assessing data quality is of paramount importance to judge its fitness
for particular use case
Existing solutions can not evaluate data quality metrics on medium /
large-scale datasets
→ This is actually where they are most important
Motivation
30
Quality Assessment Pattern (QAP)
- A reusable template to implement and design scalable quality
metrics
Approach
31
Quality Metric(QM) := Action|(QM OP Action)
OP := ∗|−|/|+
Action := Count(Transformation)
Transformation := Rule(Filter)|(Transformation BOP Transformation)
Filter := getPredicates∼?p|getSubjects∼?s|getObjects∼?o|getDistinct(Filter)
|Filter or Filter|Filter && Filter)
Rule := isURI(Filter)|isIRI(Filter)|isInternal(Filter)|isLiteral(Filter)
|!isBroken(Filter)|hasPredicateP|hasLicenceAssociated(Filter)
|hasLicenceIndications(Filter)|isExternal(Filter)|hasType((Filter)
|isLabeled(Filter)
BOP := ∩|∪
Architecture Overview
32
Definition
● Define quality dimensions
● Define quality metrics, threshold and other configurations
RDF Data
Qualityassessment
SANSA Engine
DataIngestion
Distributed Data
Structures
QAP
Results
Analyse
SANSA-NotebooksData Quality Vocabulary (DQV)
Experimental Setup
- Cluster configuration
- 7 machines (1 master, 6 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
(32 Cores), 128 GB RAM, 12 TB SATA RAID-5, Spark-2.4.0, Hadoop 2.8.0, Scala
2.11.11 and Java 8
Local mode: single instance of the cluster
- Datasets (all in .nt format)
Evaluation
33
DBpedia BSBM
LinkedGeoData en de fr 2GB 20GB 200GB
#nr. of triples 1,292,933,812 812,545,486 336,714,883 340,849,556 8,289,484 81,980,472 817,774,057
size (GB) 191.17 114.4 48.6 49.77 2 20 200
Evaluation
34
Runtime (in minutes)
Luzzu DistQualityAssessment
-----> a) single b) joint c) local d) cluster
LinkedGeoData Fail Fail 446.9 7.79
DBpedia_en Fail Fail 274.31 1.99
DBpedia_de Fail Fail 61.4 0.46
DBpedia_fr Fail Fail 195.3 0.38
BSBM_200GB Fail Fail 454.46 7.27
BSBM_0.01GB 2.64 2.65 0.04 0.42
BSBM_0.05GB 16.38 15.39 0.05 0.46
BSBM_0.1GB 40.59 37.94 0.06 0.44
BSBM_0.5GB 459.19 468.64 0.15 0.48
BSBM_1GB 1454.16 1532.95 0.4 0.56
BSBM_2GB Timeout Timeout 3.19 0.62
BSBM_10GB Timeout Timeout 29.44 0.52
BSBM_20GB Fail Fail 34.32 0.75
Large-scaleSmalltomedium
Performance evaluation of DistQualityAssessment
Evaluation
35Node scalability (BSBM-200GB) Sizeup scalability
RQ1: How can we efficiently explore the structure of large-scale RDF
datasets?
RQ2: Can we scale RDF dataset quality assessment horizontally?
RQ3: Can distributed RDF datasets be queried efficiently and
effectively?
Research Questions
36
Scalable RDF Querying
Sparklify: A Scalable Software Component for
Efficient evaluation of SPARQL queries over
distributed RDF datasets* [4]
37* A joint work with Claus Stadler, a PhD student at the University of Leipzig.
Existing solutions are narrowed down to simple RDF constructs only
Hence they do not exploit the full potential of the knowledge i.e. RDF
terms
Can we re-use existing Ontology-Based Data Access (OBDA) tooling to
facilitate running SPARQL queries on RDF kept in Apache Spark?
Motivation
38
Sparklify: Architecture Overview
39
Sparqlify
SANSA
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
Query Layer
Sparklifying
Views Views
Distributed Data
Structures
Results
RDFData
SELECT ?s ?w WHERE {
?s a dbp:Person .
?s ex:workPage ?w .
}
SPARQL
Prefix dbp:<http://dbpedia.org/ontology/>
Prefix ex:<http://ex.org/>
Create View view_person As
Construct {
?s a dbp:Person .
?s ex:workPage ?w .
}
With
?s = uri('http://mydomain.org/person', ?id)
?w = uri(?work_page)
Constrain
?w prefix "http://my-organization.org/user/"
From
person;
SELECT id, work_page
FROM view_person ;
SQLAET
SPARQL query
SPARQL Algebra
Expression Tree (AET)
Normalize AET
Experimental Setup
- Cluster configuration
- 7 nodes (1 master, 6 worker), each with Intel(R) Xeon(R) CPU E5-2620 v4 @
2.10GHz (32
- Cores), 128 GB RAM, 12 TB SATA RAID-5, connected via a Gigabit network
- Each experiment executed 3 times, avg’ed results
Datasets (all in .nt format)
Evaluation
40
LUBM WatDiv
1K 5K 10K 10M 100M 1B
#nr. of triples 138,280,374 690,895,862 1,381,692,508 10,916,457 108,997,714 1,099,208,068
size (GB) 24 116 232 1.5 15 150
Evaluation
41
Runtime (s) (mean)
SPARQLGX-SDE Sparklify
-----> a) total b) partitioning c) querying d) total
QC 103.24 134.81 61 195.84
QF 157.8 236.06 107.33 349.51
QL 102.51 241.24 134 370.3
QS 131.16 237.12 108.56 346
QC partial fail 778.62 2043.66 2829.56
QF 6734.68 1295.3 2576.52 3871.97
QL 2575.72 1275.22 610.66 1886.73
QS 4841.85 1290.72 1552.05 2845.3
Watdiv-1BWatdiv-10M
Evaluation
42
Runtime (s) (mean)
SPARQLGX-SDE Sparklify
-----> a) total b) partitioning c) querying d) total
Q1 1056.83 627.72 718.11 1346.8
Q2 fail 595.76 fail n/a
Q3 1038.62 615.95 648.63 1267.37
Q4 2761.11 632.93 1670.18 2303.18
Q5 1026.94 641.53 564.13 1206.67
Q6 537.65 695.74 267.48 963.62
Q7 2080.67 630.44 1331.13 1967.25
Q8 2636.12 639.93 1647.57 2288.48
Q9 3124.52 583.86 2126.03 2711.24
Q10 1002.56 593.68 693.73 1287.71
Q11 1023.32 594.41 522.24 1118.58
Q12 2027.59 576.31 1088.25 1665.87
Q13 1007.39 626.57 6.66 633.26
Q14 526.15 633.39 258.32 891.89
LUBM-10K
Performance evaluation of Sparklify
Evaluation
43Node scalability (WatDiv 100M) Sizeup scalability
Sparklify vs SPARQLGX-SDE per query type performance on WatDiv
100M
Evaluation
44Query Types: (QS: Star pattern, QL: Linear pattern, QF: Snowflake, QC: Complex pattern)
Scalable RDF Querying
Towards A Scalable Semantic-based Distributed
Approach for SPARQL query evaluation [5]
45
Are existing solutions more effective i.e. using property tables which
leads to reducing the number of necessary joins and unions?
What happens when not all subjects in a cluster will use all properties?
- Wide property tables may be very sparse containing many NULL
values and thus impose a large storage overhead
How about using a more flatten approach? i.e. partition into
subject-based grouping (e.g. all entities which are associated with a
unique subject)
Motivation
46
Semantic-Based: Architecture Overview
47
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
Query Layer
Semantic
map map
Distributed Data
Structures
Results
RDFData
SELECT ?p WHERE {
?p :owns ?c .
?c :madeIn
?Ingolstadt .
}
SPARQL
Joy :owns Car1
Joy :livesIn Bonn
Car1 :typeOf Car
Car1 :madeBy Audi
Car1 :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
Joy :owns Car1 :livesIn Bonn
Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt
Bonn :cityOf Germany
Audi :memeberOf Volkswagen
Ingolstadt :cityOf Germany
Experimental Setup
- Cluster configuration
- 6 machines (1 master, 5 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
(32 Cores), 128 GB RAM, 12 TB SATA RAID-5, Spark-2.4.0, Hadoop 2.8.0, Scala
2.11.11 and Java 8
- Datasets (all in nt format)
- Distributed SPARQL query evaluators we compare with:
- SHARD, SPARQLGX-SDE, and Sparklify
Evaluation
48
LUBM WatDiv
1K 2K 3K 10M 100M
#nr. of triples 138,280,374 276,349,040 414,493,296 10,916,457 108,997,714
size (GB) 24 49 70 1.5 15
Evaluation
49
Runtime (s) (mean)
Queries SHARD
SPARQLGX-SD
E SANSA.Sparklify SANSA.Semantic
C3 n/a 38.79 72.94 90.48
F3 n/a 38.41 74.69 n/a
L3 n/a 21.05 73.16 72.84
S3 n/a 26.27 70.1 79.7
C3 n/a 181.51 96.59 300.82
F3 n/a 162.86 91.2 n/a
L3 n/a 84.09 82.17 189.89
S3 n/a 123.6 93.02 176.2
Watdiv-10MWatdiv-100M
Evaluation
50
Runtime (s) (mean)
Queries SHARD
SPARQLGX-SD
E SANSA.Sparklify SANSA.Semantic
Q1 774.93 103.74 103.57 226.21
Q2 fail fail 3348.51 329.69
Q3 772.55 126.31 107.25 235.31
Q4 988.28 182.52 111.89 294.8
Q5 771.69 101.05 100.37 226.21
Q6 fail 73.05 100.72 207.06
Q7 fail 160.94 113.03 277.08
Q8 fail 179.56 114.83 309.39
Q9 fail 204.62 114.25 326.29
Q10 780.05 106.26 110.18 232.72
Q11 783.2 112.23 105.13 231.36
Q12 fail 159.65 105.86 283.53
Q13 778.16 100.06 90.87 220.28
Q14 688.44 74.64 100.58 204.43
LUBM-1K
Performance evaluation of Semantic-based approach
Evaluation
51Node scalability (LUBM-1K) Sizeup scalability
Powered By
Project and Organizations using our proposed
approaches
52
53
<https://aleth.io/>
Blockchain – Alethio
Use Case
Alethio is using SANSA in order to
perform large-scale batch
analytics, e.g. computing the
asset turnover for sets of
accounts, computing attack
pattern frequencies and Opcode
usage statistics. SANSA was run
on a 100 node cluster with 400
cores
<https://www.big-data-europe.eu/>
Big Data Platform –
BDE
SANSA is used for computing
statistics over those logs within
the BDE platform. BDE uses the Mu
Swarm Logger service for
detecting docker events and
convert their representation to
RDF. In order to generate
visualisations of log statistics,
BDE then calls DistLODStats from
SANSA-Notebooks
<http://slipo.eu/>
Categorizing Areas
of Interests (AOI)
SLIPO focuses on designing
efficient pipelines dealing with
large semantic datasets of POIs.
In this project, Sparklify is used
through the SANSA query layer
to refine, filter and select the
relevant POIs which are needed
by the pipelines
10+ more use cases
http://sansa-stack.net/powered-by/
Powered By
The Hubs and Authorities Transaction
Network Analysis
54
Amazon S3
buckets
EthOn RDF
triples
Connected Components
SANSA Engine
Data ingestion
Data partition
Querying (SPARQL)
Hubs & Authorities
entities
PageRank
Connected
Components
Top Accounts, Hubs & Authorities, Wallet
Exchange behaviorData visualization using the
Databricks notebooks or SANSA
notebooks
More than 18,000,000,000 facts*
*https://medium.com/alethio/ethereum-linked-data-b72e6283812f
Analyze game performance and customer behaviors at scale
Profiting from Kitties on Ethereum
55
Pipe different clustering algorithms at once
Scalable Integration of Big POI Data
56
RDF POI
Data
Pre
processing
SPARQL
Filtering
POI_ID Cat1 Cat2
1 0 1
2 1 0
3 0 1
4 1 1
Word Embedding
Semantic Clustering
Geo
Clustering
Conclusion and Future
Directions
57
RQ1: How can we efficiently explore the structure of large-scale RDF
datasets?
- First algorithm for computing RDF dataset statistics at scale using
Apache Spark
- An analysis of the complexity of the computational steps and the
data exchange between nodes in the cluster
- Integrated the approach into the SANSA framework
- A REST Interface for triggering RDF statistics calculation
Review of the Contributions
58
RQ2: Can we scale RDF dataset quality assessment horizontally?
- A Quality Assessment Pattern QAP to characterize scalable quality
metrics
- A distributed (open source) implementation of quality metrics using
Apache Spark
- Analysis of the complexity of the metric evaluation
- Evaluate our approach and demonstrate empirically its superiority
over a previous centralized approach
- Integrated the approach into the SANSA framework
Review of the Contributions
59
RQ3: Can distributed RDF datasets be queried efficiently and
effectively?
- A novel approach for vertical partitioning including RDF terms and a
scalable query system (Sparklify) using SPARQL-to-SQL rewriter on
top of Apache Spark
- A scalable semantic-based partitioning and semantic-based query
engine (SANSA.Semantic) on top of Apache Spark
- Evaluation of the proposed approaches with state-of-the-art
engines and demonstrate it empirically
- Integrated the approaches into the SANSA framework
Review of the Contributions
60
Large-scale RDF Dataset Statistics
- Our approach is purely batch processing, in which the data chunks
are normally very large, therefore we plan to investigate additional
techniques for lowering the network overhead and I/O footprint i.e.
HDT compression
- Near real-time computation of RDF dataset statistics using Spark
Streaming
Limitations and Future Directions
61
Assessment of RDF Datasets at Scale
- Intelligent partitioning strategies and perform dependency analysis
in order to evaluate multiple metrics simultaneously
- Real-time interactive quality assessment of large-scale RDF data
using Spark Streaming
- A declarative plugin using Quality Metric Language (QML), with the
ability to express, customize and enhance quality metrics
- Quality Assessment As a Service
- Quality check over LODStats
Limitations and Future Directions
62
Scalable RDF Querying
- Combine OBDA tools with dictionary encoding of RDF terms as
integers and evaluate the effects
- Extend our parser to support more SPARQL fragments and adding
statistics to the query engine while evaluating queries
- Investigate the re-ordering of the BGPs and evaluate the effects on
query execution time
- Consider other management operations i.e. additions, updates,
deletions i.e. DeltaLake solution as an alternative for storage layer
that brings ACID transactions to RDF data management solutions
Limitations and Future Directions
63
Adaptive Distributed RDF Querying
- Optimize index structures and distribute data based on anticipated
query workloads of particular inference or ML algorithms
Efficient Recommendation System for RDF Partitioners
- A recommender to suggest the “best partitioner” for our SPARQL
query evaluators based on the structure of the data (statistics)
A Powerful Benchmarking Suite
Limitations and Future Directions
64
With the increasing amount of the RDF data, processing large-scale RDF
datasets are constantly facing challenges
We have shown the benefits of using distributed computing frameworks
for a scalable and efficient processing of RDF datasets
Future research work can build upon the contributions presented during
this thesis for a comprehensive scalable processing of RDF datasets
The main contributions of this thesis have been integrated within the
SANSA framework making an impact on the semantic web community
Closing Remarks
65
66
@Gezim_Sejdiu
https://gezimsejdiu.github.io/
That’s all folks
>> SANSA: https://github.com/SANSA-Stack
[1]. Distributed Semantic Analytics using the SANSA Stack. Jens Lehmann; Gezim Sejdiu; Lorenz Bühmann; Patrick
Westphal; Claus Stadler; Ivan Ermilov; Simon Bin; Nilesh Chakraborty; Muhammad Saleem; Axel-Cyrille Ngomo Ngonga;
and Hajira Jabeen. In Proceedings of 16th International Semantic Web Conference - Resources Track (ISWC'2017), 2017.
[2]. DistLODStats: Distributed Computation of RDF Dataset Statistics. Gezim Sejdiu; Ivan Ermilov; Jens Lehmann; and
Mohamed Nadjib-Mami. In Proceedings of 17th International Semantic Web Conference, 2018.
[3]. A Scalable Framework for Quality Assessment of RDF Datasets. Gezim Sejdiu; Anisa Rula; Jens Lehmann; and Hajira
Jabeen. In Proceedings of 18th International Semantic Web Conference, 2019.
[4]. Sparklify: A Scalable Software Component for Efficient evaluation of SPARQL queries over distributed RDF datasets.
Claus Stadler; Gezim Sejdiu; Damien Graux; and Jens Lehmann. In Proceedings of 18th International Semantic Web
Conference, 2019.
[5]. Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation. Gezim Sejdiu; Damien
Graux; Imran Khan; Ioanna Lytra; Hajira Jabeen; and Jens Lehmann. In 15th International Conference on Semantic
Systems (SEMANTiCS), 2019.
References
67
Backup slides
68
SPARQL is a standard query language for retrieving and manipulating
RDF data
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name ?hq ?location
WHERE {
dbr:Deutsche_Post foaf:name ?name.
dbr:Deutsche_Post dbo:location ?hq.
?hq foaf:name ?location.
}
Querying Knowledge Graphs
69
Over the last years, the size of the Semantic Web has increased and
several large-scale datasets were published
> As of March 2019
~10, 000 datasets
Openly available online
using Semantic Web standards
+ many datasets
RDFized and kept private
Motivation
70
Source: LOD-Cloud (http://lod-cloud.net/ )
Speedup Ratio and Efficiency of DistLODStats
Evaluation
71
Overall Breakdown of DistLODStats by Criterion Analysis (log scale)
Evaluation
72
STATisfy: A REST Interface for DistLODStats
73
CollaborativeAnalyticsServices
Marketplace
REST
Server
BigDataEurope
Local Cluster
Standalone Resource manager
Master
Worker 1 Worker 2 Worker n
SANSA DistLODStats
QAP: consists of transformations and actions
- Transformation: Rule set or a union/intersection of transformations
- Rules: defines conditional criteria for a triple e.g. isIRI()
- Filter: retrieves a subset of an RDF triple, e.g. getPredicates
- Shortcuts ?s, ?p, ?o are frequently used for filters
- Action: maps a triple set to a numerical value, e.g. count(r)
Quality Assessment Patterns (QAPs)
74
Metric Transformation τ Action α
External Linkage r_1 = isIRI(?s)∩internal(?s)∩isIRI(?o)∩external(?o) α_1 = count(r_3)
r_2 = isIRI(?s)∩external(?s)∩isIRI(?o)∩internal(?o) α_2 = count(triples)
r_3 = r_1∪r_2 α= a_1/a_2
Overall analysis of DistQualityAssessment by metric in the cluster mode
(log scale)
Evaluation
75
Overall analysis of queries on LUBM-1K dataset (cluster mode) using
Semantic-based approach
Evaluation
76

Contenu connexe

Tendances

Enterprise knowledge graphs
Enterprise knowledge graphsEnterprise knowledge graphs
Enterprise knowledge graphsSören Auer
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representationsMarco Quartulli
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions csandit
 
Classification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different FacetsClassification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different FacetsGeoffrey Fox
 
Virtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDFVirtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDFOpenLink Software
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningGianvito Siciliano
 
Linking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An IntroductionLinking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An IntroductionRonald Ashri
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph IntroductionSören Auer
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...Geoffrey Fox
 
External CV support in Dataverse 5.7
External CV support in Dataverse 5.7External CV support in Dataverse 5.7
External CV support in Dataverse 5.7vty
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked DataEUCLID project
 
The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise Ontotext
 
Das Semantische Daten Web für Unternehmen
Das Semantische Daten Web für UnternehmenDas Semantische Daten Web für Unternehmen
Das Semantische Daten Web für UnternehmenSören Auer
 
LDOW2015 Position Talk and Discussion
LDOW2015 Position Talk and DiscussionLDOW2015 Position Talk and Discussion
LDOW2015 Position Talk and DiscussionSören Auer
 
CLARIAH CMDI use case and flexible metadata schemes
CLARIAH CMDI use case and flexible metadata schemesCLARIAH CMDI use case and flexible metadata schemes
CLARIAH CMDI use case and flexible metadata schemesVyacheslav Tykhonov
 

Tendances (20)

Enterprise knowledge graphs
Enterprise knowledge graphsEnterprise knowledge graphs
Enterprise knowledge graphs
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representations
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
 
Classification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different FacetsClassification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different Facets
 
Virtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDFVirtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDF
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
 
Linking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An IntroductionLinking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An Introduction
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
 
External CV support in Dataverse 5.7
External CV support in Dataverse 5.7External CV support in Dataverse 5.7
External CV support in Dataverse 5.7
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked Data
 
The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise
 
Providing Linked Data
Providing Linked DataProviding Linked Data
Providing Linked Data
 
Das Semantische Daten Web für Unternehmen
Das Semantische Daten Web für UnternehmenDas Semantische Daten Web für Unternehmen
Das Semantische Daten Web für Unternehmen
 
RDF data clustering
RDF data clusteringRDF data clustering
RDF data clustering
 
LDOW2015 Position Talk and Discussion
LDOW2015 Position Talk and DiscussionLDOW2015 Position Talk and Discussion
LDOW2015 Position Talk and Discussion
 
CLARIAH CMDI use case and flexible metadata schemes
CLARIAH CMDI use case and flexible metadata schemesCLARIAH CMDI use case and flexible metadata schemes
CLARIAH CMDI use case and flexible metadata schemes
 

Similaire à Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva

Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Shardinginside-BigData.com
 
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONSDATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONSijdms
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONijdms
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyShital Kat
 
Information processing architectures
Information processing architecturesInformation processing architectures
Information processing architecturesRaji Gogulapati
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkIRJET Journal
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataVipin Batra
 
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dipayan Dev
 
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmPerformance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmIRJET Journal
 
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Mark Tabladillo
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniquesijsrd.com
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?IJCSIS Research Publications
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBencht_ivanov
 

Similaire à Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva (20)

disertation
disertationdisertation
disertation
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
 
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONSDATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 
BDAModule-1.pptx
BDAModule-1.pptxBDAModule-1.pptx
BDAModule-1.pptx
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
 
Information processing architectures
Information processing architecturesInformation processing architectures
Information processing architectures
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
 
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmPerformance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
 
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
 

Dernier

Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 

Dernier (20)

Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 

Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva

  • 1. Efficient Distributed In-Memory Processing of RDF Datasets Gezim Sejdiu PhD Colloquium, Bonn 29.09.2020 Supervisor: Prof. Dr. Jens Lehmann
  • 2. Introduction Large-Scale RDF Dataset Statistics Quality Assessment of RDF Datasets at Scale Scalable RDF Querying Use Cases and Applications Conclusion & Future Directions Outline 2
  • 4. No single definition Extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions Big data is a term for data sets that are so large or complex that traditional data processing application softwares are inadequate to deal with them What is Big Data? 4
  • 5. It’s relevance is increasing drastically and Big Data Analytics is an emerging field to explore Why ‘BigData’ is so important? 5 https://trends.google.com/trends/explore?date=all&q=%22big%20data%22
  • 6. 6
  • 7. 7 © Sorpresa meme on Memegen
  • 8. Big Data Europe (BDE) Platform 8https://github.com/big-data-europe Support Layer Init Daemon GUIs Monitor App Layer Traffic Forecast Satellite Image Analysis Platform Layer Spark Flink Semantic Layer Ontario SANSA Semagrow Kafka Real-time Stream Monitoring ... ... Resource Management Layer (Swarm) Hardware Layer Premises Cloud (AWS, GCP, MS Azure, …) Data Layer Hadoop NOSQL Store CassandraElasticsearch ...RDF Store
  • 9. Fast and generic-purpose cluster computing engine Apache Spark 9 Spark Core Engine (RDD) Deploy SparkSQL& DataFrames CoreAPIs& Libraries SparkStreaming Local Single JVM Cluster (Standalone, Mesos, YARN) Containers docker-comp ose MLlib MachineLearning GraphX Graphprocessing Allows for massive parallel processing of collections of records - RDD - Resilient Distributed Dataset - DataFrame - Conceptually a table - Dataset - Unified access to data as objects and/or tables
  • 10. Heterogeneity aka Variety Key Observation From BDE 10 Banking Finance Our Known History PurchaseEntertain Gaming Social Media VISA CHASE SAP IBM NORDSTROM Amazon LOWES NETFLIX HULU NFb NETWORK Zynga XBOX 360 Facebook Pinterest Twitter Customer
  • 11. Modelling entities and their relationships The RDF (Resource Description Framework) model Knowledge Graphs 11 DPDHL Deutsche Post DHL Group full name Logistics industry Logistik label PostTower headquarters Bonn located in
  • 12. Modelling entities and their relationships Analysis: finding underlying structure of the graph e.g. to predict unknown relationships Examples: Google Knowledge Graph, DBpedia, Facebook, YAGO, Twitter, LinkedIn, MS Academic Graph, IBM Graph, WikiData Knowledge Graphs 12
  • 13. Knowledge Graphs are everywhere 13 Entity Search and Summarization Discovering Related Entities
  • 14. Tasks that are hard to solve on single machines (>1 TB memory consumption): - Querying and processing LinkedGeoData - Dataset statistics and quality assessment of the LOD Cloud - Vandalism and outlier detection in DBpedia - Inference on life science data (e.g. UniProt, EggNOG, StringDB) - Clustering of DBpedia data - Large-scale enrichment and link prediction for e.g. DBpedia → LinkedGeoData Why Distributed RDF Data Processing? 14
  • 15. Main Research Question Is it possible to process large-scale RDF datasets efficiently and effectively? 15
  • 16. RQ1: How can we efficiently explore the structure of large-scale RDF datasets? RQ2: Can we scale RDF dataset quality assessment horizontally? RQ3: Can distributed RDF datasets be queried efficiently and effectively? Research Questions 16
  • 17. RC1: A Scalable Distributed Approach for Computation of RDF Dataset Statistics RC2: A Scalable Framework for Quality Assessment of RDF Datasets RC3: A Scalable Framework for SPARQL Evaluation of Large RDF Data Contributions 17
  • 19. SANSA [1] is a processing data flow engine that provides data distribution, and fault tolerance for distributed computation over large-scale RDF datasets SANSA includes several libraries: - Read / Write RDF / OWL library - Querying library - Inference library - ML library SANSA 19 BigDataEurope Inference Knowledge Distribution & Representation DeployCoreAPIs&Libraries Local Cluster Standalone Resource manager Querying Machine Learning
  • 20. RQ1: How can we efficiently explore the structure of large-scale RDF datasets? RQ2: Can we scale RDF dataset quality assessment horizontally? RQ3: Can distributed RDF datasets be queried efficiently and effectively? Research Questions 20
  • 21. Large-Scale RDF Dataset Statistics A Scalable Distributed Approach for Computation of RDF Dataset Statistics [2] 21
  • 22. Obtaining an overview over the Web of Data, it is important to gather statistical information describing characteristics of the internal structure of datasets This process is both data-intensive and computing-intensive and it is a challenge to develop fast and efficient algorithms that can handle large scale RDF datasets There are no approaches for RDF that computes those statistical criteria and scales to large data sets Motivation 22
  • 23. A statistical criterion C is a triple C = (F, D, P), where: - F is a SPARQL filter condition - D is a derived dataset from the main dataset (RDD of triples) after applying F - P is a post-processing operation on the data structure D RDDs are in-memory collections of records that can be operated in parallel on large clusters - We use RDDs to represent RDF triples Approach 23
  • 25. Experimental Setup - Cluster configuration - 6 machines (1 master, 5 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (32 Cores), 128 GB RAM, 12 TB SATA RAID-5 - Spark-2.2.0, Hadoop 2.8.0, Scala 2.11.11 and Java 8 - Datasets (all in nt format) Evaluation 25 DBpedia BSBM LinkedGeoData en de fr 2GB 20GB 200GB #nr. of triples 1,292,933,812 812,545,486 336,714,883 340,849,556 8,289,484 81,980,472 817,774,057 size (GB) 191.17 114.4 48.6 49.77 2 20 200
  • 26. Distributed Processing on Large-Scale Datasets * e) = d) / c) - 1 Evaluation 26 Runtime (in hours) LODStats DistLODStats a) files b) bigfile c) local d) cluster e) speedup ratio LinkedGeoData n/a n/a 36.65 4.37 7.4x DBpedia_en 24.63 fail 25.34 2.97 7.6x DBpedia_de n/a n/a 10.34 1.2 7.3x DBpedia_fr n/a n/a 10.49 1.27 7.3x
  • 27. Performance evaluation of DistLODStats Evaluation 27Node scalability (BSBM-50GB) Sizeup scalability
  • 28. RQ1: How can we efficiently explore the structure of large-scale RDF datasets? RQ2: Can we scale RDF dataset quality assessment horizontally? RQ3: Can distributed RDF datasets be queried efficiently and effectively? Research Questions 28
  • 29. Quality Assessment of RDF Datasets at Scale A Scalable Framework for Quality Assessment of RDF Datasets [3] 29
  • 30. Assessing data quality is of paramount importance to judge its fitness for particular use case Existing solutions can not evaluate data quality metrics on medium / large-scale datasets → This is actually where they are most important Motivation 30
  • 31. Quality Assessment Pattern (QAP) - A reusable template to implement and design scalable quality metrics Approach 31 Quality Metric(QM) := Action|(QM OP Action) OP := ∗|−|/|+ Action := Count(Transformation) Transformation := Rule(Filter)|(Transformation BOP Transformation) Filter := getPredicates∼?p|getSubjects∼?s|getObjects∼?o|getDistinct(Filter) |Filter or Filter|Filter && Filter) Rule := isURI(Filter)|isIRI(Filter)|isInternal(Filter)|isLiteral(Filter) |!isBroken(Filter)|hasPredicateP|hasLicenceAssociated(Filter) |hasLicenceIndications(Filter)|isExternal(Filter)|hasType((Filter) |isLabeled(Filter) BOP := ∩|∪
  • 32. Architecture Overview 32 Definition ● Define quality dimensions ● Define quality metrics, threshold and other configurations RDF Data Qualityassessment SANSA Engine DataIngestion Distributed Data Structures QAP Results Analyse SANSA-NotebooksData Quality Vocabulary (DQV)
  • 33. Experimental Setup - Cluster configuration - 7 machines (1 master, 6 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (32 Cores), 128 GB RAM, 12 TB SATA RAID-5, Spark-2.4.0, Hadoop 2.8.0, Scala 2.11.11 and Java 8 Local mode: single instance of the cluster - Datasets (all in .nt format) Evaluation 33 DBpedia BSBM LinkedGeoData en de fr 2GB 20GB 200GB #nr. of triples 1,292,933,812 812,545,486 336,714,883 340,849,556 8,289,484 81,980,472 817,774,057 size (GB) 191.17 114.4 48.6 49.77 2 20 200
  • 34. Evaluation 34 Runtime (in minutes) Luzzu DistQualityAssessment -----> a) single b) joint c) local d) cluster LinkedGeoData Fail Fail 446.9 7.79 DBpedia_en Fail Fail 274.31 1.99 DBpedia_de Fail Fail 61.4 0.46 DBpedia_fr Fail Fail 195.3 0.38 BSBM_200GB Fail Fail 454.46 7.27 BSBM_0.01GB 2.64 2.65 0.04 0.42 BSBM_0.05GB 16.38 15.39 0.05 0.46 BSBM_0.1GB 40.59 37.94 0.06 0.44 BSBM_0.5GB 459.19 468.64 0.15 0.48 BSBM_1GB 1454.16 1532.95 0.4 0.56 BSBM_2GB Timeout Timeout 3.19 0.62 BSBM_10GB Timeout Timeout 29.44 0.52 BSBM_20GB Fail Fail 34.32 0.75 Large-scaleSmalltomedium
  • 35. Performance evaluation of DistQualityAssessment Evaluation 35Node scalability (BSBM-200GB) Sizeup scalability
  • 36. RQ1: How can we efficiently explore the structure of large-scale RDF datasets? RQ2: Can we scale RDF dataset quality assessment horizontally? RQ3: Can distributed RDF datasets be queried efficiently and effectively? Research Questions 36
  • 37. Scalable RDF Querying Sparklify: A Scalable Software Component for Efficient evaluation of SPARQL queries over distributed RDF datasets* [4] 37* A joint work with Claus Stadler, a PhD student at the University of Leipzig.
  • 38. Existing solutions are narrowed down to simple RDF constructs only Hence they do not exploit the full potential of the knowledge i.e. RDF terms Can we re-use existing Ontology-Based Data Access (OBDA) tooling to facilitate running SPARQL queries on RDF kept in Apache Spark? Motivation 38
  • 39. Sparklify: Architecture Overview 39 Sparqlify SANSA SANSA Engine RDF Layer Data Ingestion Partitioning Query Layer Sparklifying Views Views Distributed Data Structures Results RDFData SELECT ?s ?w WHERE { ?s a dbp:Person . ?s ex:workPage ?w . } SPARQL Prefix dbp:<http://dbpedia.org/ontology/> Prefix ex:<http://ex.org/> Create View view_person As Construct { ?s a dbp:Person . ?s ex:workPage ?w . } With ?s = uri('http://mydomain.org/person', ?id) ?w = uri(?work_page) Constrain ?w prefix "http://my-organization.org/user/" From person; SELECT id, work_page FROM view_person ; SQLAET SPARQL query SPARQL Algebra Expression Tree (AET) Normalize AET
  • 40. Experimental Setup - Cluster configuration - 7 nodes (1 master, 6 worker), each with Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (32 - Cores), 128 GB RAM, 12 TB SATA RAID-5, connected via a Gigabit network - Each experiment executed 3 times, avg’ed results Datasets (all in .nt format) Evaluation 40 LUBM WatDiv 1K 5K 10K 10M 100M 1B #nr. of triples 138,280,374 690,895,862 1,381,692,508 10,916,457 108,997,714 1,099,208,068 size (GB) 24 116 232 1.5 15 150
  • 41. Evaluation 41 Runtime (s) (mean) SPARQLGX-SDE Sparklify -----> a) total b) partitioning c) querying d) total QC 103.24 134.81 61 195.84 QF 157.8 236.06 107.33 349.51 QL 102.51 241.24 134 370.3 QS 131.16 237.12 108.56 346 QC partial fail 778.62 2043.66 2829.56 QF 6734.68 1295.3 2576.52 3871.97 QL 2575.72 1275.22 610.66 1886.73 QS 4841.85 1290.72 1552.05 2845.3 Watdiv-1BWatdiv-10M
  • 42. Evaluation 42 Runtime (s) (mean) SPARQLGX-SDE Sparklify -----> a) total b) partitioning c) querying d) total Q1 1056.83 627.72 718.11 1346.8 Q2 fail 595.76 fail n/a Q3 1038.62 615.95 648.63 1267.37 Q4 2761.11 632.93 1670.18 2303.18 Q5 1026.94 641.53 564.13 1206.67 Q6 537.65 695.74 267.48 963.62 Q7 2080.67 630.44 1331.13 1967.25 Q8 2636.12 639.93 1647.57 2288.48 Q9 3124.52 583.86 2126.03 2711.24 Q10 1002.56 593.68 693.73 1287.71 Q11 1023.32 594.41 522.24 1118.58 Q12 2027.59 576.31 1088.25 1665.87 Q13 1007.39 626.57 6.66 633.26 Q14 526.15 633.39 258.32 891.89 LUBM-10K
  • 43. Performance evaluation of Sparklify Evaluation 43Node scalability (WatDiv 100M) Sizeup scalability
  • 44. Sparklify vs SPARQLGX-SDE per query type performance on WatDiv 100M Evaluation 44Query Types: (QS: Star pattern, QL: Linear pattern, QF: Snowflake, QC: Complex pattern)
  • 45. Scalable RDF Querying Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation [5] 45
  • 46. Are existing solutions more effective i.e. using property tables which leads to reducing the number of necessary joins and unions? What happens when not all subjects in a cluster will use all properties? - Wide property tables may be very sparse containing many NULL values and thus impose a large storage overhead How about using a more flatten approach? i.e. partition into subject-based grouping (e.g. all entities which are associated with a unique subject) Motivation 46
  • 47. Semantic-Based: Architecture Overview 47 SANSA Engine RDF Layer Data Ingestion Partitioning Query Layer Semantic map map Distributed Data Structures Results RDFData SELECT ?p WHERE { ?p :owns ?c . ?c :madeIn ?Ingolstadt . } SPARQL Joy :owns Car1 Joy :livesIn Bonn Car1 :typeOf Car Car1 :madeBy Audi Car1 :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany Joy :owns Car1 :livesIn Bonn Car1 :typeOf Car :madeBy Audi :madeIn Ingolstadt Bonn :cityOf Germany Audi :memeberOf Volkswagen Ingolstadt :cityOf Germany
  • 48. Experimental Setup - Cluster configuration - 6 machines (1 master, 5 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (32 Cores), 128 GB RAM, 12 TB SATA RAID-5, Spark-2.4.0, Hadoop 2.8.0, Scala 2.11.11 and Java 8 - Datasets (all in nt format) - Distributed SPARQL query evaluators we compare with: - SHARD, SPARQLGX-SDE, and Sparklify Evaluation 48 LUBM WatDiv 1K 2K 3K 10M 100M #nr. of triples 138,280,374 276,349,040 414,493,296 10,916,457 108,997,714 size (GB) 24 49 70 1.5 15
  • 49. Evaluation 49 Runtime (s) (mean) Queries SHARD SPARQLGX-SD E SANSA.Sparklify SANSA.Semantic C3 n/a 38.79 72.94 90.48 F3 n/a 38.41 74.69 n/a L3 n/a 21.05 73.16 72.84 S3 n/a 26.27 70.1 79.7 C3 n/a 181.51 96.59 300.82 F3 n/a 162.86 91.2 n/a L3 n/a 84.09 82.17 189.89 S3 n/a 123.6 93.02 176.2 Watdiv-10MWatdiv-100M
  • 50. Evaluation 50 Runtime (s) (mean) Queries SHARD SPARQLGX-SD E SANSA.Sparklify SANSA.Semantic Q1 774.93 103.74 103.57 226.21 Q2 fail fail 3348.51 329.69 Q3 772.55 126.31 107.25 235.31 Q4 988.28 182.52 111.89 294.8 Q5 771.69 101.05 100.37 226.21 Q6 fail 73.05 100.72 207.06 Q7 fail 160.94 113.03 277.08 Q8 fail 179.56 114.83 309.39 Q9 fail 204.62 114.25 326.29 Q10 780.05 106.26 110.18 232.72 Q11 783.2 112.23 105.13 231.36 Q12 fail 159.65 105.86 283.53 Q13 778.16 100.06 90.87 220.28 Q14 688.44 74.64 100.58 204.43 LUBM-1K
  • 51. Performance evaluation of Semantic-based approach Evaluation 51Node scalability (LUBM-1K) Sizeup scalability
  • 52. Powered By Project and Organizations using our proposed approaches 52
  • 53. 53 <https://aleth.io/> Blockchain – Alethio Use Case Alethio is using SANSA in order to perform large-scale batch analytics, e.g. computing the asset turnover for sets of accounts, computing attack pattern frequencies and Opcode usage statistics. SANSA was run on a 100 node cluster with 400 cores <https://www.big-data-europe.eu/> Big Data Platform – BDE SANSA is used for computing statistics over those logs within the BDE platform. BDE uses the Mu Swarm Logger service for detecting docker events and convert their representation to RDF. In order to generate visualisations of log statistics, BDE then calls DistLODStats from SANSA-Notebooks <http://slipo.eu/> Categorizing Areas of Interests (AOI) SLIPO focuses on designing efficient pipelines dealing with large semantic datasets of POIs. In this project, Sparklify is used through the SANSA query layer to refine, filter and select the relevant POIs which are needed by the pipelines 10+ more use cases http://sansa-stack.net/powered-by/ Powered By
  • 54. The Hubs and Authorities Transaction Network Analysis 54 Amazon S3 buckets EthOn RDF triples Connected Components SANSA Engine Data ingestion Data partition Querying (SPARQL) Hubs & Authorities entities PageRank Connected Components Top Accounts, Hubs & Authorities, Wallet Exchange behaviorData visualization using the Databricks notebooks or SANSA notebooks More than 18,000,000,000 facts* *https://medium.com/alethio/ethereum-linked-data-b72e6283812f
  • 55. Analyze game performance and customer behaviors at scale Profiting from Kitties on Ethereum 55
  • 56. Pipe different clustering algorithms at once Scalable Integration of Big POI Data 56 RDF POI Data Pre processing SPARQL Filtering POI_ID Cat1 Cat2 1 0 1 2 1 0 3 0 1 4 1 1 Word Embedding Semantic Clustering Geo Clustering
  • 58. RQ1: How can we efficiently explore the structure of large-scale RDF datasets? - First algorithm for computing RDF dataset statistics at scale using Apache Spark - An analysis of the complexity of the computational steps and the data exchange between nodes in the cluster - Integrated the approach into the SANSA framework - A REST Interface for triggering RDF statistics calculation Review of the Contributions 58
  • 59. RQ2: Can we scale RDF dataset quality assessment horizontally? - A Quality Assessment Pattern QAP to characterize scalable quality metrics - A distributed (open source) implementation of quality metrics using Apache Spark - Analysis of the complexity of the metric evaluation - Evaluate our approach and demonstrate empirically its superiority over a previous centralized approach - Integrated the approach into the SANSA framework Review of the Contributions 59
  • 60. RQ3: Can distributed RDF datasets be queried efficiently and effectively? - A novel approach for vertical partitioning including RDF terms and a scalable query system (Sparklify) using SPARQL-to-SQL rewriter on top of Apache Spark - A scalable semantic-based partitioning and semantic-based query engine (SANSA.Semantic) on top of Apache Spark - Evaluation of the proposed approaches with state-of-the-art engines and demonstrate it empirically - Integrated the approaches into the SANSA framework Review of the Contributions 60
  • 61. Large-scale RDF Dataset Statistics - Our approach is purely batch processing, in which the data chunks are normally very large, therefore we plan to investigate additional techniques for lowering the network overhead and I/O footprint i.e. HDT compression - Near real-time computation of RDF dataset statistics using Spark Streaming Limitations and Future Directions 61
  • 62. Assessment of RDF Datasets at Scale - Intelligent partitioning strategies and perform dependency analysis in order to evaluate multiple metrics simultaneously - Real-time interactive quality assessment of large-scale RDF data using Spark Streaming - A declarative plugin using Quality Metric Language (QML), with the ability to express, customize and enhance quality metrics - Quality Assessment As a Service - Quality check over LODStats Limitations and Future Directions 62
  • 63. Scalable RDF Querying - Combine OBDA tools with dictionary encoding of RDF terms as integers and evaluate the effects - Extend our parser to support more SPARQL fragments and adding statistics to the query engine while evaluating queries - Investigate the re-ordering of the BGPs and evaluate the effects on query execution time - Consider other management operations i.e. additions, updates, deletions i.e. DeltaLake solution as an alternative for storage layer that brings ACID transactions to RDF data management solutions Limitations and Future Directions 63
  • 64. Adaptive Distributed RDF Querying - Optimize index structures and distribute data based on anticipated query workloads of particular inference or ML algorithms Efficient Recommendation System for RDF Partitioners - A recommender to suggest the “best partitioner” for our SPARQL query evaluators based on the structure of the data (statistics) A Powerful Benchmarking Suite Limitations and Future Directions 64
  • 65. With the increasing amount of the RDF data, processing large-scale RDF datasets are constantly facing challenges We have shown the benefits of using distributed computing frameworks for a scalable and efficient processing of RDF datasets Future research work can build upon the contributions presented during this thesis for a comprehensive scalable processing of RDF datasets The main contributions of this thesis have been integrated within the SANSA framework making an impact on the semantic web community Closing Remarks 65
  • 67. [1]. Distributed Semantic Analytics using the SANSA Stack. Jens Lehmann; Gezim Sejdiu; Lorenz Bühmann; Patrick Westphal; Claus Stadler; Ivan Ermilov; Simon Bin; Nilesh Chakraborty; Muhammad Saleem; Axel-Cyrille Ngomo Ngonga; and Hajira Jabeen. In Proceedings of 16th International Semantic Web Conference - Resources Track (ISWC'2017), 2017. [2]. DistLODStats: Distributed Computation of RDF Dataset Statistics. Gezim Sejdiu; Ivan Ermilov; Jens Lehmann; and Mohamed Nadjib-Mami. In Proceedings of 17th International Semantic Web Conference, 2018. [3]. A Scalable Framework for Quality Assessment of RDF Datasets. Gezim Sejdiu; Anisa Rula; Jens Lehmann; and Hajira Jabeen. In Proceedings of 18th International Semantic Web Conference, 2019. [4]. Sparklify: A Scalable Software Component for Efficient evaluation of SPARQL queries over distributed RDF datasets. Claus Stadler; Gezim Sejdiu; Damien Graux; and Jens Lehmann. In Proceedings of 18th International Semantic Web Conference, 2019. [5]. Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation. Gezim Sejdiu; Damien Graux; Imran Khan; Ioanna Lytra; Hajira Jabeen; and Jens Lehmann. In 15th International Conference on Semantic Systems (SEMANTiCS), 2019. References 67
  • 69. SPARQL is a standard query language for retrieving and manipulating RDF data PREFIX dbr: <http://dbpedia.org/resource/> PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT ?name ?hq ?location WHERE { dbr:Deutsche_Post foaf:name ?name. dbr:Deutsche_Post dbo:location ?hq. ?hq foaf:name ?location. } Querying Knowledge Graphs 69
  • 70. Over the last years, the size of the Semantic Web has increased and several large-scale datasets were published > As of March 2019 ~10, 000 datasets Openly available online using Semantic Web standards + many datasets RDFized and kept private Motivation 70 Source: LOD-Cloud (http://lod-cloud.net/ )
  • 71. Speedup Ratio and Efficiency of DistLODStats Evaluation 71
  • 72. Overall Breakdown of DistLODStats by Criterion Analysis (log scale) Evaluation 72
  • 73. STATisfy: A REST Interface for DistLODStats 73 CollaborativeAnalyticsServices Marketplace REST Server BigDataEurope Local Cluster Standalone Resource manager Master Worker 1 Worker 2 Worker n SANSA DistLODStats
  • 74. QAP: consists of transformations and actions - Transformation: Rule set or a union/intersection of transformations - Rules: defines conditional criteria for a triple e.g. isIRI() - Filter: retrieves a subset of an RDF triple, e.g. getPredicates - Shortcuts ?s, ?p, ?o are frequently used for filters - Action: maps a triple set to a numerical value, e.g. count(r) Quality Assessment Patterns (QAPs) 74 Metric Transformation τ Action α External Linkage r_1 = isIRI(?s)∩internal(?s)∩isIRI(?o)∩external(?o) α_1 = count(r_3) r_2 = isIRI(?s)∩external(?s)∩isIRI(?o)∩internal(?o) α_2 = count(triples) r_3 = r_1∪r_2 α= a_1/a_2
  • 75. Overall analysis of DistQualityAssessment by metric in the cluster mode (log scale) Evaluation 75
  • 76. Overall analysis of queries on LUBM-1K dataset (cluster mode) using Semantic-based approach Evaluation 76