SlideShare une entreprise Scribd logo
1  sur  50
Télécharger pour lire hors ligne
The Best of Both Worlds
Unlocking the Power of (big) Knowledge
Graphs with SANSA
Gezim Sejdiu, DBpedia Workshop @LDAC2020
Education
- PhD Student (finishing) @University of Bonn
- Msc in Computer Engineering, Uni of Prishtina, Kosovo
- Bsc in Computer Science, Uni of Prishtina, Kosovo
Experience
- Data Engineer @DPDHL|2019 - present
- Research Scientist @UniBonn/@SDA|2016 - 2019
- Guest Researcher @UniLeipzig|2015 - 2016
- System Analyst @KEDS, Kosovo|2009 - 2015
- Software Developer @EXPIK, Kosovo|2008 - 2009
~$ whoami
Data Engineer @DPDHL,
PhD Student
@SDA_Research |
@UniBonn, SANSA
Contributor & Open
Source Enthusiast
#BigData #SemanticWeb
https://gezimsejdiu.github.io/
2
Big Data
Intro
3
No single definition
Extremely large data sets that may be analysed
computationally to reveal patterns, trends, and
associations, especially relating to human behaviour and
interactions
Big data is a term for data sets that are so large or
complex that traditional data processing application
softwares are inadequate to deal with them
What is Big Data?
4
Every day, there are 2.5 quintillion bytes of data created -
so much that 90% of the data in the world today has been
created in the last two years alone
It is not only about data collection, or data querying, its is
about learning from this tremendous data for informed
decision making
What is Big Data?
5
It’s relevance is increasing drastically and Big Data
Analytics is an emerging field to explore
Why ‘BigData’ is so important?
https://trends.google.com/trends/explore?date=all&q=%22big%20data%22
6
7
8
Big Data Ecosystem
File system HDFS, NFS
Resource manager Mesos, Yarn
Coordination Zookeeper
Data Acquisition Apache Flume, Apache Sqoop
Data Stores MongoDB, Cassandra, Hbase, Project Voldemort
Data Processing
● Frameworks
Hadoop MapReduce, Apache Spark, Apache Storm, Apache
Flink
● Tools Apache Pig, Apache Hive
● Libraries SparkR, Apache Mahout, MlLib, etc
Data Integration
● Message Passing
● Managing data
heterogeneity
Apache Kafka
SemaGrow, Strabon
Operational Frameworks
● Monitoring Apache Ambari
9
Big Data Europe (BDE) Platform
https://github.com/big-data-europe
10
Heterogeneity aka Variety
Key Observation From BDE
11
Smart Big Data
Intro to Knowledge Graphs
12
Modelling entities and their relationships
The RDF (Resource Description Framework) model
Knowledge Graphs
DPDHL Deutsche Post DHL Group
full name
Logistics
industry
Logistik
label
PostTower
headquarters
Bonn
located in
13
Modelling entities and their relationships
Analysis: finding underlying structure of the graph e.g. to
predict unknown relationships
Examples: Google Knowledge Graph, DBpedia, Facebook,
YAGO, Twitter, LinkedIn, MS Academic Graph, IBM Graph,
WikiData
Knowledge Graphs
14
Knowledge Graphs are everywhere
Entity Search and Summarization
Discovering Related Entities
15
SANSA
Scalable Semantic Analytics
Stack
16
Over the last years, the size of the Semantic Web has
increased and several large-scale datasets were
published
> As of March 2019
~10, 000 datasets
Openly available online
using Semantic Web standards
+ many datasets
RDFized and kept private
Motivation
Source: LOD-Cloud (http://lod-cloud.net/ )
17
Dealing with such amount of data makes many tasks hard
to be solved on single machines
- Vocabulary Reuse
Find a suitable vocabulary for your dataset
- Coverage Analysis
Does dataset contain necessary information?
- Privacy Analysis
Does dataset contain sensitive information?
- Entity Linking
Which datasets are good candidates for interlinking?
Motivation
18
Tasks that are hard to solve on single machines (>1 TB
memory consumption):
- Querying and processing LinkedGeoData
- Dataset statistics and quality assessment of the LOD
Cloud
- Vandalism and outlier detection in DBpedia
- Inference on life science data (e.g. UniProt, EggNOG,
StringDB)
- Clustering of DBpedia data
- Large-scale enrichment and link prediction for e.g.
DBpedia → LinkedGeoData
Why Distributed RDF Data
Processing?
19
SANSA Stack Vision
20
Why combining Big Data and SW?
“Big Data” Processing (Spark/Flink) Semantic Technology Stack
Data Integration Manual pre-processing Partially automated,
standardised
Modelling Simple (often flat feature vectors) Expressive
Support for data
exchange
Limited (heterogeneous formats
with limited schema information)
Yes (RDF & OWL W3C
Recommendations)
Business value Direct Indirect
Horizontally
scalable
Yes No
Idea: combine advantages of both worlds
21
SANSA is a processing data flow engine that provides
data distribution, and fault tolerance for distributed
computation over large-scale RDF datasets
SANSA includes several libraries:
- Read / Write RDF / OWL library
- Querying library
- Inference library
- ML library
SANSA
BigDataEurope
Inference
Knowledge Distribution &
Representation
DeployCoreAPIs&Libraries
Local Cluster
Standalone Resource manager
Querying
Machine Learning
22
Ingest RDF data in different formats using Jena API style
interfaces
Represent data in multiple formats
- (e.g. RDD, Data Frames, GraphX, Tensors)
Allow transformation among these formats
Compute RDF dataset statistics and apply quality
assessment in a distributed manner
Knowledge Representation (KR)
Layer
23
To make generic queries efficient and fast using:
- Intelligent indexing
- Splitting strategies
- Distributed Storage
SPARQL query engine evaluation
(SPARQL-to-SQL approaches, Virtual Views, direct mapping)
Provision of W3C SPARQL compliant endpoint
Query Layer
24
Query Layer - the Sparklify approach
Sparqlify
SANSA
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
Query Layer
Sparklifying
Views Views
Distributed Data
Structures
Results
RDFData
SELECT ?s ?w WHERE {
?s a dbp:Person .
?s ex:workPage ?w .
}
SPARQL
Prefix dbp:<http://dbpedia.org/ontology/>
Prefix ex:<http://ex.org/>
Create View view_person As
Construct {
?s a dbp:Person .
?s ex:workPage ?w .
}
With
?s = uri('http://mydomain.org/person', ?id)
?w = uri(?work_page)
Constrain
?w prefix "http://my-organization.org/user/"
From
person;
SELECT id, work_page
FROM view_person ;
SQLAET
SPARQL query
SPARQL Algebra
Expression Tree (AET)
Normalize AET
25
W3C Standards for Modelling: RDFS and OWL
Parallel in-memory inference via rule-based forward
chaining
Beyond state of the art: dynamically build a rule
dependency graph for a rule set
→ Adjustable performance levels
Inference Layer
26
Inference Layer
RDFS rule dependency graph
(simplified) 27
Distributed Machine Learning (ML) algorithms that work
on RDF data and make use of its structure / semantics
Algorithms:
- Knowledge graph embeddings for e.g. KB
completion, link prediction
- Graph Clustering
- Power Iteration, BorderFlow, Link based
- Modularity based clustering
- Semantic Decision trees (in progress)
- Outlier detection
Machine Learning Layer
28
Machine Learning Layer
29
Show me the Code
30
// Read RDF files into Spark RDD (of triples)
val triples = spark.rdf(Lang.NTRIPLES)(input)
// Define SPARQL query
val sparqlQuery = "SELECT * WHERE {?s ?p ?o} LIMIT 10"
// Evaluate SPARQL query over Spark. Returns a DataFrame of triples
val result = triples.sparql(sparqlQuery)
// Use the results set to cluster them via PIC algorithm
val cluster =
result.cluster(ClusteringAlgorithm.RDFGraphPowerIterationClustering)
.setK(k).setMaxIterations(maxIterations).run()
Show me the Code
31
Interactive SANSA in your Browser
32
Powered by SANSA
33
<https://aleth.io/>
Blockchain – Alethio
Use Case
Alethio is using SANSA in order to
perform large-scale batch
analytics, e.g. computing the
asset turnover for sets of
accounts, computing attack
pattern frequencies and Opcode
usage statistics. SANSA was run
on a 100 node cluster with 400
cores
<https://www.big-data-europe.eu/>
Big Data Platform –
BDE
SANSA is used for computing
statistics over those logs within
the BDE platform. BDE uses the Mu
Swarm Logger service for
detecting docker events and
convert their representation to
RDF. In order to generate
visualisations of log statistics,
BDE then calls DistLODStats from
SANSA-Notebooks
<https://www.specialprivacy.eu/>
Transparency and
Compliance – SPIRIT
SANSA is used to analyse log
information concerning personal
data processing and sharing that
is output from line of business
applications on a continuous
basis, and to present the
information to the user via the
SPIRIT dashboard
<http://boost40.eu/>
Towards a European
Data Space
SANSA is used for for covering
heterogeneity between
stakeholders and data providers
for better and efficient Data
Processing, Data Management
and Data Analytics.
10+ more use cases
http://sansa-stack.net/powered-by/
SANSA 0.7 in Jan 2020, releases every 6 months
Apache Open Source License
Project activity:
- Contributors (at least one commit): 17
- Commits per day: 7.3 - Commits previous year: 2675
- Github stars (all repos): 271
SANSA Pulse
34
SANSA = only comprehensive, open source RDF
processing and analysis stack for distributed in-memory
computing
Combines distributed in-memory computing and
analytics (Apache Spark & Apache Flink) with Semantic
Web technology stack
Next steps
- Support for SPARQL 1.1 (Query Layer) via Ontop integration
- Backward chaining and better evaluation (Inference Layer)
- More algorithms and definition of ML pipelines (ML Layer)
Conclusions and Next steps
35
Thank you
36
@Gezim_Sejdiu
https://gezimsejdiu.github.io/
SANSA
Semantic Analytics Stack
https://github.com/SANSA-Stack
● SANSA-RDF
● SANSA-OWL
● SANSA-Query
● SANSA-Inference
● SANSA-ML
● SANSA-Examples
● SANSA-Notebooks
● SANSA Demo
1. Distributed Semantic Analytics using the SANSA Stack by Jens Lehmann, Gezim Sejdiu,
Lorenz Bühmann, Patrick Westphal, Claus Stadler, Ivan Ermilov, Simon Bin, Muhammad
Saleem, Axel-Cyrille Ngonga Ngomo and Hajira Jabeen in Proceedings of 16th International
Semantic Web Conference – Resources Track (ISWC’2017), 2017.
2. The Tale of Sansa Spark by Ivan Ermilov, Jens Lehmann, Gezim Sejdiu, Lorenz Bühmann,
Patrick Westphal, Claus Stadler, Simon Bin, Nilesh Chakraborty, Henning Petzka, Muhammad
Saleem, Axel-Cyrille Ngomo Ngonga, and Hajira Jabeen in Proceedings of 16th International
Semantic Web Conference, Poster & Demos, 2017.
3. DistLODStats: Distributed Computation of RDF Dataset Statistics by Gezim Sejdiu, Ivan
Ermilov, Jens Lehmann, and Mohamed Nadjib-Mami in Proceedings of 17th International
Semantic Web Conference, 2018.
4. STATisfy Me: What are my Stats?. Gezim Sejdiu; Ivan Ermilov; Jens Lehmann; and
Mohamed-Nadjib Mami. In Proceedings of 17th International Semantic Web Conference,
Poster & Demos, 2018.
5. Profiting from Kitties on Ethereum: Leveraging Blockchain RDF with SANSA by Damien
Graux; Gezim Sejdiu; Hajira Jabeen; Jens Lehmann; Danning Sui; Dominik Muhs; and
Johannes Pfeffer. In 14th International Conference on Semantic Systems, Poster & Demos,
2018.
Associated Publications (as of
January 2020)
37
6. SPIRIT: A Semantic Transparency and Compliance Stack by Patrick Westphal, Javier
Fernández, Sabrina Kirrane and Jens Lehmann. In 14th International Conference on Semantic
Systems, Poster & Demos, 2018.
7. Divided we stand out! Forging Cohorts fOr Numeric Outlier Detection in large scale
knowledge graphs (CONOD) by Hajira Jabeen; Rajjat Dadwal; Gezim Sejdiu; and Jens
Lehmann. In 21st International Conference on Knowledge Engineering and Knowledge
Management (EKAW’2018), 2018.
8. Clustering Pipelines of large RDF POI Data. Rajjat Dadwal; Damien Graux; Gezim Sejdiu;
Hajira Jabeen; and Jens Lehmann. In Proceedings of 16th Extended Semantic Web
Conference (ESWC 2019), Poster & Demos, 2019.
9. Sparklify: A Scalable Software Component for Efficient evaluation of SPARQL queries
over distributed RDF datasets. Claus Stadler; Gezim Sejdiu; Damien Graux; and Jens
Lehmann. In Proceedings of 18th International Semantic Web Conference, 2019.
10. A Scalable Framework for Quality Assessment of RDF Datasets. Gezim Sejdiu; Anisa
Rula; Jens Lehmann; and Hajira Jabeen. In Proceedings of 18th International Semantic Web
Conference, 2019.
11. Squerall: Virtual Ontology-Based Access to Heterogeneous and Large Data Sources by
Mohamed Nadjib Mami, Damien Graux, Simon Scerri, Hajira Jabeen, Sören Auer, Jens
Lehmann. In Proceedings of 18th International Semantic Web Conference, 2019.
Associated Publications (as of
January 2020)
38
12. Towards A Scalable Semantic-based Distributed Approach for SPARQL query
evaluation. Gezim Sejdiu, Damien Graux, Imran Khan, Ioanna Lytra, Hajira Jabeen, and
Jens Lehmann. In 15th International Conference on Semantic Systems (SEMANTiCS), 2019.
13. Querying large-scale RDF datasets using the SANSA framework. Claus Stadler; Gezim
Sejdiu; Damien Graux; and Jens Lehmann. In Proceedings of 18th International Semantic
Web Conference (ISWC), Poster & Demos, 2019.
14. The Hubs and Authorities Transaction Network Analysis using the SANSA framework.
Danning Sui; Gezim Sejdiu; Damien Graux; and Jens Lehmann. In 15th International
Conference on Semantic Systems (SEMANTiCS), Poster & Demos, 2019.
15. DISE: A Distributed in-Memory SPARQL Processing Engine over Tensor Data. Hajira
Jabeen; Eskender Haziiev; Gezim Sejdiu; and Jens Lehmann. In 14th IEEE International
Conference on Semantic Computing (ICSC'20), 2020.
Associated Publications (as of
January 2020)
39
Backup slides
40
Big Data Dimensions
Big Data
Value
The trustworthiness
of the data in terms
of accuracy
41
Big Data Dimensions
Big Data Volume
The size of the data
Value
The trustworthiness
of the data in terms
of accuracy
42
Big Data Dimensions
Big Data
Velocity
The speed at which
the data is generated
Volume
The size of the data
Value
The trustworthiness
of the data in terms
of accuracy
43
Big Data Dimensions
Big Data
Velocity
The speed at which
the data is generated
Variety
The different types of
data
Volume
The size of the data
Value
The trustworthiness
of the data in terms
of accuracy
44
Big Data Dimensions
Big Data
Veracity
The trustworthiness
of the data in terms
of accuracy
Velocity
The speed at which
the data is generated
Variety
The different types of
data
Volume
The size of the data
Value
The trustworthiness
of the data in terms
of accuracy
45
Big Data Dimensions
Big Data
Veracity
The trustworthiness
of the data in terms
of accuracy
Velocity
The speed at which
the data is generated
Variety
The different types of
data
Volume
The size of the data
Value
The trustworthiness
of the data in terms
of accuracy
Value
Justhaving
Big
Data
is
ofno
use
unless
w
e
can
turn
itinto
value
46
Big data is more real-time in nature than traditional data
warehousing (DW) applications
Traditional DW architectures (e.g. Exadata, Teradata) are
not well-suited for big data applications
Shared nothing, massively parallel processing, scale out
architectures are well-suited for big data applications
Big Data Analytics
47
BDE impact
48
Assessing data quality is of paramount importance to
judge its fitness for particular use case
- Availability
- Completeness
- Consistency
- Interlinking
Contribution: provision of a software framework for
quality assessment of large-scale RDF datasets
A Scalable Framework for Quality
Assessment of RDF Datasets
49
A Scalable Framework for Quality
Assessment of RDF Datasets
50
Runtime (in minutes)
Luzzu DistQualityAssessment
a) single b) joint c) local d) cluster
LinkedGeoData Fail Fail 446.90 7.79
DBpedia_en Fail Fail 274.31 1.99
DBpedia_de Fail Fail 61.40 0.46
DBpedia_fr Fail Fail 195.30 0.38
BSBM_0.01GB 2.64 2.65 0.04 0.42
BSBM_0.05GB 16.38 15.39 0.05 0.46
BSBM_0.1GB 40.59 37.94 0.06 0.44
BSBM_0.5GB 459.19 468.64 0.15 0.48
BSBM_1GB 1454.16 1532.95 0.40 0.56
BSBM_2GB Timeout Timeout 03.19 0.62
BSBM_10GB Timeout Timeout 29.44 0.52
BSBM_20GB Fail Fail 34.32 0.75
BSBM_200GB Fail Fail 454.46 7.27
Cluster configuration: 7 machines (1 master, 6 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (32 Cores), 128 GB RAM, 12 TB SATA RAID-5, Spark-2.4.0,
Hadoop 2.8.0, Scala 2.11.11 and Java 8. Local mode: single instance of the cluster [10]

Contenu connexe

Tendances

Das Semantische Daten Web für Unternehmen
Das Semantische Daten Web für UnternehmenDas Semantische Daten Web für Unternehmen
Das Semantische Daten Web für Unternehmen
Sören Auer
 

Tendances (20)

51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data Stack51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data Stack
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representations
 
Virtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDFVirtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDF
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
 
External CV support in Dataverse 5.7
External CV support in Dataverse 5.7External CV support in Dataverse 5.7
External CV support in Dataverse 5.7
 
CLARIAH CMDI use case and flexible metadata schemes
CLARIAH CMDI use case and flexible metadata schemesCLARIAH CMDI use case and flexible metadata schemes
CLARIAH CMDI use case and flexible metadata schemes
 
The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked Data
 
Providing Linked Data
Providing Linked DataProviding Linked Data
Providing Linked Data
 
Classification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different FacetsClassification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different Facets
 
Building Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 stepsBuilding Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 steps
 
Das Semantische Daten Web für Unternehmen
Das Semantische Daten Web für UnternehmenDas Semantische Daten Web für Unternehmen
Das Semantische Daten Web für Unternehmen
 
LDOW2015 Position Talk and Discussion
LDOW2015 Position Talk and DiscussionLDOW2015 Position Talk and Discussion
LDOW2015 Position Talk and Discussion
 
RDF data clustering
RDF data clusteringRDF data clustering
RDF data clustering
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open Data
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
 

Similaire à The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with SANSA @LDAC Workshop 2020 Talk

Database Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wiDatabase Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wi
OllieShoresna
 
Wed roman tut_open_datapub
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapub
eswcsummerschool
 

Similaire à The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with SANSA @LDAC Workshop 2020 Talk (20)

Release webinar: Sansa and Ontario
Release webinar: Sansa and OntarioRelease webinar: Sansa and Ontario
Release webinar: Sansa and Ontario
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
Database Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wiDatabase Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wi
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Wed roman tut_open_datapub
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapub
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.
 
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
 
Sigma EE: Reaping low-hanging fruits in RDF-based data integration
Sigma EE: Reaping low-hanging fruits in RDF-based data integrationSigma EE: Reaping low-hanging fruits in RDF-based data integration
Sigma EE: Reaping low-hanging fruits in RDF-based data integration
 
Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1) Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1)
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
 
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان دادهمعرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data
 

Dernier

Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptx
Cherry
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cherry
 
Lipids: types, structure and important functions.
Lipids: types, structure and important functions.Lipids: types, structure and important functions.
Lipids: types, structure and important functions.
Cherry
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
ANSARKHAN96
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
Cherry
 

Dernier (20)

Method of Quantifying interactions and its types
Method of Quantifying interactions and its typesMethod of Quantifying interactions and its types
Method of Quantifying interactions and its types
 
Cot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNACot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNA
 
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
 
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptx
 
GBSN - Microbiology (Unit 5) Concept of isolation
GBSN - Microbiology (Unit 5) Concept of isolationGBSN - Microbiology (Unit 5) Concept of isolation
GBSN - Microbiology (Unit 5) Concept of isolation
 
GBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) MetabolismGBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) Metabolism
 
X-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
X-rays from a Central “Exhaust Vent” of the Galactic Center ChimneyX-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
X-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Site specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdfSite specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdf
 
Genome organization in virus,bacteria and eukaryotes.pptx
Genome organization in virus,bacteria and eukaryotes.pptxGenome organization in virus,bacteria and eukaryotes.pptx
Genome organization in virus,bacteria and eukaryotes.pptx
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
Precision Silviculture and Silviculture practices of bamboo.pptx
Precision Silviculture and Silviculture practices of bamboo.pptxPrecision Silviculture and Silviculture practices of bamboo.pptx
Precision Silviculture and Silviculture practices of bamboo.pptx
 
EU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdfEU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdf
 
Lipids: types, structure and important functions.
Lipids: types, structure and important functions.Lipids: types, structure and important functions.
Lipids: types, structure and important functions.
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxSaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 

The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with SANSA @LDAC Workshop 2020 Talk

  • 1. The Best of Both Worlds Unlocking the Power of (big) Knowledge Graphs with SANSA Gezim Sejdiu, DBpedia Workshop @LDAC2020
  • 2. Education - PhD Student (finishing) @University of Bonn - Msc in Computer Engineering, Uni of Prishtina, Kosovo - Bsc in Computer Science, Uni of Prishtina, Kosovo Experience - Data Engineer @DPDHL|2019 - present - Research Scientist @UniBonn/@SDA|2016 - 2019 - Guest Researcher @UniLeipzig|2015 - 2016 - System Analyst @KEDS, Kosovo|2009 - 2015 - Software Developer @EXPIK, Kosovo|2008 - 2009 ~$ whoami Data Engineer @DPDHL, PhD Student @SDA_Research | @UniBonn, SANSA Contributor & Open Source Enthusiast #BigData #SemanticWeb https://gezimsejdiu.github.io/ 2
  • 4. No single definition Extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions Big data is a term for data sets that are so large or complex that traditional data processing application softwares are inadequate to deal with them What is Big Data? 4
  • 5. Every day, there are 2.5 quintillion bytes of data created - so much that 90% of the data in the world today has been created in the last two years alone It is not only about data collection, or data querying, its is about learning from this tremendous data for informed decision making What is Big Data? 5
  • 6. It’s relevance is increasing drastically and Big Data Analytics is an emerging field to explore Why ‘BigData’ is so important? https://trends.google.com/trends/explore?date=all&q=%22big%20data%22 6
  • 7. 7
  • 8. 8
  • 9. Big Data Ecosystem File system HDFS, NFS Resource manager Mesos, Yarn Coordination Zookeeper Data Acquisition Apache Flume, Apache Sqoop Data Stores MongoDB, Cassandra, Hbase, Project Voldemort Data Processing ● Frameworks Hadoop MapReduce, Apache Spark, Apache Storm, Apache Flink ● Tools Apache Pig, Apache Hive ● Libraries SparkR, Apache Mahout, MlLib, etc Data Integration ● Message Passing ● Managing data heterogeneity Apache Kafka SemaGrow, Strabon Operational Frameworks ● Monitoring Apache Ambari 9
  • 10. Big Data Europe (BDE) Platform https://github.com/big-data-europe 10
  • 11. Heterogeneity aka Variety Key Observation From BDE 11
  • 12. Smart Big Data Intro to Knowledge Graphs 12
  • 13. Modelling entities and their relationships The RDF (Resource Description Framework) model Knowledge Graphs DPDHL Deutsche Post DHL Group full name Logistics industry Logistik label PostTower headquarters Bonn located in 13
  • 14. Modelling entities and their relationships Analysis: finding underlying structure of the graph e.g. to predict unknown relationships Examples: Google Knowledge Graph, DBpedia, Facebook, YAGO, Twitter, LinkedIn, MS Academic Graph, IBM Graph, WikiData Knowledge Graphs 14
  • 15. Knowledge Graphs are everywhere Entity Search and Summarization Discovering Related Entities 15
  • 17. Over the last years, the size of the Semantic Web has increased and several large-scale datasets were published > As of March 2019 ~10, 000 datasets Openly available online using Semantic Web standards + many datasets RDFized and kept private Motivation Source: LOD-Cloud (http://lod-cloud.net/ ) 17
  • 18. Dealing with such amount of data makes many tasks hard to be solved on single machines - Vocabulary Reuse Find a suitable vocabulary for your dataset - Coverage Analysis Does dataset contain necessary information? - Privacy Analysis Does dataset contain sensitive information? - Entity Linking Which datasets are good candidates for interlinking? Motivation 18
  • 19. Tasks that are hard to solve on single machines (>1 TB memory consumption): - Querying and processing LinkedGeoData - Dataset statistics and quality assessment of the LOD Cloud - Vandalism and outlier detection in DBpedia - Inference on life science data (e.g. UniProt, EggNOG, StringDB) - Clustering of DBpedia data - Large-scale enrichment and link prediction for e.g. DBpedia → LinkedGeoData Why Distributed RDF Data Processing? 19
  • 21. Why combining Big Data and SW? “Big Data” Processing (Spark/Flink) Semantic Technology Stack Data Integration Manual pre-processing Partially automated, standardised Modelling Simple (often flat feature vectors) Expressive Support for data exchange Limited (heterogeneous formats with limited schema information) Yes (RDF & OWL W3C Recommendations) Business value Direct Indirect Horizontally scalable Yes No Idea: combine advantages of both worlds 21
  • 22. SANSA is a processing data flow engine that provides data distribution, and fault tolerance for distributed computation over large-scale RDF datasets SANSA includes several libraries: - Read / Write RDF / OWL library - Querying library - Inference library - ML library SANSA BigDataEurope Inference Knowledge Distribution & Representation DeployCoreAPIs&Libraries Local Cluster Standalone Resource manager Querying Machine Learning 22
  • 23. Ingest RDF data in different formats using Jena API style interfaces Represent data in multiple formats - (e.g. RDD, Data Frames, GraphX, Tensors) Allow transformation among these formats Compute RDF dataset statistics and apply quality assessment in a distributed manner Knowledge Representation (KR) Layer 23
  • 24. To make generic queries efficient and fast using: - Intelligent indexing - Splitting strategies - Distributed Storage SPARQL query engine evaluation (SPARQL-to-SQL approaches, Virtual Views, direct mapping) Provision of W3C SPARQL compliant endpoint Query Layer 24
  • 25. Query Layer - the Sparklify approach Sparqlify SANSA SANSA Engine RDF Layer Data Ingestion Partitioning Query Layer Sparklifying Views Views Distributed Data Structures Results RDFData SELECT ?s ?w WHERE { ?s a dbp:Person . ?s ex:workPage ?w . } SPARQL Prefix dbp:<http://dbpedia.org/ontology/> Prefix ex:<http://ex.org/> Create View view_person As Construct { ?s a dbp:Person . ?s ex:workPage ?w . } With ?s = uri('http://mydomain.org/person', ?id) ?w = uri(?work_page) Constrain ?w prefix "http://my-organization.org/user/" From person; SELECT id, work_page FROM view_person ; SQLAET SPARQL query SPARQL Algebra Expression Tree (AET) Normalize AET 25
  • 26. W3C Standards for Modelling: RDFS and OWL Parallel in-memory inference via rule-based forward chaining Beyond state of the art: dynamically build a rule dependency graph for a rule set → Adjustable performance levels Inference Layer 26
  • 27. Inference Layer RDFS rule dependency graph (simplified) 27
  • 28. Distributed Machine Learning (ML) algorithms that work on RDF data and make use of its structure / semantics Algorithms: - Knowledge graph embeddings for e.g. KB completion, link prediction - Graph Clustering - Power Iteration, BorderFlow, Link based - Modularity based clustering - Semantic Decision trees (in progress) - Outlier detection Machine Learning Layer 28
  • 30. Show me the Code 30
  • 31. // Read RDF files into Spark RDD (of triples) val triples = spark.rdf(Lang.NTRIPLES)(input) // Define SPARQL query val sparqlQuery = "SELECT * WHERE {?s ?p ?o} LIMIT 10" // Evaluate SPARQL query over Spark. Returns a DataFrame of triples val result = triples.sparql(sparqlQuery) // Use the results set to cluster them via PIC algorithm val cluster = result.cluster(ClusteringAlgorithm.RDFGraphPowerIterationClustering) .setK(k).setMaxIterations(maxIterations).run() Show me the Code 31
  • 32. Interactive SANSA in your Browser 32
  • 33. Powered by SANSA 33 <https://aleth.io/> Blockchain – Alethio Use Case Alethio is using SANSA in order to perform large-scale batch analytics, e.g. computing the asset turnover for sets of accounts, computing attack pattern frequencies and Opcode usage statistics. SANSA was run on a 100 node cluster with 400 cores <https://www.big-data-europe.eu/> Big Data Platform – BDE SANSA is used for computing statistics over those logs within the BDE platform. BDE uses the Mu Swarm Logger service for detecting docker events and convert their representation to RDF. In order to generate visualisations of log statistics, BDE then calls DistLODStats from SANSA-Notebooks <https://www.specialprivacy.eu/> Transparency and Compliance – SPIRIT SANSA is used to analyse log information concerning personal data processing and sharing that is output from line of business applications on a continuous basis, and to present the information to the user via the SPIRIT dashboard <http://boost40.eu/> Towards a European Data Space SANSA is used for for covering heterogeneity between stakeholders and data providers for better and efficient Data Processing, Data Management and Data Analytics. 10+ more use cases http://sansa-stack.net/powered-by/
  • 34. SANSA 0.7 in Jan 2020, releases every 6 months Apache Open Source License Project activity: - Contributors (at least one commit): 17 - Commits per day: 7.3 - Commits previous year: 2675 - Github stars (all repos): 271 SANSA Pulse 34
  • 35. SANSA = only comprehensive, open source RDF processing and analysis stack for distributed in-memory computing Combines distributed in-memory computing and analytics (Apache Spark & Apache Flink) with Semantic Web technology stack Next steps - Support for SPARQL 1.1 (Query Layer) via Ontop integration - Backward chaining and better evaluation (Inference Layer) - More algorithms and definition of ML pipelines (ML Layer) Conclusions and Next steps 35
  • 36. Thank you 36 @Gezim_Sejdiu https://gezimsejdiu.github.io/ SANSA Semantic Analytics Stack https://github.com/SANSA-Stack ● SANSA-RDF ● SANSA-OWL ● SANSA-Query ● SANSA-Inference ● SANSA-ML ● SANSA-Examples ● SANSA-Notebooks ● SANSA Demo
  • 37. 1. Distributed Semantic Analytics using the SANSA Stack by Jens Lehmann, Gezim Sejdiu, Lorenz Bühmann, Patrick Westphal, Claus Stadler, Ivan Ermilov, Simon Bin, Muhammad Saleem, Axel-Cyrille Ngonga Ngomo and Hajira Jabeen in Proceedings of 16th International Semantic Web Conference – Resources Track (ISWC’2017), 2017. 2. The Tale of Sansa Spark by Ivan Ermilov, Jens Lehmann, Gezim Sejdiu, Lorenz Bühmann, Patrick Westphal, Claus Stadler, Simon Bin, Nilesh Chakraborty, Henning Petzka, Muhammad Saleem, Axel-Cyrille Ngomo Ngonga, and Hajira Jabeen in Proceedings of 16th International Semantic Web Conference, Poster & Demos, 2017. 3. DistLODStats: Distributed Computation of RDF Dataset Statistics by Gezim Sejdiu, Ivan Ermilov, Jens Lehmann, and Mohamed Nadjib-Mami in Proceedings of 17th International Semantic Web Conference, 2018. 4. STATisfy Me: What are my Stats?. Gezim Sejdiu; Ivan Ermilov; Jens Lehmann; and Mohamed-Nadjib Mami. In Proceedings of 17th International Semantic Web Conference, Poster & Demos, 2018. 5. Profiting from Kitties on Ethereum: Leveraging Blockchain RDF with SANSA by Damien Graux; Gezim Sejdiu; Hajira Jabeen; Jens Lehmann; Danning Sui; Dominik Muhs; and Johannes Pfeffer. In 14th International Conference on Semantic Systems, Poster & Demos, 2018. Associated Publications (as of January 2020) 37
  • 38. 6. SPIRIT: A Semantic Transparency and Compliance Stack by Patrick Westphal, Javier Fernández, Sabrina Kirrane and Jens Lehmann. In 14th International Conference on Semantic Systems, Poster & Demos, 2018. 7. Divided we stand out! Forging Cohorts fOr Numeric Outlier Detection in large scale knowledge graphs (CONOD) by Hajira Jabeen; Rajjat Dadwal; Gezim Sejdiu; and Jens Lehmann. In 21st International Conference on Knowledge Engineering and Knowledge Management (EKAW’2018), 2018. 8. Clustering Pipelines of large RDF POI Data. Rajjat Dadwal; Damien Graux; Gezim Sejdiu; Hajira Jabeen; and Jens Lehmann. In Proceedings of 16th Extended Semantic Web Conference (ESWC 2019), Poster & Demos, 2019. 9. Sparklify: A Scalable Software Component for Efficient evaluation of SPARQL queries over distributed RDF datasets. Claus Stadler; Gezim Sejdiu; Damien Graux; and Jens Lehmann. In Proceedings of 18th International Semantic Web Conference, 2019. 10. A Scalable Framework for Quality Assessment of RDF Datasets. Gezim Sejdiu; Anisa Rula; Jens Lehmann; and Hajira Jabeen. In Proceedings of 18th International Semantic Web Conference, 2019. 11. Squerall: Virtual Ontology-Based Access to Heterogeneous and Large Data Sources by Mohamed Nadjib Mami, Damien Graux, Simon Scerri, Hajira Jabeen, Sören Auer, Jens Lehmann. In Proceedings of 18th International Semantic Web Conference, 2019. Associated Publications (as of January 2020) 38
  • 39. 12. Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation. Gezim Sejdiu, Damien Graux, Imran Khan, Ioanna Lytra, Hajira Jabeen, and Jens Lehmann. In 15th International Conference on Semantic Systems (SEMANTiCS), 2019. 13. Querying large-scale RDF datasets using the SANSA framework. Claus Stadler; Gezim Sejdiu; Damien Graux; and Jens Lehmann. In Proceedings of 18th International Semantic Web Conference (ISWC), Poster & Demos, 2019. 14. The Hubs and Authorities Transaction Network Analysis using the SANSA framework. Danning Sui; Gezim Sejdiu; Damien Graux; and Jens Lehmann. In 15th International Conference on Semantic Systems (SEMANTiCS), Poster & Demos, 2019. 15. DISE: A Distributed in-Memory SPARQL Processing Engine over Tensor Data. Hajira Jabeen; Eskender Haziiev; Gezim Sejdiu; and Jens Lehmann. In 14th IEEE International Conference on Semantic Computing (ICSC'20), 2020. Associated Publications (as of January 2020) 39
  • 41. Big Data Dimensions Big Data Value The trustworthiness of the data in terms of accuracy 41
  • 42. Big Data Dimensions Big Data Volume The size of the data Value The trustworthiness of the data in terms of accuracy 42
  • 43. Big Data Dimensions Big Data Velocity The speed at which the data is generated Volume The size of the data Value The trustworthiness of the data in terms of accuracy 43
  • 44. Big Data Dimensions Big Data Velocity The speed at which the data is generated Variety The different types of data Volume The size of the data Value The trustworthiness of the data in terms of accuracy 44
  • 45. Big Data Dimensions Big Data Veracity The trustworthiness of the data in terms of accuracy Velocity The speed at which the data is generated Variety The different types of data Volume The size of the data Value The trustworthiness of the data in terms of accuracy 45
  • 46. Big Data Dimensions Big Data Veracity The trustworthiness of the data in terms of accuracy Velocity The speed at which the data is generated Variety The different types of data Volume The size of the data Value The trustworthiness of the data in terms of accuracy Value Justhaving Big Data is ofno use unless w e can turn itinto value 46
  • 47. Big data is more real-time in nature than traditional data warehousing (DW) applications Traditional DW architectures (e.g. Exadata, Teradata) are not well-suited for big data applications Shared nothing, massively parallel processing, scale out architectures are well-suited for big data applications Big Data Analytics 47
  • 49. Assessing data quality is of paramount importance to judge its fitness for particular use case - Availability - Completeness - Consistency - Interlinking Contribution: provision of a software framework for quality assessment of large-scale RDF datasets A Scalable Framework for Quality Assessment of RDF Datasets 49
  • 50. A Scalable Framework for Quality Assessment of RDF Datasets 50 Runtime (in minutes) Luzzu DistQualityAssessment a) single b) joint c) local d) cluster LinkedGeoData Fail Fail 446.90 7.79 DBpedia_en Fail Fail 274.31 1.99 DBpedia_de Fail Fail 61.40 0.46 DBpedia_fr Fail Fail 195.30 0.38 BSBM_0.01GB 2.64 2.65 0.04 0.42 BSBM_0.05GB 16.38 15.39 0.05 0.46 BSBM_0.1GB 40.59 37.94 0.06 0.44 BSBM_0.5GB 459.19 468.64 0.15 0.48 BSBM_1GB 1454.16 1532.95 0.40 0.56 BSBM_2GB Timeout Timeout 03.19 0.62 BSBM_10GB Timeout Timeout 29.44 0.52 BSBM_20GB Fail Fail 34.32 0.75 BSBM_200GB Fail Fail 454.46 7.27 Cluster configuration: 7 machines (1 master, 6 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (32 Cores), 128 GB RAM, 12 TB SATA RAID-5, Spark-2.4.0, Hadoop 2.8.0, Scala 2.11.11 and Java 8. Local mode: single instance of the cluster [10]