The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with SANSA @LDAC Workshop 2020 Talk

The Best of Both Worlds
Unlocking the Power of (big) Knowledge
Graphs with SANSA
Gezim Sejdiu, DBpedia Workshop @LDAC2020

Education
- PhD Student (ﬁnishing) @University of Bonn
- Msc in Computer Engineering, Uni of Prishtina, Kosovo
- Bsc in Computer Science, Uni of Prishtina, Kosovo
Experience
- Data Engineer @DPDHL|2019 - present
- Research Scientist @UniBonn/@SDA|2016 - 2019
- Guest Researcher @UniLeipzig|2015 - 2016
- System Analyst @KEDS, Kosovo|2009 - 2015
- Software Developer @EXPIK, Kosovo|2008 - 2009
~$ whoami
Data Engineer @DPDHL,
PhD Student
@SDA_Research |
@UniBonn, SANSA
Contributor & Open
Source Enthusiast
#BigData #SemanticWeb
https://gezimsejdiu.github.io/
2

No single deﬁnition
Extremely large data sets that may be analysed
computationally to reveal patterns, trends, and
associations, especially relating to human behaviour and
interactions
Big data is a term for data sets that are so large or
complex that traditional data processing application
softwares are inadequate to deal with them
What is Big Data?
4

Every day, there are 2.5 quintillion bytes of data created -
so much that 90% of the data in the world today has been
created in the last two years alone
It is not only about data collection, or data querying, its is
about learning from this tremendous data for informed
decision making
What is Big Data?
5

It’s relevance is increasing drastically and Big Data
Analytics is an emerging ﬁeld to explore
Why ‘BigData’ is so important?
https://trends.google.com/trends/explore?date=all&q=%22big%20data%22
6

Big Data Ecosystem
File system HDFS, NFS
Resource manager Mesos, Yarn
Coordination Zookeeper
Data Acquisition Apache Flume, Apache Sqoop
Data Stores MongoDB, Cassandra, Hbase, Project Voldemort
Data Processing
● Frameworks
Hadoop MapReduce, Apache Spark, Apache Storm, Apache
Flink
● Tools Apache Pig, Apache Hive
● Libraries SparkR, Apache Mahout, MlLib, etc
Data Integration
● Message Passing
● Managing data
heterogeneity
Apache Kafka
SemaGrow, Strabon
Operational Frameworks
● Monitoring Apache Ambari
9

Big Data Europe (BDE) Platform
https://github.com/big-data-europe
10

Heterogeneity aka Variety
Key Observation From BDE
11

Smart Big Data
Intro to Knowledge Graphs
12

Modelling entities and their relationships
The RDF (Resource Description Framework) model
Knowledge Graphs
DPDHL Deutsche Post DHL Group
full name
Logistics
industry
Logistik
label
PostTower
headquarters
Bonn
located in
13

Modelling entities and their relationships
Analysis: ﬁnding underlying structure of the graph e.g. to
predict unknown relationships
Examples: Google Knowledge Graph, DBpedia, Facebook,
YAGO, Twitter, LinkedIn, MS Academic Graph, IBM Graph,
WikiData
Knowledge Graphs
14

Knowledge Graphs are everywhere
Entity Search and Summarization
Discovering Related Entities
15

SANSA
Scalable Semantic Analytics
Stack
16

Over the last years, the size of the Semantic Web has
increased and several large-scale datasets were
published
> As of March 2019
~10, 000 datasets
Openly available online
using Semantic Web standards
+ many datasets
RDFized and kept private
Motivation
Source: LOD-Cloud (http://lod-cloud.net/ )
17

Dealing with such amount of data makes many tasks hard
to be solved on single machines
- Vocabulary Reuse
Find a suitable vocabulary for your dataset
- Coverage Analysis
Does dataset contain necessary information?
- Privacy Analysis
Does dataset contain sensitive information?
- Entity Linking
Which datasets are good candidates for interlinking?
Motivation
18

Tasks that are hard to solve on single machines (>1 TB
memory consumption):
- Querying and processing LinkedGeoData
- Dataset statistics and quality assessment of the LOD
Cloud
- Vandalism and outlier detection in DBpedia
- Inference on life science data (e.g. UniProt, EggNOG,
StringDB)
- Clustering of DBpedia data
- Large-scale enrichment and link prediction for e.g.
DBpedia → LinkedGeoData
Why Distributed RDF Data
Processing?
19

Why combining Big Data and SW?
“Big Data” Processing (Spark/Flink) Semantic Technology Stack
Data Integration Manual pre-processing Partially automated,
standardised
Modelling Simple (often flat feature vectors) Expressive
Support for data
exchange
Limited (heterogeneous formats
with limited schema information)
Yes (RDF & OWL W3C
Recommendations)
Business value Direct Indirect
Horizontally
scalable
Yes No
Idea: combine advantages of both worlds
21

SANSA is a processing data flow engine that provides
data distribution, and fault tolerance for distributed
computation over large-scale RDF datasets
SANSA includes several libraries:
- Read / Write RDF / OWL library
- Querying library
- Inference library
- ML library
SANSA
BigDataEurope
Inference
Knowledge Distribution &
Representation
DeployCoreAPIs&Libraries
Local Cluster
Standalone Resource manager
Querying
Machine Learning
22

Ingest RDF data in different formats using Jena API style
interfaces
Represent data in multiple formats
- (e.g. RDD, Data Frames, GraphX, Tensors)
Allow transformation among these formats
Compute RDF dataset statistics and apply quality
assessment in a distributed manner
Knowledge Representation (KR)
Layer
23

To make generic queries efﬁcient and fast using:
- Intelligent indexing
- Splitting strategies
- Distributed Storage
SPARQL query engine evaluation
(SPARQL-to-SQL approaches, Virtual Views, direct mapping)
Provision of W3C SPARQL compliant endpoint
Query Layer
24

Query Layer - the Sparklify approach
Sparqlify
SANSA
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
Query Layer
Sparklifying
Views Views
Distributed Data
Structures
Results
RDFData
SELECT ?s ?w WHERE {
?s a dbp:Person .
?s ex:workPage ?w .
}
SPARQL
Prefix dbp:<http://dbpedia.org/ontology/>
Prefix ex:<http://ex.org/>
Create View view_person As
Construct {
?s a dbp:Person .
?s ex:workPage ?w .
}
With
?s = uri('http://mydomain.org/person', ?id)
?w = uri(?work_page)
Constrain
?w prefix "http://my-organization.org/user/"
From
person;
SELECT id, work_page
FROM view_person ;
SQLAET
SPARQL query
SPARQL Algebra
Expression Tree (AET)
Normalize AET
25

W3C Standards for Modelling: RDFS and OWL
Parallel in-memory inference via rule-based forward
chaining
Beyond state of the art: dynamically build a rule
dependency graph for a rule set
→ Adjustable performance levels
Inference Layer
26

Inference Layer
RDFS rule dependency graph
(simplified) 27

Distributed Machine Learning (ML) algorithms that work
on RDF data and make use of its structure / semantics
Algorithms:
- Knowledge graph embeddings for e.g. KB
completion, link prediction
- Graph Clustering
- Power Iteration, BorderFlow, Link based
- Modularity based clustering
- Semantic Decision trees (in progress)
- Outlier detection
Machine Learning Layer
28

// Read RDF files into Spark RDD (of triples)
val triples = spark.rdf(Lang.NTRIPLES)(input)
// Define SPARQL query
val sparqlQuery = "SELECT * WHERE {?s ?p ?o} LIMIT 10"
// Evaluate SPARQL query over Spark. Returns a DataFrame of triples
val result = triples.sparql(sparqlQuery)
// Use the results set to cluster them via PIC algorithm
val cluster =
result.cluster(ClusteringAlgorithm.RDFGraphPowerIterationClustering)
.setK(k).setMaxIterations(maxIterations).run()
Show me the Code
31

Interactive SANSA in your Browser
32

Powered by SANSA
33
<https://aleth.io/>
Blockchain – Alethio
Use Case
Alethio is using SANSA in order to
perform large-scale batch
analytics, e.g. computing the
asset turnover for sets of
accounts, computing attack
pattern frequencies and Opcode
usage statistics. SANSA was run
on a 100 node cluster with 400
cores
<https://www.big-data-europe.eu/>
Big Data Platform –
BDE
SANSA is used for computing
statistics over those logs within
the BDE platform. BDE uses the Mu
Swarm Logger service for
detecting docker events and
convert their representation to
RDF. In order to generate
visualisations of log statistics,
BDE then calls DistLODStats from
SANSA-Notebooks
<https://www.specialprivacy.eu/>
Transparency and
Compliance – SPIRIT
SANSA is used to analyse log
information concerning personal
data processing and sharing that
is output from line of business
applications on a continuous
basis, and to present the
information to the user via the
SPIRIT dashboard
<http://boost40.eu/>
Towards a European
Data Space
SANSA is used for for covering
heterogeneity between
stakeholders and data providers
for better and efficient Data
Processing, Data Management
and Data Analytics.
10+ more use cases
http://sansa-stack.net/powered-by/

SANSA 0.7 in Jan 2020, releases every 6 months
Apache Open Source License
Project activity:
- Contributors (at least one commit): 17
- Commits per day: 7.3 - Commits previous year: 2675
- Github stars (all repos): 271
SANSA Pulse
34

SANSA = only comprehensive, open source RDF
processing and analysis stack for distributed in-memory
computing
Combines distributed in-memory computing and
analytics (Apache Spark & Apache Flink) with Semantic
Web technology stack
Next steps
- Support for SPARQL 1.1 (Query Layer) via Ontop integration
- Backward chaining and better evaluation (Inference Layer)
- More algorithms and deﬁnition of ML pipelines (ML Layer)
Conclusions and Next steps
35

Thank you
36
@Gezim_Sejdiu
https://gezimsejdiu.github.io/
SANSA
Semantic Analytics Stack
https://github.com/SANSA-Stack
● SANSA-RDF
● SANSA-OWL
● SANSA-Query
● SANSA-Inference
● SANSA-ML
● SANSA-Examples
● SANSA-Notebooks
● SANSA Demo

1. Distributed Semantic Analytics using the SANSA Stack by Jens Lehmann, Gezim Sejdiu,
Lorenz Bühmann, Patrick Westphal, Claus Stadler, Ivan Ermilov, Simon Bin, Muhammad
Saleem, Axel-Cyrille Ngonga Ngomo and Hajira Jabeen in Proceedings of 16th International
Semantic Web Conference – Resources Track (ISWC’2017), 2017.
2. The Tale of Sansa Spark by Ivan Ermilov, Jens Lehmann, Gezim Sejdiu, Lorenz Bühmann,
Patrick Westphal, Claus Stadler, Simon Bin, Nilesh Chakraborty, Henning Petzka, Muhammad
Saleem, Axel-Cyrille Ngomo Ngonga, and Hajira Jabeen in Proceedings of 16th International
Semantic Web Conference, Poster & Demos, 2017.
3. DistLODStats: Distributed Computation of RDF Dataset Statistics by Gezim Sejdiu, Ivan
Ermilov, Jens Lehmann, and Mohamed Nadjib-Mami in Proceedings of 17th International
Semantic Web Conference, 2018.
4. STATisfy Me: What are my Stats?. Gezim Sejdiu; Ivan Ermilov; Jens Lehmann; and
Mohamed-Nadjib Mami. In Proceedings of 17th International Semantic Web Conference,
Poster & Demos, 2018.
5. Profiting from Kitties on Ethereum: Leveraging Blockchain RDF with SANSA by Damien
Graux; Gezim Sejdiu; Hajira Jabeen; Jens Lehmann; Danning Sui; Dominik Muhs; and
Johannes Pfeffer. In 14th International Conference on Semantic Systems, Poster & Demos,
2018.
Associated Publications (as of
January 2020)
37

6. SPIRIT: A Semantic Transparency and Compliance Stack by Patrick Westphal, Javier
Fernández, Sabrina Kirrane and Jens Lehmann. In 14th International Conference on Semantic
Systems, Poster & Demos, 2018.
7. Divided we stand out! Forging Cohorts fOr Numeric Outlier Detection in large scale
knowledge graphs (CONOD) by Hajira Jabeen; Rajjat Dadwal; Gezim Sejdiu; and Jens
Lehmann. In 21st International Conference on Knowledge Engineering and Knowledge
Management (EKAW’2018), 2018.
8. Clustering Pipelines of large RDF POI Data. Rajjat Dadwal; Damien Graux; Gezim Sejdiu;
Hajira Jabeen; and Jens Lehmann. In Proceedings of 16th Extended Semantic Web
Conference (ESWC 2019), Poster & Demos, 2019.
9. Sparklify: A Scalable Software Component for Efficient evaluation of SPARQL queries
over distributed RDF datasets. Claus Stadler; Gezim Sejdiu; Damien Graux; and Jens
Lehmann. In Proceedings of 18th International Semantic Web Conference, 2019.
10. A Scalable Framework for Quality Assessment of RDF Datasets. Gezim Sejdiu; Anisa
Rula; Jens Lehmann; and Hajira Jabeen. In Proceedings of 18th International Semantic Web
Conference, 2019.
11. Squerall: Virtual Ontology-Based Access to Heterogeneous and Large Data Sources by
Mohamed Nadjib Mami, Damien Graux, Simon Scerri, Hajira Jabeen, Sören Auer, Jens
Lehmann. In Proceedings of 18th International Semantic Web Conference, 2019.
January 2020)
38

12. Towards A Scalable Semantic-based Distributed Approach for SPARQL query
evaluation. Gezim Sejdiu, Damien Graux, Imran Khan, Ioanna Lytra, Hajira Jabeen, and
Jens Lehmann. In 15th International Conference on Semantic Systems (SEMANTiCS), 2019.
13. Querying large-scale RDF datasets using the SANSA framework. Claus Stadler; Gezim
Sejdiu; Damien Graux; and Jens Lehmann. In Proceedings of 18th International Semantic
Web Conference (ISWC), Poster & Demos, 2019.
14. The Hubs and Authorities Transaction Network Analysis using the SANSA framework.
Danning Sui; Gezim Sejdiu; Damien Graux; and Jens Lehmann. In 15th International
Conference on Semantic Systems (SEMANTiCS), Poster & Demos, 2019.
15. DISE: A Distributed in-Memory SPARQL Processing Engine over Tensor Data. Hajira
Jabeen; Eskender Haziiev; Gezim Sejdiu; and Jens Lehmann. In 14th IEEE International
Conference on Semantic Computing (ICSC'20), 2020.
January 2020)
39

Big Data Dimensions
Big Data
Value
The trustworthiness
of the data in terms
of accuracy
41

Big Data Dimensions
Big Data Volume
The size of the data
Value
The trustworthiness
of accuracy
42

Big Data Dimensions
Big Data
Velocity
The speed at which
the data is generated
Volume
Value
The trustworthiness
of accuracy
43

Big Data Dimensions
Big Data
Velocity
The speed at which
Variety
The different types of
data
Volume
Value
The trustworthiness
of accuracy
44

Big Data Dimensions
Big Data
Veracity
The trustworthiness
of accuracy
Velocity
The speed at which
Variety
data
Volume
Value
The trustworthiness
of accuracy
45

Big Data Dimensions
Big Data
Veracity
The trustworthiness
of accuracy
Velocity
The speed at which
Variety
data
Volume
Value
The trustworthiness
of accuracy
Value
Justhaving
Big
Data
is
ofno
use
unless
w
e
can
turn
itinto
value
46

Big data is more real-time in nature than traditional data
warehousing (DW) applications
Traditional DW architectures (e.g. Exadata, Teradata) are
not well-suited for big data applications
Shared nothing, massively parallel processing, scale out
architectures are well-suited for big data applications
Big Data Analytics
47

Assessing data quality is of paramount importance to
judge its ﬁtness for particular use case
- Availability
- Completeness
- Consistency
- Interlinking
Contribution: provision of a software framework for
quality assessment of large-scale RDF datasets
A Scalable Framework for Quality
Assessment of RDF Datasets
49

A Scalable Framework for Quality
Assessment of RDF Datasets
50
Runtime (in minutes)
Luzzu DistQualityAssessment
a) single b) joint c) local d) cluster
LinkedGeoData Fail Fail 446.90 7.79
DBpedia_en Fail Fail 274.31 1.99
DBpedia_de Fail Fail 61.40 0.46
DBpedia_fr Fail Fail 195.30 0.38
BSBM_0.01GB 2.64 2.65 0.04 0.42
BSBM_0.05GB 16.38 15.39 0.05 0.46
BSBM_0.1GB 40.59 37.94 0.06 0.44
BSBM_0.5GB 459.19 468.64 0.15 0.48
BSBM_1GB 1454.16 1532.95 0.40 0.56
BSBM_2GB Timeout Timeout 03.19 0.62
BSBM_10GB Timeout Timeout 29.44 0.52
BSBM_20GB Fail Fail 34.32 0.75
BSBM_200GB Fail Fail 454.46 7.27
Cluster configuration: 7 machines (1 master, 6 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (32 Cores), 128 GB RAM, 12 TB SATA RAID-5, Spark-2.4.0,
Hadoop 2.8.0, Scala 2.11.11 and Java 8. Local mode: single instance of the cluster [10]

The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with SANSA @LDAC Workshop 2020 Talk

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with SANSA @LDAC Workshop 2020 Talk

Similaire à The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with SANSA @LDAC Workshop 2020 Talk (20)

Dernier

Dernier (20)

The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with SANSA @LDAC Workshop 2020 Talk