Over the past decade, vast amounts of machine-readable structured information have become available through the automation of research processes as well as the increasing popularity of knowledge graphs and semantic technologies.
A major and yet unsolved challenge that research faces today is to perform scalable analysis of large scale knowledge graphs in order to facilitate applications like link prediction, knowledge base completion, and question answering.
Most machine learning approaches, which scale horizontally (i.e. can be executed in a distributed environment) work on simpler feature vector based input rather than more expressive knowledge structures.
On the other hand, the learning methods which exploit the expressive structures, e.g. Statistical Relational Learning and Inductive Logic Programming approaches, usually do not scale well to very large knowledge bases owing to their working complexity.
This talk gives an overview of the ongoing project Semantic Analytics Stack (SANSA) which aims to bridge this research gap by creating an out of the box library for scalable, in-memory, structured learning.
4. No single definition
Extremely large data sets that may be analysed
computationally to reveal patterns, trends, and
associations, especially relating to human behaviour and
interactions
Big data is a term for data sets that are so large or
complex that traditional data processing application
softwares are inadequate to deal with them
What is Big Data?
4
5. Every day, there are 2.5 quintillion bytes of data created -
so much that 90% of the data in the world today has been
created in the last two years alone
It is not only about data collection, or data querying, its is
about learning from this tremendous data for informed
decision making
What is Big Data?
5
6. It’s relevance is increasing drastically and Big Data
Analytics is an emerging field to explore
Why ‘BigData’ is so important?
https://trends.google.com/trends/explore?date=all&q=%22big%20data%22
6
13. Modelling entities and their relationships
The RDF (Resource Description Framework) model
Knowledge Graphs
DPDHL Deutsche Post DHL Group
full name
Logistics
industry
Logistik
label
PostTower
headquarters
Bonn
located in
13
14. Modelling entities and their relationships
Analysis: finding underlying structure of the graph e.g. to
predict unknown relationships
Examples: Google Knowledge Graph, DBpedia, Facebook,
YAGO, Twitter, LinkedIn, MS Academic Graph, IBM Graph,
WikiData
Knowledge Graphs
14
15. Knowledge Graphs are everywhere
Entity Search and Summarization
Discovering Related Entities
15
17. Over the last years, the size of the Semantic Web has
increased and several large-scale datasets were
published
> As of March 2019
~10, 000 datasets
Openly available online
using Semantic Web standards
+ many datasets
RDFized and kept private
Motivation
Source: LOD-Cloud (http://lod-cloud.net/ )
17
18. Dealing with such amount of data makes many tasks hard
to be solved on single machines
- Vocabulary Reuse
Find a suitable vocabulary for your dataset
- Coverage Analysis
Does dataset contain necessary information?
- Privacy Analysis
Does dataset contain sensitive information?
- Entity Linking
Which datasets are good candidates for interlinking?
Motivation
18
19. Tasks that are hard to solve on single machines (>1 TB
memory consumption):
- Querying and processing LinkedGeoData
- Dataset statistics and quality assessment of the LOD
Cloud
- Vandalism and outlier detection in DBpedia
- Inference on life science data (e.g. UniProt, EggNOG,
StringDB)
- Clustering of DBpedia data
- Large-scale enrichment and link prediction for e.g.
DBpedia → LinkedGeoData
Why Distributed RDF Data
Processing?
19
21. Why combining Big Data and SW?
“Big Data” Processing (Spark/Flink) Semantic Technology Stack
Data Integration Manual pre-processing Partially automated,
standardised
Modelling Simple (often flat feature vectors) Expressive
Support for data
exchange
Limited (heterogeneous formats
with limited schema information)
Yes (RDF & OWL W3C
Recommendations)
Business value Direct Indirect
Horizontally
scalable
Yes No
Idea: combine advantages of both worlds
21
22. SANSA is a processing data flow engine that provides
data distribution, and fault tolerance for distributed
computation over large-scale RDF datasets
SANSA includes several libraries:
- Read / Write RDF / OWL library
- Querying library
- Inference library
- ML library
SANSA
BigDataEurope
Inference
Knowledge Distribution &
Representation
DeployCoreAPIs&Libraries
Local Cluster
Standalone Resource manager
Querying
Machine Learning
22
23. Ingest RDF data in different formats using Jena API style
interfaces
Represent data in multiple formats
- (e.g. RDD, Data Frames, GraphX, Tensors)
Allow transformation among these formats
Compute RDF dataset statistics and apply quality
assessment in a distributed manner
Knowledge Representation (KR)
Layer
23
24. To make generic queries efficient and fast using:
- Intelligent indexing
- Splitting strategies
- Distributed Storage
SPARQL query engine evaluation
(SPARQL-to-SQL approaches, Virtual Views, direct mapping)
Provision of W3C SPARQL compliant endpoint
Query Layer
24
25. Query Layer - the Sparklify approach
Sparqlify
SANSA
SANSA Engine
RDF Layer
Data Ingestion
Partitioning
Query Layer
Sparklifying
Views Views
Distributed Data
Structures
Results
RDFData
SELECT ?s ?w WHERE {
?s a dbp:Person .
?s ex:workPage ?w .
}
SPARQL
Prefix dbp:<http://dbpedia.org/ontology/>
Prefix ex:<http://ex.org/>
Create View view_person As
Construct {
?s a dbp:Person .
?s ex:workPage ?w .
}
With
?s = uri('http://mydomain.org/person', ?id)
?w = uri(?work_page)
Constrain
?w prefix "http://my-organization.org/user/"
From
person;
SELECT id, work_page
FROM view_person ;
SQLAET
SPARQL query
SPARQL Algebra
Expression Tree (AET)
Normalize AET
25
26. W3C Standards for Modelling: RDFS and OWL
Parallel in-memory inference via rule-based forward
chaining
Beyond state of the art: dynamically build a rule
dependency graph for a rule set
→ Adjustable performance levels
Inference Layer
26
28. Distributed Machine Learning (ML) algorithms that work
on RDF data and make use of its structure / semantics
Algorithms:
- Knowledge graph embeddings for e.g. KB
completion, link prediction
- Graph Clustering
- Power Iteration, BorderFlow, Link based
- Modularity based clustering
- Semantic Decision trees (in progress)
- Outlier detection
Machine Learning Layer
28
31. // Read RDF files into Spark RDD (of triples)
val triples = spark.rdf(Lang.NTRIPLES)(input)
// Define SPARQL query
val sparqlQuery = "SELECT * WHERE {?s ?p ?o} LIMIT 10"
// Evaluate SPARQL query over Spark. Returns a DataFrame of triples
val result = triples.sparql(sparqlQuery)
// Use the results set to cluster them via PIC algorithm
val cluster =
result.cluster(ClusteringAlgorithm.RDFGraphPowerIterationClustering)
.setK(k).setMaxIterations(maxIterations).run()
Show me the Code
31
33. Powered by SANSA
33
<https://aleth.io/>
Blockchain – Alethio
Use Case
Alethio is using SANSA in order to
perform large-scale batch
analytics, e.g. computing the
asset turnover for sets of
accounts, computing attack
pattern frequencies and Opcode
usage statistics. SANSA was run
on a 100 node cluster with 400
cores
<https://www.big-data-europe.eu/>
Big Data Platform –
BDE
SANSA is used for computing
statistics over those logs within
the BDE platform. BDE uses the Mu
Swarm Logger service for
detecting docker events and
convert their representation to
RDF. In order to generate
visualisations of log statistics,
BDE then calls DistLODStats from
SANSA-Notebooks
<https://www.specialprivacy.eu/>
Transparency and
Compliance – SPIRIT
SANSA is used to analyse log
information concerning personal
data processing and sharing that
is output from line of business
applications on a continuous
basis, and to present the
information to the user via the
SPIRIT dashboard
<http://boost40.eu/>
Towards a European
Data Space
SANSA is used for for covering
heterogeneity between
stakeholders and data providers
for better and efficient Data
Processing, Data Management
and Data Analytics.
10+ more use cases
http://sansa-stack.net/powered-by/
34. SANSA 0.7 in Jan 2020, releases every 6 months
Apache Open Source License
Project activity:
- Contributors (at least one commit): 17
- Commits per day: 7.3 - Commits previous year: 2675
- Github stars (all repos): 271
SANSA Pulse
34
35. SANSA = only comprehensive, open source RDF
processing and analysis stack for distributed in-memory
computing
Combines distributed in-memory computing and
analytics (Apache Spark & Apache Flink) with Semantic
Web technology stack
Next steps
- Support for SPARQL 1.1 (Query Layer) via Ontop integration
- Backward chaining and better evaluation (Inference Layer)
- More algorithms and definition of ML pipelines (ML Layer)
Conclusions and Next steps
35
37. 1. Distributed Semantic Analytics using the SANSA Stack by Jens Lehmann, Gezim Sejdiu,
Lorenz Bühmann, Patrick Westphal, Claus Stadler, Ivan Ermilov, Simon Bin, Muhammad
Saleem, Axel-Cyrille Ngonga Ngomo and Hajira Jabeen in Proceedings of 16th International
Semantic Web Conference – Resources Track (ISWC’2017), 2017.
2. The Tale of Sansa Spark by Ivan Ermilov, Jens Lehmann, Gezim Sejdiu, Lorenz Bühmann,
Patrick Westphal, Claus Stadler, Simon Bin, Nilesh Chakraborty, Henning Petzka, Muhammad
Saleem, Axel-Cyrille Ngomo Ngonga, and Hajira Jabeen in Proceedings of 16th International
Semantic Web Conference, Poster & Demos, 2017.
3. DistLODStats: Distributed Computation of RDF Dataset Statistics by Gezim Sejdiu, Ivan
Ermilov, Jens Lehmann, and Mohamed Nadjib-Mami in Proceedings of 17th International
Semantic Web Conference, 2018.
4. STATisfy Me: What are my Stats?. Gezim Sejdiu; Ivan Ermilov; Jens Lehmann; and
Mohamed-Nadjib Mami. In Proceedings of 17th International Semantic Web Conference,
Poster & Demos, 2018.
5. Profiting from Kitties on Ethereum: Leveraging Blockchain RDF with SANSA by Damien
Graux; Gezim Sejdiu; Hajira Jabeen; Jens Lehmann; Danning Sui; Dominik Muhs; and
Johannes Pfeffer. In 14th International Conference on Semantic Systems, Poster & Demos,
2018.
Associated Publications (as of
January 2020)
37
38. 6. SPIRIT: A Semantic Transparency and Compliance Stack by Patrick Westphal, Javier
Fernández, Sabrina Kirrane and Jens Lehmann. In 14th International Conference on Semantic
Systems, Poster & Demos, 2018.
7. Divided we stand out! Forging Cohorts fOr Numeric Outlier Detection in large scale
knowledge graphs (CONOD) by Hajira Jabeen; Rajjat Dadwal; Gezim Sejdiu; and Jens
Lehmann. In 21st International Conference on Knowledge Engineering and Knowledge
Management (EKAW’2018), 2018.
8. Clustering Pipelines of large RDF POI Data. Rajjat Dadwal; Damien Graux; Gezim Sejdiu;
Hajira Jabeen; and Jens Lehmann. In Proceedings of 16th Extended Semantic Web
Conference (ESWC 2019), Poster & Demos, 2019.
9. Sparklify: A Scalable Software Component for Efficient evaluation of SPARQL queries
over distributed RDF datasets. Claus Stadler; Gezim Sejdiu; Damien Graux; and Jens
Lehmann. In Proceedings of 18th International Semantic Web Conference, 2019.
10. A Scalable Framework for Quality Assessment of RDF Datasets. Gezim Sejdiu; Anisa
Rula; Jens Lehmann; and Hajira Jabeen. In Proceedings of 18th International Semantic Web
Conference, 2019.
11. Squerall: Virtual Ontology-Based Access to Heterogeneous and Large Data Sources by
Mohamed Nadjib Mami, Damien Graux, Simon Scerri, Hajira Jabeen, Sören Auer, Jens
Lehmann. In Proceedings of 18th International Semantic Web Conference, 2019.
Associated Publications (as of
January 2020)
38
39. 12. Towards A Scalable Semantic-based Distributed Approach for SPARQL query
evaluation. Gezim Sejdiu, Damien Graux, Imran Khan, Ioanna Lytra, Hajira Jabeen, and
Jens Lehmann. In 15th International Conference on Semantic Systems (SEMANTiCS), 2019.
13. Querying large-scale RDF datasets using the SANSA framework. Claus Stadler; Gezim
Sejdiu; Damien Graux; and Jens Lehmann. In Proceedings of 18th International Semantic
Web Conference (ISWC), Poster & Demos, 2019.
14. The Hubs and Authorities Transaction Network Analysis using the SANSA framework.
Danning Sui; Gezim Sejdiu; Damien Graux; and Jens Lehmann. In 15th International
Conference on Semantic Systems (SEMANTiCS), Poster & Demos, 2019.
15. DISE: A Distributed in-Memory SPARQL Processing Engine over Tensor Data. Hajira
Jabeen; Eskender Haziiev; Gezim Sejdiu; and Jens Lehmann. In 14th IEEE International
Conference on Semantic Computing (ICSC'20), 2020.
Associated Publications (as of
January 2020)
39
42. Big Data Dimensions
Big Data Volume
The size of the data
Value
The trustworthiness
of the data in terms
of accuracy
42
43. Big Data Dimensions
Big Data
Velocity
The speed at which
the data is generated
Volume
The size of the data
Value
The trustworthiness
of the data in terms
of accuracy
43
44. Big Data Dimensions
Big Data
Velocity
The speed at which
the data is generated
Variety
The different types of
data
Volume
The size of the data
Value
The trustworthiness
of the data in terms
of accuracy
44
45. Big Data Dimensions
Big Data
Veracity
The trustworthiness
of the data in terms
of accuracy
Velocity
The speed at which
the data is generated
Variety
The different types of
data
Volume
The size of the data
Value
The trustworthiness
of the data in terms
of accuracy
45
46. Big Data Dimensions
Big Data
Veracity
The trustworthiness
of the data in terms
of accuracy
Velocity
The speed at which
the data is generated
Variety
The different types of
data
Volume
The size of the data
Value
The trustworthiness
of the data in terms
of accuracy
Value
Justhaving
Big
Data
is
ofno
use
unless
w
e
can
turn
itinto
value
46
47. Big data is more real-time in nature than traditional data
warehousing (DW) applications
Traditional DW architectures (e.g. Exadata, Teradata) are
not well-suited for big data applications
Shared nothing, massively parallel processing, scale out
architectures are well-suited for big data applications
Big Data Analytics
47
49. Assessing data quality is of paramount importance to
judge its fitness for particular use case
- Availability
- Completeness
- Consistency
- Interlinking
Contribution: provision of a software framework for
quality assessment of large-scale RDF datasets
A Scalable Framework for Quality
Assessment of RDF Datasets
49
50. A Scalable Framework for Quality
Assessment of RDF Datasets
50
Runtime (in minutes)
Luzzu DistQualityAssessment
a) single b) joint c) local d) cluster
LinkedGeoData Fail Fail 446.90 7.79
DBpedia_en Fail Fail 274.31 1.99
DBpedia_de Fail Fail 61.40 0.46
DBpedia_fr Fail Fail 195.30 0.38
BSBM_0.01GB 2.64 2.65 0.04 0.42
BSBM_0.05GB 16.38 15.39 0.05 0.46
BSBM_0.1GB 40.59 37.94 0.06 0.44
BSBM_0.5GB 459.19 468.64 0.15 0.48
BSBM_1GB 1454.16 1532.95 0.40 0.56
BSBM_2GB Timeout Timeout 03.19 0.62
BSBM_10GB Timeout Timeout 29.44 0.52
BSBM_20GB Fail Fail 34.32 0.75
BSBM_200GB Fail Fail 454.46 7.27
Cluster configuration: 7 machines (1 master, 6 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (32 Cores), 128 GB RAM, 12 TB SATA RAID-5, Spark-2.4.0,
Hadoop 2.8.0, Scala 2.11.11 and Java 8. Local mode: single instance of the cluster [10]