Big linked geospatial data tools in ExtremeEarth-phiweek19

ExtremeEarth
From Copernicus Big Data
to Extreme Earth Analytics
This project has received funding from the European Union’s Horizon 2020
research and innovation programme under grant agreement No 825258.

12 September 2019
Earth Observation Φ-week, ESA-ESRIN, Frascati, Italy
Manolis Koubarakis, Konstantina Bereta, Dimitris Bilidas, Angelos
Charalabidis, Konstantinos Giannousis, Theofilos Ioannidis, Vangelis
Karkaletsis, Stasinos Konstantopoulos, Despina-Athanasia Pantazi,
George Stamoulis
Big Linked Geospatial Data Tools in ExtremeEarth

3
Linked Open Data
● The vision of linked open data is to go from a Web of
documents to a Web of data:
○ Unlock open data dormant in their silos.
○ Make it available on the Web using Semantic Web
Technologies (HTTP, URIs, RDF, SPARQL)
○ Interlink it with other open data (e.g., from the
European data portal)

4
Big Linked Geospatial Data
● Information and knowledge extracted from EO data is voluminous: 1 PB of
Sentinel data may contain >750*103 products which will result in >450TB of
information and knowledge (e.g., classes of objects).
● >106 PB of data in the Copernicus Open Access Hub
● ExtremeEarth will develop tools for transforming, integrating, querying and
performing geospatial analytics for the big information and knowledge that will
be mined from Copernicus data and other auxiliary data sources using the
deep learning techniques in the project’s pipeline

5
Overview of tools
● GeoTriples-Publishing geospatial data as RDF graphs
● JedAI-Interlinking Framework
● Strabon - A state-of-the-art spatiotemporal RDF store
● Semagrow - A federated SPARQL query processor
● Geographica - A benchmark for big linked geospatial tools

6
GeoTriples-Publishing geospatial data
as RDF graphs
https://geotriples.di.uoa.gr

7
GeoTriples-Spark
● GeoTriples-Spark is a new implementation of GeoTriples, capable of transforming
big geospatial data into RDF.
● We extended GeoTriples to run on top of Hops and Apache Spark.

9
GeoTriples-Spark CSV transformation
Input Size #Executors Output Size #Execution time(m)
1GB 2 7.7GB 1
10GB 21 83.4GB 1.6
100GB 41 840.1GB 3.3
250GB 60 2.1TB 6.6
500GB 65 4.1TB 13
1TB 70 8.3TB 26
2TB 80 16.6TB 50

10
JedAI Interlinking Framework
● Based on meta-blocking with the ability to discover geospatial relationships
among resources in geospatial RDF stores
● Implemented in Silk and Radon tools
● Adaptation of all JedAI algorithms, including meta-blocking, to Apache Spark
for massive parallelization
● Integration of state-of-the-art, massively parallel geospatial interlinking
methods in JedAI
● Development of novel geospatial interlinking methods based on meta-
blocking
https://jedai.scify.org/

11
Strabon - A state-of-the-art spatiotemporal RDF
store
http://strabon.di.uoa.gr

12
Strabon on the cloud
● Objective of the project is to develop a new cloud-based Strabon implementation. The
results of this task will make Strabon, currently the state-of-the-art open-source geospatial
RDF store, the first GeoSPARQL distributed engine for big geospatial data and extreme
geospatial analytics.
● design and implement a horizontally scalable RDF store to the PB scale. This means that
the components of the system should have the following capabilities and characteristics:
1. Ingest/bulk import RDF data (e.g. N-Triples, Turtle, TriG, N-Quads, JSON-LD,
RDF/XML) that will be provided in a Hadoop-based distributed filesystem
2. Store the imported data on a distributed data store (eg. HBase, Accumulo, etc.)
3. Create the spatial indexing in addition to all standard indexing schemes enforced
4. Provide the interfaces to query the stored data with SPARQL v1.1 and export the
results in various formats
● Two candidate approaches:
○ GeoSPARK
○ GeoMESA

14
Strabon on the cloud (3)
● Use GeoSPARK library:
○ Rich Spatial Functionality on Apache SPARK
○ Supports both RDD and SQL/DataFrame APIs
○ Easy Integration with HOPS platform
● Strabon with GeoSPARK has been tested on Hops.site
○ Datasets from Geographica benchmark with 100M and 500M triples
○ 3 queries of varying selectivity
○ Same storage schema (tables) as in centralized Strabon PostGIS storage
○ 64 executors, 6 GB executor memory, 2 executor cores, 24G master memory, 8 master
cores

15
Experiments (times in seconds)
Query Dataset Strabon(Cold Cache) Strabon(Warm Cache)
GeoSPARK on
HOPS
SC1
100M 284 224 54
500M 1008 975 102
SC2
100M 87 10 46
500M 2785 297 348
SC3
100M 83 9 60
500M 2005 214 324
(-) Centalized Strabon performs better on warm cache for SC2 and SC3 even
for the 500M dataset
(+) GeoSPARK seems to achieving better scalability. Needs further
experiments witg larger datasets

16
Improving the Storage Schema for RDF Data
● Storage schema used in the experiments is not optimized for cloud
processing
● Extended Vertical Partitioning schema from S2RDF seems promising
○ computes semijoins between RDF predicates depending on selectivity
○ cache semijoins results
● Implementing a data loader in HOPS that reads RDF data and creates
Parquet files according to Extended Vertical Partitioning

17
Semagrow - A federated SPARQL query
processor
http://semagrow.github.io/
● Semagrow exposes a single SPARQL endpoint that federates a number of heterogeneous data sources.
● Targets the federation of heterogeneous and independently provided data sources. In other words,
Semagrow aims to offer the most efficient distributed querying solution that can be achieved without
controlling the way data is distributed between sources and, in general, without having the responsibility to
centrally manage the data sources of the federation.
● Integrate multiple data sources at query-time.
○ No need to ingest the whole dataset in order to integrate.
○ All queries return updated answers.
○ Easily add and remove a data source from the federation without affecting the application.
● Data sources can be heterogeneous.
● Semagrow can handle different systems with different query expressivity. Currently supports:
■ SPARQL 1.0 and 1.1
■ CassandraQL
● Objective in Extreme Earth: Integrate geospatial operations in multiple big geospatial data sources.

18
Geographica - A benchmark for big linked
geospatial tools
● Geographica 2 is a recently published version of the Geographica benchmark for evaluating
geospatial RDF stores, which it improves by adding more workloads, extending the existing ones
and evaluating more RDF stores. It tests the eﬃciency of primitive spatial functions in RDF
stores, the performance of single node RDF stores in real use case scenarios, a more detailed
evaluation is performed using a synthetic workload and finally the scalability of the RDF stores is
revealed with the scalability workload.
● Geographica will be futher extended with data sources and scenarios from the two use cases of
ExtremeEarth, so that it can be used to evaluate the performance of the transformation,
interlinking, querying and integration systems in cloud environments.
● http://geographica2.di.uoa.gr/

Big linked geospatial data tools in ExtremeEarth-phiweek19

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Big linked geospatial data tools in ExtremeEarth-phiweek19

Similaire à Big linked geospatial data tools in ExtremeEarth-phiweek19 (20)

Plus de ExtremeEarth

Plus de ExtremeEarth (15)

Dernier

Dernier (20)

Big linked geospatial data tools in ExtremeEarth-phiweek19