Discovering Related Data Sources in Data Portals

•

0 likes•1,531 views

Peter Haase

Slides from my presentation at the 1st International Workshop on Semantic Statistics Sydney, Oct 22, 2013

Technology

Discovering
Related
Data
Sources

in
Data
Portals

Andreas
Wagner,
Peter
Haase,

Achim
Re4nger,
Holger
Lamm

1st
Interna:onal
Workshop
on
Seman:c
Sta:s:cs

Sydney,
Oct
22,
2013

Poten&al
of
Open
(Sta&s&cs)
Data

WORLD BANK

ﬂuidOps
Open
Data
Portal

•  Data
collec&on

•  Integra&on
of
major
open
data
catalogs

•  Automated
provisioning
of
10.000s
data
sets

•  Portal
for
search
and
explora&on
of
data
sets

•  Rich
metadata
based
on
open
standards

•  Both
descrip&ve
and
structural
metadata

•  Integrated
querying
across
interlinked
data
sets

•  Easy
to
use
queries
against
mul&ple
data
sets

•  Using
federa&on
technologies

•  Self-‐service
UI

•  Custom
queries
and
visualiza&ons

•  Widgets,
dashboarding,
etc.

WORLD BANK

Finding
Related
Data
Sets

•  Many
informa&on
needs
require
analysis
of
mul&ple
data
sets

•  Example:
Compare
and
correlate
GDP,
popula&on
and
public
debt

of
countries
over
&me

•  Task
of
ﬁnding
related
data
sets

•  Iden&fy
data
sets
that
are
similar,
but
complementary

•  To
support
queries
across
mul&ple
data
sets,
e.g.
in
the
form
of
joins

and
unions

•  Inspira&on:
Finding
related
tables

•  En&ty
complement:
same
aVributes,
complemen&ng
en&&es

•  Schema
complement:
same
en&&es,
complemen&ng
aVributes

Finding
Related
Data
Sources

via
Related
En&&es

•  Data
Model:
Data
source
is
a
set
of
mul&ple

RDF
graphs

•  Intui&on:
if
data
sources
contain
similar

en&&es,
they
are
somehow
related

Cluster
2

Cluster
1

•  Approach:

En&&es

1.  En&ty
Extrac&on

2.  En&ty
Similarity

3.  En&ty
Clustering

Related?!

Source
1

Source
3

Source
2

Related
En&&es
(2)

1.  En&ty
Extrac&on

–  Sample
over
en&&es
in
data
graphs
in
D

–  For
each
en&ty
crawl
its
surrounding
sub-‐graph
[1]

2.  En&ty
Similarity

–  Deﬁne
dissimilarity
measure
between
two
en&&es

based
on
kernel
func&ons

–  Compare
en&ty
structure
and
literals
via
diﬀerent

kernels
[2,3]

3.  En&ty
Clustering

–  Apply
k-‐means
clustering
to
discover
similar

en&&es
[4]

Contextualisa&on
Score

•  Contextualiza&on
score
for
data
source
D’’

given
D’:
ec(D’’|D’)
and
sc(D’’|D’)

•  En*ty
complement
score

•  Schema
complement
score

Queries
Across
Related
Data
Sets

•  Query
for
GDP
of
Germany

•  Union
of
results
from

•  Worldbank:
GDP
(current
US$
)
(up
to
2010)

•  Eurostat:
GDP
at
Market
Prices
(including
projected
values
un&l
2014)

Queries
Across
Related
Data
Sets

Data
from
Worldbank

Data
from
Eurostat

Summary
and
Outlook

•  Techniques
for
ﬁnding
related
data
sets

–  Based
on
ﬁnding
related
en&&es

•  Implementa&on
available
in
open
data
portal

•  Outlook

–  Finding
relevant
related
data
sources
for
a
given

informa&on
need

–  End
user
interfaces
for
formula&ng
queries

across
data
sets
(see
Op&que
project)

–  Operators
for
combining
data
cubes

–  Interac&ve
visualiza&on
and
explora&on
of

combined
data
cubes
(see
OpenCube
project)

References

[1]

G.
A.
Grimnes,
P.
Edwards,
and
A.
Preece.

Instance
based
clustering
of
seman:c
web

resources.
In
ESWC,
2008.

[2]
U.
Lösch,
S.
Bloehdorn,
and
A.
Reenger.

Graph
kernels
for
RDF
data.
In
ESWC,
2012.

[3]
J.
Shawe-‐Taylor
and
N.
Cris&anini.
Kernel

Methods
for
PaPern
Analysis.
2004.

[4]

R.
Zhang
and
A.
Rudnicky.
A
large
scale

clustering
scheme
for
kernel
k-‐means.
In

PaVern
Recogni&on,
2002.

What's hot

Linked data experience at Macmillan: Building discovery services for scientif...Michele Pasin

Sören Auer | Enterprise Knowledge Graphssemanticsconference

Querying the Wikidata Knowledge GraphIoan Toma

A distributed network of digital heritage information - Semantics AmsterdamEnno Meijers

Documents, services, and data on the webChiara Del Vescovo

DSpace standard Data model and DSpace-CRISAndrea Bollini

Making Use of the Linked Open Data Services for OpenAIRE (DI4R 2016 tutorial ...OpenAIRE

6.15.17 DSpace-Cris Webinar Presentation SlidesDuraSpace

TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...LIBER Europe

DSpace-CRIS: a CRIS enhanced repository platformAndrea Bollini

Linked DataAnja Jentzsch

Session 1.6 slovak public metadata governance and management based on linke...semanticsconference

Linked Data efforts for data standards in biopharma and healthcareKerstin Forsberg

Enhancing Interoperability: The Implementation of OpenAIRE Guidelines and COA...4Science

Adoption and Integration of Persistent Identifiers in European Research Infor...LIBER Europe

WikidataAnja Jentzsch

Benchmarking RDF Metadata Representations: Reification, Singleton Property an...Fabrizio Orlandi

DSpace-CRIS: new features and contribution to the DSpace mainstreamAndrea Bollini

The CIARD RINGValeriCIARD Movement

Beyond 2022 project presentation 2021Fabrizio Orlandi

What's hot (20)

Linked data experience at Macmillan: Building discovery services for scientif...

Sören Auer | Enterprise Knowledge Graphs

Querying the Wikidata Knowledge Graph

A distributed network of digital heritage information - Semantics Amsterdam

Documents, services, and data on the web

DSpace standard Data model and DSpace-CRIS

Making Use of the Linked Open Data Services for OpenAIRE (DI4R 2016 tutorial ...

6.15.17 DSpace-Cris Webinar Presentation Slides

TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...

DSpace-CRIS: a CRIS enhanced repository platform

Linked Data

Session 1.6 slovak public metadata governance and management based on linke...

Linked Data efforts for data standards in biopharma and healthcare

Enhancing Interoperability: The Implementation of OpenAIRE Guidelines and COA...

Adoption and Integration of Persistent Identifiers in European Research Infor...

Wikidata

Benchmarking RDF Metadata Representations: Reification, Singleton Property an...

DSpace-CRIS: new features and contribution to the DSpace mainstream

The CIARD RINGValeri

Beyond 2022 project presentation 2021

Similar to Discovering Related Data Sources in Data Portals

Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...AKSHAY BHAGAT

Linked (Open) DataBernhard Haslhofer

Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...giuseppe_futia

Unit 3 part i Data miningDhilsath Fathima

Semantic Similarity and Selection of Resources Published According to Linked ...Riccardo Albertoni

RDF-Gen: Generating RDF from streaming and archival dataGiorgos Santipantakis

A Framework for Ontology Usage AnalysisJamshaid Ashraf

UNIT - 5: Data Warehousing and Data MiningNandakumar P

Hide the Stack:Toward Usable Linked Dataaba-sah

A scalable architecture for extracting, aligning, linking, and visualizing mu...Craig Knoblock

At33264269IJERA Editor

Semantic web 101: Benefits for geologistsdgarijo

SSSW2015 Data Workflow TutorialSSSW

IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEEFINALYEARSTUDENTPROJECTS

2014 IEEE JAVA DATA MINING PROJECT Keyword query routingIEEEMEMTECHSTUDENTSPROJECTS

Relational Database explanation with detail.pdf9wldv5h8n

Semantic Technologies for Big Sciences including AstrophysicsArtificial Intelligence Institute at UofSC

WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016CLARIAH

Benchmarking graph databases on the problem of community detectionSymeon Papadopoulos

Similar to Discovering Related Data Sources in Data Portals (20)

Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...

Linked (Open) Data

Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...

Unit 3 part i Data mining

Semantic Similarity and Selection of Resources Published According to Linked ...

RDF-Gen: Generating RDF from streaming and archival data

A Framework for Ontology Usage Analysis

UNIT - 5: Data Warehousing and Data Mining

Hide the Stack:Toward Usable Linked Data

A scalable architecture for extracting, aligning, linking, and visualizing mu...

At33264269

Semantic web 101: Benefits for geologists

SSSW2015 Data Workflow Tutorial

IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing

2014 IEEE JAVA DATA MINING PROJECT Keyword query routing

Relational Database explanation with detail.pdf

Semantic Technologies for Big Sciences including Astrophysics

WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016

Benchmarking graph databases on the problem of community detection

Recently uploaded

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Histor y of HAM Radio presentation slidevu2urc

Google AI Hackathon: LLM based Evaluator for RAGSujit Pal

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Recently uploaded (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Injustice - Developers Among Us (SciFiDevCon 2024)

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Salesforce Community Group Quito, Salesforce 101

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Histor y of HAM Radio presentation slide

Google AI Hackathon: LLM based Evaluator for RAG

[2024]Digital Global Overview Report 2024 Meltwater.pdf

GenCyber Cyber Security Day Presentation

Unblocking The Main Thread Solving ANRs and Frozen Frames

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

08448380779 Call Girls In Friends Colony Women Seeking Men

Maximizing Board Effectiveness 2024 Webinar.pptx

Scaling API-first – The story of a global engineering organization

Finology Group – Insurtech Innovation Award 2024

My Hashitalk Indonesia April 2024 Presentation

CNv6 Instructor Chapter 6 Quality of Service

Discovering Related Data Sources in Data Portals

1. Discovering Related Data Sources in Data Portals Andreas Wagner, Peter Haase, Achim Re4nger, Holger Lamm 1st Interna:onal Workshop on Seman:c Sta:s:cs Sydney, Oct 22, 2013

2. Poten&al of Open (Sta&s&cs) Data WORLD BANK

3. ﬂuidOps Open Data Portal •  Data collec&on •  Integra&on of major open data catalogs •  Automated provisioning of 10.000s data sets •  Portal for search and explora&on of data sets •  Rich metadata based on open standards •  Both descrip&ve and structural metadata •  Integrated querying across interlinked data sets •  Easy to use queries against mul&ple data sets •  Using federa&on technologies •  Self-‐service UI •  Custom queries and visualiza&ons •  Widgets, dashboarding, etc. WORLD BANK

5. Finding Related Data Sets •  Many informa&on needs require analysis of mul&ple data sets •  Example: Compare and correlate GDP, popula&on and public debt of countries over &me •  Task of ﬁnding related data sets •  Iden&fy data sets that are similar, but complementary •  To support queries across mul&ple data sets, e.g. in the form of joins and unions •  Inspira&on: Finding related tables •  En&ty complement: same aVributes, complemen&ng en&&es •  Schema complement: same en&&es, complemen&ng aVributes

6. Finding Related Data Sources via Related En&&es •  Data Model: Data source is a set of mul&ple RDF graphs •  Intui&on: if data sources contain similar en&&es, they are somehow related Cluster 2 Cluster 1 •  Approach: En&&es 1.  En&ty Extrac&on 2.  En&ty Similarity 3.  En&ty Clustering Related?! Source 1 Source 3 Source 2

7. Related En&&es (2) 1.  En&ty Extrac&on –  Sample over en&&es in data graphs in D –  For each en&ty crawl its surrounding sub-‐graph [1] 2.  En&ty Similarity –  Deﬁne dissimilarity measure between two en&&es based on kernel func&ons –  Compare en&ty structure and literals via diﬀerent kernels [2,3] 3.  En&ty Clustering –  Apply k-‐means clustering to discover similar en&&es [4]

8. Contextualisa&on Score •  Contextualiza&on score for data source D’’ given D’: ec(D’’|D’) and sc(D’’|D’) •  En*ty complement score •  Schema complement score

10. Search for Gross Domes&c Product

11.

12. Querying the Data Set

13. Visualizing the Results

14. Queries Across Related Data Sets •  Query for GDP of Germany •  Union of results from •  Worldbank: GDP (current US$ ) (up to 2010) •  Eurostat: GDP at Market Prices (including projected values un&l 2014)

15. Queries Across Related Data Sets Data from Worldbank Data from Eurostat

16. Summary and Outlook •  Techniques for ﬁnding related data sets –  Based on ﬁnding related en&&es •  Implementa&on available in open data portal •  Outlook –  Finding relevant related data sources for a given informa&on need –  End user interfaces for formula&ng queries across data sets (see Op&que project) –  Operators for combining data cubes –  Interac&ve visualiza&on and explora&on of combined data cubes (see OpenCube project)

17. References [1] G. A. Grimnes, P. Edwards, and A. Preece. Instance based clustering of seman:c web resources. In ESWC, 2008. [2] U. Lösch, S. Bloehdorn, and A. Reenger. Graph kernels for RDF data. In ESWC, 2012. [3] J. Shawe-‐Taylor and N. Cris&anini. Kernel Methods for PaPern Analysis. 2004. [4] R. Zhang and A. Rudnicky. A large scale clustering scheme for kernel k-‐means. In PaVern Recogni&on, 2002.

Discovering Related Data Sources in Data Portals

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Discovering Related Data Sources in Data Portals

Similar to Discovering Related Data Sources in Data Portals (20)

More from Peter Haase

More from Peter Haase (11)

Recently uploaded

Recently uploaded (20)

Discovering Related Data Sources in Data Portals