1. Discovering
Related
Data
Sources
in
Data
Portals
Andreas
Wagner,
Peter
Haase,
Achim
Re4nger,
Holger
Lamm
1st
Interna:onal
Workshop
on
Seman:c
Sta:s:cs
Sydney,
Oct
22,
2013
3. fluidOps
Open
Data
Portal
• Data
collec&on
• Integra&on
of
major
open
data
catalogs
• Automated
provisioning
of
10.000s
data
sets
• Portal
for
search
and
explora&on
of
data
sets
• Rich
metadata
based
on
open
standards
• Both
descrip&ve
and
structural
metadata
• Integrated
querying
across
interlinked
data
sets
• Easy
to
use
queries
against
mul&ple
data
sets
• Using
federa&on
technologies
• Self-‐service
UI
• Custom
queries
and
visualiza&ons
• Widgets,
dashboarding,
etc.
WORLD BANK
4.
5. Finding
Related
Data
Sets
• Many
informa&on
needs
require
analysis
of
mul&ple
data
sets
• Example:
Compare
and
correlate
GDP,
popula&on
and
public
debt
of
countries
over
&me
• Task
of
finding
related
data
sets
• Iden&fy
data
sets
that
are
similar,
but
complementary
• To
support
queries
across
mul&ple
data
sets,
e.g.
in
the
form
of
joins
and
unions
• Inspira&on:
Finding
related
tables
• En&ty
complement:
same
aVributes,
complemen&ng
en&&es
• Schema
complement:
same
en&&es,
complemen&ng
aVributes
6. Finding
Related
Data
Sources
via
Related
En&&es
• Data
Model:
Data
source
is
a
set
of
mul&ple
RDF
graphs
• Intui&on:
if
data
sources
contain
similar
en&&es,
they
are
somehow
related
Cluster
2
Cluster
1
• Approach:
En&&es
1. En&ty
Extrac&on
2. En&ty
Similarity
3. En&ty
Clustering
Related?!
Source
1
Source
3
Source
2
7. Related
En&&es
(2)
1. En&ty
Extrac&on
– Sample
over
en&&es
in
data
graphs
in
D
– For
each
en&ty
crawl
its
surrounding
sub-‐graph
[1]
2. En&ty
Similarity
– Define
dissimilarity
measure
between
two
en&&es
based
on
kernel
func&ons
– Compare
en&ty
structure
and
literals
via
different
kernels
[2,3]
3. En&ty
Clustering
– Apply
k-‐means
clustering
to
discover
similar
en&&es
[4]
8. Contextualisa&on
Score
• Contextualiza&on
score
for
data
source
D’’
given
D’:
ec(D’’|D’)
and
sc(D’’|D’)
• En*ty
complement
score
• Schema
complement
score
14. Queries
Across
Related
Data
Sets
• Query
for
GDP
of
Germany
• Union
of
results
from
• Worldbank:
GDP
(current
US$
)
(up
to
2010)
• Eurostat:
GDP
at
Market
Prices
(including
projected
values
un&l
2014)
16. Summary
and
Outlook
• Techniques
for
finding
related
data
sets
– Based
on
finding
related
en&&es
• Implementa&on
available
in
open
data
portal
• Outlook
– Finding
relevant
related
data
sources
for
a
given
informa&on
need
– End
user
interfaces
for
formula&ng
queries
across
data
sets
(see
Op&que
project)
– Operators
for
combining
data
cubes
– Interac&ve
visualiza&on
and
explora&on
of
combined
data
cubes
(see
OpenCube
project)
17. References
[1]
G.
A.
Grimnes,
P.
Edwards,
and
A.
Preece.
Instance
based
clustering
of
seman:c
web
resources.
In
ESWC,
2008.
[2]
U.
Lösch,
S.
Bloehdorn,
and
A.
Reenger.
Graph
kernels
for
RDF
data.
In
ESWC,
2012.
[3]
J.
Shawe-‐Taylor
and
N.
Cris&anini.
Kernel
Methods
for
PaPern
Analysis.
2004.
[4]
R.
Zhang
and
A.
Rudnicky.
A
large
scale
clustering
scheme
for
kernel
k-‐means.
In
PaVern
Recogni&on,
2002.