A talk given at the EDBT/ICDT 2010 conference. For more details, visit the project website at http://img.cs.manchester.ac.uk/dataspaces/dataspaces.html
1. Feedback-Based Annotation, Selection and Refinement of
Schema Mappings for Dataspaces
Khalid Belhajjame, Norman W. Paton, Suzanne M. Embury,
Alvaro A. A. Fernandes, and Cornelia Hedeler
EDBT/ICDT
2010
1
2. Data
Integra2on
What
are
the
available
proteins
of
the
Fruit
Fly?
Scien2st
Integra2on
Schema
Mappings
PedroDB
PepSeeker
Pride
GPMDB
EDBT/ICDT
2010
2
3. Towards
Pay-‐as-‐you-‐go
Data
Integra2on
Data
Integra*on
– SeKng
up
a
data
integra2on
system
requires
significant
upfront
effort
– The
specifica2on
of
schema
mappings
has
proved
to
be
2me
and
resource
consuming:
it
requires
deep
knowledge
of
the
sources
to
be
integrated
as
well
as
the
user’s
requirements.
Dataspaces:
a
Pay-‐as-‐you-‐go
Data
Integra*on
[Franklin
et
al.
2005]
– Reduce
the
up-‐front
cost
required
to
setup
a
data
integra2on
system:
Provide
some
services
immediately
– Gradually
improve
the
services
provided
by
the
system
through
interac2on
with
end
users
in
a
pay-‐as-‐you-‐go
fashion.
M.
J.
Franklin,
A.
Y.
Halevy,
and
D.
Maier.
From
databases
to
dataspaces:
a
new
abstrac2on
for
informa2on
management.
SIGMOD
Record,
34(4):27–33,
2005.
EDBT/ICDT
2010
3
4. Pay-‐as-‐you-‐go
Data
Integra2on
What
are
the
available
proteins
of
the
Fruit
Fly?
Scien2st
Integra2on
Schema
Bootstrap
Dataspaces
Mappings
PedroDB
PepSeeker
Pride
GPMDB
Objec2ve
of
the
present
work:
Inves2gate
Pay-‐as-‐you-‐go
Annota2on,
Selec2on,
and
Refinement
of
Schema
Mappings
EDBT/ICDT
2010
4
5. Pay-‐as-‐you-‐go
Data
Integra2on
We consider that integration schema and source schemas are relational,
and that the schema mappings that define the extent of the relations in the
integration schema, r, are global as view mappings of the form:
m = ⟨r,qs⟩
where qs is a relational query over the source schemas.
A relation in the integration schema can be associated with multiple
candidate mappings: We consider a setting in which multiple matching
mechanisms can be used, each of which could give rise to multiple mapping
candidates for populating the same relation of the integration schema.
EDBT/ICDT
2010
5
6. Outline
User
Feedback
Annota*on
of
Schema
Mappings
Selec*on
of
Schema
Mappings
Based
on
User
Requirements
Refinement
of
Schema
Mappings
EDBT/ICDT
2010
6
7. User
Feedback
Query:
What
are
the
available
fruit
fly
proteins?
Results:
Feedback
✔
✖
✖
✔
EDBT/ICDT
2010
7
8. User
Feedback
(cont.)
Let
m
be
a
candidate
mapping,
and
UF
a
set
of
feedback
instances
UF
supplied
by
the
user:
tp(m,UF):
the
tuples
that
are
expected
by
the
user
and
that
are
retrieved
by
the
mapping
m.
fp(m,UF):
the
tuples
that
are
not
expected
by
the
user
and
that
are
retrieved
by
the
mapping
m.
fn(m,UF):
the
tuples
that
are
expected
by
the
user
and
are
not
retrieved
by
the
mapping
m.
EDBT/ICDT
2010
8
9. Outline
User
Feedback
Annota*on
of
Schema
Mappings
Selec*on
of
Schema
Mappings
Based
on
User
Requirements
Refinement
of
Schema
Mappings
EDBT/ICDT
2010
9
10. Annota2ng
Mappings
Using
a
simple
annota*on
scheme,
a
schema
mapping
can
be
annotated
as:
Correct
Incorrect
The
set
of
schema
mappings
is
likely
to
be
incomplete,
and,
therefore,
we
may
end
up
annota2ng
all
mappings
as
incorrect.
Because
of
this,
we
use
a
less
stringent
scheme
mapping
annota2on.
EDBT/ICDT
2010
10
11. Annota2ng
Mappings
(cont.)
Instead,
we
use
and
adapt
the
no2ons
of
precision
and
recall
used
in
informa2on
retrieval
to
measure
the
quality
of
a
mapping.
Precision:
Recall:
F
measure:
EDBT/ICDT
2010
11
12. Mapping
Annota2on:
Valida2on
Ques*ons:
– How
much
user
feedback
is
required
for
approxima8ng
the
real
precision
and
recall,
i.e.,
those
based
on
complete
knowledge
of
the
expected
results?
– Does
the
pay-‐as-‐you-‐go
philosophy
hold?
EDBT/ICDT
2010
12
13. Mapping
Annota2on:
Valida2on
(cont.)
Experiment:
Data:
– Two
datasets:
the
Mondial
geographical
database
and
the
Amalgam
data
integra2on
benchmark
– Candidate
schema
mappings:
created
using
the
IBM
Infosphere
Data
Architect.
Process:
we
applied
the
two-‐step
process
illustrated
below
for
mul2ple
itera2ons.
1. Generate
a
sample
feedback
instances.
2. Compute
the
rela2ve
precision
and
recall
of
the
candidate
mappings
given
cumula2ve
feedback.
EDBT/ICDT
2010
13
16. Outline
User
Feedback
Annota*on
of
Schema
Mappings
Selec*on
of
Schema
Mappings
Based
on
User
Requirements
Refinement
of
Schema
Mappings
EDBT/ICDT
2010
16
17. Mapping
Selec2on
Mapping
selec2on
should
be
tailored
to
meet
user
requirements.
We
use
a
selec2on
method
that
aims
to
maximise
the
recall
such
that
the
precision
of
the
results
is
higher
than
a
given
precision
threshold.
We
cast
this
selec2on
problem
as
a
search
problem
that
aims
to
maximise
the
following
u2lity
func2on:
D.
A.
Menascé
and
V.
Dubey.
U2lity-‐based
qos
brokering
in
service
oriented
architectures.
In
ICWS,
pages
422–430.
IEEE
CS,
2007.
EDBT/ICDT
2010
17
18. Mapping
Selec2on
Mapping
selec2on
should
be
tailored
to
meet
user
requirements.
We
use
a
selec2on
method
that
aims
to
maximise
the
recall
such
that
the
precision
of
the
results
is
higher
than
a
given
precision
threshold.
We
cast
this
selec2on
problem
as
a
search
problem
that
aims
to
maximise
the
following
u2lity
func2on:
D.
A.
Menascé
and
V.
Dubey.
U2lity-‐based
qos
brokering
in
service
oriented
architectures.
In
ICWS,
pages
422–430.
IEEE
CS,
2007.
EDBT/ICDT
2010
18
19. Mapping
Selec2on:
Precision
Do
we
meet
precision
requirement,
i.e.,
is
the
precision
threshold
set
by
the
user
respected?
EDBT/ICDT
2010
19
23. Outline
User
Feedback
Annota*on
of
Schema
Mappings
Selec*on
of
Schema
Mappings
Based
on
User
Requirements
Refinement
of
Schema
Mappings
EDBT/ICDT
2010
23
24. Mapping
Refinement
We
dis2nguish
two
kinds
of
refinement:
Mapping
refinement
that
seeks
to
reduce
the
number
of
false
posi2ves
A
candidate
mapping
is
refined
by
modifying
a
source
query
so
that
the
number
of
false
posi2ves
it
returns
is
reduced.
Mapping
refinement
that
aims
to
increase
the
number
of
true
posi2ves
A
candidate
mapping
m
is
refined
by
modifying
a
source
query
so
that
the
number
of
true
posi2ves
it
returns
is
increased.
EDBT/ICDT
2010
24
25. Mapping
Refinement:
Example
I Want Fruit fly
proteins
Integration Protein
schema Accession name gene
m = <Protein, ProteinEntry>
Source
schema
EDBT/ICDT
2010
25
26. Mapping
Refinement:
The
Space
of
Solu2ons
The
space
of
solu2ons
is
composed
of
the
mappings
that
can
be
constructed
out
of
the
candidate
mappings.
Specifically:,
by
i. Joining
the
source
query
of
a
candidate
mapping.
ii. Augmen2ng
the
source
query
of
a
candidate
mapping
with
a
selec2on
condi2on.
iii. Relaxing
the
selec2on
condi2on
of
the
source
query
of
a
candidate
mapping.
iv. Combining
the
source
queries
of
two
or
more
mappings
using
union,
difference
and
intersec2on.
15/04/2009
Khalid
26
27. Exploring
the
Space
of
Solu2ons
The
space
of
mappings
that
can
be
obtained
by
refinement
is
poten2ally
large.
A
search
algorithm
that
explores
the
whole
space
of
the
possible
mappings
may
not
be
able
to
find
a
solu2on
in
a
bounded
2me.
In
the
context
of
the
present
work,
we
used
an
evolu*onary
algorithm
for
exploring
the
space
of
mappings
that
can
be
obtained
by
refinement.
15/04/2009
Khalid
27
29. Mapping
Refinement:
Valida2on
Ques*on:
Can
mapping
refinement
improve
the
quality
of
ini8al
candidate
mappings,
and,
if
so,
at
what
cost,
i.e.,
what
is
the
amount
of
user
feedback
required?
Experiment:
To
answer
the
above
ques2on
we
applied
the
following
process
for
mul2ple
itera2ons.
1) Generate
a
sample
of
feedback
instances.
2) Annotate
the
set
of
candidate
mappings.
3) Refine
candidate
mappings
using
the
RefineMappings
algorithm.
EDBT/ICDT
2010
29
31. Conclusions
Pay-‐as-‐you-‐go
Annota*on
of
Schema
Mappings
We
showed
how
schema
mappings
can
be
incrementally
annotated
based
on
feedback
supplied
by
end
users.
We
also
showed
through
an
evalua2on
exercise
that
the
more
feedback
the
user
supplies,
the
bemer
is
the
quality
of
the
mapping
annota2on
computed.
Applica*on:
Selec*on
and
Refinement
of
Schema
Mappings
in
Dataspaces
Mapping
annota2on
computed
based
on
user
feedback
are
used
as
input
for
enabling
the
selec2on
and
the
refinement
of
schema
mappings.
The
evalua2on
exercises
also
showed
that
mapping
refinement
is
more
cost
effec2ve
in
the
first
feedback
itera2ons.
EDBT/ICDT
2010
31
32. Feedback-Based Annotation, Selection and Refinement of
Schema Mappings for Dataspaces
Khalid Belhajjame, Norman W. Paton, Suzanne M. Embury,
Alvaro A. A. Fernandes, and Cornelia Hedeler
EDBT/ICDT
2010
32