20090411

Integrating Web Query Results:
Holistic Schema Matching
1

CIKM’08

Shui-Lung Chuang
Kevin Chen-Chuan Chang

Yen-Ling Lin
2009/04/13

26 pages

Outline
2

 Introduction
 Approach
 Framework
 Algorithm
 Experiments

26 pages

Introduction
3

Back
26 pages

Introduction
4

Back
26 pages

Introduction –
Schema Matching on Query Results
6

 Data fields are the basic units processed by matching.
 A data field can be viewed as a label plus a set of values.
 We lack explicit and complete schema information. e.g.
 To conquer such challenges, we observe some niches
in this context of integrating query results
First, we often need to integrate multiple sources. Some useful
1)
effects naturally occur when cross-referencing many sources.
Second, although no schema-based constraint is available, there are
2)
indeed useful regularities that can be observed from many sources.
These regularities, treated as observed domain constraints, are very
helpful for matching discovery.

26 pages

Introduction - Approach
7

 The enrichment occurs basically in three levels
The content of a field
1.

The kinds of fields
2.

The constraints of fields
3.

With all the above enrichment, we learn a more

complete schema to describe the whole input data.
 This learned schema can thus help us in making
further matching.

26 pages

Framework – Problem Statement
8

 Suppose A={a1,a2,…} for the book source. For source
S1, the fields X1 = (x11,x12,…,x17) can be assigned with
the matching Y1= (a1,a2,…,a7)
 Matching is actually discovering the assignment of
the groups in A to the fields of each source:

Ys = (ys1,…,ysls) and each yi ∊ A is the group that
source field xsi ∊ Xs is assigned as.

26 pages

Framework
Matching as Domain Schema Discovery
9

 Let the domain schema be M=(A, B)
 A ：the set of domain fields

 B：the statistical constraints

 For each source Ss
It projects M onto a source schema Ms = (Ys, Vs)
1)
Ys：a subset of A to be the fields of source Ss
1)
Vs：a set of constraints instantiated from B
2)

Construct the source instances Xs
2)

Vs Us , Ys Xs ：Is = (Xs, Us)
3)

Output：Xs
4)

26 pages

Framework
10

 This procedure of data generation can be conceptually
sketched as：

 M=(A, B) where A={a1,…,a11} and B={first(a1):.67,
first(a2):.33, pos≻(a2, a3):1}
M1=(Y1,V1) where Y1={a1,..,a5,a7,a8} and V1= ={first(a1):.67,

first(a2):.33, pos≻(a2, a3):1}
We generate data using source schema M1.

Map Y1 as X1 – e.g., a2 is mapped as x1,2

 first(a1) in V1 is rewritten as first(x11) in U1, pos≻(a2,a3) as pos≻(x12,x13)

26 pages

Framework
11

 Let the data observed from source Ss be Is= (Xs, Us).
 Given the matching Y={Ys: s ∊S}, learning the best
domain schema can be described as a probabilistic
optimization expression:
arg max p ( I s |Y s , M )
*
M
sS
M

 Similarly, if the domain schema M is given, the best
matching Y {Y : s S } can be discovered, again using
* *
s

statistical techniques to find out the most likely
assignment of domain fields to the fields of each
source: *
arg max p ( I s | Y s , M ) for each s ∊ S
Ys
Ys

26 pages

Framework
12

 Suppose X1={x11,x12,x13} and X2={x21,x22}.
Suppose we have one predicate function to check: first.
Then, I1={X1,U1} where U1={first(x11):1}, and I2={X2,U2}
where U2={first(X21):1}
 Suppose Y1={a1,a2,a3} and Y2={a2,a3}.
Construct M1= (Y1,V1), V1={first(a1):1} and M2=(Y2,V2) ,
V2={first(a2):1}
 It is clear that first(a1) holds for M1 but not M2. Thus
first(a1) has confidence 0.5. Thus, combining source
schemas M1 and M2, the domain schema then becomes
M=(A, B) where A={a1, a2, a3} and B={first(a1):.5,
first(a2):.5}.

26 pages

Framework Formulation and Overview
13

 Field Model
 A field model a is a statistic model specifying how to generate
instances.
 A field model a is a function that accepts an instance z and
produces p(z| a ), indicating the likelihood that z is an instance
produced by the field model a .
 Statistical Constraint
 A statistical constraint b is written as f(e):c
f: a predicate name, e is the vector of elements, c is a confidence

value of range[0,1].

26 pages

Framework Formulation and Overview
14

 Overall, our framework translates the problem of instance-
based matching into a schema-discovery problem.
 With such a strategy, we leverage not only the data instances
but also the regularities observed from the data in a principled
way.

a

26 pages

Algorithm
15

 To solve our matching problem, we need to discover
either an optimal matching Y* or an optimal schema
M*.
 If one of them is obtained, the other can be derived.
 The basic idea is to start an initial guess of the
matching Y and iteratively improve it using the
schema M that is derived from the current
estimation of Y.

26 pages

Algorithm
16

 InitMatch
 The function is to generate an initial matching, to be the start
point for iterations.
 EnumRelations
 We need to identify the constraints occurring in the input data.
i ,..., i
1 k

 Predicate Function

f ( i1 ,..., i k , X )
i1 ,..., i k :which elements to check their satisfaction with the

predicate f and X is the original data.
True: the input satisfies the predicate


False: otherwise

26 pages

Algorithm
17

 LearnSchema – From matching to schema
 Aim to construct a schema based on a given matching.
 First, group the matched source fields together.
 Each group is trained as field model.
 Model it as 2-state HMM.

Learning an HMM a given a set of instances and computing

the probability p(z|a) for given instance z will follow the
standard HMM training and inference algorithm.

26 pages

Algorithm
18

 SchemaMatch – From Schema to Matching
 Given the domain schema, matching becomes labeling the
elements of sources with the appropriate domain fields.
 For each hj∈Vs with the corresponding bj ∈ B, let their
constraint be fj(yi1,…yik), we define

qi, j (a ) z (a ) (a ) p (h j | b j ) ( yl ) ( yl )
i i l l
i1 ,..., i k , y i a l i1 ,.., i k , l i

qi ( a ) z qi, j (a )
j

The most likely value for each yi is thus:

*
yi arg max q i ( a )
aA

26 pages

Algorithm
19

 MetaMatch :
 Adopt F-measure to measure the consistency.
2 R i , j Pi , j
Fi , j
Ri, j Pi , j
For two matching m1 and m2, using m1 as tastee and m2 as

tester, ni
F ( m1 , m 2 ) max { Fi , j }
n j m1
i m2

Let these candidates generated during this process be C and

the n matchings be R={r1,…,rn}: The final matching is obtained
as: *
m arg max F (m , r )
mC rR
 InitMatch aims to guess an initial matching, to be
the start point of the iterative computation.

26 pages

Algorithm
20

 HoliMatch’s algorithm

26 pages

Experiments
21

 Data set
 Four domains

 For each domain, collect 10 sources

26 pages

Experiments
22

 Comparison Methods
 PairMatch: adopt Corpus-based approach

 ClusMatch:

 ChainMatch: e.g., 1-2-3-4

 ProgMatch: e.g., becoming (((1-2)-3)-4)

 InitMatch：an extension of using pairwise matching

 HoliMatch

 Performance
 The matching accuracy is measured using F-measure.

 Give the result matching m and the correct matching c, the F-
measure is F(m, c), indicating how close m is to c.

26 pages

Experiments
23

 Matching on Correct Extraction Data
 Matchers

Iterations


26 pages

Experiments
24

 Sources

26 pages

Experiments
25

 Pairwise

26 pages

Experiments
26

 Matching on Real Extraction Data

26 pages

20090411

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 20090411

Similaire à 20090411 (20)

Plus de xoanon

Plus de xoanon (7)

Dernier

Dernier (20)

20090411