6. Introduction –
Schema Matching on Query Results
6
Data fields are the basic units processed by matching.
A data field can be viewed as a label plus a set of values.
We lack explicit and complete schema information. e.g.
To conquer such challenges, we observe some niches
in this context of integrating query results
First, we often need to integrate multiple sources. Some useful
1)
effects naturally occur when cross-referencing many sources.
Second, although no schema-based constraint is available, there are
2)
indeed useful regularities that can be observed from many sources.
These regularities, treated as observed domain constraints, are very
helpful for matching discovery.
26 pages
7. Introduction - Approach
7
The enrichment occurs basically in three levels
The content of a field
1.
The kinds of fields
2.
The constraints of fields
3.
With all the above enrichment, we learn a more
complete schema to describe the whole input data.
This learned schema can thus help us in making
further matching.
26 pages
8. Framework – Problem Statement
8
Suppose A={a1,a2,…} for the book source. For source
S1, the fields X1 = (x11,x12,…,x17) can be assigned with
the matching Y1= (a1,a2,…,a7)
Matching is actually discovering the assignment of
the groups in A to the fields of each source:
Ys = (ys1,…,ysls) and each yi ∊ A is the group that
source field xsi ∊ Xs is assigned as.
26 pages
9. Framework
Matching as Domain Schema Discovery
9
Let the domain schema be M=(A, B)
A :the set of domain fields
B:the statistical constraints
For each source Ss
It projects M onto a source schema Ms = (Ys, Vs)
1)
Ys:a subset of A to be the fields of source Ss
1)
Vs:a set of constraints instantiated from B
2)
Construct the source instances Xs
2)
Vs Us , Ys Xs :Is = (Xs, Us)
3)
Output:Xs
4)
26 pages
10. Framework
Matching as Domain Schema Discovery
10
This procedure of data generation can be conceptually
sketched as:
M=(A, B) where A={a1,…,a11} and B={first(a1):.67,
first(a2):.33, pos≻(a2, a3):1}
M1=(Y1,V1) where Y1={a1,..,a5,a7,a8} and V1= ={first(a1):.67,
first(a2):.33, pos≻(a2, a3):1}
We generate data using source schema M1.
Map Y1 as X1 – e.g., a2 is mapped as x1,2
first(a1) in V1 is rewritten as first(x11) in U1, pos≻(a2,a3) as pos≻(x12,x13)
26 pages
11. Framework
Matching as Domain Schema Discovery
11
Let the data observed from source Ss be Is= (Xs, Us).
Given the matching Y={Ys: s ∊S}, learning the best
domain schema can be described as a probabilistic
optimization expression:
arg max p ( I s |Y s , M )
*
M
sS
M
Similarly, if the domain schema M is given, the best
matching Y {Y : s S } can be discovered, again using
* *
s
statistical techniques to find out the most likely
assignment of domain fields to the fields of each
source: *
arg max p ( I s | Y s , M ) for each s ∊ S
Ys
Ys
26 pages
12. Framework
Matching as Domain Schema Discovery
12
Suppose X1={x11,x12,x13} and X2={x21,x22}.
Suppose we have one predicate function to check: first.
Then, I1={X1,U1} where U1={first(x11):1}, and I2={X2,U2}
where U2={first(X21):1}
Suppose Y1={a1,a2,a3} and Y2={a2,a3}.
Construct M1= (Y1,V1), V1={first(a1):1} and M2=(Y2,V2) ,
V2={first(a2):1}
It is clear that first(a1) holds for M1 but not M2. Thus
first(a1) has confidence 0.5. Thus, combining source
schemas M1 and M2, the domain schema then becomes
M=(A, B) where A={a1, a2, a3} and B={first(a1):.5,
first(a2):.5}.
26 pages
13. Framework Formulation and Overview
13
Field Model
A field model a is a statistic model specifying how to generate
instances.
A field model a is a function that accepts an instance z and
produces p(z| a ), indicating the likelihood that z is an instance
produced by the field model a .
Statistical Constraint
A statistical constraint b is written as f(e):c
f: a predicate name, e is the vector of elements, c is a confidence
value of range[0,1].
26 pages
14. Framework Formulation and Overview
14
Overall, our framework translates the problem of instance-
based matching into a schema-discovery problem.
With such a strategy, we leverage not only the data instances
but also the regularities observed from the data in a principled
way.
a
26 pages
15. Algorithm
15
To solve our matching problem, we need to discover
either an optimal matching Y* or an optimal schema
M*.
If one of them is obtained, the other can be derived.
The basic idea is to start an initial guess of the
matching Y and iteratively improve it using the
schema M that is derived from the current
estimation of Y.
26 pages
16. Algorithm
16
InitMatch
The function is to generate an initial matching, to be the start
point for iterations.
EnumRelations
We need to identify the constraints occurring in the input data.
i ,..., i
1 k
Predicate Function
f ( i1 ,..., i k , X )
i1 ,..., i k :which elements to check their satisfaction with the
predicate f and X is the original data.
True: the input satisfies the predicate
False: otherwise
26 pages
17. Algorithm
17
LearnSchema – From matching to schema
Aim to construct a schema based on a given matching.
First, group the matched source fields together.
Each group is trained as field model.
Model it as 2-state HMM.
Learning an HMM a given a set of instances and computing
the probability p(z|a) for given instance z will follow the
standard HMM training and inference algorithm.
26 pages
18. Algorithm
18
SchemaMatch – From Schema to Matching
Given the domain schema, matching becomes labeling the
elements of sources with the appropriate domain fields.
For each hj∈Vs with the corresponding bj ∈ B, let their
constraint be fj(yi1,…yik), we define
qi, j (a ) z (a ) (a ) p (h j | b j ) ( yl ) ( yl )
i i l l
i1 ,..., i k , y i a l i1 ,.., i k , l i
qi ( a ) z qi, j (a )
j
The most likely value for each yi is thus:
*
yi arg max q i ( a )
aA
26 pages
19. Algorithm
19
MetaMatch :
Adopt F-measure to measure the consistency.
2 R i , j Pi , j
Fi , j
Ri, j Pi , j
For two matching m1 and m2, using m1 as tastee and m2 as
tester, ni
F ( m1 , m 2 ) max { Fi , j }
n j m1
i m2
Let these candidates generated during this process be C and
the n matchings be R={r1,…,rn}: The final matching is obtained
as: *
m arg max F (m , r )
mC rR
InitMatch aims to guess an initial matching, to be
the start point of the iterative computation.
26 pages
21. Experiments
21
Data set
Four domains
For each domain, collect 10 sources
26 pages
22. Experiments
22
Comparison Methods
PairMatch: adopt Corpus-based approach
ClusMatch:
ChainMatch: e.g., 1-2-3-4
ProgMatch: e.g., becoming (((1-2)-3)-4)
InitMatch:an extension of using pairwise matching
HoliMatch
Performance
The matching accuracy is measured using F-measure.
Give the result matching m and the correct matching c, the F-
measure is F(m, c), indicating how close m is to c.
26 pages
23. Experiments
23
Matching on Correct Extraction Data
Matchers
Iterations
26 pages
24. Experiments
24
Matching on Correct Extraction Data
Sources
26 pages
25. Experiments
25
Matching on Correct Extraction Data
Pairwise
26 pages
26. Experiments
26
Matching on Real Extraction Data
26 pages