On Leveraging Crowdsourcing Techniques for Schema Matching Networks

1
Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Karl Aberer
École Polytechnique Fédérale de Lausanne, Switzerland
Zoltán Miklós
Université de Rennes 1, IRISA, France
DASFAA 2013, Part II, LNCS 7826, pp. 139 – 154, 2013

2
Database schema matching is an active research field:
Surveys: [1], [2]
Applications: data transformation, data migration, data alignment, …
Automatic Matching Tools: COMA++, AMC, OpenII, Falcon, …
Schema matching is the task of establishing correspondences that connect related
attributes in two (independently developed) database schemas.
SA SB
BirthName BirthName
BirthDate
Address Address
[1] Rahm, E. et al. “A Survey of Approaches to Automatic Schema Matching”. JVLDB, 2001
[2] Bernstein, P.A. et al. “Generic Schema Matching, Ten Years Later”. PVLDB, 2011

3
Automatic schema matchers will
(sometimes) fail to identify the correct
correspondences
There is a need for post‐matching
reconciliation through human input
This effort is the « real cost » in the company
Schemas do not appear alone, they are
part of a matching network
The network‐level consistency constraints
are very important for business users

4
Real‐world scenario: a repository of schemas in the same domain
Schema matching network: connect schemas by pair‐wise matchings
Network‐level consistency constraints
Automatic tools produce incorrect correspondences  need validation by
human

7
DASFAA’2013, BDA’2013: On Leveraging
Crowdsourcing Techniques for Schema
Matching Networks
ER’2013: Minimizing Human Effort in
Reconciling Match Networks
coopIS’2013: Collaborative Schema Matching
Reconciliation
ICDE’2014: Pay‐as‐you‐go Reconciliation in
Schema Matching Networks

“Crowdsourcing is the practice of obtaining needed services, ideas, or content by soliciting contributions
from a large group of people, and especially from an online community, rather than from traditional
employees or suppliers.” ‐ Wiki
Our context: employ many workers (users) to validate same correspondences and
combine their answers.
Surveys: [1], [2]
A wide range of applications (e.g. CrowdSearch) have been developed on top of
more than 70 crowdsourcing platforms (e.g. Amazon Mechanical Turk).
8
Our contribution:
Define network‐level constraints in schema matching network
Design questions for workers to validate correspondences
Leverage network‐level constraints to reduce user efforts
[1] E. Law et al. “Human Computation”. Morgan & Claypool Publishers, 2011
[2] A. Doan et al. “Crowdsourcing systems on the World Wide Web”. CACM, 2011

11
Three elements of questions:
Asking object: correspondence
Possible choices: simple YES/NO question
Support Information: alternatives, constraint satisfactions, constraint
violations

12
User Question Answer
U1 C Yes
U2 C Yes
U3 C No
User Quality
User Reliability
U1 r1
U2 r2
U3 r3
User Feedbacks
Answer
Aggregation
Probabilistic Model (*)
Pr(C)
Compute <a,e>
aggregation + error rate
Corr Aggregation Error Rate
C True 0.19
r1 = Pr (C=true | U1=yes)
= Pr (C=false | U1=no)
(*) Majority Voting, Expectation Maximization, …
See full paper for details

To achieve higher accuracy, we need more answers  Cost‐Accuracy Tradeoff
13
r = 0.6
Goal
Solution: Leverage constraints to reduce error rate

14
Idea: correspondences support each other if they satisfy a constraint
1‐1 constraint: ONE source attribute matches to only ONE target attribute
S T
b1
a
b2
Pr(ab1=true) = 0.8
Pr(ab2=false) = 0.6
By independence,
0.8 x 0.6
ab1 ab2 Prob
T T 0.32 not satisfy
T F 0.48 satisfy
F T 0.08 satisfy
F F 0.12 satisfy
Pr ܾܽଶ ൌ ݂݈ܽݏ݁ ߛଵିଵ ൌ
0.48 ൅ 0.12
0.48 ൅ 0.08 ൅ 0.12
ൌ ૙. ૡૡ
Without Constraint With Constraint
ab2 False 0.4 (*)
ab2 False 0.12 (**)
>
(*) Error Rate = 1 – Pr (ab2=false) (**) Error Rate = 1 – Pr ሺܾܽଵ ൌ ݂݈ܽݏ݁|ߛଵିଵሻ

0.512 ൅ 3 ൈ Δ ൈ 0.032 ൅ Δ ൈ 0.008
ൎ ૙. ૢૠ૜ with ઢ ൌ ૙. ૛
15
Circle constraint: sequence of correspondences create a closed circle
Δ: probability of compensating errors along the circle (*)
b Pr(ab=T) = 0.8
Pr(ac=T) = 0.8 Pr(bc=T) = 0.8
S3
S2
c
ab bc ac Prob
T T T 0.512 1.0
T T F 0.128 0.0
T F T 0.128 0.0
T F F 0.032
F T T 0.128 0.0
F T F 0.032
F F T 0.032
F F F 0.008
By independence,
0.8 x 0.8 x 0.8
Pr ܾܽ ൌ ܂ ߛ௖௜௥௖௟௘ ൌ
0.512 ൅ Δ ൈ 0.032
Without Constraint With Constraint
S1
a
ab True 0.2 (**)
ab True 0.027 (***)
>
(**) Error Rate = 1 – Pr (ab=T) (***) Error Rate = 1 – Pr ܾܽ ൌ ܂ ߛ௖௜௥௖௟௘
* Cudré-Mauroux, et al. Probabilistic message passing in peer data management systems. ICDE 2006.

16
Settings:
Real‐world schemas. Use ground truth to simulate users/workers.
Error Threshold = 0.1 : make decision when error rate < 0.1; otherwise,
continue to ask users.
Metric: Cost =
Observation: Cost (With Constraints) Cost (Without Constraints)

We model a crowdsourcing process for schema
matching network
address optimization goals: minimize monetary cost,
maximize accuracy (minimize error rate).
We design a variety of questions with different support
information.
We leverage consistency constraints  reduce error
rate  reduce the monetary cost.
17

On Leveraging Crowdsourcing Techniques for Schema Matching Networks

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

Similaire à On Leveraging Crowdsourcing Techniques for Schema Matching Networks

Similaire à On Leveraging Crowdsourcing Techniques for Schema Matching Networks (20)

Plus de PlanetData Network of Excellence

Plus de PlanetData Network of Excellence (20)

Dernier

Dernier (20)

On Leveraging Crowdsourcing Techniques for Schema Matching Networks