SlideShare une entreprise Scribd logo
1  sur  26
Integrating Web Query Results:
     Holistic Schema Matching
                    1




                                 CIKM’08

                         Shui-Lung Chuang
                   Kevin Chen-Chuan Chang

                             Yen-Ling Lin
                              2009/04/13




26 pages
Outline
                    2

 Introduction
 Approach
 Framework
 Algorithm
 Experiments




26 pages
Introduction
                3




                          Back
26 pages
Introduction
                4




                          Back
26 pages
5




26 pages
Introduction –
              Schema Matching on Query Results
                                         6

 Data fields are the basic units processed by matching.
 A data field can be viewed as a label plus a set of values.
 We lack explicit and complete schema information. e.g.
 To conquer such challenges, we observe some niches
   in this context of integrating query results
           First, we often need to integrate multiple sources. Some useful
   1)
           effects naturally occur when cross-referencing many sources.
           Second, although no schema-based constraint is available, there are
   2)
           indeed useful regularities that can be observed from many sources.
           These regularities, treated as observed domain constraints, are very
           helpful for matching discovery.


26 pages
Introduction - Approach
                           7

 The enrichment occurs basically in three levels
     The content of a field
  1.

     The kinds of fields
  2.

     The constraints of fields
  3.

  With all the above enrichment, we learn a more

  complete schema to describe the whole input data.
 This learned schema can thus help us in making
  further matching.



26 pages
Framework – Problem Statement
                               8

 Suppose A={a1,a2,…} for the book source. For source
  S1, the fields X1 = (x11,x12,…,x17) can be assigned with
  the matching Y1= (a1,a2,…,a7)
 Matching is actually discovering the assignment of
  the groups in A to the fields of each source:


   Ys = (ys1,…,ysls) and each yi ∊ A is the group that
   source field xsi ∊ Xs is assigned as.


26 pages
Framework
              Matching as Domain Schema Discovery
                                            9

 Let the domain schema be M=(A, B)
   A :the set of domain fields

   B:the statistical constraints

 For each source Ss
     It projects M onto a source schema Ms = (Ys, Vs)
  1)
               Ys:a subset of A to be the fields of source Ss
        1)
               Vs:a set of constraints instantiated from B
        2)

             Construct the source instances Xs
   2)

             Vs     Us , Ys     Xs :Is = (Xs, Us)
   3)

             Output:Xs
   4)




26 pages
Framework
               Matching as Domain Schema Discovery
                                              10

 This procedure of data generation can be conceptually
   sketched as:




 M=(A, B) where A={a1,…,a11} and B={first(a1):.67,
   first(a2):.33, pos≻(a2, a3):1}
       M1=(Y1,V1) where Y1={a1,..,a5,a7,a8} and V1= ={first(a1):.67,
   
       first(a2):.33, pos≻(a2, a3):1}
       We generate data using source schema M1.
   
             Map Y1 as X1 – e.g., a2 is mapped as x1,2
           
            first(a1) in V1 is rewritten as first(x11) in U1, pos≻(a2,a3) as pos≻(x12,x13)



26 pages
Framework
           Matching as Domain Schema Discovery
                                        11

 Let the data observed from source Ss be Is= (Xs, Us).
 Given the matching Y={Ys: s ∊S}, learning the best
   domain schema can be described as a probabilistic
   optimization expression:
                       arg max        p ( I s |Y s , M )
                   *
               M
                                 sS
                             M

 Similarly, if the domain schema M is given, the best
   matching Y {Y : s S } can be discovered, again using
                   *     *
                        s

   statistical techniques to find out the most likely
   assignment of domain fields to the fields of each
   source: *
                  arg max p ( I s | Y s , M ) for each s ∊ S
             Ys
                      Ys

26 pages
Framework
           Matching as Domain Schema Discovery
                             12

 Suppose X1={x11,x12,x13} and X2={x21,x22}.
  Suppose we have one predicate function to check: first.
  Then, I1={X1,U1} where U1={first(x11):1}, and I2={X2,U2}
  where U2={first(X21):1}
 Suppose Y1={a1,a2,a3} and Y2={a2,a3}.
  Construct M1= (Y1,V1), V1={first(a1):1} and M2=(Y2,V2) ,
  V2={first(a2):1}
 It is clear that first(a1) holds for M1 but not M2. Thus
  first(a1) has confidence 0.5. Thus, combining source
  schemas M1 and M2, the domain schema then becomes
  M=(A, B) where A={a1, a2, a3} and B={first(a1):.5,
  first(a2):.5}.

26 pages
Framework Formulation and Overview
                                             13

 Field Model
   A field model a is a statistic model specifying how to generate
    instances.
   A field model a is a function that accepts an instance z and
    produces p(z| a ), indicating the likelihood that z is an instance
    produced by the field model a .
 Statistical Constraint
   A statistical constraint b is written as f(e):c
               f: a predicate name, e is the vector of elements, c is a confidence
           
               value of range[0,1].




26 pages
Framework Formulation and Overview
                                14

 Overall, our framework translates the problem of instance-
  based matching into a schema-discovery problem.
 With such a strategy, we leverage not only the data instances
  but also the regularities observed from the data in a principled
  way.




a


26 pages
Algorithm
                           15

 To solve our matching problem, we need to discover
  either an optimal matching Y* or an optimal schema
  M*.
 If one of them is obtained, the other can be derived.
 The basic idea is to start an initial guess of the
  matching Y and iteratively improve it using the
  schema M that is derived from the current
  estimation of Y.



26 pages
Algorithm
                                         16

 InitMatch
   The function is to generate an initial matching, to be the start
    point for iterations.
 EnumRelations
   We need to identify the constraints occurring in the input data.
                                i ,..., i
                                     1        k




   Predicate Function

                         f ( i1 ,..., i k , X )
       i1 ,..., i k :which elements to check their satisfaction with the
   
       predicate f and X is the original data.
       True: the input satisfies the predicate
   

       False: otherwise
   
26 pages
Algorithm
                                   17

 LearnSchema – From matching to schema
   Aim to construct a schema based on a given matching.
   First, group the matched source fields together.
   Each group is trained as field model.
   Model it as 2-state HMM.




       Learning an HMM a given a set of instances and computing
   
       the probability p(z|a) for given instance z will follow the
       standard HMM training and inference algorithm.

26 pages
Algorithm
                                                             18

 SchemaMatch – From Schema to Matching
   Given the domain schema, matching becomes labeling the
    elements of sources with the appropriate domain fields.
   For each hj∈Vs with the corresponding bj ∈ B, let their
    constraint be fj(yi1,…yik), we define

           qi, j (a )   z       (a )       (a )                  p (h j | b j )                             ( yl )       ( yl )
                            i          i                                                                l            l
                                                  i1 ,..., i k , y i a                l i1 ,.., i k , l i


                                            qi ( a )         z           qi, j (a )
                                                                  j

       The most likely value for each yi is thus:
   
                                            *
                                           yi        arg max q i ( a )
                                                            aA

26 pages
Algorithm
                                                                      19

 MetaMatch :
   Adopt F-measure to measure the consistency.
                    2 R i , j Pi , j
           Fi , j
                    Ri, j      Pi , j
       For two matching m1 and m2, using m1 as tastee and m2 as
   
       tester,                 ni
                                        F ( m1 , m 2 )               max { Fi , j }
                                                                 n    j m1
                                                          i m2

       Let these candidates generated during this process be C and
   
       the n matchings be R={r1,…,rn}: The final matching is obtained
       as:             *
                                        m        arg max                F (m , r )
                                                         mC      rR
 InitMatch aims to guess an initial matching, to be
   the start point of the iterative computation.

26 pages
Algorithm
                          20

 HoliMatch’s algorithm




26 pages
Experiments
                                21

 Data set
   Four domains

   For each domain, collect 10 sources




26 pages
Experiments
                                22

 Comparison Methods
   PairMatch: adopt Corpus-based approach

   ClusMatch:

   ChainMatch: e.g., 1-2-3-4

   ProgMatch: e.g., becoming (((1-2)-3)-4)

   InitMatch:an extension of using pairwise matching

   HoliMatch

 Performance
   The matching accuracy is measured using F-measure.

   Give the result matching m and the correct matching c, the F-
    measure is F(m, c), indicating how close m is to c.

26 pages
Experiments
                         23

 Matching on Correct Extraction Data
   Matchers




       Iterations
   




26 pages
Experiments
                         24

 Matching on Correct Extraction Data
   Sources




26 pages
Experiments
                         25

 Matching on Correct Extraction Data
   Pairwise




26 pages
Experiments
                         26

 Matching on Real Extraction Data




26 pages

Contenu connexe

Tendances

Johan Suykens: "Models from Data: a Unifying Picture"
Johan Suykens: "Models from Data: a Unifying Picture" Johan Suykens: "Models from Data: a Unifying Picture"
Johan Suykens: "Models from Data: a Unifying Picture" ieee_cis_cyprus
 
Engr 371 final exam april 2010
Engr 371 final exam april 2010Engr 371 final exam april 2010
Engr 371 final exam april 2010amnesiann
 
Lesson 2: A Catalog of Essential Functions
Lesson 2: A Catalog of Essential FunctionsLesson 2: A Catalog of Essential Functions
Lesson 2: A Catalog of Essential FunctionsMatthew Leingang
 
Mesh Processing Course : Differential Calculus
Mesh Processing Course : Differential CalculusMesh Processing Course : Differential Calculus
Mesh Processing Course : Differential CalculusGabriel Peyré
 
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
Scientific Computing with Python Webinar 9/18/2009:Curve FittingScientific Computing with Python Webinar 9/18/2009:Curve Fitting
Scientific Computing with Python Webinar 9/18/2009:Curve FittingEnthought, Inc.
 
An application if ivfss in med dignosis
An application if ivfss in med dignosisAn application if ivfss in med dignosis
An application if ivfss in med dignosisjobishvd
 
Machine learning of structured outputs
Machine learning of structured outputsMachine learning of structured outputs
Machine learning of structured outputszukun
 
002 equation of_a_line
002 equation of_a_line002 equation of_a_line
002 equation of_a_linephysics101
 
Image Processing 3
Image Processing 3Image Processing 3
Image Processing 3jainatin
 
Form 5 formulae and note
Form 5 formulae and noteForm 5 formulae and note
Form 5 formulae and notesmktsj2
 
Mesh Processing Course : Geodesic Sampling
Mesh Processing Course : Geodesic SamplingMesh Processing Course : Geodesic Sampling
Mesh Processing Course : Geodesic SamplingGabriel Peyré
 
Pc12 sol c03_ptest
Pc12 sol c03_ptestPc12 sol c03_ptest
Pc12 sol c03_ptestGarden City
 
Open GL 04 linealgos
Open GL 04 linealgosOpen GL 04 linealgos
Open GL 04 linealgosRoziq Bahtiar
 
Low Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse ProblemsLow Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse ProblemsGabriel Peyré
 
Lesson 24: The Definite Integral (Section 4 version)
Lesson 24: The Definite Integral (Section 4 version)Lesson 24: The Definite Integral (Section 4 version)
Lesson 24: The Definite Integral (Section 4 version)Matthew Leingang
 
Lesson 24: The Definite Integral (Section 10 version)
Lesson 24: The Definite Integral (Section 10 version)Lesson 24: The Definite Integral (Section 10 version)
Lesson 24: The Definite Integral (Section 10 version)Matthew Leingang
 
18 directional derivatives and gradient
18 directional  derivatives and gradient18 directional  derivatives and gradient
18 directional derivatives and gradientmath267
 

Tendances (20)

Johan Suykens: "Models from Data: a Unifying Picture"
Johan Suykens: "Models from Data: a Unifying Picture" Johan Suykens: "Models from Data: a Unifying Picture"
Johan Suykens: "Models from Data: a Unifying Picture"
 
Engr 371 final exam april 2010
Engr 371 final exam april 2010Engr 371 final exam april 2010
Engr 371 final exam april 2010
 
Lesson 2: A Catalog of Essential Functions
Lesson 2: A Catalog of Essential FunctionsLesson 2: A Catalog of Essential Functions
Lesson 2: A Catalog of Essential Functions
 
Mesh Processing Course : Differential Calculus
Mesh Processing Course : Differential CalculusMesh Processing Course : Differential Calculus
Mesh Processing Course : Differential Calculus
 
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
Scientific Computing with Python Webinar 9/18/2009:Curve FittingScientific Computing with Python Webinar 9/18/2009:Curve Fitting
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
 
An application if ivfss in med dignosis
An application if ivfss in med dignosisAn application if ivfss in med dignosis
An application if ivfss in med dignosis
 
Machine learning of structured outputs
Machine learning of structured outputsMachine learning of structured outputs
Machine learning of structured outputs
 
002 equation of_a_line
002 equation of_a_line002 equation of_a_line
002 equation of_a_line
 
Image Processing 3
Image Processing 3Image Processing 3
Image Processing 3
 
Form 5 formulae and note
Form 5 formulae and noteForm 5 formulae and note
Form 5 formulae and note
 
Mesh Processing Course : Geodesic Sampling
Mesh Processing Course : Geodesic SamplingMesh Processing Course : Geodesic Sampling
Mesh Processing Course : Geodesic Sampling
 
Pc12 sol c03_ptest
Pc12 sol c03_ptestPc12 sol c03_ptest
Pc12 sol c03_ptest
 
Open GL 04 linealgos
Open GL 04 linealgosOpen GL 04 linealgos
Open GL 04 linealgos
 
calculo vectorial
calculo vectorialcalculo vectorial
calculo vectorial
 
JavaYDL13
JavaYDL13JavaYDL13
JavaYDL13
 
Low Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse ProblemsLow Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse Problems
 
Lesson 24: The Definite Integral (Section 4 version)
Lesson 24: The Definite Integral (Section 4 version)Lesson 24: The Definite Integral (Section 4 version)
Lesson 24: The Definite Integral (Section 4 version)
 
Lesson 24: The Definite Integral (Section 10 version)
Lesson 24: The Definite Integral (Section 10 version)Lesson 24: The Definite Integral (Section 10 version)
Lesson 24: The Definite Integral (Section 10 version)
 
Astaño 2
Astaño 2Astaño 2
Astaño 2
 
18 directional derivatives and gradient
18 directional  derivatives and gradient18 directional  derivatives and gradient
18 directional derivatives and gradient
 

Similaire à 20090411

13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdfEmanAsem4
 
Dynamic1
Dynamic1Dynamic1
Dynamic1MyAlome
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modeljins0618
 
Cheatsheet unsupervised-learning
Cheatsheet unsupervised-learningCheatsheet unsupervised-learning
Cheatsheet unsupervised-learningSteve Nouri
 
Machine learning (11)
Machine learning (11)Machine learning (11)
Machine learning (11)NYversity
 
Methods from Mathematical Data Mining (Supported by Optimization)
Methods from Mathematical Data Mining (Supported by Optimization)Methods from Mathematical Data Mining (Supported by Optimization)
Methods from Mathematical Data Mining (Supported by Optimization)SSA KPI
 
Machine learning (7)
Machine learning (7)Machine learning (7)
Machine learning (7)NYversity
 
Semi-Supervised Regression using Cluster Ensemble
Semi-Supervised Regression using Cluster EnsembleSemi-Supervised Regression using Cluster Ensemble
Semi-Supervised Regression using Cluster EnsembleAlexander Litvinenko
 
Joint3DShapeMatching - a fast approach to 3D model matching using MatchALS 3...
Joint3DShapeMatching  - a fast approach to 3D model matching using MatchALS 3...Joint3DShapeMatching  - a fast approach to 3D model matching using MatchALS 3...
Joint3DShapeMatching - a fast approach to 3D model matching using MatchALS 3...Mamoon Ismail Khalid
 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx36rajneekant
 
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationAccelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationFeynman Liang
 

Similaire à 20090411 (20)

13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf
 
Dynamic1
Dynamic1Dynamic1
Dynamic1
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
 
Cheatsheet unsupervised-learning
Cheatsheet unsupervised-learningCheatsheet unsupervised-learning
Cheatsheet unsupervised-learning
 
2018 MUMS Fall Course - Issue Arising in Several Working Groups: Probabilisti...
2018 MUMS Fall Course - Issue Arising in Several Working Groups: Probabilisti...2018 MUMS Fall Course - Issue Arising in Several Working Groups: Probabilisti...
2018 MUMS Fall Course - Issue Arising in Several Working Groups: Probabilisti...
 
Machine learning (11)
Machine learning (11)Machine learning (11)
Machine learning (11)
 
2018 MUMS Fall Course - Gaussian Processes and Statistic Emulators (EDITED) -...
2018 MUMS Fall Course - Gaussian Processes and Statistic Emulators (EDITED) -...2018 MUMS Fall Course - Gaussian Processes and Statistic Emulators (EDITED) -...
2018 MUMS Fall Course - Gaussian Processes and Statistic Emulators (EDITED) -...
 
Signals and Systems Homework Help.pptx
Signals and Systems Homework Help.pptxSignals and Systems Homework Help.pptx
Signals and Systems Homework Help.pptx
 
Methods from Mathematical Data Mining (Supported by Optimization)
Methods from Mathematical Data Mining (Supported by Optimization)Methods from Mathematical Data Mining (Supported by Optimization)
Methods from Mathematical Data Mining (Supported by Optimization)
 
Matlab algebra
Matlab algebraMatlab algebra
Matlab algebra
 
Machine learning (7)
Machine learning (7)Machine learning (7)
Machine learning (7)
 
Compound Structure Detection
Compound Structure DetectionCompound Structure Detection
Compound Structure Detection
 
Es272 ch5a
Es272 ch5aEs272 ch5a
Es272 ch5a
 
Lecture12 xing
Lecture12 xingLecture12 xing
Lecture12 xing
 
Lesson 8
Lesson 8Lesson 8
Lesson 8
 
Semi-Supervised Regression using Cluster Ensemble
Semi-Supervised Regression using Cluster EnsembleSemi-Supervised Regression using Cluster Ensemble
Semi-Supervised Regression using Cluster Ensemble
 
Joint3DShapeMatching - a fast approach to 3D model matching using MatchALS 3...
Joint3DShapeMatching  - a fast approach to 3D model matching using MatchALS 3...Joint3DShapeMatching  - a fast approach to 3D model matching using MatchALS 3...
Joint3DShapeMatching - a fast approach to 3D model matching using MatchALS 3...
 
Exhaustive Combinatorial Enumeration
Exhaustive Combinatorial EnumerationExhaustive Combinatorial Enumeration
Exhaustive Combinatorial Enumeration
 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx
 
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationAccelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference Compilation
 

Plus de xoanon

Progress Report
Progress ReportProgress Report
Progress Reportxoanon
 
Progress Report 20091009
Progress Report 20091009Progress Report 20091009
Progress Report 20091009xoanon
 
2009 God
2009 God2009 God
2009 Godxoanon
 
2008.12.10
2008.12.102008.12.10
2008.12.10xoanon
 
2008.12.23 CompoWeb
2008.12.23 CompoWeb2008.12.23 CompoWeb
2008.12.23 CompoWebxoanon
 
2008.12.09
2008.12.092008.12.09
2008.12.09xoanon
 
20080930
2008093020080930
20080930xoanon
 

Plus de xoanon (7)

Progress Report
Progress ReportProgress Report
Progress Report
 
Progress Report 20091009
Progress Report 20091009Progress Report 20091009
Progress Report 20091009
 
2009 God
2009 God2009 God
2009 God
 
2008.12.10
2008.12.102008.12.10
2008.12.10
 
2008.12.23 CompoWeb
2008.12.23 CompoWeb2008.12.23 CompoWeb
2008.12.23 CompoWeb
 
2008.12.09
2008.12.092008.12.09
2008.12.09
 
20080930
2008093020080930
20080930
 

Dernier

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 

Dernier (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

20090411

  • 1. Integrating Web Query Results: Holistic Schema Matching 1 CIKM’08 Shui-Lung Chuang Kevin Chen-Chuan Chang Yen-Ling Lin 2009/04/13 26 pages
  • 2. Outline 2  Introduction  Approach  Framework  Algorithm  Experiments 26 pages
  • 3. Introduction 3 Back 26 pages
  • 4. Introduction 4 Back 26 pages
  • 6. Introduction – Schema Matching on Query Results 6  Data fields are the basic units processed by matching.  A data field can be viewed as a label plus a set of values.  We lack explicit and complete schema information. e.g.  To conquer such challenges, we observe some niches in this context of integrating query results First, we often need to integrate multiple sources. Some useful 1) effects naturally occur when cross-referencing many sources. Second, although no schema-based constraint is available, there are 2) indeed useful regularities that can be observed from many sources. These regularities, treated as observed domain constraints, are very helpful for matching discovery. 26 pages
  • 7. Introduction - Approach 7  The enrichment occurs basically in three levels The content of a field 1. The kinds of fields 2. The constraints of fields 3. With all the above enrichment, we learn a more  complete schema to describe the whole input data.  This learned schema can thus help us in making further matching. 26 pages
  • 8. Framework – Problem Statement 8  Suppose A={a1,a2,…} for the book source. For source S1, the fields X1 = (x11,x12,…,x17) can be assigned with the matching Y1= (a1,a2,…,a7)  Matching is actually discovering the assignment of the groups in A to the fields of each source: Ys = (ys1,…,ysls) and each yi ∊ A is the group that source field xsi ∊ Xs is assigned as. 26 pages
  • 9. Framework Matching as Domain Schema Discovery 9  Let the domain schema be M=(A, B)  A :the set of domain fields  B:the statistical constraints  For each source Ss It projects M onto a source schema Ms = (Ys, Vs) 1) Ys:a subset of A to be the fields of source Ss 1) Vs:a set of constraints instantiated from B 2) Construct the source instances Xs 2) Vs Us , Ys Xs :Is = (Xs, Us) 3) Output:Xs 4) 26 pages
  • 10. Framework Matching as Domain Schema Discovery 10  This procedure of data generation can be conceptually sketched as:  M=(A, B) where A={a1,…,a11} and B={first(a1):.67, first(a2):.33, pos≻(a2, a3):1} M1=(Y1,V1) where Y1={a1,..,a5,a7,a8} and V1= ={first(a1):.67,  first(a2):.33, pos≻(a2, a3):1} We generate data using source schema M1.  Map Y1 as X1 – e.g., a2 is mapped as x1,2   first(a1) in V1 is rewritten as first(x11) in U1, pos≻(a2,a3) as pos≻(x12,x13) 26 pages
  • 11. Framework Matching as Domain Schema Discovery 11  Let the data observed from source Ss be Is= (Xs, Us).  Given the matching Y={Ys: s ∊S}, learning the best domain schema can be described as a probabilistic optimization expression: arg max p ( I s |Y s , M ) * M sS M  Similarly, if the domain schema M is given, the best matching Y {Y : s S } can be discovered, again using * * s statistical techniques to find out the most likely assignment of domain fields to the fields of each source: * arg max p ( I s | Y s , M ) for each s ∊ S Ys Ys 26 pages
  • 12. Framework Matching as Domain Schema Discovery 12  Suppose X1={x11,x12,x13} and X2={x21,x22}. Suppose we have one predicate function to check: first. Then, I1={X1,U1} where U1={first(x11):1}, and I2={X2,U2} where U2={first(X21):1}  Suppose Y1={a1,a2,a3} and Y2={a2,a3}. Construct M1= (Y1,V1), V1={first(a1):1} and M2=(Y2,V2) , V2={first(a2):1}  It is clear that first(a1) holds for M1 but not M2. Thus first(a1) has confidence 0.5. Thus, combining source schemas M1 and M2, the domain schema then becomes M=(A, B) where A={a1, a2, a3} and B={first(a1):.5, first(a2):.5}. 26 pages
  • 13. Framework Formulation and Overview 13  Field Model  A field model a is a statistic model specifying how to generate instances.  A field model a is a function that accepts an instance z and produces p(z| a ), indicating the likelihood that z is an instance produced by the field model a .  Statistical Constraint  A statistical constraint b is written as f(e):c f: a predicate name, e is the vector of elements, c is a confidence  value of range[0,1]. 26 pages
  • 14. Framework Formulation and Overview 14  Overall, our framework translates the problem of instance- based matching into a schema-discovery problem.  With such a strategy, we leverage not only the data instances but also the regularities observed from the data in a principled way. a 26 pages
  • 15. Algorithm 15  To solve our matching problem, we need to discover either an optimal matching Y* or an optimal schema M*.  If one of them is obtained, the other can be derived.  The basic idea is to start an initial guess of the matching Y and iteratively improve it using the schema M that is derived from the current estimation of Y. 26 pages
  • 16. Algorithm 16  InitMatch  The function is to generate an initial matching, to be the start point for iterations.  EnumRelations  We need to identify the constraints occurring in the input data. i ,..., i 1 k  Predicate Function f ( i1 ,..., i k , X ) i1 ,..., i k :which elements to check their satisfaction with the  predicate f and X is the original data. True: the input satisfies the predicate  False: otherwise  26 pages
  • 17. Algorithm 17  LearnSchema – From matching to schema  Aim to construct a schema based on a given matching.  First, group the matched source fields together.  Each group is trained as field model.  Model it as 2-state HMM. Learning an HMM a given a set of instances and computing  the probability p(z|a) for given instance z will follow the standard HMM training and inference algorithm. 26 pages
  • 18. Algorithm 18  SchemaMatch – From Schema to Matching  Given the domain schema, matching becomes labeling the elements of sources with the appropriate domain fields.  For each hj∈Vs with the corresponding bj ∈ B, let their constraint be fj(yi1,…yik), we define qi, j (a ) z (a ) (a ) p (h j | b j ) ( yl ) ( yl ) i i l l i1 ,..., i k , y i a l i1 ,.., i k , l i qi ( a ) z qi, j (a ) j The most likely value for each yi is thus:  * yi arg max q i ( a ) aA 26 pages
  • 19. Algorithm 19  MetaMatch :  Adopt F-measure to measure the consistency. 2 R i , j Pi , j Fi , j Ri, j Pi , j For two matching m1 and m2, using m1 as tastee and m2 as  tester, ni F ( m1 , m 2 ) max { Fi , j } n j m1 i m2 Let these candidates generated during this process be C and  the n matchings be R={r1,…,rn}: The final matching is obtained as: * m arg max F (m , r ) mC rR  InitMatch aims to guess an initial matching, to be the start point of the iterative computation. 26 pages
  • 20. Algorithm 20  HoliMatch’s algorithm 26 pages
  • 21. Experiments 21  Data set  Four domains  For each domain, collect 10 sources 26 pages
  • 22. Experiments 22  Comparison Methods  PairMatch: adopt Corpus-based approach  ClusMatch:  ChainMatch: e.g., 1-2-3-4  ProgMatch: e.g., becoming (((1-2)-3)-4)  InitMatch:an extension of using pairwise matching  HoliMatch  Performance  The matching accuracy is measured using F-measure.  Give the result matching m and the correct matching c, the F- measure is F(m, c), indicating how close m is to c. 26 pages
  • 23. Experiments 23  Matching on Correct Extraction Data  Matchers Iterations  26 pages
  • 24. Experiments 24  Matching on Correct Extraction Data  Sources 26 pages
  • 25. Experiments 25  Matching on Correct Extraction Data  Pairwise 26 pages
  • 26. Experiments 26  Matching on Real Extraction Data 26 pages