Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Complex Matching of RDF Datatype Properties

558 vues

Publié le

Paper about complex matching of RDF datatype properties.

DEXA Conference, 2013, Prague, Czech Republic.

Publié dans : Formation, Voyages
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Complex Matching of RDF Datatype Properties

  1. 1. Complex Matching of RDF Datatype Properties Bernardo Pereira Nunes1,2 , Alexander Mera1 , Marco Antonio Casanova1 , Besnik Fetahu2 , Luiz André P. Paes Leme3 , Stefan Dietze2 1) Department of Informatics-PUC-Rio, 2) L3S Research Center, Leibniz University Hannover, 3) Computer Science Institute, Fluminese Federal University DEXA 2013 – Prague, Czech Republic
  2. 2. Outline • Introduction • Motivation • Related Work • Schema Matching Principles • Our approach: • Phase 1) Estimated Mutual Information – EMI • Phase 2) Genetic Programing - GP • Evaluation • Results • Conclusions Besnik Fetahu 2DEXA 2013 – Prague, Czech Republic
  3. 3. Introduction • Data Integration • Combine different data sources into an unified view of data • Originally fomented by large organizations: • Merge companies databases due to acquisitions • Currently, driven by new Web trends such as: • Improvement of Web-based search • Proliferation of Web applications • e-business • Examples: momondo.de, semantic search, price watchers sites, etc. Besnik Fetahu 3DEXA 2013 – Prague, Czech Republic
  4. 4. Introduction • Challenges • Heterogeneous data • Different data formats • Data quality (data impurities, corrupted information) • Scalability • Adaptability • Costly Besnik Fetahu 4DEXA 2013 – Prague, Czech Republic
  5. 5. Introduction • Initiatives to address data integration problems • Linked Data Principles • Ontology Alignment Initiatives (OAI) • Schema Matching tools Besnik Fetahu 5DEXA 2013 – Prague, Czech Republic
  6. 6. Motivation • Given two schemas S and T a matching from S to T is characterized if an element e from S is mapped to an element e’ from T by some expression that relates both elements. Besnik Fetahu 6DEXA 2013 – Prague, Czech Republic ? ? ?
  7. 7. Related Work • Methods • RiMOM, iMAP, S-Match, DSSim, ATOM, etc. • Schema-based approach • Instance-based approach • Hybrid approach • Cardinality • 1:1 • 1:n • n:m Besnik Fetahu 7DEXA 2013 – Prague, Czech Republic Rahm, E. and Bernstein, P. 2001. A survey of approaches to automatic schema matching. The VLDB Journal 10, 4 (Dec. 2001), 334-350.
  8. 8. Cardinality • Simple match • 1:1 – direct matching • Complex match • 1:1 / n:1 (mapping functions) Besnik Fetahu 8DEXA 2013 – Prague, Czech Republic ISBN 0-671-72287-5 ISBN 0-671-72287-5 Fullname William Shakespeare Firstname Last name William Shakespeare split(fullname) concatenate(f,l)
  9. 9. Our approach • Two-phase approach: • Estimated Mutual Information • Suggest 1:1 and 1:n mappings • Serve as a filtering step (filter out data properties that have no mutual information) • Reduce search space for the next phase (speed up the process) • Genetic Programming • Automatic process for creating mapping functions • Reduces the cost of traversing the search space Besnik Fetahu 9DEXA 2013 – Prague, Czech Republic
  10. 10. Estimated Mutual Information (EMI) • EMI Matrix • p=(p1,…,pu), q=(q1,…,qv) two lists of sets (i.e. sets of data type properties) Besnik Fetahu 10DEXA 2013 – Prague, Czech Republic Cosine Similarity Jaccard Index …..
  11. 11. Estimated Mutual Information (EMI) • Computing the mutual information: • Cosine Similarity • Simple matches: William Shakespeare → William Shakespeare • Jaccard Similarity • Simple and Complex matches: William → William Shakespeare Besnik Fetahu 11DEXA 2013 – Prague, Czech Republic
  12. 12. Genetic Programming (GP) • Genetic programming refers to an automated method to create and evolve programs to solve a problem. • A solution is represented by a tree, whose nodes are labeled with functions (concatenate, split, sum) or with values (strings, numbers, etc). • New individuals are generated by applying genetic operations to the current population of individuals. • Selects individuals that should breed by an evolutionary process. Besnik Fetahu 12DEXA 2013 – Prague, Czech Republic
  13. 13. Genetic Programming (GP) • GP Functions: • Crossover • The act of swapping gene values between two potential solutions, simulating the "mating" of the two solutions. • Mutation • The act of randomly altering the value of a gene in a potential solution. • Reproduction • The act of making a copy of a potential solution Besnik Fetahu 13DEXA 2013 – Prague, Czech Republic
  14. 14. Genetic Programming (GP) • Fitness function • Levenshtein similarity function for string values • KL-divergence measure for numeric values • Different measures are applied since data properties values can have multiple common values (such as 0) and it can lead to a wrong match. Thus, we use measure the probability of two sets being the same with KL. Besnik Fetahu 14DEXA 2013 – Prague, Czech Republic
  15. 15. An Example of Implementation Besnik Fetahu 15DEXA 2013 – Prague, Czech Republic
  16. 16. An Example of Implementation Phase 1 – Co-occurrence matrix 1. Difference between Cosine/Jaccard similarity metrics. Besnik Fetahu 16DEXA 2013 – Prague, Czech Republic
  17. 17. An Example of Implementation Phase 1 – EMI matrix 2. Possible matchings: Besnik Fetahu 17DEXA 2013 – Prague, Czech Republic
  18. 18. An Example of Implementation Besnik Fetahu 18DEXA 2013 – Prague, Czech Republic
  19. 19. An Example of Implementation Besnik Fetahu 19DEXA 2013 – Prague, Czech Republic Complement + NumberAddress + Number Crossover NeighborhoodComplementNumber Address + + mutation Complement + Number reproduction
  20. 20. An Example of Implementation Besnik Fetahu 20DEXA 2013 – Prague, Czech Republic Correct Repetitive and Incorrect mutation
  21. 21. Evaluation • Datasets • “Personal Information” dataset lists information about people • “Real Estate” dataset lists information about houses for sale • “Inventory” dataset describes product inventories With exception of the “Personal Information” dataset due to privacy reasons, other datasets are available at: http://pages.cs.wisc.edu/ anhai/wisc-si-archive/domains/ Besnik Fetahu 21DEXA 2013 – Prague, Czech Republic
  22. 22. Results Besnik Fetahu 22DEXA 2013 – Prague, Czech Republic
  23. 23. Results Besnik Fetahu 23DEXA 2013 – Prague, Czech Republic
  24. 24. 27/08/13Ricardo Kawase 24 Conclusions • Complex schema matching approach • Simple + Complex matching: • Estimated Mutual Information + Genetic Programing • Reduced search space for matching properties • Adaptive to variations of 1:1 and n:1 matching instances • High accuracy on generated matches and coverage
  25. 25. Questions? Thank you! Besnik Fetahu 25DEXA 2013 – Prague, Czech Republic

×