SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
Performance Analysis of Algorithms to
   Reason about XML Keys

   Emir Mu˜oz Jim´nez
          n      e
   University of Santiago de Chile
   Yahoo! Labs LatAm

   Joint work with F. Ferrarotti, S. Hartmann, S. Link, M. Marin

   DEXA 2012 @ Vienna, Austria, 3rd September 2012


Introduction RW & Motivation XML Keys Results Conclusion           1/24
Contribution



          An efficient implementation of an algorithm that decides the
          implication problem for a tractable and expressive class of
          XML keys.
          Performance Analysis:
                 Implication problem over large sets of XML Keys.
                 Non-redundant covers of XML keys and validation of XML
                 documents.
          Our experiments show that reasoning about expressive notions
          of XML keys can be done efficiently in practice and scales well.




Introduction RW & Motivation XML Keys Results Conclusion                   2/24
Outline



   1 Introduction


   2 Related Work & Motivation


   3 Reasoning about XML Keys


   4 Results


   5 Conclusion




Introduction RW & Motivation XML Keys Results Conclusion   3/24
Introduction



   Keys
          XML: de-facto standard for Web data exchange and
          integration.
          XML ↔ data storage ↔ data processing.
          Both industry and academia have long since recognized the
          importance of keys in XML data management.
          Several notions of XML keys have been proposed. Most
          influential is due to Buneman et al. who defined keys on the
          basis of an XML tree model similar to the one suggested by
          DOM and XPath.



Introduction RW & Motivation XML Keys Results Conclusion               4/24
Introduction
                    XML Tree Example (1/2)



   Example (An XML tree representing an XML document)
                                                                                     db
                                                                                     E
                            project                                                                                       project
                              E                                                                  pname                        E
   pname                                                                                                                                               team
                                                        team                                                                                                           tname
                                                   E                    tname                      A                                              E
     A
                                                                          A     Riders          MediaLow
                                                                                                               employee                                                    A   Coyotes
   Phobia           employee           employee                                                                                      employee
            rol                       rol                       rol                  employee          rol                          rol                        rol                  employee
                        E                       E                               E                                  E                              E                            E
            A                         A                          A                                       A                          A                           A
       Team                 name    Team               name Team                    name            Team               name        Team               name    Team                 name
      Member                       Member                      Leader                              Member                         Member                      Leader
                        E                          E                          E                                    E                              E                            E

            fname              lname fname               lname fname                  lname              fname            lname      fname              lname fname                  lname
                    E         E              E          E                E           E                         E          E                E            E              E             E


                    S         S              S          S                S           S                         S          S                S            S              S             S
                  Dexter Cooper             John       Smith          William       Bell                     Dexter Cooper              William        Bell          Brook         Davis




Introduction RW & Motivation XML Keys Results Conclusion                                                                                                                                       5/24
Introduction
           XML Tree Example (2/2)




   Some Constraints: (gathered as business rules)

   (a) A project node is identified by pname, no matter where the
       project node appears in the document.
   (b) A team node can be identified by tname relatively to a
       project node.
   (c) Within any given subtree rooted at team, an employee node
       is identified by name.




Introduction RW & Motivation XML Keys Results Conclusion           6/24
Introduction
           XML Tree Example (2/2)




   Some Constraints: (gathered as business rules)

   (a) A project node is identified by pname, no matter where the
       project node appears in the document. Absolute key
   (b) A team node can be identified by tname relatively to a
       project node. Relative key
   (c) Within any given subtree rooted at team, an employee node
       is identified by name. Relative key




Introduction RW & Motivation XML Keys Results Conclusion           6/24
Related Work & Motivation



          Keys are important in many perennial tasks in database
          management.
          The hope is that keys will turn out to be equally beneficial for
          XML.
          In practice, expressive yet tractable notions of XML keys have
          been ignored so far.
          We initiate an empirical study of the fragment of XML keys
          with nonempty sets of simple key paths.
                 One of the most fundamental questions on keys is:
                 Can we decide if a new key holds given a set of known keys?
                 (Logical Implication Problem.)



Introduction RW & Motivation XML Keys Results Conclusion                       7/24
Related Work & Motivation
           Key implication example

   Some Constraints: (gathered as business rules)

   (a) A project node is identified by pname, no matter where the
       project node appears in the document.
   (b) A team node can be identified by tname relatively to a
       project node.
   (c) Within any given subtree rooted at team, an employee node
       is identified by name.

   Suppose, the consideration of a further key:
   (d) A project node can be identified in the document by its child
       nodes pname and team.



Introduction RW & Motivation XML Keys Results Conclusion              8/24
Related Work & Motivation
           Key implication example

   Some Constraints: (gathered as business rules)

   (a) A project node is identified by pname, no matter where the
       project node appears in the document.
   (b) A team node can be identified by tname relatively to a
       project node.
   (c) Within any given subtree rooted at team, an employee node
       is identified by name.

   Suppose, the consideration of a further key:
   (d) A project node can be identified in the document by its child
       nodes pname and team. key (a) actually implies (d)!!
       because (d) is a superkey of (a).

Introduction RW & Motivation XML Keys Results Conclusion              8/24
Keys for XML
                    Definitions (1/2)


   Definition (Value equality)
   u is value equal to v (u =v v) if the trees rooted at u and v are
   isomorphic by an isomorphism that is the identity on string values.
   (e.g., element nodes employee for “Dexter Cooper”)
                                                                                   db
                                                                                   E
                            project                                                                                     project
                              E                                                                pname                        E
   pname                                                                                                                                             team
                                                      team                                                                                                           tname
                                                 E                    tname                      A                                              E
     A
                                                                        A     Riders          MediaLow
                                                                                                             employee                                                    A   Coyotes
   Phobia           employee           employee                                                                                    employee
            rol                       rol                     rol                  employee          rol                          rol                        rol                  employee
                        E                       E                             E                                  E                              E                            E
            A                         A                        A                                       A                          A                           A
       Team                 name Team                name Team                    name            Team               name        Team               name    Team                 name
      Member                    Member                       Leader                              Member                         Member                      Leader
                        E                        E                          E                                    E                              E                            E

            fname              lname fname             lname fname                  lname              fname            lname      fname              lname fname                  lname
                    E         E            E          E                E           E                         E          E                E            E              E             E


                    S         S            S          S                S           S                         S          S                S            S              S             S
                  Dexter Cooper           John       Smith          William       Bell                     Dexter Cooper              William        Bell          Brook         Davis




Introduction RW & Motivation XML Keys Results Conclusion                                                                                                                                     9/24
Keys for XML
           Definitions (2/2)

   Some Constraints: (in class K of XML keys)

   (a) A project node is identified by pname, no matter where the
       project node appears in the document.
                (ε, (project , {pname}))
   (b) A team node can be identified by tname relatively to a
       project node.
                (project , (team, {tname}))
   (c) Within any give subtree rooted at team, an employee node is
       identified by name.
                ( ∗ .team, (employee, {name}))
   (d) A project node can be identified by its child nodes pname
       and team.
                (ε, (project , {pname, team}))

Introduction RW & Motivation XML Keys Results Conclusion             10/24
Deciding XML Key Implication
           Notions



          We say that Σ implies ϕ, denoted by Σ |= ϕ, iff every finite
          XML tree T that satisfies all σ ∈ Σ also satisfies ϕ.
          Implication Problem for a class C of XML keys: given any
          Σ ∪ {ϕ} in C, decide whether Σ |= ϕ.
          e.g., let be Σ = {(a), (b), (c)} and ϕ = (d), then Σ |= ϕ is
          true.
          e.g., let be Σ = {(a), (b)} and ϕ = (c), then Σ |= ϕ is false.
          Hartmann & Link characterize the implication problem for the
          class K of XML keys in terms of the of reachability problem
          for fixed nodes in a suitable digraph.



Introduction RW & Motivation XML Keys Results Conclusion                   11/24
Deciding XML Key Implication
           Mini-trees and Witness Graphs (Hartmann & Link, 2009)

                                                       ϕ = (ε, (public. ∗ .project , {pname.S , year .S }))
                            rϕ = q ϕ
               p              1111
                              0000
                              1111
                              0000               (1)
                              1111
                              0000                              qϕ
                              1111
                              0000
                              1111
                              0000
                           rϕ1111
                            ′ 0000               (2)                 E   db
                              1111
                              0000
                              1111
                              0000
                              1111
                              0000
                            ′
                                                                 ′
                                                                v1   E   public                         ❦
                                                                                                       qϕ E   db
                   ′
                           v1   E    public
               p
                                                                 ′
                            ′                                   v2 E     national                        E    public
                           v2   E   national
                                                                 ′
                            ′                                   qϕ E     project                         E    national
                           qϕ   E    project     (3)
                                                                               ′
                                                         w1                   w1
                                                                pname                                   ′❦
                                    11111
                                    00000 ϕ
                                                            E                 E    year                qϕ E   project
              r10000
               ϕ1111
                1111
                0000                11111
                                    00000r
                                    11111
                                    00000 2
                   1111
                   0000
                   1111
                   0000             11111
                                    00000              xϕ                          xϕ
                                    11111
                                    00000  ′            1   S                 S     2      pname   E           E       year
      p1      w1                          w1
                       E    pname     E   year
                                                 p2         ×                 ×
                                                                                                   S           S
              xϕ
               1       S              S   xϕ
                                           2                                                       ×           ×

                                (a) Mini tree TΣ,ϕ                                        (b) Witness graph GΣ,ϕ


Introduction RW & Motivation XML Keys Results Conclusion                                                                      12/24
An Efficient Implementation
           Data structures and implementation

          Suitable data structures for Mini-tree & Witness-graph.

                                                                       L
                                                                  Node 0        nodeEle
                             qϕ   E   db
                                                      vertexEle
                                                                      (db)
                                                                                Node 1
                                                                                edgeEle

                                  E   public                      Node 1        nodeEle
                                                      vertexEle
                                                                    (public)
                                                                                Node 2
                                                                                edgeEle
                                  E   national
                                                                  Node 2        nodeEle
                                                      vertexEle
                                                                   (national)
                              ′                                                 Node 3    Node 0
                             qϕ   E   project                                   edgeEle   edgeEle

                                                                  Node 3        nodeEle
                 pname                                vertexEle
                         E             E       year                 (project)
                                                                                Node 4    Node 6    Node 1

                                                                     .....      edgeEle   edgeEle   edgeEle

                         S             S
                                                                  Node 7

                         ×             ×                               (S)


                (c) Witness graph GΣ,ϕ                             (d) Adj. list for GΣ,ϕ


Introduction RW & Motivation XML Keys Results Conclusion                                                      13/24
XML Key Reasoning for Document Validation
           Applications




          Fast algorithms for the validation of XML documents against
          keys are crucial to ensure the consistency and semantic
          correctness of data stored or exchanged between applications.
          We use our implementation of the implication problem to
          compute non-redundant covers for sets of XML keys.
          Which in turn can be used to significantly speed up the
          process of XML document validation against sets of XML
          keys.




Introduction RW & Motivation XML Keys Results Conclusion                  14/24
XML Key Reasoning for Document Validation
           Cover Sets for XML Keys

   Fact
   Σ is non-redundant if there is no key ψ ∈ Σ such that
   Σ − {ψ} |= ψ.


   Algorithm 1: Non-redundant Cover for XML keys
   Input : Finite set Σ of XML keys
   Output: A non-redundant cover for Σ
1 Θ = Σ;
2 foreach key ψ ∈ Σ do
3      if Θ − {ψ} |= ψ then
4           Θ = Θ − {ψ};

5 return Θ;
6
   This set can be computed in O(|Σ| × (max{|ψ| : ψ ∈ Σ})2 ) time.

Introduction RW & Motivation XML Keys Results Conclusion             15/24
Experimental Results
           Performance Analysis


          We analyze:
            (i) The scalability of the implication problem.
           (ii) The viability of computing non-redundant cover sets to speed
                up the validation of XML documents against XML keys.
          For (i) we generated large sets of XML keys in the following
          two systematic ways:
             1. using a manually defined set of 5 to 10 XML keys as seeds, we
                computed new implied keys by successively applying the
                inference rules from the axiomatization presented by
                Hartmann & Link (2009).
             2. we defined some non-implied XML keys, adding witness edges
                                                ′
                keeping qϕ not reachable from qϕ .
          For (ii) we generated sets of keys following the same strategy.


Introduction RW & Motivation XML Keys Results Conclusion                       16/24
Experimental Results – Datasets (1/2)




                                   Table: XML Documents.
          Doc ID   Document               No. of       No. of       Size    Max.    Average
                                         Elements     Attributes            Depth    Depth
          Doc1     321gone.xml              311            0       23 KB      5     3.76527
          Doc2     yahoo.xml                342            0       24 KB      5     3.76608
          Doc3     dblp.xml               29,494        3,247      1.6 MB     6     2.90228
          Doc4     nasa.xml              476,646       56,317      23 MB      8     5.58314
          Doc5     SigmodRecord.xml       11,526        3,737      476 KB     6     5.14107
          Doc6     mondial-3.0.xml        22,423       47,423       1 MB      5     3.59274




Introduction RW & Motivation XML Keys Results Conclusion                                      17/24
Experimental Results – Datasets (2/2)
   SigmodRecord (seed keys)
   (ε, (issue, {volume.S, number.S}))
   (ε, ( ∗ .issue, {volume.S, number.S}))                                      Interaction rule
   (issue, (articles, {article.title.S}))                               (Q, (Q′ , {P.P1 , . . . , P.Pk })),

   (issue.articles, (article, {title.S}))                                (Q.Q′ , (P, {P1 , . . . , Pk }))

   (issue, (articles.article, {initP age.S, endP age.S}))                (Q, (Q′ .P, {P1 , . . . , Pk }))
   (ε, (issue.articles.article.authors.author, {position}))      (issue, (articles.article, {title.S}))
   Sigmod.xml
      <SigmodRecord>
         <issue>
            <volume>13</volume>
            <number>2</number>
            <articles>
               <article>
                  <title>Deadlock Detection is Cheap.</title>
                  <initPage>19</initPage>
                  <endPage>34</endPage>
                  <authors>
                      <author position="00">Rakesh Agrawal</author>
                      <author position="01">Michael J. Carey</author>
                      <author position="02">David J. DeWitt</author>
                  </authors>
               </article> ...
            </articles>
         </issue> ...
      </SigmodRecord>


Introduction RW & Motivation XML Keys Results Conclusion                                                      18/24
Experimental Results
            Strategies to generate implied and non-implied XML keys

                                                           Non-implied keys
       If we apply the interaction rule to
                                                                            E   db
   -   (issue, (articles, {article.title.S}))
                                                                       qϕ   E   conference
   -   (issue.articles, (article, {title.S}))
                                                                            E   issue
       We derived the implied key
   -   (issue, (articles.article, {title.S}))                               E   ℓ0 = content


       We applied the interaction,                                          E   articles

       context-target, subnodes, context-path
                                                                            E   article
       containment, target-path containment,
                                                                        ′
       subnodes-epsilon and prefix-epsilon rules                        qϕ   E   author

       whenever possible.                                      first   E         E      last


                                                                       S         S




Introduction RW & Motivation XML Keys Results Conclusion                                       19/24
Results for Implication Problem
                           Deciding implication of XML keys


                                                                                       2.5
                           abs-abs                                                                mix-rel
                 1.8        abs-rel                                                              wildcard
                            rel-abs
                 1.6         rel-rel
                           mix-abs                                                      2
                           mix-rel
     Time [ms]




                                                                          Time [ms]
                 1.4

                 1.2
                                                                                       1.5
                  1

                 0.8
                                                                                        1

                 0.6

                 0.4                                                                   0.5

                 0.2

                  0                                                                     0
                       0          20   40   60     80   100   120   140                      0        20    40   60     80   100   120   140
                                                 Size of Σ                                                            Size of Σ

      (e) XML key implication, all cases.                                             (f) Effect of wildcards presence.

   Figure: Performance of the Algorithm for the Implication of XML Keys
   (Σ |= ϕ).




Introduction RW & Motivation XML Keys Results Conclusion                                                                                       20/24
Results for Validation Problem
           Document Validation (1/2)


        XML doc.    Key Set                  Time[ms]
       321gone &    Processed Keys: 23         3.458                        35000                   >30min   >25min 1.88min     5min
                                                                                       Full-set
          yahoo     Discarded keys: 15                                               Cover-set
       (Doc1 & 2)   Cover set: 8 keys                                       30000

          DBLP      Processed Keys: 36        12.757
                                                                            25000




                                                        Running Time [ms]
         (Doc3)     Discarded keys: 24
                    Cover set: 12 keys                                      20000
          nasa      Processed Keys: 35         9.23
         (Doc4)     Discarded keys: 28                                      15000

                    Cover set size: 7 keys
                                                                            10000
        Sigmod      Processed Keys: 24        5.294
        Record      Discarded Keys: 19                                       5000
        (Doc5)      Cover set: 5 keys                                               952ms 1329ms
        mondial     Processed Keys: 26        4.342                             0
                                                                                       Doc1       Doc2    Doc3   Doc4    Doc5    Doc6
        (Doc6)      Discarded Keys: 16
                                                                                                         XML documents
                    Cover set: 10



     (a) Non-redundant Cover Sets.                                          (b) Validation Against Cover Sets.
   Figure: Non-redundant Cover Sets of XML keys and Validation of XML
   Documents

Introduction RW & Motivation XML Keys Results Conclusion                                                                                21/24
Results for Validation Problem
           Document Validation (2/2)




          There are some very expensive XPath queries.
          SigmodRecord
                 (issue, (articles.article, {title.S})) - there are 1503 nodes
                 in the XPath query “//issue/articles/article”.
                 (ε, (issue.articles.article.authors.author, {S})) - there are
                 3737 nodes in the XPath query
                 “//issue/articles/article/authors/author”.




Introduction RW & Motivation XML Keys Results Conclusion                         22/24
Main conclusion
   This work was motivated by two objectives:
      1 To demonstrate that there are expressive classes of XML keys that can be
          reasoned about efficiently.
      2 To show that our observations on the problem of deciding implication has
        immediate consequences for other perennial task in XML database
        management.
  [1.1] We studied a fragment of XML keys with nonempty sets of simple key
        paths.
[1.1.1] Presenting an efficient implementation for the implication problem.
[1.1.2] Showing through experiments that the proposed algorithm runs fast in
        practice and scales well.
  [2.1] We studied the problem of validating an XML document against a set of
        XML keys.
  [2.2] Presenting an optimization method for this validation via the
        computation of non-redundant cover sets of XML keys.
  [2.3] Our experiments show that enormous time savings can be achieved in
        practice.

Introduction RW & Motivation XML Keys Results Conclusion                           23/24
Questions?




                                            THANKS!

                                              Emir Mu˜oz Jim´nez– emir@emunoz.org
                                                     n      e

Introduction RW & Motivation XML Keys Results Conclusion                            24/24
Axiomatization


                                                               ∗
Let be P L from the grammar Q → ℓ | ε | Q.Q |
where ℓ ∈ L is a label, ε is the empty word, “.” is the concat
operator, and “ ∗ ” is the wildcard.
For nodes v and v ′ of an XML tree T , the value intersection
of v[[Q]] and v ′ [ Q]] is given by
v[[Q]] ∩v v ′ [ Q]] = {(w, w′ ) | w ∈ v[[Q]], w′ ∈ v ′ [ Q]], w =v w′ }
We define semantic closure by: Σ∗ = {ϕ ∈ C | Σ |= ϕ}
and also syntactic closure by: Σ+ = {ϕ | Σ ⊢ℜ ϕ}
                                ℜ
A set of rules ℜ is sound (complete) if Σ+ ⊆ Σ∗ (Σ∗ ⊆ Σ+ )
                                         ℜ             ℜ
A sound and complete set of rules is called axiomatization



                                                                          1/2
Deciding XML Key Implication
         Mini-trees and Witness Graphs (2/2)



  Theorem (Hartman & Link, 2009)

  Let Σ ∪ {ϕ} be a finite set of keys in the class K. We have Σ |= ϕ
                                       ′
  if and only if qϕ is reachable from qϕ in GΣ,ϕ .



  Algorithm 2: XML key implication in K
  Input : finite set of XML keys Σ ∪ {ϕ} in K
  Output: yes, if Σ |= ϕ; no, otherwise
1 Construct GΣ,ϕ for Σ and ϕ;
                           ′
2 if qϕ is reachable from qϕ in G then return yes; else return no; end if




                                                                            2/2

Contenu connexe

Plus de Emir Muñoz

Web Intelligence - 2010
Web Intelligence - 2010Web Intelligence - 2010
Web Intelligence - 2010Emir Muñoz
 
μRaptor: A DOM-based system with appetite for hCard elements
μRaptor: A DOM-based system with appetite for hCard elementsμRaptor: A DOM-based system with appetite for hCard elements
μRaptor: A DOM-based system with appetite for hCard elementsEmir Muñoz
 
Learning Content Patterns from Linked Data
Learning Content Patterns from Linked DataLearning Content Patterns from Linked Data
Learning Content Patterns from Linked DataEmir Muñoz
 
Claves XML: Una Implementación de Algoritmos de Implicación y Validación
Claves XML: Una Implementación de Algoritmos de Implicación y ValidaciónClaves XML: Una Implementación de Algoritmos de Implicación y Validación
Claves XML: Una Implementación de Algoritmos de Implicación y ValidaciónEmir Muñoz
 
Using Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's TablesUsing Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's TablesEmir Muñoz
 
Reading Group 2014
Reading Group 2014Reading Group 2014
Reading Group 2014Emir Muñoz
 
Soft Cardinality Constraints on XML Data
Soft Cardinality Constraints on XML DataSoft Cardinality Constraints on XML Data
Soft Cardinality Constraints on XML DataEmir Muñoz
 
DRETa: Extracting RDF From Wikitables
DRETa: Extracting RDF From WikitablesDRETa: Extracting RDF From Wikitables
DRETa: Extracting RDF From WikitablesEmir Muñoz
 

Plus de Emir Muñoz (8)

Web Intelligence - 2010
Web Intelligence - 2010Web Intelligence - 2010
Web Intelligence - 2010
 
μRaptor: A DOM-based system with appetite for hCard elements
μRaptor: A DOM-based system with appetite for hCard elementsμRaptor: A DOM-based system with appetite for hCard elements
μRaptor: A DOM-based system with appetite for hCard elements
 
Learning Content Patterns from Linked Data
Learning Content Patterns from Linked DataLearning Content Patterns from Linked Data
Learning Content Patterns from Linked Data
 
Claves XML: Una Implementación de Algoritmos de Implicación y Validación
Claves XML: Una Implementación de Algoritmos de Implicación y ValidaciónClaves XML: Una Implementación de Algoritmos de Implicación y Validación
Claves XML: Una Implementación de Algoritmos de Implicación y Validación
 
Using Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's TablesUsing Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's Tables
 
Reading Group 2014
Reading Group 2014Reading Group 2014
Reading Group 2014
 
Soft Cardinality Constraints on XML Data
Soft Cardinality Constraints on XML DataSoft Cardinality Constraints on XML Data
Soft Cardinality Constraints on XML Data
 
DRETa: Extracting RDF From Wikitables
DRETa: Extracting RDF From WikitablesDRETa: Extracting RDF From Wikitables
DRETa: Extracting RDF From Wikitables
 

DEXA 2012 Talk

  • 1. Performance Analysis of Algorithms to Reason about XML Keys Emir Mu˜oz Jim´nez n e University of Santiago de Chile Yahoo! Labs LatAm Joint work with F. Ferrarotti, S. Hartmann, S. Link, M. Marin DEXA 2012 @ Vienna, Austria, 3rd September 2012 Introduction RW & Motivation XML Keys Results Conclusion 1/24
  • 2. Contribution An efficient implementation of an algorithm that decides the implication problem for a tractable and expressive class of XML keys. Performance Analysis: Implication problem over large sets of XML Keys. Non-redundant covers of XML keys and validation of XML documents. Our experiments show that reasoning about expressive notions of XML keys can be done efficiently in practice and scales well. Introduction RW & Motivation XML Keys Results Conclusion 2/24
  • 3. Outline 1 Introduction 2 Related Work & Motivation 3 Reasoning about XML Keys 4 Results 5 Conclusion Introduction RW & Motivation XML Keys Results Conclusion 3/24
  • 4. Introduction Keys XML: de-facto standard for Web data exchange and integration. XML ↔ data storage ↔ data processing. Both industry and academia have long since recognized the importance of keys in XML data management. Several notions of XML keys have been proposed. Most influential is due to Buneman et al. who defined keys on the basis of an XML tree model similar to the one suggested by DOM and XPath. Introduction RW & Motivation XML Keys Results Conclusion 4/24
  • 5. Introduction XML Tree Example (1/2) Example (An XML tree representing an XML document) db E project project E pname E pname team team tname E tname A E A A Riders MediaLow employee A Coyotes Phobia employee employee employee rol rol rol employee rol rol rol employee E E E E E E A A A A A A Team name Team name Team name Team name Team name Team name Member Member Leader Member Member Leader E E E E E E fname lname fname lname fname lname fname lname fname lname fname lname E E E E E E E E E E E E S S S S S S S S S S S S Dexter Cooper John Smith William Bell Dexter Cooper William Bell Brook Davis Introduction RW & Motivation XML Keys Results Conclusion 5/24
  • 6. Introduction XML Tree Example (2/2) Some Constraints: (gathered as business rules) (a) A project node is identified by pname, no matter where the project node appears in the document. (b) A team node can be identified by tname relatively to a project node. (c) Within any given subtree rooted at team, an employee node is identified by name. Introduction RW & Motivation XML Keys Results Conclusion 6/24
  • 7. Introduction XML Tree Example (2/2) Some Constraints: (gathered as business rules) (a) A project node is identified by pname, no matter where the project node appears in the document. Absolute key (b) A team node can be identified by tname relatively to a project node. Relative key (c) Within any given subtree rooted at team, an employee node is identified by name. Relative key Introduction RW & Motivation XML Keys Results Conclusion 6/24
  • 8. Related Work & Motivation Keys are important in many perennial tasks in database management. The hope is that keys will turn out to be equally beneficial for XML. In practice, expressive yet tractable notions of XML keys have been ignored so far. We initiate an empirical study of the fragment of XML keys with nonempty sets of simple key paths. One of the most fundamental questions on keys is: Can we decide if a new key holds given a set of known keys? (Logical Implication Problem.) Introduction RW & Motivation XML Keys Results Conclusion 7/24
  • 9. Related Work & Motivation Key implication example Some Constraints: (gathered as business rules) (a) A project node is identified by pname, no matter where the project node appears in the document. (b) A team node can be identified by tname relatively to a project node. (c) Within any given subtree rooted at team, an employee node is identified by name. Suppose, the consideration of a further key: (d) A project node can be identified in the document by its child nodes pname and team. Introduction RW & Motivation XML Keys Results Conclusion 8/24
  • 10. Related Work & Motivation Key implication example Some Constraints: (gathered as business rules) (a) A project node is identified by pname, no matter where the project node appears in the document. (b) A team node can be identified by tname relatively to a project node. (c) Within any given subtree rooted at team, an employee node is identified by name. Suppose, the consideration of a further key: (d) A project node can be identified in the document by its child nodes pname and team. key (a) actually implies (d)!! because (d) is a superkey of (a). Introduction RW & Motivation XML Keys Results Conclusion 8/24
  • 11. Keys for XML Definitions (1/2) Definition (Value equality) u is value equal to v (u =v v) if the trees rooted at u and v are isomorphic by an isomorphism that is the identity on string values. (e.g., element nodes employee for “Dexter Cooper”) db E project project E pname E pname team team tname E tname A E A A Riders MediaLow employee A Coyotes Phobia employee employee employee rol rol rol employee rol rol rol employee E E E E E E A A A A A A Team name Team name Team name Team name Team name Team name Member Member Leader Member Member Leader E E E E E E fname lname fname lname fname lname fname lname fname lname fname lname E E E E E E E E E E E E S S S S S S S S S S S S Dexter Cooper John Smith William Bell Dexter Cooper William Bell Brook Davis Introduction RW & Motivation XML Keys Results Conclusion 9/24
  • 12. Keys for XML Definitions (2/2) Some Constraints: (in class K of XML keys) (a) A project node is identified by pname, no matter where the project node appears in the document. (ε, (project , {pname})) (b) A team node can be identified by tname relatively to a project node. (project , (team, {tname})) (c) Within any give subtree rooted at team, an employee node is identified by name. ( ∗ .team, (employee, {name})) (d) A project node can be identified by its child nodes pname and team. (ε, (project , {pname, team})) Introduction RW & Motivation XML Keys Results Conclusion 10/24
  • 13. Deciding XML Key Implication Notions We say that Σ implies ϕ, denoted by Σ |= ϕ, iff every finite XML tree T that satisfies all σ ∈ Σ also satisfies ϕ. Implication Problem for a class C of XML keys: given any Σ ∪ {ϕ} in C, decide whether Σ |= ϕ. e.g., let be Σ = {(a), (b), (c)} and ϕ = (d), then Σ |= ϕ is true. e.g., let be Σ = {(a), (b)} and ϕ = (c), then Σ |= ϕ is false. Hartmann & Link characterize the implication problem for the class K of XML keys in terms of the of reachability problem for fixed nodes in a suitable digraph. Introduction RW & Motivation XML Keys Results Conclusion 11/24
  • 14. Deciding XML Key Implication Mini-trees and Witness Graphs (Hartmann & Link, 2009) ϕ = (ε, (public. ∗ .project , {pname.S , year .S })) rϕ = q ϕ p 1111 0000 1111 0000 (1) 1111 0000 qϕ 1111 0000 1111 0000 rϕ1111 ′ 0000 (2) E db 1111 0000 1111 0000 1111 0000 ′ ′ v1 E public ❦ qϕ E db ′ v1 E public p ′ ′ v2 E national E public v2 E national ′ ′ qϕ E project E national qϕ E project (3) ′ w1 w1 pname ′❦ 11111 00000 ϕ E E year qϕ E project r10000 ϕ1111 1111 0000 11111 00000r 11111 00000 2 1111 0000 1111 0000 11111 00000 xϕ xϕ 11111 00000 ′ 1 S S 2 pname E E year p1 w1 w1 E pname E year p2 × × S S xϕ 1 S S xϕ 2 × × (a) Mini tree TΣ,ϕ (b) Witness graph GΣ,ϕ Introduction RW & Motivation XML Keys Results Conclusion 12/24
  • 15. An Efficient Implementation Data structures and implementation Suitable data structures for Mini-tree & Witness-graph. L Node 0 nodeEle qϕ E db vertexEle (db) Node 1 edgeEle E public Node 1 nodeEle vertexEle (public) Node 2 edgeEle E national Node 2 nodeEle vertexEle (national) ′ Node 3 Node 0 qϕ E project edgeEle edgeEle Node 3 nodeEle pname vertexEle E E year (project) Node 4 Node 6 Node 1 ..... edgeEle edgeEle edgeEle S S Node 7 × × (S) (c) Witness graph GΣ,ϕ (d) Adj. list for GΣ,ϕ Introduction RW & Motivation XML Keys Results Conclusion 13/24
  • 16. XML Key Reasoning for Document Validation Applications Fast algorithms for the validation of XML documents against keys are crucial to ensure the consistency and semantic correctness of data stored or exchanged between applications. We use our implementation of the implication problem to compute non-redundant covers for sets of XML keys. Which in turn can be used to significantly speed up the process of XML document validation against sets of XML keys. Introduction RW & Motivation XML Keys Results Conclusion 14/24
  • 17. XML Key Reasoning for Document Validation Cover Sets for XML Keys Fact Σ is non-redundant if there is no key ψ ∈ Σ such that Σ − {ψ} |= ψ. Algorithm 1: Non-redundant Cover for XML keys Input : Finite set Σ of XML keys Output: A non-redundant cover for Σ 1 Θ = Σ; 2 foreach key ψ ∈ Σ do 3 if Θ − {ψ} |= ψ then 4 Θ = Θ − {ψ}; 5 return Θ; 6 This set can be computed in O(|Σ| × (max{|ψ| : ψ ∈ Σ})2 ) time. Introduction RW & Motivation XML Keys Results Conclusion 15/24
  • 18. Experimental Results Performance Analysis We analyze: (i) The scalability of the implication problem. (ii) The viability of computing non-redundant cover sets to speed up the validation of XML documents against XML keys. For (i) we generated large sets of XML keys in the following two systematic ways: 1. using a manually defined set of 5 to 10 XML keys as seeds, we computed new implied keys by successively applying the inference rules from the axiomatization presented by Hartmann & Link (2009). 2. we defined some non-implied XML keys, adding witness edges ′ keeping qϕ not reachable from qϕ . For (ii) we generated sets of keys following the same strategy. Introduction RW & Motivation XML Keys Results Conclusion 16/24
  • 19. Experimental Results – Datasets (1/2) Table: XML Documents. Doc ID Document No. of No. of Size Max. Average Elements Attributes Depth Depth Doc1 321gone.xml 311 0 23 KB 5 3.76527 Doc2 yahoo.xml 342 0 24 KB 5 3.76608 Doc3 dblp.xml 29,494 3,247 1.6 MB 6 2.90228 Doc4 nasa.xml 476,646 56,317 23 MB 8 5.58314 Doc5 SigmodRecord.xml 11,526 3,737 476 KB 6 5.14107 Doc6 mondial-3.0.xml 22,423 47,423 1 MB 5 3.59274 Introduction RW & Motivation XML Keys Results Conclusion 17/24
  • 20. Experimental Results – Datasets (2/2) SigmodRecord (seed keys) (ε, (issue, {volume.S, number.S})) (ε, ( ∗ .issue, {volume.S, number.S})) Interaction rule (issue, (articles, {article.title.S})) (Q, (Q′ , {P.P1 , . . . , P.Pk })), (issue.articles, (article, {title.S})) (Q.Q′ , (P, {P1 , . . . , Pk })) (issue, (articles.article, {initP age.S, endP age.S})) (Q, (Q′ .P, {P1 , . . . , Pk })) (ε, (issue.articles.article.authors.author, {position})) (issue, (articles.article, {title.S})) Sigmod.xml <SigmodRecord> <issue> <volume>13</volume> <number>2</number> <articles> <article> <title>Deadlock Detection is Cheap.</title> <initPage>19</initPage> <endPage>34</endPage> <authors> <author position="00">Rakesh Agrawal</author> <author position="01">Michael J. Carey</author> <author position="02">David J. DeWitt</author> </authors> </article> ... </articles> </issue> ... </SigmodRecord> Introduction RW & Motivation XML Keys Results Conclusion 18/24
  • 21. Experimental Results Strategies to generate implied and non-implied XML keys Non-implied keys If we apply the interaction rule to E db - (issue, (articles, {article.title.S})) qϕ E conference - (issue.articles, (article, {title.S})) E issue We derived the implied key - (issue, (articles.article, {title.S})) E ℓ0 = content We applied the interaction, E articles context-target, subnodes, context-path E article containment, target-path containment, ′ subnodes-epsilon and prefix-epsilon rules qϕ E author whenever possible. first E E last S S Introduction RW & Motivation XML Keys Results Conclusion 19/24
  • 22. Results for Implication Problem Deciding implication of XML keys 2.5 abs-abs mix-rel 1.8 abs-rel wildcard rel-abs 1.6 rel-rel mix-abs 2 mix-rel Time [ms] Time [ms] 1.4 1.2 1.5 1 0.8 1 0.6 0.4 0.5 0.2 0 0 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 Size of Σ Size of Σ (e) XML key implication, all cases. (f) Effect of wildcards presence. Figure: Performance of the Algorithm for the Implication of XML Keys (Σ |= ϕ). Introduction RW & Motivation XML Keys Results Conclusion 20/24
  • 23. Results for Validation Problem Document Validation (1/2) XML doc. Key Set Time[ms] 321gone & Processed Keys: 23 3.458 35000 >30min >25min 1.88min 5min Full-set yahoo Discarded keys: 15 Cover-set (Doc1 & 2) Cover set: 8 keys 30000 DBLP Processed Keys: 36 12.757 25000 Running Time [ms] (Doc3) Discarded keys: 24 Cover set: 12 keys 20000 nasa Processed Keys: 35 9.23 (Doc4) Discarded keys: 28 15000 Cover set size: 7 keys 10000 Sigmod Processed Keys: 24 5.294 Record Discarded Keys: 19 5000 (Doc5) Cover set: 5 keys 952ms 1329ms mondial Processed Keys: 26 4.342 0 Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 (Doc6) Discarded Keys: 16 XML documents Cover set: 10 (a) Non-redundant Cover Sets. (b) Validation Against Cover Sets. Figure: Non-redundant Cover Sets of XML keys and Validation of XML Documents Introduction RW & Motivation XML Keys Results Conclusion 21/24
  • 24. Results for Validation Problem Document Validation (2/2) There are some very expensive XPath queries. SigmodRecord (issue, (articles.article, {title.S})) - there are 1503 nodes in the XPath query “//issue/articles/article”. (ε, (issue.articles.article.authors.author, {S})) - there are 3737 nodes in the XPath query “//issue/articles/article/authors/author”. Introduction RW & Motivation XML Keys Results Conclusion 22/24
  • 25. Main conclusion This work was motivated by two objectives: 1 To demonstrate that there are expressive classes of XML keys that can be reasoned about efficiently. 2 To show that our observations on the problem of deciding implication has immediate consequences for other perennial task in XML database management. [1.1] We studied a fragment of XML keys with nonempty sets of simple key paths. [1.1.1] Presenting an efficient implementation for the implication problem. [1.1.2] Showing through experiments that the proposed algorithm runs fast in practice and scales well. [2.1] We studied the problem of validating an XML document against a set of XML keys. [2.2] Presenting an optimization method for this validation via the computation of non-redundant cover sets of XML keys. [2.3] Our experiments show that enormous time savings can be achieved in practice. Introduction RW & Motivation XML Keys Results Conclusion 23/24
  • 26. Questions? THANKS! Emir Mu˜oz Jim´nez– emir@emunoz.org n e Introduction RW & Motivation XML Keys Results Conclusion 24/24
  • 27. Axiomatization ∗ Let be P L from the grammar Q → ℓ | ε | Q.Q | where ℓ ∈ L is a label, ε is the empty word, “.” is the concat operator, and “ ∗ ” is the wildcard. For nodes v and v ′ of an XML tree T , the value intersection of v[[Q]] and v ′ [ Q]] is given by v[[Q]] ∩v v ′ [ Q]] = {(w, w′ ) | w ∈ v[[Q]], w′ ∈ v ′ [ Q]], w =v w′ } We define semantic closure by: Σ∗ = {ϕ ∈ C | Σ |= ϕ} and also syntactic closure by: Σ+ = {ϕ | Σ ⊢ℜ ϕ} ℜ A set of rules ℜ is sound (complete) if Σ+ ⊆ Σ∗ (Σ∗ ⊆ Σ+ ) ℜ ℜ A sound and complete set of rules is called axiomatization 1/2
  • 28. Deciding XML Key Implication Mini-trees and Witness Graphs (2/2) Theorem (Hartman & Link, 2009) Let Σ ∪ {ϕ} be a finite set of keys in the class K. We have Σ |= ϕ ′ if and only if qϕ is reachable from qϕ in GΣ,ϕ . Algorithm 2: XML key implication in K Input : finite set of XML keys Σ ∪ {ϕ} in K Output: yes, if Σ |= ϕ; no, otherwise 1 Construct GΣ,ϕ for Σ and ϕ; ′ 2 if qϕ is reachable from qϕ in G then return yes; else return no; end if 2/2