SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
Design Problem

                      Distributed Database Systems
                                 Fall 2012
                                                                                                  Design problem of distributed systems: Making decisions about
                                                                                                  the placement of data and programs across the sites of a computer
           Distributed Database Design                                                            network as well as possibly designing the network itself.
                                         SL02                                                     In DDBMS, the distribution of applications involves
                                                                                                           Distribution of the DDBMS software
                                                                                                           Distribution of applications that run on the database
                                                                                                  Distribution of applications will not be considered in the following;
       Design problem                                                                             instead the distribution of data is studied.
       Design strategies (top-down, bottom-up)
       Fragmentation (horizontal, vertical)
       Allocation and replication of fragments, optimality, heuristics


 DDBS12, SL02                                 1/60                                 ¨
                                                                               M. Bohlen    DDBS12, SL02                                 2/60                            ¨
                                                                                                                                                                     M. Bohlen


Framework of Distribution                                                                  Design Strategies/1
       Dimension for the analysis of distributed systems
                Level of sharing: no sharing, data sharing, data + program sharing
                Behavior of access patterns: static, dynamic
                Level of knowledge on access pattern behavior: no information,
                partial information, complete information
                                                                                                  Top-down approach
                                                                                                           Designing systems from scratch
                                                                                                           Homogeneous systems
                                                                                                  Bottom-up approach
                                                                                                           The databases already exist at a number of sites
                                                                                                           The databases should be connected to solve common tasks




       Distributed database design should be considered within this
       general framework.
 DDBS12, SL02                                 3/60                                 ¨
                                                                               M. Bohlen    DDBS12, SL02                                 4/60                            ¨
                                                                                                                                                                     M. Bohlen
Design Strategies/2                                   Design Strategies/3
       Top-down design strategy
                                                             Distribution design is the central part of the design in DDBMSs
                                                             (the other tasks are similar to traditional databases).
                                                             Objective: Design the LCSs by distributing the data over the sites.
                                                             Two main aspects have to be designed carefully:
                                                                      Fragmentation: Relation may be divided into a number of
                                                                      sub-relations, which are distributed
                                                                      Allocation and replication: Each fragment is stored at site(s) with
                                                                      ”optimal” distribution
                                                             Distribution design issues
                                                                      Why fragment at all?
                                                                      How to fragment?
                                                                      How much to fragment?
                                                                      How to test correctness?
                                                                      How to allocate?



 DDBS12, SL02                      5/60       ¨
                                          M. Bohlen    DDBS12, SL02                                  6/60                                  ¨
                                                                                                                                       M. Bohlen


Design Strategies/4                                   Fragmentation/1
       Bottom-up design strategy                             What is a reasonable unit of distribution? Relation or fragment of
                                                             relation?
                                                             Relations as unit of distribution:
                                                                      If the relation is not replicated, we get a high volume of remote data
                                                                      accesses.
                                                                      If the relation is replicated, we get unnecessary replications, which
                                                                      cause problems in executing updates and waste disk space
                                                                      Might be an OK solution, if queries need all the data in the relation
                                                                      and data stays only at the sites that use the data
                                                             Fragments of relation as unit of distribution:
                                                                      Application views are usually subsets of relations
                                                                      Thus, locality of accesses of applications is defined on subsets of
                                                                      relations
                                                                      Permits a number of transactions to execute concurrently, since they
                                                                      will access different portions of a relation
                                                                      Parallel execution of a single query (intra-query concurrency)
                                                                      However, semantic data control (especially integrity enforcement) is
                                                                      more difficult
                                                        ⇒ Fragments of relations are (usually) appropriate unit of distribution.
 DDBS12, SL02                      7/60       ¨
                                          M. Bohlen    DDBS12, SL02                                  8/60                                  ¨
                                                                                                                                       M. Bohlen
Fragmentation/2                                                                               Fragmentation/3
                                                                                                     Types of Fragmentation
                                                                                                              Horizontal: partitions a relation along its tuples
       Fragmentation aims to improve:                                                                         Vertical: partitions a relation along its attributes
                                                                                                              Mixed/hybrid: a combination of horizontal and vertical fragmentation
                Reliability
                Performance
                Balanced storage capacity and costs
                Communication costs
                Security
       The following information is used to decide fragmentation:
                Quantitative information: cardinality of relations, frequency of queries,                                     (a) Horizontal Fragmentation
                site where query is run, selectivity of the queries, etc.
                Qualitative information: predicates in queries, types of access of
                data, read/write, etc.


                                                                                                       (b) Vertical Fragmentation                       (c) Mixed Fragmentation



 DDBS12, SL02                                  9/60                                   ¨
                                                                                  M. Bohlen    DDBS12, SL02                                 10/60                                     ¨
                                                                                                                                                                                  M. Bohlen


Fragmentation/4                                                                               Fragmentation/5
       Example of database instance
                                                                                                     Example (contd.): Horizontal fragmentation of PROJ relation
                                                                                                              PROJ1: projects with budgets less than 200 000
                                                                                                              PROJ2: projects with budgets greater than or equal to 200 000




                                     Data
 DDBS12, SL02                                  11/60                                  ¨
                                                                                  M. Bohlen    DDBS12, SL02                                 12/60                                     ¨
                                                                                                                                                                                  M. Bohlen
Fragmentation/6                                                                            Correctness Rules of Fragmentation
       Example (contd.): Vertical fragmentation of PROJ relation
                PROJ1: information about project budgets                                          Completeness
                PROJ2: information about project names and locations                                       Decomposition of relation R into fragments R1 , R2 , . . . , Rn is complete
                                                                                                           iff each data item in R can also be found in some Ri .
                                                                                                  Reconstruction
                                                                                                           If relation R is decomposed into fragments R1 , R2 , . . . , Rn , then there
                                                                                                           should exist some relational operator that reconstructs R from its
                                                                                                           fragments, i.e., R = R1 . . . Rn
                                                                                                                Union to combine horizontal fragments
                                                                                                                Join to combine vertical fragments
                                                                                                  Disjointness
                                                                                                           If relation R is decomposed into fragments R1 , R2 , . . . , Rn and data
                                                                                                           item di appears in fragment Rj , then di should not appear in any other
                                                                                                           fragment Rk , k j (exception: primary key attribute for vertical
                                                                                                           fragmentation)
                                                                                                                For horizontal fragmentation, data item is a tuple
                                                                                                                For vertical fragmentation, data item is an attribute



 DDBS12, SL02                                13/60                                 ¨
                                                                               M. Bohlen    DDBS12, SL02                                    14/60                                    ¨
                                                                                                                                                                                 M. Bohlen


Idea of Horizontal Fragmentation                                                           Information Requirements/1

                                                                                           Database information:
                                                                                                  Links between relations (a link models a 1:N relationship between
       Intuition behind horizontal fragmentation                                                  relations that are related to each other by an equality join)
                Every site should hold all information that is used to query the site
                The information at the site should be fragmented so the queries of the
                site run faster
       Horizontal fragmentation is defined as selection operation, σp (R )
       Example:
                                    σBUDGET <200K (PROJ )
                                    σBUDGET ≥200K (PROJ )
                                                                                                                                  Links (one to many)
                                                                                                  Cardinality of relations: card(R)



 DDBS12, SL02                                15/60                                 ¨
                                                                               M. Bohlen    DDBS12, SL02                                    16/60                                    ¨
                                                                                                                                                                                 M. Bohlen
Information Requirements/2                                                         02.1               Information Requirements/3                                                  02.2




Application information:
       Simple predicates used in queries
                Relation R [A1 , A2 , ..., An ]
                Ai θvi a simple predicate (θ ∈ {=, <, ≤, >, ≥, }, vi ∈ Di , Di is the                 Application information (cnt’d):
                domain of Ai )                                                                               Minterm selectivities
                For relation R we define Pr = {p1 , p2 , ..., pm }                                                     The number of tuples of the relation that would be accessed by a user
                Example: PNAME = ”Maintenance”, BUDGET < 200000                                                       query, which is specified according to a given minterm predicate mi .
       Minterm predicates                                                                                    Access frequencies
                minterm predicates are conjunctions of simple predicates                                              The frequency with which a user application qi accesses data.
                Relation R, Pr = {p1 , p2 , ..., pm }
                M = {m1 , m2 , ..., mr } such that

                             M = {mi | mi = ∧pj ∈Pr pj ∗}, 1 ≤ j ≤ m, 1 ≤ i ≤ z

                where pj ∗ = pj or pj ∗ = ¬pj



 DDBS12, SL02                                    17/60                                        ¨
                                                                                          M. Bohlen    DDBS12, SL02                                18/60                                     ¨
                                                                                                                                                                                         M. Bohlen


Horizontal Fragmentation/1                                                                            Horizontal Fragmentation/2


                                                                                                      A horizontal fragment Ri of relation R consists of all the tuples of R that
We write Pr to denote the complete and minimal set of simple                                          satisfy a predicate Fj :
predicates generated from Pr:
                                                                                                                                       Rj = σFj (R ), 1 ≤ j ≤ w
       The set of predicates is complete if and only if any two tuples in the
       same fragment are referenced with the same probability by any                                  where Fj is a selection formula, which is (preferably) a minterm predicate.
       application
       The set of predicates is minimal if and only if each simple predicate                          Computing the horizontal fragmentation:
       is relevant for determining the fragmentation and for each pair of                               1. Determine Pr (80/20 rule)
       fragments there is at least one query that accesses the fragments                                2. Compute Pr’
       differently
                                                                                                        3. Determine M
                                                                                                        4. Minimize M (eliminating impossible minterms)




 DDBS12, SL02                                    19/60                                        ¨
                                                                                          M. Bohlen    DDBS12, SL02                                20/60                                     ¨
                                                                                                                                                                                         M. Bohlen
Horizontal Fragmentation/3                                                                Horizontal Fragmentation/4


       Example: Fragmentation of the PROJ relation
                Consider the following query: Find the name and budget of projects
                given their location.                                                            If access is only according to the location, the above set of
                The query is issued at all three locations                                       predicates is complete
                Fragmentation based on LOC, using the set of predicates                                   i.e., in each fragment PROJi each tuple has the same probability of
                {LOC = ‘Montreal ‘, LOC = ‘NewYork ‘, LOC = ‘Paris ‘}                                     being accessed
                                                                                                 If there is a second query/application that accesses only those
       PROJ1 = σLOC =‘Montreal ‘ (PROJ )
                                                PROJ2 = σLOC =‘NewYork ‘ (PROJ )                 project tuples where the budget is less than $200K, the set of
       PNO PNAME BUDGET LOC
                                                PNO PNAME BUDGET                 LOC             predicates is not complete.
                                                 P2 DB Develop. 135000 New York                           P2 in PROJ2 has higher probability to be accessed
        P1 Instrument. 150000 Montreal
                                                 P3 CAD/CAM 250000 New York
       PROJ3 = σLOC =‘Paris ‘ (PROJ )
       PNO    PNAME          BUDGET LOC
        P4 Maintenance 310000 Paris




 DDBS12, SL02                                21/60                                ¨
                                                                              M. Bohlen    DDBS12, SL02                                       22/60                                            ¨
                                                                                                                                                                                           M. Bohlen


Horizontal Fragmentation/5                                                                Horizontal Fragmentation/6                                                                02.3




       Example (contd.):                                                                         Example (contd.): Now, PROJi , i = 1, 2, 3 will be split in two
          Add BUDGET < 200K and BUDGET ≥ 200K to the set of predicates                           fragments
                to make it complete.
                ⇒ {LOC = ‘Montreal ‘, LOC = ‘NewYork ‘, LOC = ‘Paris ‘,                          PROJ1 = σLOC =‘Montr ‘∧BDGT <200K (PROJ )            PROJ2 = σLOC =‘NY ‘∧BDGT <200K (PROJ )
                    BUDGET ≥ 200K , BUDGET < 200K } is a complete set                            PNO PNAME BUDGET LOC                                 PNO PNAME BUDGET               LOC
                Minterms to fragment the relation are given as follows:                           P1 Instrumnt 150000 Montreal                         P2 DB Devel 135000 New York

                                                                                                 PROJ3 = σLOC =‘Paris ‘∧BDGT ≥200K (PROJ )            PROJ2 = σLOC =‘NY ‘∧BDGT ≥200K (PROJ )
                             (LOC = ‘Montreal ‘) ∧ (BUDGET ≤ 200K )
                                                                                                 PNO    PNAME          BUDGET LOC                     PNO PNAME BUDGET                LOC
                             (LOC = ‘Montreal ‘) ∧ (BUDGET > 200K )                               P4 Maintenance 310000 Paris                          P3 CAD/CAM 250000 New York
                             (LOC = ‘NewYork ‘) ∧ (BUDGET ≤ 200K )
                             (LOC = ‘NewYork ‘) ∧ (BUDGET > 200K )                                        Note that the following fragments are empty:
                             (LOC = ‘Paris ‘) ∧ (BUDGET ≤ 200K )                                               σLOC =‘Paris ‘∧BDGT <200K (PROJ )
                                                                                                               σLOC =‘Montreal ‘∧BDGT ≥200K (PROJ )
                             (LOC = ‘Paris ‘) ∧ (BUDGET > 200K )




 DDBS12, SL02                                23/60                                ¨
                                                                              M. Bohlen    DDBS12, SL02                                       24/60                                            ¨
                                                                                                                                                                                           M. Bohlen
Derived Horizontal Fragmentation/1                                       02.5               Derived Horizontal Fragmentation/2

       Defined on a member (target) relation of a link according to a
       selection operation specified on its owner (source).
                                                                                                   Given a link L where owner (L ) = S and member (L ) = R, the
                Each link is an equijoin.
                                                                                                   derived horizontal fragments of R are defined as
                Fragments of member relation can be defined with a semijoin on
                fragments of owner relation.                                                                                   Ri = R   |><   Si , 1 ≤ i ≤ w

                                                                                                   where w is the maximum number of fragments that will be defined
                                                                                                   on R and
                                                                                                                            Si = σFi (S )
                                                                                                   where Fi is the formula according to which the primary horizontal
                                                                                                   fragment Si is defined.




 DDBS12, SL02                               25/60                                   ¨
                                                                                M. Bohlen    DDBS12, SL02                                 26/60                           ¨
                                                                                                                                                                      M. Bohlen


Derived Horizontal Fragmentation/3                                       02.7               Vertical Fragmentation/1
                                                                                                   Objective of vertical fragmentation is to partition a relation into a set
                                                                                                   of smaller relations so that many of the applications will run on only
                                                                                                   one fragment.
       Given link L 1 where                                                                        Vertical fragmentation of a relation R produces fragments
           owner (L 1) = PAY and                                                                   R1 , R2 , . . . , each of which contains a subset of R’s attributes.
           member (L 1) = EMP                                                                      Vertical fragmentation is defined using the projection operation of
       and                                                                                         the relational algebra:
           PAY 1 = σSAL ≤30000 (PAY )                                                                                               πA1 ,A2 ,...,An (R )
           PAY 2 = σSAL >30000 (PAY )
       we get
           EMP1 = EMP |>< PAY 1                                                                    Example:
           EMP2 = EMP |>< PAY 2
                                                                                                                         PROJ1 = πPNO ,BUDGET (PROJ )
                                                                                                                         PROJ2 = πPNO ,PNAME ,LOC (PROJ )
                                                                                                   Vertical fragmentation has also been studied for (centralized) DBMS
                                                                                                            Smaller relations, and hence less page accesses
                                                                                                            e.g., MONET system
 DDBS12, SL02                               27/60                                   ¨
                                                                                M. Bohlen    DDBS12, SL02                                 28/60                           ¨
                                                                                                                                                                      M. Bohlen
Vertical Fragmentation/2                                                                     Vertical Fragmentation/3



                                                                                                    Two types of heuristics for vertical fragmentation exist:
       Vertical fragmentation is more complicated than horizontal
                                                                                                             Grouping: assign each attribute to one fragment, and at each step,
       fragmentation
                                                                                                             join some of the fragments until some criteria is satisfied.
                In horizontal partitioning: for n simple predicates, the number of
                                                                                                                  Bottom-up approach
                possible minterms is 2n ; some of them can be ruled out by existing
                                                                                                             Splitting: starts with a relation and decides on beneficial partitionings
                implications/constraints.
                                                                                                             based on the access behaviour of applications to the attributes.
                In vertical partitioning: for m non-primary key attributes, the number
                of possible fragments is equal to B (m) (= the mth Bell number), i.e.,                            Top-down approach
                the number of partitions of a set with m members.                                                 Results in non-overlapping fragments
                     For large numbers, B (m) ≈ mm (e.g., B (15) = 109 )                                     Optimal solution is probably closer to the full relation than to a set of
                                                                                                             small relations with only one attribute




 DDBS12, SL02                                   29/60                                ¨
                                                                                 M. Bohlen    DDBS12, SL02                                     30/60                                 ¨
                                                                                                                                                                                 M. Bohlen


Vertical Fragmentation/4                                                                     Vertical Fragmentation/5



                                                                                                    Given are the user queries/applications Q = (q1 , . . . , qq ) that will
       Application information: The major information required as input                             run on relation R (A1 , . . . , An )
       for vertical fragmentation is related to applications (queries)                              Attribute Usage Matrix: Denotes which query uses which attribute:
                Since vertical fragmentation places in one fragment those attributes
                usually accessed together, there is a need for some measure that                                                                 1     iff qi uses Aj
                would define more precisely the notion of “togetherness”, i.e., how                                          use (qi , Aj ) =
                                                                                                                                                 0       otherwise
                closely related the attributes are.
                This information is obtained from queries and collected in the
                Attribute Usage Matrix and Attribute Affinity Matrix.                                         The use (qi , •) vectors for each application are easy to define if the
                                                                                                             designer knows the applications that willl run on the DB (consider
                                                                                                             also the 80-20 rule)




 DDBS12, SL02                                   31/60                                ¨
                                                                                 M. Bohlen    DDBS12, SL02                                     32/60                                 ¨
                                                                                                                                                                                 M. Bohlen
Vertical Fragmentation/6                                                      02.8               Vertical Fragmentation/7
       Example: Consider relation PROJ (PNO , PNAME , BUDGET , LOC )
       and queries:

        q1 = SELECT BUDGET FROM PROJ WHERE PNO=Value                                                    Attribute Affinity Matrix: Denotes the frequency of two attributes Ai
        q2 = SELECT PNAME,BUDGET FROM PROJ                                                              and Aj with respect to a set of queries Q = (q1 , . . . , qn ):
        q3 = SELECT PNAME FROM PROJ WHERE LOC=Value
                                                                                                                      aff (Ai , Aj ) =                         (             ref l (qk ) ∗ acc l (qk ))
        q4 = SELECT SUM(BUDGET) FROM PROJ WHERE LOC =Value                                                                                   use (q ,A )=1,
                                                                                                                                         k : use (qk ,Ai )=1       sites l
                                                                                                                                                   k j
       Abbreviations:
       A1 = PNO , A2 = PNAME , A3 = BUDGET , A4 = LOC                                                   where
       Attribute Usage Matrix                                                                                    ref l (qk ) is the cost (= number of accesses to (Ai , Aj )) of query qK at
                                                                                                                 site l
                                                                                                                 acc l (qk ) is the frequency of query qk at site l




 DDBS12, SL02                                  33/60                                     ¨
                                                                                     M. Bohlen    DDBS12, SL02                                             34/60                                              ¨
                                                                                                                                                                                                          M. Bohlen


Vertical Fragmentation/8                                                                         Vertical Fragmentation/9
       Example (contd.): Let the cost of each query be ref l (qk ) = 1, and
       the frequency acc l (qk ) of the queries be as follows:
                        Site1             Site2             Site3
                        acc1 (q1 ) = 15   acc2 (q1 ) = 20   acc3 (q1 ) = 10                             Idea: Take the attribute affinity matrix (AA) and reorganize the
                        acc1 (q2 ) = 5    acc2 (q2 ) = 0    acc3 (q2 ) = 0                              attribute orders to form clusters where the attributes in each cluster
                        acc1 (q3 ) = 25   acc2 (q3 ) = 25   acc3 (q3 ) = 25                             demonstrate high affinity to one another.
                        acc1 (q4 ) = 3    acc2 (q4 ) = 0    acc3 (q4 ) = 0                              Bond energy algorithm (BEA) has been suggested for that
                                                                                                        purpose for several reasons:
       Attribute affinity matrix aff (Ai , Aj ) =                                                                 It is designed specifically to determine groups of similar items as
                                                                                                                 opposed to a linear ordering of the items.
                                                                                                                 The final groupings are insensitive to the order in which items are
                                                                                                                 presented.
                                                                                                                 The computation time is reasonable (O (n2 ), where n is the number of
                                                                                                                 attributes)


                e.g., aff (A1 , A3 ) = 1 =1 3=1 acc l (qk ) =
                                         k     l
                acc 1 (q1 ) + acc 2 (q1 ) + acc 3 (q1 ) = 45
                (q1 is the only query to access both A1 and A3 )
 DDBS12, SL02                                  35/60                                     ¨
                                                                                     M. Bohlen    DDBS12, SL02                                             36/60                                              ¨
                                                                                                                                                                                                          M. Bohlen
Vertical Fragmentation/10                                                                                    Vertical Fragmentation/11
                                                                                                                    Example (contd.): Cluster Affinity Matrix CA after running BEA



       BEA:
                Input: AA matrix
                Output: Clustered AA matrix (CA)
                Permutation is done in such a way to maximize the following global
                affinity mesaure (affinity of Ai and Aj with their neighbors):
                                n    n
                                                                                                                             Elements with similar values are grouped together, and two clusters
                                                                                                                             can be identified
                        AM =               aff(Ai , Aj )[aff(Ai , Aj −1 ) + aff(Ai , Aj +1 ) +
                                                                                                                             An additional partitioning algorithm is needed to identify the clusters
                               i =1 j =1
                                                                                                                             in CA
                                                        aff(Ai −1 , Aj ) + aff(Ai +1 , Aj )]                                        Usually more clusters and more than one candidate partitioning, thus
                                                                                                                                    additional steps are needed to select the best clustering.
                                                                                                                             The resulting fragmentation after partitioning (PNO is added in
                                                                                                                             PROJ2 explicilty as key):
                                                                                                                                                    PROJ1 = {PNO , BUDGET }
                                                                                                                                                    PROJ2 = {PNO , PNAME , LOC }
 DDBS12, SL02                                        37/60                                           ¨
                                                                                                 M. Bohlen    DDBS12, SL02                                       38/60                                 ¨
                                                                                                                                                                                                   M. Bohlen


BEA Algorithm/1                                                                                              BEA Algorithm/2
                                                                                                             In order to determine the best placement we define the contribution of a
                                                                                                             placement.
                                                                                                                    Contribution of a placement:
       Input: The AA matrix
       Output: Clustered affinity matrix CA which is a permutation of AA                                             cont (Ai , Ak , Aj ) =
                                                                                                                             2 ∗ bond (Ai , Ak )+
       Initialization: Place and fix one of the columns of AA in CA (we                                                       2 ∗ bond (Ak , Aj )−
       choose column 1 in our example)                                                                                       2 ∗ bond (Ai , Aj )
       Iteration: Place the remaining n-i columns in the remaining i+1
       positions in the CA matrix. For each column choose the placement
       that makes the largest contribution to the global affinity measure.                                           Bond between a pair of attributes is defined as:
       Row order: Order the rows according to the column ordering.
                                                                                                                    bond (Ax , Ay ) =
                                                                                                                                n
                                                                                                                                     aff (Az , Ax ) ∗ aff (Az , Ay )
                                                                                                                              z =1


 DDBS12, SL02                                        39/60                                           ¨
                                                                                                 M. Bohlen    DDBS12, SL02                                       40/60                                 ¨
                                                                                                                                                                                                   M. Bohlen
BEA Algorithm/3                                                                   BEA Algorithm/4
                                                                                         Ordering (0-3-1) :
Consider the following AA matrix and the corresponding CA matrix where                   cont (A 0, A 3, A 1)
A1 and A2 have been placed.
                                                                                                  = 2 ∗ bond (A 0, A 3) + 2 ∗ bond (A 3, A 1)–2 ∗ bond (A 0, A 1)
                A1 A2 A3 A4                       A1 A2                                           = 2 ∗ 0 + 2 ∗ 4410 – 2 ∗ 0 = 8820
        A1      45 0 45 0                  A1     45 0
                                                                                         Ordering (1-3-2) :
   AA = A2      0 80 5 75             CA = A2     0 80
        A3      45 5 53 3                  A3     45 5                                   cont (A 1, A 3, A 2)
        A4      0 75 3 78                  A4     0 75                                            = 2 ∗ bond (A 1, A 3) + 2 ∗ bond (A 3, A 2)–2 ∗ bond (A 1, A 2)
                                                                                                  = 2 ∗ 4410 + 2 ∗ 890 – 2 ∗ 225 = 10150
bond (A3 , A1 ) = 45 ∗ 45 + 5 ∗ 0 + 53 ∗ 45 + 3 ∗ 0 = 4410                               Ordering (2-3-4) :
bond (A3 , A2 ) = 45 ∗ 0 + 5 ∗ 80 + 53 ∗ 5 + 3 ∗ 75 = 890
                                                                                         cont (A 2, A 3, A 4)
                                                                                                  = 2 ∗ bond (A 2, A 3) + 2 ∗ bond (A 3, A 4)–2 ∗ bond (A 2, A 4)
The next step is to place A3 .                                                                    = 1780


 DDBS12, SL02                        41/60                                ¨
                                                                      M. Bohlen    DDBS12, SL02                                  42/60                              ¨
                                                                                                                                                                M. Bohlen


BEA Algorithm/5                                                02.9               BEA Algorithm/6
After adding A3 the CA matrix has the form
                                                                                         The final step is to divide a set of clustered attributes {A1 , ..., An } into
                A1 A3 A2                                                                 two sets {A1 , ..., Ai } and {Ai +1 , ..., An } such that the costs of queries
                45 45 0                                                                  that access both sets are minimized.
      CA =      0  5 80                                                                  The appropriate split point on the diagonal must be determined:
                45 53 5
                0  3 75                                                                                                     A1    A3     A2   A4
                                                                                                                     A1     45    45      0   0
After adding A4 and ordering rows the CA matrix has the form                                                         A3     45    53      5   3
                                                                                                                     A2     0      5     80   75
                  A1 A3 A2 A4                                                                                        A4     0      3     75   78
           A1     45 45 0  0
                                                                                         This is an optimization problem: the optimal point on the diagonal
      CA = A3     45 53 5  3
                                                                                         must be determined.
           A2     0  5 80 75
                                                                                         Note: Generalizations are needed to deal with
           A4     0  3 75 78
                                                                                                  partitions located in the middle of the matrix
                                                                                                  multiple split points


 DDBS12, SL02                        43/60                                ¨
                                                                      M. Bohlen    DDBS12, SL02                                  44/60                              ¨
                                                                                                                                                                M. Bohlen
BEA Algorithm/7                                                                                       Correctness of Vertical Fragmentation

                                                                                                             Relation R is decomposed into fragments R1 , R2 , . . . , Rn
       AQ (qi ) = {Aj | use (qi , Aj ) = 1}                           attributes accessed by qi                  e.g., PROJ = {PNO , BUDGET , PNAME , LOC } into
                                                                                                                 PROJ1 = {PNO , BUDGET } and PROJ2 = {PNO , PNAME , LOC }
       TQ = {qi | AQ (qi ) ⊆ TA }                                queries that access TA only
       BQ = {qi | AQ (qi ) ⊆ BA }                               queries that access BA only                  Completeness
       OQ = Q − {TQ ∪ BQ }                                   queries that access TA and BA                            Guaranteed by the partitioning algortihm, which assigns each
                                                                                                                      attribute in A to one partition
       CQ =     q i ∈Q   ∀Sj   refj (qi ) ∗ accj (qi )                       cost of all queries             Reconstruction
       CTQ =                          refj (qi ) ∗ accj (qi )                cost of TQ queries                       Join to reconstruct vertical fragments
                   qi ∈TQ      ∀Sj
                                                                                                                      R = R1 · · · Rn = PROJ1 PROJ2
       CBQ =       qi ∈BQ      ∀Sj    refj (qi ) ∗ accj (qi )               cost of BQ queries
                                                                                                             Disjointness
       COQ =        qi ∈OQ      ∀Sj   refj (qi ) ∗ accj (qi )             cost of other queries                       Attributes have to be disjoint in VF. Two cases are distinguished:
                                                                                                                           If tuple IDs are used, the fragments are really disjoint
       CTQ ∗ CBQ − COQ 2                                                              maximize                             Otherwise, key attributes are replicated automatically by the system
                                                                                                                           e.g., PNO in the above example



 DDBS12, SL02                                        45/60                                    ¨
                                                                                          M. Bohlen    DDBS12, SL02                                   46/60                                     ¨
                                                                                                                                                                                            M. Bohlen


Mixed Fragmentation                                                                                   Replication and Allocation


       In most cases simple horizontal or vertical fragmentation of a DB                                     Replication: Which fragements shall be stored as multiple copies?
       schema will not be sufficient to satisfy the requirements of the                                                Complete Replication
       applications.                                                                                                       Complete copy of the database is maintained in each site
       Mixed fragmentation (hybrid fragmentation): Consists of a                                                      Selective Replication
       horizontal fragment followed by a vertical fragmentation, or a vertical                                             Selected fragments are replicated in some sites
       fragmentation followed by a horizontal fragmentation                                                  Allocation: On which sites to store the various fragments?
       Fragmentation is defined using the selection and projection                                                     Centralized
                                                                                                                           Consists of a single DB and DBMS stored at one site with users
       operations of relational algebra:
                                                                                                                           distributed across the network
                                                                                                                      Partitioned
                                              σp (πA1 ,...,An (R ))
                                                                                                                           Database is partitioned into disjoint fragments, each fragment assigned
                                              πA1 ,...,An (σp (R ))                                                        to one site




 DDBS12, SL02                                        47/60                                    ¨
                                                                                          M. Bohlen    DDBS12, SL02                                   48/60                                     ¨
                                                                                                                                                                                            M. Bohlen
Replication/1                                                                             Replication/2

                                                                                                 Comparison of replication alternatives


       Replicated DB
                fully replicated: each fragment at each site
                partially replicated: each fragment at some of the sites
       Non-replicated DB (= partitioned DB)
                partitioned: each fragment resides at only one site
       Rule of thumb:
                  read only queries
                If                  ≥ 1, then replication is advantageous, otherwise
                   update queries
                replication may cause problems




 DDBS12, SL02                                   49/60                             ¨
                                                                              M. Bohlen    DDBS12, SL02                                    50/60                                      ¨
                                                                                                                                                                                  M. Bohlen


Fragment Allocation/1                                                                     Fragment Allocation/2

       Fragment allocation problem                                                               Required information
                Given are:                                                                                Database Information
                – fragments F = {F1 , F2 , ..., Fn }                                                           selectivity of fragments
                – network sites S = {S1 , S2 , ..., Sm }                                                       size of a fragment
                – and applications Q = {q1 , q2 , ..., ql }                                               Application Information
                Find: the ”optimal” distribution of F to S                                                     RRij : number of read accesses of a query qi to a fragment Fj
                                                                                                               URij : number of update accesses of query qi to a fragment Fj
       Optimality                                                                                              uij : a matrix indicating which queries updates which fragments,
                Minimal cost                                                                                   rij : a similar matrix for retrievals
                                                                                                               originating site of each query
                     Communication + storage + processing (read and update)
                     Cost in terms of time (usually)                                                      Site Information
                Performance                                                                                    USCk : unit cost of storing data at a site Sk
                                                                                                               LPCk : cost of processing one unit of data at a site Sk
                     Response time and/or throughput
                                                                                                          Network Information
                Constraints
                                                                                                               communication cost/frame between two sites
                     Per site constraints (storage and processing)
                                                                                                               frame size



 DDBS12, SL02                                   51/60                             ¨
                                                                              M. Bohlen    DDBS12, SL02                                    52/60                                      ¨
                                                                                                                                                                                  M. Bohlen
Fragment Allocation/3                                                                            Fragment Allocation/4
                                                                                                        The total cost function has two components: storage and query
       We discuss an allocation model which attempts to                                                 processing.
                minimize the total cost of processing and storage
                meet certain response time restrictions                                                                             TOC =                     STCjk +            QPCi
       General Form:                                                                                                                            Sk ∈S Fj ∈F             q i ∈Q
                                           min(Total Cost)

                                                                                                                 Storage cost of fragment Fj at site Sk :
                subject to
                     response time constraint                                                                                                  STCjk = USCk ∗ size (Fj ) ∗ xij
                     storage constraint
                     processing constraint                                                                       where USCk is the unit storage cost at site k
       Decision variable xij
                                                                                                                 Query processing cost for a query qi is composed of two
                                                                                                                 components:
                                 1   if fragment Fi is stored at site Sj
                         xij =                                                                                        composed of processing cost (PC) and transmission cost (TC)
                                 0               otherwise
                                                                                                                                                    QPCi = PCi + TCi

 DDBS12, SL02                                          53/60                             ¨
                                                                                     M. Bohlen    DDBS12, SL02                                            54/60                              ¨
                                                                                                                                                                                         M. Bohlen


Fragment Allocation/5                                                                            Fragment Allocation/6
       Processing cost is a sum of three components:                                                    The transmission cost is composed of two components:
                access cost (AC), integrity contraint cost (IE), concurency control cost                         Cost of processing updates (TCU) and cost of processing retrievals
                (CC)                                                                                             (TCR)

                                           PCi = ACi + IEi + CCi                                                                                   TCi = TCUi + TCRi

                Access cost:                                                                                     Cost of updates:
                                                                                                                      Inform all the sites that have replicas + a short confirmation message
                                 ACi =                 (URij + RRij ) ∗ xij ∗ LPCk                                    back
                                         sk ∈S Fj ∈F

                                                                                                                 TCUi =                     uij ∗ (update message cost + acknowledgment cost)
                where LPCk is the unit process cost at site k
                                                                                                                          Sk ∈S Fj ∈F
                Integrity and concurrency costs:
                     Can be similarly computed, though depends on the specific constraints                        Retrieval cost:
                                                                                                                      Send retrieval request to all sites that have a copy of fragments that are
       Note: ACi assumes that processing a query involves decomposing                                                 needed + sending back the results from these sites to the originating
       it into a set of subqueries, each of which works on a fragment, ...,                                           site.
                This is a very simplistic model
                Does not take into consideration different query costs depending on                              TCRi =            min (xjk ∗ (cost retrieval request + cost sending back result))
                                                                                                                                   S k ∈S
                                                                                                                          F j ∈F
                the operator or different algorithms that are applied
 DDBS12, SL02                                          55/60                             ¨
                                                                                     M. Bohlen    DDBS12, SL02                                            56/60                              ¨
                                                                                                                                                                                         M. Bohlen
Fragment Allocation/7                                                                            Fragment Allocation/8

                                                                                                        Solution Methods
       Modeling the constraints                                                                                  The complexity of this allocation model/problem is NP-complete
                                                                                                                 Correspondence between the allocation problem and similar
                Response time constraint for a query qi
                                                                                                                 problems in other areas
                      execution time of qi ≤ max. allowable response time for qi                                      Plant location problem in operations research
                                                                                                                      Knapsack problem
                Storage constraints for a site Sk                                                                     Network flow problem
                                                                                                                 Hence, solutions from these areas can be re-used
                              storage requirement of Fj at Sk ≤ storage capacity of Sk                           Use different heuristics to reduce the search space
                    F j ∈F                                                                                            Assume that all candidate partitionings have been determined together
                                                                                                                      with their associated costs and benefits in terms of query processing.
                Processing constraints for a site Sk                                                                  The problem is then reduced to find the optimal partitioning and
                                                                                                                      placement for each relation
                             processing load of qi at site Sk ≤ processing capacity ofSk                              Ignore replication at the first step and find an optimal non-replicated
                  qi ∈Q                                                                                               solution
                                                                                                                      Replication is then handeled in a second step on top of the previous
                                                                                                                      non-replicated solution.



 DDBS12, SL02                                      57/60                                 ¨
                                                                                     M. Bohlen    DDBS12, SL02                                   58/60                                   ¨
                                                                                                                                                                                     M. Bohlen


Conclusion/1                                                                                     Conclusion/2

       Distributed design decides on the placement of (parts of the) data
       and programs across the sites of a computer network
       There are two key technical questions: fragmentation and
       allocation/replication of data                                                                   Allocation/Replication of data
       Horizontal fragmentation is defined via the selection operation                                            Type of replication: no replication, partial replication, full replication
       σp (R )                                                                                                   Optimal allocation/replication modelled as a cost function under a set
                                                                                                                 of constraints
                Rewrites the queries of each site in the conjunctive normal form and
                                                                                                                 The complexity of the problem is NP-complete
                finds a minimal and complete set of conjunctions to determine
                                                                                                                 Use of different heuristics to reduce the complexity
                fragmentation
       Vertical fragmentation via the projection operation πA (R )
                Computes the attribute affinity matrix and groups “similar” attributes
                together
       Mixed fragmentation is a combination of both approaches



 DDBS12, SL02                                      59/60                                 ¨
                                                                                     M. Bohlen    DDBS12, SL02                                   60/60                                   ¨
                                                                                                                                                                                     M. Bohlen

Contenu connexe

En vedette (6)

8 drived horizontal fragmentation
8  drived horizontal fragmentation8  drived horizontal fragmentation
8 drived horizontal fragmentation
 
Database fragmentation
Database fragmentationDatabase fragmentation
Database fragmentation
 
3 design
3 design3 design
3 design
 
Database, 3 Distribution Design
Database, 3 Distribution DesignDatabase, 3 Distribution Design
Database, 3 Distribution Design
 
Lecture10
Lecture10Lecture10
Lecture10
 
Fragmentation and types of fragmentation in Distributed Database
Fragmentation and types of fragmentation in Distributed DatabaseFragmentation and types of fragmentation in Distributed Database
Fragmentation and types of fragmentation in Distributed Database
 

Similaire à Sl02 2x2 (1)

Similaire à Sl02 2x2 (1) (15)

Csc 303
Csc 303 Csc 303
Csc 303
 
DDBMS Paper with Solution
DDBMS Paper with SolutionDDBMS Paper with Solution
DDBMS Paper with Solution
 
Distributed DBMS - Unit 3 - Distributed DBMS Architecture
Distributed DBMS - Unit 3 - Distributed DBMS ArchitectureDistributed DBMS - Unit 3 - Distributed DBMS Architecture
Distributed DBMS - Unit 3 - Distributed DBMS Architecture
 
Distributed databases and dbm ss
Distributed databases and dbm ssDistributed databases and dbm ss
Distributed databases and dbm ss
 
Database Development Strategies
Database Development StrategiesDatabase Development Strategies
Database Development Strategies
 
Distributed Database System
Distributed Database SystemDistributed Database System
Distributed Database System
 
Chapter25
Chapter25Chapter25
Chapter25
 
Lecture 1 ddbms
Lecture 1 ddbmsLecture 1 ddbms
Lecture 1 ddbms
 
Adbms 23 distributed database design
Adbms 23 distributed database designAdbms 23 distributed database design
Adbms 23 distributed database design
 
Design1
Design1Design1
Design1
 
5. oose design new copy
5. oose design new   copy5. oose design new   copy
5. oose design new copy
 
ditributed databases
ditributed databasesditributed databases
ditributed databases
 
Distributed Databases
Distributed DatabasesDistributed Databases
Distributed Databases
 
Domain driven design - Part I
Domain driven design - Part IDomain driven design - Part I
Domain driven design - Part I
 
Design concepts and design principles
Design concepts and design principlesDesign concepts and design principles
Design concepts and design principles
 

Plus de Prasanta Paul

ICard-Prasanta Kumar Paul
ICard-Prasanta Kumar PaulICard-Prasanta Kumar Paul
ICard-Prasanta Kumar PaulPrasanta Paul
 
Ppt for paper id 696 a review of hybrid data mining algorithm for big data mi...
Ppt for paper id 696 a review of hybrid data mining algorithm for big data mi...Ppt for paper id 696 a review of hybrid data mining algorithm for big data mi...
Ppt for paper id 696 a review of hybrid data mining algorithm for big data mi...Prasanta Paul
 
Techno India Ramgarh
Techno India RamgarhTechno India Ramgarh
Techno India RamgarhPrasanta Paul
 

Plus de Prasanta Paul (8)

Aicte-List
Aicte-List Aicte-List
Aicte-List
 
ICard-Prasanta Kumar Paul
ICard-Prasanta Kumar PaulICard-Prasanta Kumar Paul
ICard-Prasanta Kumar Paul
 
Prasanta paul
Prasanta paulPrasanta paul
Prasanta paul
 
Prasanta Kumar Paul
Prasanta Kumar PaulPrasanta Kumar Paul
Prasanta Kumar Paul
 
Ppt for paper id 696 a review of hybrid data mining algorithm for big data mi...
Ppt for paper id 696 a review of hybrid data mining algorithm for big data mi...Ppt for paper id 696 a review of hybrid data mining algorithm for big data mi...
Ppt for paper id 696 a review of hybrid data mining algorithm for big data mi...
 
Techno India Ramgarh
Techno India RamgarhTechno India Ramgarh
Techno India Ramgarh
 
ICardPrasanta
ICardPrasantaICardPrasanta
ICardPrasanta
 
Coding
CodingCoding
Coding
 

Dernier

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 

Dernier (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Sl02 2x2 (1)

  • 1. Design Problem Distributed Database Systems Fall 2012 Design problem of distributed systems: Making decisions about the placement of data and programs across the sites of a computer Distributed Database Design network as well as possibly designing the network itself. SL02 In DDBMS, the distribution of applications involves Distribution of the DDBMS software Distribution of applications that run on the database Distribution of applications will not be considered in the following; Design problem instead the distribution of data is studied. Design strategies (top-down, bottom-up) Fragmentation (horizontal, vertical) Allocation and replication of fragments, optimality, heuristics DDBS12, SL02 1/60 ¨ M. Bohlen DDBS12, SL02 2/60 ¨ M. Bohlen Framework of Distribution Design Strategies/1 Dimension for the analysis of distributed systems Level of sharing: no sharing, data sharing, data + program sharing Behavior of access patterns: static, dynamic Level of knowledge on access pattern behavior: no information, partial information, complete information Top-down approach Designing systems from scratch Homogeneous systems Bottom-up approach The databases already exist at a number of sites The databases should be connected to solve common tasks Distributed database design should be considered within this general framework. DDBS12, SL02 3/60 ¨ M. Bohlen DDBS12, SL02 4/60 ¨ M. Bohlen
  • 2. Design Strategies/2 Design Strategies/3 Top-down design strategy Distribution design is the central part of the design in DDBMSs (the other tasks are similar to traditional databases). Objective: Design the LCSs by distributing the data over the sites. Two main aspects have to be designed carefully: Fragmentation: Relation may be divided into a number of sub-relations, which are distributed Allocation and replication: Each fragment is stored at site(s) with ”optimal” distribution Distribution design issues Why fragment at all? How to fragment? How much to fragment? How to test correctness? How to allocate? DDBS12, SL02 5/60 ¨ M. Bohlen DDBS12, SL02 6/60 ¨ M. Bohlen Design Strategies/4 Fragmentation/1 Bottom-up design strategy What is a reasonable unit of distribution? Relation or fragment of relation? Relations as unit of distribution: If the relation is not replicated, we get a high volume of remote data accesses. If the relation is replicated, we get unnecessary replications, which cause problems in executing updates and waste disk space Might be an OK solution, if queries need all the data in the relation and data stays only at the sites that use the data Fragments of relation as unit of distribution: Application views are usually subsets of relations Thus, locality of accesses of applications is defined on subsets of relations Permits a number of transactions to execute concurrently, since they will access different portions of a relation Parallel execution of a single query (intra-query concurrency) However, semantic data control (especially integrity enforcement) is more difficult ⇒ Fragments of relations are (usually) appropriate unit of distribution. DDBS12, SL02 7/60 ¨ M. Bohlen DDBS12, SL02 8/60 ¨ M. Bohlen
  • 3. Fragmentation/2 Fragmentation/3 Types of Fragmentation Horizontal: partitions a relation along its tuples Fragmentation aims to improve: Vertical: partitions a relation along its attributes Mixed/hybrid: a combination of horizontal and vertical fragmentation Reliability Performance Balanced storage capacity and costs Communication costs Security The following information is used to decide fragmentation: Quantitative information: cardinality of relations, frequency of queries, (a) Horizontal Fragmentation site where query is run, selectivity of the queries, etc. Qualitative information: predicates in queries, types of access of data, read/write, etc. (b) Vertical Fragmentation (c) Mixed Fragmentation DDBS12, SL02 9/60 ¨ M. Bohlen DDBS12, SL02 10/60 ¨ M. Bohlen Fragmentation/4 Fragmentation/5 Example of database instance Example (contd.): Horizontal fragmentation of PROJ relation PROJ1: projects with budgets less than 200 000 PROJ2: projects with budgets greater than or equal to 200 000 Data DDBS12, SL02 11/60 ¨ M. Bohlen DDBS12, SL02 12/60 ¨ M. Bohlen
  • 4. Fragmentation/6 Correctness Rules of Fragmentation Example (contd.): Vertical fragmentation of PROJ relation PROJ1: information about project budgets Completeness PROJ2: information about project names and locations Decomposition of relation R into fragments R1 , R2 , . . . , Rn is complete iff each data item in R can also be found in some Ri . Reconstruction If relation R is decomposed into fragments R1 , R2 , . . . , Rn , then there should exist some relational operator that reconstructs R from its fragments, i.e., R = R1 . . . Rn Union to combine horizontal fragments Join to combine vertical fragments Disjointness If relation R is decomposed into fragments R1 , R2 , . . . , Rn and data item di appears in fragment Rj , then di should not appear in any other fragment Rk , k j (exception: primary key attribute for vertical fragmentation) For horizontal fragmentation, data item is a tuple For vertical fragmentation, data item is an attribute DDBS12, SL02 13/60 ¨ M. Bohlen DDBS12, SL02 14/60 ¨ M. Bohlen Idea of Horizontal Fragmentation Information Requirements/1 Database information: Links between relations (a link models a 1:N relationship between Intuition behind horizontal fragmentation relations that are related to each other by an equality join) Every site should hold all information that is used to query the site The information at the site should be fragmented so the queries of the site run faster Horizontal fragmentation is defined as selection operation, σp (R ) Example: σBUDGET <200K (PROJ ) σBUDGET ≥200K (PROJ ) Links (one to many) Cardinality of relations: card(R) DDBS12, SL02 15/60 ¨ M. Bohlen DDBS12, SL02 16/60 ¨ M. Bohlen
  • 5. Information Requirements/2 02.1 Information Requirements/3 02.2 Application information: Simple predicates used in queries Relation R [A1 , A2 , ..., An ] Ai θvi a simple predicate (θ ∈ {=, <, ≤, >, ≥, }, vi ∈ Di , Di is the Application information (cnt’d): domain of Ai ) Minterm selectivities For relation R we define Pr = {p1 , p2 , ..., pm } The number of tuples of the relation that would be accessed by a user Example: PNAME = ”Maintenance”, BUDGET < 200000 query, which is specified according to a given minterm predicate mi . Minterm predicates Access frequencies minterm predicates are conjunctions of simple predicates The frequency with which a user application qi accesses data. Relation R, Pr = {p1 , p2 , ..., pm } M = {m1 , m2 , ..., mr } such that M = {mi | mi = ∧pj ∈Pr pj ∗}, 1 ≤ j ≤ m, 1 ≤ i ≤ z where pj ∗ = pj or pj ∗ = ¬pj DDBS12, SL02 17/60 ¨ M. Bohlen DDBS12, SL02 18/60 ¨ M. Bohlen Horizontal Fragmentation/1 Horizontal Fragmentation/2 A horizontal fragment Ri of relation R consists of all the tuples of R that We write Pr to denote the complete and minimal set of simple satisfy a predicate Fj : predicates generated from Pr: Rj = σFj (R ), 1 ≤ j ≤ w The set of predicates is complete if and only if any two tuples in the same fragment are referenced with the same probability by any where Fj is a selection formula, which is (preferably) a minterm predicate. application The set of predicates is minimal if and only if each simple predicate Computing the horizontal fragmentation: is relevant for determining the fragmentation and for each pair of 1. Determine Pr (80/20 rule) fragments there is at least one query that accesses the fragments 2. Compute Pr’ differently 3. Determine M 4. Minimize M (eliminating impossible minterms) DDBS12, SL02 19/60 ¨ M. Bohlen DDBS12, SL02 20/60 ¨ M. Bohlen
  • 6. Horizontal Fragmentation/3 Horizontal Fragmentation/4 Example: Fragmentation of the PROJ relation Consider the following query: Find the name and budget of projects given their location. If access is only according to the location, the above set of The query is issued at all three locations predicates is complete Fragmentation based on LOC, using the set of predicates i.e., in each fragment PROJi each tuple has the same probability of {LOC = ‘Montreal ‘, LOC = ‘NewYork ‘, LOC = ‘Paris ‘} being accessed If there is a second query/application that accesses only those PROJ1 = σLOC =‘Montreal ‘ (PROJ ) PROJ2 = σLOC =‘NewYork ‘ (PROJ ) project tuples where the budget is less than $200K, the set of PNO PNAME BUDGET LOC PNO PNAME BUDGET LOC predicates is not complete. P2 DB Develop. 135000 New York P2 in PROJ2 has higher probability to be accessed P1 Instrument. 150000 Montreal P3 CAD/CAM 250000 New York PROJ3 = σLOC =‘Paris ‘ (PROJ ) PNO PNAME BUDGET LOC P4 Maintenance 310000 Paris DDBS12, SL02 21/60 ¨ M. Bohlen DDBS12, SL02 22/60 ¨ M. Bohlen Horizontal Fragmentation/5 Horizontal Fragmentation/6 02.3 Example (contd.): Example (contd.): Now, PROJi , i = 1, 2, 3 will be split in two Add BUDGET < 200K and BUDGET ≥ 200K to the set of predicates fragments to make it complete. ⇒ {LOC = ‘Montreal ‘, LOC = ‘NewYork ‘, LOC = ‘Paris ‘, PROJ1 = σLOC =‘Montr ‘∧BDGT <200K (PROJ ) PROJ2 = σLOC =‘NY ‘∧BDGT <200K (PROJ ) BUDGET ≥ 200K , BUDGET < 200K } is a complete set PNO PNAME BUDGET LOC PNO PNAME BUDGET LOC Minterms to fragment the relation are given as follows: P1 Instrumnt 150000 Montreal P2 DB Devel 135000 New York PROJ3 = σLOC =‘Paris ‘∧BDGT ≥200K (PROJ ) PROJ2 = σLOC =‘NY ‘∧BDGT ≥200K (PROJ ) (LOC = ‘Montreal ‘) ∧ (BUDGET ≤ 200K ) PNO PNAME BUDGET LOC PNO PNAME BUDGET LOC (LOC = ‘Montreal ‘) ∧ (BUDGET > 200K ) P4 Maintenance 310000 Paris P3 CAD/CAM 250000 New York (LOC = ‘NewYork ‘) ∧ (BUDGET ≤ 200K ) (LOC = ‘NewYork ‘) ∧ (BUDGET > 200K ) Note that the following fragments are empty: (LOC = ‘Paris ‘) ∧ (BUDGET ≤ 200K ) σLOC =‘Paris ‘∧BDGT <200K (PROJ ) σLOC =‘Montreal ‘∧BDGT ≥200K (PROJ ) (LOC = ‘Paris ‘) ∧ (BUDGET > 200K ) DDBS12, SL02 23/60 ¨ M. Bohlen DDBS12, SL02 24/60 ¨ M. Bohlen
  • 7. Derived Horizontal Fragmentation/1 02.5 Derived Horizontal Fragmentation/2 Defined on a member (target) relation of a link according to a selection operation specified on its owner (source). Given a link L where owner (L ) = S and member (L ) = R, the Each link is an equijoin. derived horizontal fragments of R are defined as Fragments of member relation can be defined with a semijoin on fragments of owner relation. Ri = R |>< Si , 1 ≤ i ≤ w where w is the maximum number of fragments that will be defined on R and Si = σFi (S ) where Fi is the formula according to which the primary horizontal fragment Si is defined. DDBS12, SL02 25/60 ¨ M. Bohlen DDBS12, SL02 26/60 ¨ M. Bohlen Derived Horizontal Fragmentation/3 02.7 Vertical Fragmentation/1 Objective of vertical fragmentation is to partition a relation into a set of smaller relations so that many of the applications will run on only one fragment. Given link L 1 where Vertical fragmentation of a relation R produces fragments owner (L 1) = PAY and R1 , R2 , . . . , each of which contains a subset of R’s attributes. member (L 1) = EMP Vertical fragmentation is defined using the projection operation of and the relational algebra: PAY 1 = σSAL ≤30000 (PAY ) πA1 ,A2 ,...,An (R ) PAY 2 = σSAL >30000 (PAY ) we get EMP1 = EMP |>< PAY 1 Example: EMP2 = EMP |>< PAY 2 PROJ1 = πPNO ,BUDGET (PROJ ) PROJ2 = πPNO ,PNAME ,LOC (PROJ ) Vertical fragmentation has also been studied for (centralized) DBMS Smaller relations, and hence less page accesses e.g., MONET system DDBS12, SL02 27/60 ¨ M. Bohlen DDBS12, SL02 28/60 ¨ M. Bohlen
  • 8. Vertical Fragmentation/2 Vertical Fragmentation/3 Two types of heuristics for vertical fragmentation exist: Vertical fragmentation is more complicated than horizontal Grouping: assign each attribute to one fragment, and at each step, fragmentation join some of the fragments until some criteria is satisfied. In horizontal partitioning: for n simple predicates, the number of Bottom-up approach possible minterms is 2n ; some of them can be ruled out by existing Splitting: starts with a relation and decides on beneficial partitionings implications/constraints. based on the access behaviour of applications to the attributes. In vertical partitioning: for m non-primary key attributes, the number of possible fragments is equal to B (m) (= the mth Bell number), i.e., Top-down approach the number of partitions of a set with m members. Results in non-overlapping fragments For large numbers, B (m) ≈ mm (e.g., B (15) = 109 ) Optimal solution is probably closer to the full relation than to a set of small relations with only one attribute DDBS12, SL02 29/60 ¨ M. Bohlen DDBS12, SL02 30/60 ¨ M. Bohlen Vertical Fragmentation/4 Vertical Fragmentation/5 Given are the user queries/applications Q = (q1 , . . . , qq ) that will Application information: The major information required as input run on relation R (A1 , . . . , An ) for vertical fragmentation is related to applications (queries) Attribute Usage Matrix: Denotes which query uses which attribute: Since vertical fragmentation places in one fragment those attributes usually accessed together, there is a need for some measure that 1 iff qi uses Aj would define more precisely the notion of “togetherness”, i.e., how use (qi , Aj ) = 0 otherwise closely related the attributes are. This information is obtained from queries and collected in the Attribute Usage Matrix and Attribute Affinity Matrix. The use (qi , •) vectors for each application are easy to define if the designer knows the applications that willl run on the DB (consider also the 80-20 rule) DDBS12, SL02 31/60 ¨ M. Bohlen DDBS12, SL02 32/60 ¨ M. Bohlen
  • 9. Vertical Fragmentation/6 02.8 Vertical Fragmentation/7 Example: Consider relation PROJ (PNO , PNAME , BUDGET , LOC ) and queries: q1 = SELECT BUDGET FROM PROJ WHERE PNO=Value Attribute Affinity Matrix: Denotes the frequency of two attributes Ai q2 = SELECT PNAME,BUDGET FROM PROJ and Aj with respect to a set of queries Q = (q1 , . . . , qn ): q3 = SELECT PNAME FROM PROJ WHERE LOC=Value aff (Ai , Aj ) = ( ref l (qk ) ∗ acc l (qk )) q4 = SELECT SUM(BUDGET) FROM PROJ WHERE LOC =Value use (q ,A )=1, k : use (qk ,Ai )=1 sites l k j Abbreviations: A1 = PNO , A2 = PNAME , A3 = BUDGET , A4 = LOC where Attribute Usage Matrix ref l (qk ) is the cost (= number of accesses to (Ai , Aj )) of query qK at site l acc l (qk ) is the frequency of query qk at site l DDBS12, SL02 33/60 ¨ M. Bohlen DDBS12, SL02 34/60 ¨ M. Bohlen Vertical Fragmentation/8 Vertical Fragmentation/9 Example (contd.): Let the cost of each query be ref l (qk ) = 1, and the frequency acc l (qk ) of the queries be as follows: Site1 Site2 Site3 acc1 (q1 ) = 15 acc2 (q1 ) = 20 acc3 (q1 ) = 10 Idea: Take the attribute affinity matrix (AA) and reorganize the acc1 (q2 ) = 5 acc2 (q2 ) = 0 acc3 (q2 ) = 0 attribute orders to form clusters where the attributes in each cluster acc1 (q3 ) = 25 acc2 (q3 ) = 25 acc3 (q3 ) = 25 demonstrate high affinity to one another. acc1 (q4 ) = 3 acc2 (q4 ) = 0 acc3 (q4 ) = 0 Bond energy algorithm (BEA) has been suggested for that purpose for several reasons: Attribute affinity matrix aff (Ai , Aj ) = It is designed specifically to determine groups of similar items as opposed to a linear ordering of the items. The final groupings are insensitive to the order in which items are presented. The computation time is reasonable (O (n2 ), where n is the number of attributes) e.g., aff (A1 , A3 ) = 1 =1 3=1 acc l (qk ) = k l acc 1 (q1 ) + acc 2 (q1 ) + acc 3 (q1 ) = 45 (q1 is the only query to access both A1 and A3 ) DDBS12, SL02 35/60 ¨ M. Bohlen DDBS12, SL02 36/60 ¨ M. Bohlen
  • 10. Vertical Fragmentation/10 Vertical Fragmentation/11 Example (contd.): Cluster Affinity Matrix CA after running BEA BEA: Input: AA matrix Output: Clustered AA matrix (CA) Permutation is done in such a way to maximize the following global affinity mesaure (affinity of Ai and Aj with their neighbors): n n Elements with similar values are grouped together, and two clusters can be identified AM = aff(Ai , Aj )[aff(Ai , Aj −1 ) + aff(Ai , Aj +1 ) + An additional partitioning algorithm is needed to identify the clusters i =1 j =1 in CA aff(Ai −1 , Aj ) + aff(Ai +1 , Aj )] Usually more clusters and more than one candidate partitioning, thus additional steps are needed to select the best clustering. The resulting fragmentation after partitioning (PNO is added in PROJ2 explicilty as key): PROJ1 = {PNO , BUDGET } PROJ2 = {PNO , PNAME , LOC } DDBS12, SL02 37/60 ¨ M. Bohlen DDBS12, SL02 38/60 ¨ M. Bohlen BEA Algorithm/1 BEA Algorithm/2 In order to determine the best placement we define the contribution of a placement. Contribution of a placement: Input: The AA matrix Output: Clustered affinity matrix CA which is a permutation of AA cont (Ai , Ak , Aj ) = 2 ∗ bond (Ai , Ak )+ Initialization: Place and fix one of the columns of AA in CA (we 2 ∗ bond (Ak , Aj )− choose column 1 in our example) 2 ∗ bond (Ai , Aj ) Iteration: Place the remaining n-i columns in the remaining i+1 positions in the CA matrix. For each column choose the placement that makes the largest contribution to the global affinity measure. Bond between a pair of attributes is defined as: Row order: Order the rows according to the column ordering. bond (Ax , Ay ) = n aff (Az , Ax ) ∗ aff (Az , Ay ) z =1 DDBS12, SL02 39/60 ¨ M. Bohlen DDBS12, SL02 40/60 ¨ M. Bohlen
  • 11. BEA Algorithm/3 BEA Algorithm/4 Ordering (0-3-1) : Consider the following AA matrix and the corresponding CA matrix where cont (A 0, A 3, A 1) A1 and A2 have been placed. = 2 ∗ bond (A 0, A 3) + 2 ∗ bond (A 3, A 1)–2 ∗ bond (A 0, A 1) A1 A2 A3 A4 A1 A2 = 2 ∗ 0 + 2 ∗ 4410 – 2 ∗ 0 = 8820 A1 45 0 45 0 A1 45 0 Ordering (1-3-2) : AA = A2 0 80 5 75 CA = A2 0 80 A3 45 5 53 3 A3 45 5 cont (A 1, A 3, A 2) A4 0 75 3 78 A4 0 75 = 2 ∗ bond (A 1, A 3) + 2 ∗ bond (A 3, A 2)–2 ∗ bond (A 1, A 2) = 2 ∗ 4410 + 2 ∗ 890 – 2 ∗ 225 = 10150 bond (A3 , A1 ) = 45 ∗ 45 + 5 ∗ 0 + 53 ∗ 45 + 3 ∗ 0 = 4410 Ordering (2-3-4) : bond (A3 , A2 ) = 45 ∗ 0 + 5 ∗ 80 + 53 ∗ 5 + 3 ∗ 75 = 890 cont (A 2, A 3, A 4) = 2 ∗ bond (A 2, A 3) + 2 ∗ bond (A 3, A 4)–2 ∗ bond (A 2, A 4) The next step is to place A3 . = 1780 DDBS12, SL02 41/60 ¨ M. Bohlen DDBS12, SL02 42/60 ¨ M. Bohlen BEA Algorithm/5 02.9 BEA Algorithm/6 After adding A3 the CA matrix has the form The final step is to divide a set of clustered attributes {A1 , ..., An } into A1 A3 A2 two sets {A1 , ..., Ai } and {Ai +1 , ..., An } such that the costs of queries 45 45 0 that access both sets are minimized. CA = 0 5 80 The appropriate split point on the diagonal must be determined: 45 53 5 0 3 75 A1 A3 A2 A4 A1 45 45 0 0 After adding A4 and ordering rows the CA matrix has the form A3 45 53 5 3 A2 0 5 80 75 A1 A3 A2 A4 A4 0 3 75 78 A1 45 45 0 0 This is an optimization problem: the optimal point on the diagonal CA = A3 45 53 5 3 must be determined. A2 0 5 80 75 Note: Generalizations are needed to deal with A4 0 3 75 78 partitions located in the middle of the matrix multiple split points DDBS12, SL02 43/60 ¨ M. Bohlen DDBS12, SL02 44/60 ¨ M. Bohlen
  • 12. BEA Algorithm/7 Correctness of Vertical Fragmentation Relation R is decomposed into fragments R1 , R2 , . . . , Rn AQ (qi ) = {Aj | use (qi , Aj ) = 1} attributes accessed by qi e.g., PROJ = {PNO , BUDGET , PNAME , LOC } into PROJ1 = {PNO , BUDGET } and PROJ2 = {PNO , PNAME , LOC } TQ = {qi | AQ (qi ) ⊆ TA } queries that access TA only BQ = {qi | AQ (qi ) ⊆ BA } queries that access BA only Completeness OQ = Q − {TQ ∪ BQ } queries that access TA and BA Guaranteed by the partitioning algortihm, which assigns each attribute in A to one partition CQ = q i ∈Q ∀Sj refj (qi ) ∗ accj (qi ) cost of all queries Reconstruction CTQ = refj (qi ) ∗ accj (qi ) cost of TQ queries Join to reconstruct vertical fragments qi ∈TQ ∀Sj R = R1 · · · Rn = PROJ1 PROJ2 CBQ = qi ∈BQ ∀Sj refj (qi ) ∗ accj (qi ) cost of BQ queries Disjointness COQ = qi ∈OQ ∀Sj refj (qi ) ∗ accj (qi ) cost of other queries Attributes have to be disjoint in VF. Two cases are distinguished: If tuple IDs are used, the fragments are really disjoint CTQ ∗ CBQ − COQ 2 maximize Otherwise, key attributes are replicated automatically by the system e.g., PNO in the above example DDBS12, SL02 45/60 ¨ M. Bohlen DDBS12, SL02 46/60 ¨ M. Bohlen Mixed Fragmentation Replication and Allocation In most cases simple horizontal or vertical fragmentation of a DB Replication: Which fragements shall be stored as multiple copies? schema will not be sufficient to satisfy the requirements of the Complete Replication applications. Complete copy of the database is maintained in each site Mixed fragmentation (hybrid fragmentation): Consists of a Selective Replication horizontal fragment followed by a vertical fragmentation, or a vertical Selected fragments are replicated in some sites fragmentation followed by a horizontal fragmentation Allocation: On which sites to store the various fragments? Fragmentation is defined using the selection and projection Centralized Consists of a single DB and DBMS stored at one site with users operations of relational algebra: distributed across the network Partitioned σp (πA1 ,...,An (R )) Database is partitioned into disjoint fragments, each fragment assigned πA1 ,...,An (σp (R )) to one site DDBS12, SL02 47/60 ¨ M. Bohlen DDBS12, SL02 48/60 ¨ M. Bohlen
  • 13. Replication/1 Replication/2 Comparison of replication alternatives Replicated DB fully replicated: each fragment at each site partially replicated: each fragment at some of the sites Non-replicated DB (= partitioned DB) partitioned: each fragment resides at only one site Rule of thumb: read only queries If ≥ 1, then replication is advantageous, otherwise update queries replication may cause problems DDBS12, SL02 49/60 ¨ M. Bohlen DDBS12, SL02 50/60 ¨ M. Bohlen Fragment Allocation/1 Fragment Allocation/2 Fragment allocation problem Required information Given are: Database Information – fragments F = {F1 , F2 , ..., Fn } selectivity of fragments – network sites S = {S1 , S2 , ..., Sm } size of a fragment – and applications Q = {q1 , q2 , ..., ql } Application Information Find: the ”optimal” distribution of F to S RRij : number of read accesses of a query qi to a fragment Fj URij : number of update accesses of query qi to a fragment Fj Optimality uij : a matrix indicating which queries updates which fragments, Minimal cost rij : a similar matrix for retrievals originating site of each query Communication + storage + processing (read and update) Cost in terms of time (usually) Site Information Performance USCk : unit cost of storing data at a site Sk LPCk : cost of processing one unit of data at a site Sk Response time and/or throughput Network Information Constraints communication cost/frame between two sites Per site constraints (storage and processing) frame size DDBS12, SL02 51/60 ¨ M. Bohlen DDBS12, SL02 52/60 ¨ M. Bohlen
  • 14. Fragment Allocation/3 Fragment Allocation/4 The total cost function has two components: storage and query We discuss an allocation model which attempts to processing. minimize the total cost of processing and storage meet certain response time restrictions TOC = STCjk + QPCi General Form: Sk ∈S Fj ∈F q i ∈Q min(Total Cost) Storage cost of fragment Fj at site Sk : subject to response time constraint STCjk = USCk ∗ size (Fj ) ∗ xij storage constraint processing constraint where USCk is the unit storage cost at site k Decision variable xij Query processing cost for a query qi is composed of two components: 1 if fragment Fi is stored at site Sj xij = composed of processing cost (PC) and transmission cost (TC) 0 otherwise QPCi = PCi + TCi DDBS12, SL02 53/60 ¨ M. Bohlen DDBS12, SL02 54/60 ¨ M. Bohlen Fragment Allocation/5 Fragment Allocation/6 Processing cost is a sum of three components: The transmission cost is composed of two components: access cost (AC), integrity contraint cost (IE), concurency control cost Cost of processing updates (TCU) and cost of processing retrievals (CC) (TCR) PCi = ACi + IEi + CCi TCi = TCUi + TCRi Access cost: Cost of updates: Inform all the sites that have replicas + a short confirmation message ACi = (URij + RRij ) ∗ xij ∗ LPCk back sk ∈S Fj ∈F TCUi = uij ∗ (update message cost + acknowledgment cost) where LPCk is the unit process cost at site k Sk ∈S Fj ∈F Integrity and concurrency costs: Can be similarly computed, though depends on the specific constraints Retrieval cost: Send retrieval request to all sites that have a copy of fragments that are Note: ACi assumes that processing a query involves decomposing needed + sending back the results from these sites to the originating it into a set of subqueries, each of which works on a fragment, ..., site. This is a very simplistic model Does not take into consideration different query costs depending on TCRi = min (xjk ∗ (cost retrieval request + cost sending back result)) S k ∈S F j ∈F the operator or different algorithms that are applied DDBS12, SL02 55/60 ¨ M. Bohlen DDBS12, SL02 56/60 ¨ M. Bohlen
  • 15. Fragment Allocation/7 Fragment Allocation/8 Solution Methods Modeling the constraints The complexity of this allocation model/problem is NP-complete Correspondence between the allocation problem and similar Response time constraint for a query qi problems in other areas execution time of qi ≤ max. allowable response time for qi Plant location problem in operations research Knapsack problem Storage constraints for a site Sk Network flow problem Hence, solutions from these areas can be re-used storage requirement of Fj at Sk ≤ storage capacity of Sk Use different heuristics to reduce the search space F j ∈F Assume that all candidate partitionings have been determined together with their associated costs and benefits in terms of query processing. Processing constraints for a site Sk The problem is then reduced to find the optimal partitioning and placement for each relation processing load of qi at site Sk ≤ processing capacity ofSk Ignore replication at the first step and find an optimal non-replicated qi ∈Q solution Replication is then handeled in a second step on top of the previous non-replicated solution. DDBS12, SL02 57/60 ¨ M. Bohlen DDBS12, SL02 58/60 ¨ M. Bohlen Conclusion/1 Conclusion/2 Distributed design decides on the placement of (parts of the) data and programs across the sites of a computer network There are two key technical questions: fragmentation and allocation/replication of data Allocation/Replication of data Horizontal fragmentation is defined via the selection operation Type of replication: no replication, partial replication, full replication σp (R ) Optimal allocation/replication modelled as a cost function under a set of constraints Rewrites the queries of each site in the conjunctive normal form and The complexity of the problem is NP-complete finds a minimal and complete set of conjunctions to determine Use of different heuristics to reduce the complexity fragmentation Vertical fragmentation via the projection operation πA (R ) Computes the attribute affinity matrix and groups “similar” attributes together Mixed fragmentation is a combination of both approaches DDBS12, SL02 59/60 ¨ M. Bohlen DDBS12, SL02 60/60 ¨ M. Bohlen