SlideShare une entreprise Scribd logo
1  sur  8
1                                                                                                  Introduction
        ON-LINE ANALYTICAL PROCESSING IN DISTRIBUTED DATA WAREHOUSES

                                          Jens Albrecht, Wolfgang Lehner
                                   University of Erlangen-Nuremberg, Germany
                                  {jalbrecht, lehner}@informatik.uni-erlangen.de


                         Abstract                                     Therefore, many companies today decide to start with
                                                                smaller, flexible data marts dedicated to specific business
      The concepts of ‘Data Warehousing’ and ‘On-line           areas. In order to get the possibilities for cross-functional
Analytical Processing’ have seen a growing interest in the      analysis, there are two possibilities. The first one is to cre-
research and commercial product community. Today, the           ate again a centralized data warehouse for only cross-func-
trend moves away from complex centralized data ware-            tional summary data. The other one is to integrate the data
houses to distributed data marts integrated in a common         marts into a common conceptual schema and therefore cre-
conceptual schema. However, as the first part of this paper      ate a distributed data warehouse.
demonstrates, there are many problems and little solutions
                                                                      We will concentrate on the second alternative which
for large distributed decision support systems in worldwide
                                                                allows more flexible querying. To the user a distributed
operating corporations. After showing the benefits and
                                                                data warehouse should behave exactly like a centralized
problems of the distributed approach, this paper outlines
                                                                data warehouse. For transparent and efficient OLAP on a
possibilities for achieving performance in distributed on-
                                                                distributed data warehouse several problems need to be
line analytical processing. Finally, the architectural frame-
                                                                solved. The intention of this article is not to present final
work of the prototypical distributed OLAP system
                                                                solutions but to show the potential of the distributed
CUBESTAR is outlined.
                                                                approach and the challenges for researchers and software
                                                                vendors as well.
1 Introduction
                                                                      In the following section we will characterize different
     Today’s global economy has placed a premium on             data warehouse architectures and their inherent drawbacks.
information, because in a dynamic market environment            After showing the benefits of a distributed approach to data
with many competitors it is crucial for an enterprise to have   warehousing we show the problems resulting for distrib-
on-line information about its general business figures as        uted OLAP systems. Performance as the main problem is
well as detailed information on specific topics to be able to    the issue of section 3. Finally, in section 4 we give an over-
make the right decisions at the right time. That’s why today    view over the prototypical distributed OLAP system
almost all big companies are trying to build data ware-         CUBESTAR.
houses. In contrast to former management information sys-
tems based mostly on operational data, data warehouses          2 Data Warehouse Architectures
contain integrated and non-volatile data and provide there-
fore a consistent basis for organizational decision making.          According to [Inmo92], a data warehouse is a “sub-
In addition to classical reporting, the main application of     ject-oriented, integrated, time-varying, non-volatile collec-
data warehouses is On-line Analytical Processing (OLAP,         tion of corporate data”. Since that definition is very
[CoCS93]), i.e. the interactive exploration of the data. The    generic, we will now give an overview of three types of
data warehouse promise is getting accurate business infor-      data warehouse architectures, partly based on [WeCa96].
mation fast.
      But the ultimate goal of data warehousing and OLAP        2.1 Enterprise Data Warehouses
goes even further, the vision is to “put a crystal ball on
                                                                     What most people think of when they use the term
every desktop”. Today that ideal is far from being reality.
                                                                data warehouse is the enterprise or corporate data ware-
Because first, to build a single centralized data warehouse
                                                                house, where all business data is stored in a single database
serving many different user groups takes a long time. Setup
                                                                with a single corporate data model (figure 1). Enterprise
and maintenance are very expensive. All this contributes to
                                                                data warehouses are created to provide knowledge workers
a relatively inflexible architecture. Second, local access
                                                                with consistent and integrated data from all major business
behavior is considered in the data warehouse design neither
                                                                areas. They offer the possibility to correlate information
on the conceptual nor on the internal but only on the exter-
                                                                across the enterprise.
nal layer.
Data Warehouse Architectures                                                                                          2.2
      But, because of the huge sub-
ject area (the corporation) and the
vast amount of different opera-
tional sources that need to be inte-
grated, enterprise data warehouses          OLAP                                 Distributed OLAP Middleware
are very difficult and time-con-             Server
suming to implement. Too many
                                                                                               Global
different data sources and too                                                                 Schema
many different user requirements            Centralized                                                      Asian
                                               DW                           Amer.                    Service Sales
are likely reasons for the failure of                                       Sales         All                     Logistics
                                                                                         Service
the whole data warehouse project.
                                            Europe                              America                      Asia
     Since the data warehousing
setup strategy from the analysts       Fig. 1: Enterprise                      Fig. 3: Distributed Enterprise Data
point of view is, as Inmon puts it,    Data Warehouse                                       Warehouse
“give me what I want and I will tell
you what I really want” [Inmo92], an incremental approach       The benefits of this divide-and-conquer approach are clear.
has proven to be the right choice.                              In the distributed enterprise data warehouse the individual
                                                                data marts can be created and managed flexibly.
2.2 Data Marts                                                       The flexibility for the individual data marts is traded
                                                                for additional complexity in the management of the global
     The term data mart stands for                              schema and query processing across several data marts.
the idea of a highly-focused ver-                               Table 2.1 compares some characteristics to the centralized
sion of a data warehouse (figure                                 data warehouse.
2). There is a need to distinguish
data marts which are local             OLAP       OLAP                Characteristics          Centralized      Distributed DW
extracts from an enterprise data       Server     Server
                                                                                                 DW          Local Area Wide Area
warehouse and data marts which
are serving as the data warehouse                                  Conceptual Flexibility          low          low       moderate
for a small business group with                                      Data Distribution             none      moderate          high
                                              Sales
well defined needs. In the latter      Service      Logistics        Query Processing               easy      moderate         complex
case, which is the definition we                                     Communication Costs            low       moderate          high
will use in this paper, the limita-           Asia                  Security Requirements.         none         low             low
tion in size and scope of the data                                  Local Maintenance              hard        easy            easy
warehouse, e.g. only sales or mar-    Fig. 2: Data Marts
                                                                   Global Maintenance              n/a         easy       moderate
keting, dramatically reduces setup
costs. Data marts can be deployed in weeks or months. That                Tab. 2.1 Characteristics of Centralized and
is why most of the current data warehousing projects                             Distributed Data Warehouses
include data marts.

2.3 Distributed Enterprise Data Warehouses                           There are basically two scenarios for a distributed
                                                                enterprise data warehouse. The local area scenario (e.g. the
     Although the data mart idea is very attractive and         data marts of only the Asian division) characterizes what
potentially offers the opportunity to build a corporate wide    first comes into mind. The extension of this approach to a
data warehouse from the bottom up, the benefits of data          wide area scenario (the global setting depicted in figure 3)
marts can easily be outweighed if there is no corporate-        is possible, but the high degree of data distribution results
wide data warehouse strategy. If data mart design is not        in much more complexity for query processing since com-
done carefully, independent islands of information are cre-     munication costs become a major factor. The more global
ated and cross-functional analysis, as in the enterprise data   the distributed data warehouse is, the more important is
warehouse, is not possible.                                     increasing local access by replication. On the contrary, the
                                                                more local the scenario is, the more likely techniques for
     Therefore, the pragmatic approach to combine the vir-      load balancing in a shared-nothing server cluster can be
tues of the former is to integrate the departmental data        applied.
marts into a common schema. The result is a distributed
enterprise data warehouse (figure 3), also called federated
data warehouse [WeCa96] or enterprise data mart [Info97].
2.4                                                                                                         Discussion
2.4 Discussion                                                       Site A                                             Site C

     The current trend in the creation of data warehouses
follows the bottom up approach where many data marts are
created with a later integration into a distributed enterprise
data warehouse in mind. The Ovum report [WeCa96] pre-                Site B
dicts that centralized data warehouses will become increas-
ingly unpopular due to their inherent drawbacks and thus
                                                                                                                Query Result
be replaced by a set of interoperating data marts.
      The commercial product community seems to confirm
                                                                               Complete data cube
the prediction. A first step towards easy integration was
                                                                               Selected canditate aggregates
recently done by the Informatica Corp. with the release of
its tools for enterprise data mart support. These tools are                    Other possible candidates
proposed to easily allow the creation and maintenance of
                                                                                 Fig. 4: Patchworking algorithm
distributed data marts and a seamless integration into a
common schema. Moreover, query tools can directly
                                                                 every minute. If user behavior is considered at all, only a fix
access the metadata by providing a metadata exchange
                                                                 number of queries is taken as the basis for the aggregation
API. OLAP tool vendors like Microstrategy Inc. already
                                                                 algorithms ([GHRU97], [Gupt97], [BaPT97]). On the
announced to support this feature. For querying, “it means
                                                                 commercial product side there are OLAP tools like Micros-
that all data marts are working off the same sets of data def-
                                                                 trategy’s DSS Tools or Informix’ Metacube which are able
initions and business rules allowing users to query aggre-
                                                                 to give the DBA hints for the creation of aggregates based
gate data that may be distributed across multiple disparate
                                                                 on monitored user behavior. However, if the DBA decides
data marts” [Ecke97]. Companies like Cognos [Cogn97]
                                                                 to create aggregates, these are static and a change in the
and Hewlett Packard [HP97] also offer already rudimental
                                                                 user behavior is not reflected.
support for distributed data warehouses.
                                                                      Today, aggregates are more or less seen as another
      But to our knowledge, no query tool today is able to
                                                                 multidimensional access method, similar to an index, that
transparently and performantly generate and execute que-
                                                                 must be managed by the DBA. But only a self-maintaining,
ries in distributed setting. An intelligent middleware layer
                                                                 dynamic aggregate management working like a (distrib-
as the basis for traffic routing, aggregate caching and load
                                                                 uted) system buffer or cache for aggregates offers the
balancing is needed. This is the problem we will focus on
                                                                 needed flexibility and the potential for high query response
in the remaining sections.
                                                                 times and low human maintenance efforts. The selection
                                                                 and the usage of aggregates should be completely transpar-
3 Achieving Performance in Distributed                           ent to the user. The DBA should only be able to setup
  OLAP Systems                                                   parameters for the creation of but not the content of aggre-
                                                                 gates.
     In order to deserve the attribute on-line, queries
should never run much longer than a minute. In this section      Partial Aggregates. The algorithms for the selection of
we will outline some specific methods and necessary pre-          aggregates explicitly supporting the multidimensional
requisites to achieve performance in large distributed           model compute preaggregates that are based on the full
OLAP environments. Note, that some of these also apply in        scope of the data cube. The query result in figure 4, if mate-
the non-distributed case but become essential in the distrib-    rialized, would be such a complete aggregate. Relationally
uted case.                                                       spoken, only different attribute combinations in the group-
                                                                 by clause are considered. The where-clause in the materi-
3.1 The Selection and Use of Aggregates                          alized view definition is always empty. The benefit of this
                                                                 approach is that query processing is relatively simple.
      Researches and vendors agree that the main approach        Therefore, this is the only kind of aggregates commercial
to speed up OLAP queries is the use of aggregates or mate-       products support today.
rialized views [LeRT96]. In the last two years many articles
on the selection of views to materialize in a data warehouse          However, materialized views created in that manner
were published. A good overview over techniques on the           may contain much data that is never needed and therefore
selection of aggregates can be found in [ThSe97].                storage space is wasted. Even more important, in a distrib-
                                                                 uted setting with inherent data fragmentation the handling
Dynamic Management of Aggregates. None of the recent             of partial aggregates (the candidates in figure 4) becomes
articles considers the dynamics of an on-line environment        a necessary prerequisite. A data mart server has only parts
with dozens of completely different queries to execute in        of the whole data cube stored locally. In fact, that is the fun-
Achieving Performance in Distributed OLAP Systems                                                              3.2
damental idea of having data marts in a distributed enter-           The fragmentation of the raw data is already fix due to
prise data warehouse. Thus, at least at the lower aggrega-      the definition of the data mart. Usually, there is not much of
tion levels it does not even make sense to create aggregates    a choice here. The problems as well as the performance
with full, enterprise-wide scope.                               potential of distributed data warehouses arise from the
     The big drawback is clearly that query optimization in     mentioned techniques of dynamic aggregate management
the presence of partial aggregates becomes more complex.        and dynamic data allocation. These techniques were not
A patchworking algorithm finding the least expensive             previously applicable because redundancy in transaction-
aggregate partitions must be invoked (figure 4).                 oriented systems always requires expensive synchroniza-
                                                                tion mechanisms.
Data Allocation and Load Balancing. Directly related to              The query optimizer of a distributed OLAP system
the question of selecting aggregates in a distributed envi-     first needs to be aware of the aggregations which would be
ronment is the allocation problem. The system should not        helpful in order to speed up query processing. If this infor-
only manage the dynamic creation of partial aggregates,         mation is known it must invoke a patchworking algorithm
but also decide where to place them on a tactical basis.        finding the least expensive aggregate partitions (figure 4).
Moreover, the traditional method to increase local access       The choice how to assemble the query result may vary from
performance in distributed environments by the use of rep-      query to query because the underlying redundant fragments
licas can and should also be applied in distributed OLAP.       may be dropped for the selection of new ones by the
     Hence, redundancy in the distributed data warehouse        resource management and the system and network load
can be distinguished into vertical redundancy involving the     may change.
creation of summary data with a coarser granularity than             An important point is that the determination of candi-
the detailed raw data, and horizontal redundancy being          dates as well as the patchworking itself, which includes the
introduced by the replication of aggregates (figure 5).          comparison of selection predicates, is only feasible in the
                                                                context of a multidimensional data model. In contrast to the
                 aggr op.                                       simple relational data model, selections (slices) can be
                         replicate                              defined by classification nodes (see section 4.2). Predicates
     Summary Data                      Summary Data             defined in that manner contain much more semantics as a
                 aggr op.                                       simple relational selection predicate. If one would allow
                                                                arbitrary predicates for the definition of fragments, patch-
         Raw Data                                               working would not be possible.
       Vertical                Horizontal
       Redundancy              Redundancy                       3.2 Quality of Service Specification
         Fig. 5: Horizontal and Vertical Redundancy
                                                                      Another point greatly distinguishes the distributed
                                                                OLAP scenario from traditional distributed databases. In
     In order to allow the system to adapt to changing user     the analytical context general figures are of interest and not
requirements, there should be no fixed locations for any         individual entities. It is often the case that approximate
kind of redundant data fragments. Aggregate data should         numbers are sufficient if they can be delivered faster. Thus,
be close to sites where they are needed in order to minimize    it is feasible to trade correctness for performance. Some
the communication costs. In order to increase parallel          factors characterizing correctness are accuracy, actuality
query execution, data fragments should be distributed on        and consistency.
several data mart servers, especially if communication cost
is low. Redundant fragments must dynamically be created         Accuracy. One approach to relaxing the accuracy of the
and dropped according to the changing user behavior.            reports is offering summary data that is either aggregated at
     Thus, a flexible adapting migration and replication         a higher level than requested or offering only parts of the
policy is essential for a data based load balancing at the      requested information. For example, if a query requesting
data mart servers. Involved in the process of choosing and      sales figures for each city in Germany would take hours, an
allocating aggregates are not only access characteristics but   answer containing sales figures for each county or even
also communication cost, maintenance cost for redundant         only for South Germany might be enough for the moment.
data and additional storage cost.                               An algorithm dealing with these issues is given in
                                                                [Dyre96].
Differences to traditional distributed DBMS. Although
the fragmentation and allocation problematic is well            Actuality. Absolute actuality in a worldwide distributed
known from traditional distributed database systems, there      enterprise data warehouse is an expensive requirement. If
are several specifics for distributed data warehouses.           each data mart is updated once a day for example, it means
                                                                for the distributed data warehouse that it is updated every
4                                                                 The Distributed OLAP system CubeStar
few hours. Therefore, the maintenance cost for summary             Clients        Result
data can increases dramatically. But getting the most actual
numbers may not be necessary for the analyst. Hence, a
certain degree of staleness can be acceptable allowing                                                  Query
redundant data to be updated or dropped gradually                 Middleware      Parser Syntax Check                            Query
                                                                                                                 Restructurer
[Lenz96].                                                                                                                        Generator

                                                                                           Semantic Check            Broker
Consistency. The consistency of the reports generated in a                                                                           Partition
session should be another adjustable factor. In general, it is        Delivery     Distributed Global                Partition       Directory
                                                                      Server         Classification                   Selector         Cache
desirable that a sequence of drill-down and roll-up opera-                            Information

tions relies on the same base figures. However, this require-
ment might not be crucial, if the analyst just wants to get an      Data Mart
                                                                    Server                  Bidder              B                B
overview over some figures. Hence, this condition as well
might be relaxed according to the analysts needs. Note, that                                                    RM      AE       RM       AE
                                                                     Aggregation           Resource
generating consistent reports can be very expensive,                   Engine              Manager
because all reports in a session must have the same actual-
ity.                                                                         AE                                                   Distributed
                                                                      AE                                                          Server
                                                                                                                                  Federation
     Especially in a large distributed environment where
communication costs can not be neglected, these factors                                                                          Control Flow
                                                                                                                                 Data Flow
are critical for reasonable overall performance. A little
inaccurate, a little stale or a little inconsistent data may be                    Fig. 6: Architecture of CUBESTAR
sufficient for the analyst, if he just gets the figures in a few
seconds rather than minutes.                                           The third layer (server layer) consists of a federation
                                                                  of data mart servers. Each data mart server controls the
4 The Distributed OLAP system CUBESTAR                            locally available set of data partitions and the parallel query
                                                                  execution at its machine.
     This section focuses on the architecture and the prin-
ciples for query processing and resource management of            4.2 The Data Model
the prototypical distributed OLAP system CUBESTAR. In
order to deal with the complexity of the issues we borrowed            CUBESTAR consequently applies the multidimen-
many ideas from the Mariposa system ([SDK+94]) and                sional data model at all levels of query processing and
applied them to our special application domain.                   aggregate management. Only issues of physical data access
                                                                  are managed by the underlying relational database engines.
4.1 The Architecture                                                   Since the main operation in OLAP is the application
                                                                  of aggregation operations, classification hierarchies can be
     The general architecture of CUBESTAR consists of a           used to define groups for the application of aggregation
three tier architecture as shown in figure 6. At the client        operations (figure 7). For example, the typical OLAP-
layer are the end user tools with reporting facilities. Que-      query of figure 7 specified in CQL asks for the sales figures
ries are issued in the Cube-Query-Language (CQL,                  at the granularity of product families subsumed by video
[BaLe97], figure 7), specifically supporting the multidi-           equipment (= camcorders and recorders).
mensional data model.
     The heart of the architecture is the middleware layer        Multidimensional Objects. In analogy to the ‘cubette’-
hiding the details of data distribution. Its main task is to      approach of [Dyre96], a query may be seen as a multidi-
translate a user query and generate an optimized distributed      mensional subcube, called Multidimensional Object (MO),
query execution plan. For distributed query processing and        basically described by three components [Lehn98]:
aggregate management we apply a microeconomic para-                 • A specification of the original cell and the applied
digm. A site tries to maximize revenues earned for taking             aggregation operator (SUM(Sales))
part in query processing. In order to compete, the site has
to maintain a set of redundant data according to the current        • A multidimensional scope specification consisting of
market needs. The middleware layer also implements a glo-             a tuple of classification nodes (<P.Group = ‘Video’,
bal classification information service providing a globally            G.Country = ‘Germany’, T.Year = ‘1997’>)
consistent view of the multidimensional data model.                 • A data granularity specification consisting of a tuple
                                                                      of categories (<P.Family, G.Top, T.Top>)
The Distributed OLAP system CubeStar                                                                             4.3
     TOP                     ALL                                   for the aggregation efforts. The value of a query result is
                                                                   computed by a weighted combination of the following fac-
                          Consumer                                 tors:
     Area                 Electronics
                                                                    • Actual size of the raw data MOs required for the com-
     Group       Audio       Video                                    putation (number of tuples).
                                                                      Taking the size of the raw data is reasonable, since
     Family      Camcorder         Recorder
                                                                      requests for aggregates of huge fact tables should be
     article#   TR-75 TS-78 A200 V-201 ClassicI
                                                                      more expensive than those of smaller ones. Moreover,
                                                                      this decision favors the storage of higher aggregates,
                                                                      since they need little space and yield high profits.
    SELECT SUM(SALES)
    FROM Product P, Geography G, Time T                             • Actual size of the query result (number of tuples).
    WHERE P.Group = ‘Video’,                                          This term simply values the additional overhead for
           G.Country = ‘Germany’,                                     storing large aggregates.
           T.Year = ‘1997’                                          • Aggregation time.
    UPTO P.Family                                                     This is the time spent from issuing the query to deliver-
                                                                      ing the result, i.e. the local query processing time.
       Fig. 7: A Sample Classification Hierarchy and                  Higher aggregation time results in a lower price, thus
                  a Sample Query in CQL                               favoring fast data mart servers.
                                                                    • Transfer time.
     Within the distributed multidimensional OLAP envi-               The computation may require the shipping of interme-
ronment, multidimensional objects reflect the basic units              diate MOs from one data mart server to another.
for distributed storage management and query processing.
Therefore, different types of MOs with regard to their gen-        4.4 Query Processing
eration in the distributed environment are considered.
     The raw or base data partitions at each data mart are         A query in CUBESTAR is basically a multidimensional
called Base Multidimensional Objects (BMO). Thus,                  object specified by a CQL query like the one in figure 7.
BMOs represent data of the finest granularity. BMOs are             Affiliated with each query is a user defined quality of ser-
not allowed to migrate but are assigned a unique home site,        vice specification consisting of a limitation to the query
where they are uploaded by the data mart loading tools.            processing time and an indicator for the requested accuracy
Consequently, they are per definition up-to-date.                   of the data. Thus, the user has the choice to trade speed for
                                                                   actuality.
     From the BMOs any number of MOs can be derived.
In a relational sense these derived MOs are materialized           The query is issued to a broker which tries to perform the
views. However, because of the definition of MOs, the               query in the best way possible on behalf of the user. Since
questions whether one MO is contained in, intersected by           all users are treated equally and prices are regulated, a
or can be derived from another MO is easily decidable.             query is not assigned a budget. The brokers objective is
According to the requirements of the system MOs are                simply to get the best service available for the users
allowed to migrate or be replicated freely.                        request, i.e. to minimize query execution time under the
                                                                   quality of service constraints. After query processing, the
                                                                   broker pays the price according to the billing scheme to the
4.3 Microeconomics                                                 participating data mart servers. Note, that in order to sup-
An elegant solution to cope with the complexity of                 port user priorities it would make sense to limit the users
dynamic aggregate management and load balancing is to              budget resulting in a second objective, price, to query opti-
put all issues related to shared resources into a microecono-      mization.
mic framework ([SDK+94]). In an economy where every                Query processing basically proceeds in the three steps
site tries to selfishly maximize its utility there is no need for   query preparation, query optimization, and query execu-
a central coordinator; the decision process is inherently          tion.
decentralized. Moreover, the competition for orders and
                                                                   In the query preparation phase, the query string is handed
profits allows the system to dynamically adapt to market
                                                                   over to the parser, where it is checked for syntactical and
demands, i.e. query behavior and system load as well.
                                                                   semantic correctness. The result of this step is a tree-like
In contrast to Mariposa where demand and supply regulate           representation of the query with the queried MO at the top,
the prices, for the sake of simplicity a uniform billing           intermediate results at the inner nodes, and raw data at the
scheme is applied to each executed query having fixed rates         leaves.
4.5                                                                                      Storage Management
For the generation of a distributed query execution plan the     4.5 Storage Management
current distribution of the aggregates must be taken into
account. Note, that due to the dynamic aggregation man-          So far, the existence of many redundant multidimensional
agement the existence and location of aggregates and rep-        objects was assumed. This subsection focuses on when,
licates changes frequently. Therefore, the broker may issue      how, and where these MOs are generated. In order to deal
a broadcast request to the data servers in order to obtain the   with the complexity of the issues of section 3.1, this prob-
necessary information. This kind of meta data is cached in       lem is handled by the application of microeconomics.
the partition directory cache in order to avoid too many         The resource manager at a data server aims towards maxi-
broadcasts. Using this method, it is possible that invalid       mal usage of its local resources in terms of revenues.
meta data are used. In this case, there are two possibilities,   Therefore, a site always tries to expand its market share by
either force a data mart server to the recomputation of          cleverly creating and dropping aggregates and replicas. In
dropped aggregates or submit a new broadcast.                    order to estimate current demands the resource manager
At the data mart server side, the bidder component receiv-       maintains a query history list containing those requests the
ing the request checks for those locally available MOs           site was rejected to answer, either because it did not have
which can contribute to the query. Each bid from a data          any suitable MOs or other bidders were preferred. The list
mart server includes the following data                          entries include for each query the actually selected MOs
                                                                 with their locations and revenues. Based on this informa-
 • the identifier of the data mart server,                        tion the resource manager can decide to obtain new MOs in
 • a set of tuples { (Mi, Ti) } each of which contains the       order to underbid its competitors in the future. If the new
   description of an offered MO in the repository and the        MO is to be acquired from a remote location, the resource
   time the site expects to aggregate that MO to the gran-       manager acts as a client itself, and therefore also has to pay
   ularity of the query M,                                       for the MO from its own account. In this case, it is doing an
                                                                 investment.
 • an indicator for the available processing power the site
   is willing to reserve for the query depending on the sys-     The other way to use the storage space is to offer it for
   tem load,                                                     query processing. If MOs from different locations must be
                                                                 coalesced, one site will receive the other parts. Note, that it
 • the free space the site is willing to reserve for interme-
                                                                 is very lucrative to receive other pieces of MOs fitting to
   diate results possibly from other data mart servers.
                                                                 locally stored ones, because next time a larger MO can be
The query optimizer’s task is now to select from the set of      offered.
multidimensional objects offered by the bidding sites
                                                                 In fact, using the intermediate results of the query process-
those, which require the least aggregation and transporta-
                                                                 ing is the most convenient way to get new MOs also in the
tion efforts using a patchworking algorithm. To estimate
                                                                 local case. Each of these MOs is checked by the resource
communication costs, average network throughputs
                                                                 manager for usefulness. Basically, storing MOs that were
between the sites are used.
                                                                 already needed is working like a cache, with a microecono-
The result of this step is a distributed query execution plan,   mic heuristics for replacement.
which is basically a tree with physically existing MOs at
the leaves, the target MO at the top, and intermediate
                                                                 5 Summary and Future Work
results at the inner nodes. Attached to each node is an oper-
ation, an estimated processing time for the operation, and            In the first part of this paper we made clear the draw-
the location for the execution of the operation denoting the     backs of the centralized data warehouse and the non-inte-
bidders that won the competition.                                grated data mart approach. In order to circumvent these
In the query execution step, the bidders are informed            drawbacks and combine the benefits, integrated distributed
whether their offer was accepted or not. The notification         enterprise data warehouses were presented as a possible
includes the complete query execution plan allowing the          solution. The on-line exploration of distributed data ware-
sites to understand the decision of the broker and draw con-     houses or distributed OLAP is related to many complex
clusions for their future resource management.                   issues like distributed query optimization, meta data man-
                                                                 agement and security policy handling, each of which
If all MOs necessary for processing have arrived at the site,
                                                                 demanding innovative solutions. We characterized some
the local query execution is initiated by an aggregation
                                                                 possibilities for achieving query performance by dynamic
engine. If subqueries can locally be executed in parallel,       and partial aggregation, data based load balancing and the
the aggregation engine may spawn new aggregation                 specification of the quality of service. Finally, we discussed
engines running in parallel threads. After the aggregation       the architectural framework of the prototypical distributed
engine completed its task, in the last step the resulting MO     OLAP system CUBESTAR.
is sent to the target location.
Summary and Future Work                                                                                          5
      We consider sophisticated aggregation management a          HaRU96   Harinarayan, V.; Rajaraman, A.; Ullman, J.D.:
major research issue. This covers algorithms for using dis-                Implementing Data Cubes Efficiently, in: 25th
tributed partial aggregations and deciding which aggregate                 International Conference on Management of Data,
combinations yield the highest benefit and are the best to be               (SIGMOD96, Montreal, Quebec, Canada, June 4-6,
                                                                           1996), pp. 205-216
materialized. A reasonable solution offering the needed
flexibility is the application of a microeconomic framework        HP97     N.N.: HP Intelligent Warehouse, White Paper,
as in the Mariposa system ([SDK+94]). We admit, that this                  Hewlett Packard, http://www.hp.com, 1997
last statement needs to be proven by the implementation.          Inmo92   Inmon, W.H.: Building the Data Warehouse, John
                                                                           Wiley, 1992
                                                                  Info97   N.N.: Enterprise-Scalable Data Marts: A New
References                                                                 Strategy for Building and Deploying Fast, Scalable
BaLe97      Bauer, A.; Lehner, W.: The Cube-Query-Language                 Data Warehousing Systems, White Paper,
            for Multidimensional Statistical and Scientific                 Informatica, http://www.informatica.com, 1997
            Database Systems, in: 5th International Conference    Lehn98   Lehner, W.: Modeling Large Scale OLAP
            on Database Systems For Advanced Applications                  Scenarios, in: 6th International Conference on
            (DASFAA’97, Melbourne, Australia, April 1-4,                   Extending Database Technology, (EDBT’98,
            1997), pp. 263-272                                             Valencia, Spain, March 23-27, 1998)
BaPT97      Baralis, E.; Paraboschi, S.; Teniente, E.:            Lenz96   Lenz, R.: Adaptive Distributed Data Management
            Materialized View Selection in a Multidimensional              with Weak Consistent Replicated Data. In:
            Database, in: 23rd International Conference on Very            Proceedings of the 11th annual Symposium on
            Large Data Bases (VLDB’97, Athens, Greece,                     Applied Computing (SAC’96), Philadelphia, 1996
            1997)
                                                                  LeRT96   Lehner, W.; Ruf, T.; Teschke, M.: Improving Query
CoCS93      Codd, E.F.; Codd, S.B.; Salley, C.T.: Providing                Response Time in Scientific Databases Using Data
            OLAP (On-line Analytical Processing) to User                   Aggregation, in: 7th International Conference and
            Analysts: An IT Mandate, White Paper, Arbor                    Workshop on Database and Expert Systems
            Software Corporation, 1993                                     Applications (DEXA’96, Zurich, Switzerland,
Cogn97      Cognos Corporation: Distributed OLAP, White                    Sept. 9-10, 1996)
            Paper, http://www.cognos.com, 1997                    SDK+94   Stonebraker, M.; Devine, R.; Kornacker, M.; Litwin,
Cube97      The      CUBESTAR          Project:    http://                 W.; Pfeffer, A.; Sah, A.; Staelin, C.: An Economic
            www6.informatik.uni-erlangen.de/research/                      Paradigm for Query Processing and Data Migration
            cubestar.html                                                  in Mariposa, in: Proceedings of 3rd International
                                                                           Conference on Parallel and Distributed Information
Dyre96      Dyreson, C.: Information Retrieval from an                     Systems (PDIS’94, Austin, TX., Sept. 28-30), 1994,
            Incomplete Data Cube, in: 22th International                   pp. 58-67
            Conference on Very Large Data Bases, (VLDB’96,
            Mumbai, India, 1996)                                  ThSe97   Theodoratos, D.; Sellis, T.: Data Warehouse
                                                                           Configuration, in: 23th International Conference on
Ecke97      Eckerson, W.: Building and Managing Data Marts,                Very Large Data Bases, (VLDB’97, Athens,
            Patricia Seybold Group Report for Informatica,                 Greece, 1997)
            http://www.informatica.com, 1997
                                                                  WeCa96   Wells, D.; Carnelley, P.: Ovum evaluates: The Data
GHRU97      Gupta, H.; Harinarayan, V.; Rajaraman, A.; Ullman,             Warehouse, Ovum Ltd., London, 1996
            J.D.: Index Selection for OLAP, in: 13th
            International Conference on Data Engineering,
            (ICDE’97, Birmingham, UK, April 7-11, 1997)
Gupt97      Gupta, H.: Selection of Views to Materialize in a
            Data Warehouse, in: 6th International Conference
            on Database Theory (ICDT‘97, Delphi, Greece,
            Jan 8-10, 1997), pp. 98-112

Contenu connexe

Tendances

Vbmca204821311240
Vbmca204821311240Vbmca204821311240
Vbmca204821311240Ayushi Jain
 
Juniper: Data Center Evolution
Juniper: Data Center EvolutionJuniper: Data Center Evolution
Juniper: Data Center EvolutionTechnologyBIZ
 
Microsoft SQL Azure - Cloud Based Database Datasheet
Microsoft SQL Azure - Cloud Based Database DatasheetMicrosoft SQL Azure - Cloud Based Database Datasheet
Microsoft SQL Azure - Cloud Based Database DatasheetMicrosoft Private Cloud
 
Demonstrating the Future of Data Science
Demonstrating the Future of Data ScienceDemonstrating the Future of Data Science
Demonstrating the Future of Data Sciencegreenplum
 
Dell server power edge
Dell server power edgeDell server power edge
Dell server power edgeMuhammad Azeem
 
Talk IT_ Oracle_김태완_110831
Talk IT_ Oracle_김태완_110831Talk IT_ Oracle_김태완_110831
Talk IT_ Oracle_김태완_110831Cana Ko
 
Idc Reducing It Costs With Blades
Idc Reducing It Costs With BladesIdc Reducing It Costs With Blades
Idc Reducing It Costs With Bladespankaj009
 
Exploiting dynamic resource allocation for
Exploiting dynamic resource allocation forExploiting dynamic resource allocation for
Exploiting dynamic resource allocation foringenioustech
 
DataCore + Fujitsu Business Solutions
DataCore + Fujitsu Business SolutionsDataCore + Fujitsu Business Solutions
DataCore + Fujitsu Business SolutionsDataCore Software
 
The Database Environment Chapter 1
The Database Environment Chapter 1The Database Environment Chapter 1
The Database Environment Chapter 1Jeanie Arnoco
 
Managed Data Services
Managed Data ServicesManaged Data Services
Managed Data ServicesMark Halpin
 
Managed Data Services
Managed Data ServicesManaged Data Services
Managed Data ServicesSimon Dale
 
Summit 2011 ods edw technical
Summit 2011 ods edw technicalSummit 2011 ods edw technical
Summit 2011 ods edw technicalGreg Turmel
 
Build a Smarter Data Center with Juniper Networks Qfabric
Build a Smarter Data Center with Juniper Networks QfabricBuild a Smarter Data Center with Juniper Networks Qfabric
Build a Smarter Data Center with Juniper Networks QfabricIBM India Smarter Computing
 

Tendances (19)

Greenplum hadoop
Greenplum hadoopGreenplum hadoop
Greenplum hadoop
 
DW Basics
DW BasicsDW Basics
DW Basics
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Vbmca204821311240
Vbmca204821311240Vbmca204821311240
Vbmca204821311240
 
Juniper: Data Center Evolution
Juniper: Data Center EvolutionJuniper: Data Center Evolution
Juniper: Data Center Evolution
 
Microsoft SQL Azure - Cloud Based Database Datasheet
Microsoft SQL Azure - Cloud Based Database DatasheetMicrosoft SQL Azure - Cloud Based Database Datasheet
Microsoft SQL Azure - Cloud Based Database Datasheet
 
Data Flux
Data FluxData Flux
Data Flux
 
CentreView 2010
CentreView 2010CentreView 2010
CentreView 2010
 
Demonstrating the Future of Data Science
Demonstrating the Future of Data ScienceDemonstrating the Future of Data Science
Demonstrating the Future of Data Science
 
Dell server power edge
Dell server power edgeDell server power edge
Dell server power edge
 
Talk IT_ Oracle_김태완_110831
Talk IT_ Oracle_김태완_110831Talk IT_ Oracle_김태완_110831
Talk IT_ Oracle_김태완_110831
 
Idc Reducing It Costs With Blades
Idc Reducing It Costs With BladesIdc Reducing It Costs With Blades
Idc Reducing It Costs With Blades
 
Exploiting dynamic resource allocation for
Exploiting dynamic resource allocation forExploiting dynamic resource allocation for
Exploiting dynamic resource allocation for
 
DataCore + Fujitsu Business Solutions
DataCore + Fujitsu Business SolutionsDataCore + Fujitsu Business Solutions
DataCore + Fujitsu Business Solutions
 
The Database Environment Chapter 1
The Database Environment Chapter 1The Database Environment Chapter 1
The Database Environment Chapter 1
 
Managed Data Services
Managed Data ServicesManaged Data Services
Managed Data Services
 
Managed Data Services
Managed Data ServicesManaged Data Services
Managed Data Services
 
Summit 2011 ods edw technical
Summit 2011 ods edw technicalSummit 2011 ods edw technical
Summit 2011 ods edw technical
 
Build a Smarter Data Center with Juniper Networks Qfabric
Build a Smarter Data Center with Juniper Networks QfabricBuild a Smarter Data Center with Juniper Networks Qfabric
Build a Smarter Data Center with Juniper Networks Qfabric
 

En vedette

Happy new year celebration style 1 powerpoint templates
Happy new year celebration style 1 powerpoint templatesHappy new year celebration style 1 powerpoint templates
Happy new year celebration style 1 powerpoint templatesSlideTeam.net
 
Happy New Year 2016 Quotes and Wishes
Happy New Year 2016 Quotes and WishesHappy New Year 2016 Quotes and Wishes
Happy New Year 2016 Quotes and WishesSimpy Saini
 
Mobile guest service
Mobile guest serviceMobile guest service
Mobile guest serviceLaura ORIOT
 
Claudio Barna
Claudio BarnaClaudio Barna
Claudio Barnagisalvi
 
Comunicato pallavolo N°11 del 21 dicembre 2016
Comunicato pallavolo N°11 del 21 dicembre 2016Comunicato pallavolo N°11 del 21 dicembre 2016
Comunicato pallavolo N°11 del 21 dicembre 2016Giuliano Ganassi
 
Next Challenges in Corporate Knowledge Management
Next Challenges in Corporate Knowledge ManagementNext Challenges in Corporate Knowledge Management
Next Challenges in Corporate Knowledge ManagementJose Manuel Gómez-Pérez
 
Consumer ED ILHIE Toolkit for Professionals
Consumer ED  ILHIE Toolkit for ProfessionalsConsumer ED  ILHIE Toolkit for Professionals
Consumer ED ILHIE Toolkit for ProfessionalsWirehead Technology
 
Triangulos rectangulos notables(completo)
Triangulos rectangulos notables(completo)Triangulos rectangulos notables(completo)
Triangulos rectangulos notables(completo)Martin Huamán Pazos
 
It’s ideas time!
It’s ideas time! It’s ideas time!
It’s ideas time! Mimi Lai
 

En vedette (16)

Happy new year celebration style 1 powerpoint templates
Happy new year celebration style 1 powerpoint templatesHappy new year celebration style 1 powerpoint templates
Happy new year celebration style 1 powerpoint templates
 
Happy New Year 2016 Quotes and Wishes
Happy New Year 2016 Quotes and WishesHappy New Year 2016 Quotes and Wishes
Happy New Year 2016 Quotes and Wishes
 
Provenance and Trust
Provenance and TrustProvenance and Trust
Provenance and Trust
 
Mobile guest service
Mobile guest serviceMobile guest service
Mobile guest service
 
Claudio Barna
Claudio BarnaClaudio Barna
Claudio Barna
 
דפוס בארי
דפוס בארידפוס בארי
דפוס בארי
 
Resume
ResumeResume
Resume
 
Activity 4
Activity 4Activity 4
Activity 4
 
Comunicato pallavolo N°11 del 21 dicembre 2016
Comunicato pallavolo N°11 del 21 dicembre 2016Comunicato pallavolo N°11 del 21 dicembre 2016
Comunicato pallavolo N°11 del 21 dicembre 2016
 
Mass and volume
Mass and volumeMass and volume
Mass and volume
 
Next Challenges in Corporate Knowledge Management
Next Challenges in Corporate Knowledge ManagementNext Challenges in Corporate Knowledge Management
Next Challenges in Corporate Knowledge Management
 
Consumer ED ILHIE Toolkit for Professionals
Consumer ED  ILHIE Toolkit for ProfessionalsConsumer ED  ILHIE Toolkit for Professionals
Consumer ED ILHIE Toolkit for Professionals
 
Frank resume 11112015
Frank resume 11112015Frank resume 11112015
Frank resume 11112015
 
Producción Audiovisual Cine: Tema 2
Producción Audiovisual Cine: Tema 2Producción Audiovisual Cine: Tema 2
Producción Audiovisual Cine: Tema 2
 
Triangulos rectangulos notables(completo)
Triangulos rectangulos notables(completo)Triangulos rectangulos notables(completo)
Triangulos rectangulos notables(completo)
 
It’s ideas time!
It’s ideas time! It’s ideas time!
It’s ideas time!
 

Similaire à Distributed Data Warehouses and OLAP Performance

Use Big Data Technologies to Modernize Your Enterprise Data Warehouse
Use Big Data Technologies to Modernize Your Enterprise Data Warehouse Use Big Data Technologies to Modernize Your Enterprise Data Warehouse
Use Big Data Technologies to Modernize Your Enterprise Data Warehouse EMC
 
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...IJCERT JOURNAL
 
The technology of the business data lake
The technology of the business data lakeThe technology of the business data lake
The technology of the business data lakeCapgemini
 
Basic data warehousing
Basic data warehousingBasic data warehousing
Basic data warehousingganikota
 
Challenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBAChallenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBAinventy
 
Open Source DWBI-A Primer
Open Source DWBI-A PrimerOpen Source DWBI-A Primer
Open Source DWBI-A Primerpartha69
 
LEGO EMBRACING CHANGE BY COMBINING BI WITH FLEXIBLE INFORMATION SYSTEM
LEGO EMBRACING CHANGE BY COMBINING BI WITH FLEXIBLE INFORMATION SYSTEMLEGO EMBRACING CHANGE BY COMBINING BI WITH FLEXIBLE INFORMATION SYSTEM
LEGO EMBRACING CHANGE BY COMBINING BI WITH FLEXIBLE INFORMATION SYSTEMmyteratak
 
Current trends in dbms
Current trends in dbmsCurrent trends in dbms
Current trends in dbmsDaisy Joy
 
Cidr11 paper32
Cidr11 paper32Cidr11 paper32
Cidr11 paper32jujukoko
 
Megastore providing scalable, highly available storage for interactive services
Megastore providing scalable, highly available storage for interactive servicesMegastore providing scalable, highly available storage for interactive services
Megastore providing scalable, highly available storage for interactive servicesJoão Gabriel Lima
 
GigaOm-sector-roadmap-cloud-analytic-databases-2017
GigaOm-sector-roadmap-cloud-analytic-databases-2017GigaOm-sector-roadmap-cloud-analytic-databases-2017
GigaOm-sector-roadmap-cloud-analytic-databases-2017Jeremy Maranitch
 
Conspectus data warehousing appliances – fad or future
Conspectus   data warehousing appliances – fad or futureConspectus   data warehousing appliances – fad or future
Conspectus data warehousing appliances – fad or futureDavid Walker
 

Similaire à Distributed Data Warehouses and OLAP Performance (20)

Use Big Data Technologies to Modernize Your Enterprise Data Warehouse
Use Big Data Technologies to Modernize Your Enterprise Data Warehouse Use Big Data Technologies to Modernize Your Enterprise Data Warehouse
Use Big Data Technologies to Modernize Your Enterprise Data Warehouse
 
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
 
No sql database
No sql databaseNo sql database
No sql database
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
The technology of the business data lake
The technology of the business data lakeThe technology of the business data lake
The technology of the business data lake
 
Basic data warehousing
Basic data warehousingBasic data warehousing
Basic data warehousing
 
DMDW 1st module.pdf
DMDW 1st module.pdfDMDW 1st module.pdf
DMDW 1st module.pdf
 
Issue in Data warehousing and OLAP in E-business
Issue in Data warehousing and OLAP in E-businessIssue in Data warehousing and OLAP in E-business
Issue in Data warehousing and OLAP in E-business
 
Challenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBAChallenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBA
 
[IJET-V1I5P5] Authors: T.Jalaja, M.Shailaja
[IJET-V1I5P5] Authors: T.Jalaja, M.Shailaja[IJET-V1I5P5] Authors: T.Jalaja, M.Shailaja
[IJET-V1I5P5] Authors: T.Jalaja, M.Shailaja
 
Open Source DWBI-A Primer
Open Source DWBI-A PrimerOpen Source DWBI-A Primer
Open Source DWBI-A Primer
 
Case study 9
Case study 9Case study 9
Case study 9
 
LEGO EMBRACING CHANGE BY COMBINING BI WITH FLEXIBLE INFORMATION SYSTEM
LEGO EMBRACING CHANGE BY COMBINING BI WITH FLEXIBLE INFORMATION SYSTEMLEGO EMBRACING CHANGE BY COMBINING BI WITH FLEXIBLE INFORMATION SYSTEM
LEGO EMBRACING CHANGE BY COMBINING BI WITH FLEXIBLE INFORMATION SYSTEM
 
Current trends in dbms
Current trends in dbmsCurrent trends in dbms
Current trends in dbms
 
Cidr11 paper32
Cidr11 paper32Cidr11 paper32
Cidr11 paper32
 
Megastore providing scalable, highly available storage for interactive services
Megastore providing scalable, highly available storage for interactive servicesMegastore providing scalable, highly available storage for interactive services
Megastore providing scalable, highly available storage for interactive services
 
GigaOm-sector-roadmap-cloud-analytic-databases-2017
GigaOm-sector-roadmap-cloud-analytic-databases-2017GigaOm-sector-roadmap-cloud-analytic-databases-2017
GigaOm-sector-roadmap-cloud-analytic-databases-2017
 
Data lakes
Data lakesData lakes
Data lakes
 
AtomicDBCoreTech_White Papaer
AtomicDBCoreTech_White PapaerAtomicDBCoreTech_White Papaer
AtomicDBCoreTech_White Papaer
 
Conspectus data warehousing appliances – fad or future
Conspectus   data warehousing appliances – fad or futureConspectus   data warehousing appliances – fad or future
Conspectus data warehousing appliances – fad or future
 

Dernier

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Dernier (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Distributed Data Warehouses and OLAP Performance

  • 1. 1 Introduction ON-LINE ANALYTICAL PROCESSING IN DISTRIBUTED DATA WAREHOUSES Jens Albrecht, Wolfgang Lehner University of Erlangen-Nuremberg, Germany {jalbrecht, lehner}@informatik.uni-erlangen.de Abstract Therefore, many companies today decide to start with smaller, flexible data marts dedicated to specific business The concepts of ‘Data Warehousing’ and ‘On-line areas. In order to get the possibilities for cross-functional Analytical Processing’ have seen a growing interest in the analysis, there are two possibilities. The first one is to cre- research and commercial product community. Today, the ate again a centralized data warehouse for only cross-func- trend moves away from complex centralized data ware- tional summary data. The other one is to integrate the data houses to distributed data marts integrated in a common marts into a common conceptual schema and therefore cre- conceptual schema. However, as the first part of this paper ate a distributed data warehouse. demonstrates, there are many problems and little solutions We will concentrate on the second alternative which for large distributed decision support systems in worldwide allows more flexible querying. To the user a distributed operating corporations. After showing the benefits and data warehouse should behave exactly like a centralized problems of the distributed approach, this paper outlines data warehouse. For transparent and efficient OLAP on a possibilities for achieving performance in distributed on- distributed data warehouse several problems need to be line analytical processing. Finally, the architectural frame- solved. The intention of this article is not to present final work of the prototypical distributed OLAP system solutions but to show the potential of the distributed CUBESTAR is outlined. approach and the challenges for researchers and software vendors as well. 1 Introduction In the following section we will characterize different Today’s global economy has placed a premium on data warehouse architectures and their inherent drawbacks. information, because in a dynamic market environment After showing the benefits of a distributed approach to data with many competitors it is crucial for an enterprise to have warehousing we show the problems resulting for distrib- on-line information about its general business figures as uted OLAP systems. Performance as the main problem is well as detailed information on specific topics to be able to the issue of section 3. Finally, in section 4 we give an over- make the right decisions at the right time. That’s why today view over the prototypical distributed OLAP system almost all big companies are trying to build data ware- CUBESTAR. houses. In contrast to former management information sys- tems based mostly on operational data, data warehouses 2 Data Warehouse Architectures contain integrated and non-volatile data and provide there- fore a consistent basis for organizational decision making. According to [Inmo92], a data warehouse is a “sub- In addition to classical reporting, the main application of ject-oriented, integrated, time-varying, non-volatile collec- data warehouses is On-line Analytical Processing (OLAP, tion of corporate data”. Since that definition is very [CoCS93]), i.e. the interactive exploration of the data. The generic, we will now give an overview of three types of data warehouse promise is getting accurate business infor- data warehouse architectures, partly based on [WeCa96]. mation fast. But the ultimate goal of data warehousing and OLAP 2.1 Enterprise Data Warehouses goes even further, the vision is to “put a crystal ball on What most people think of when they use the term every desktop”. Today that ideal is far from being reality. data warehouse is the enterprise or corporate data ware- Because first, to build a single centralized data warehouse house, where all business data is stored in a single database serving many different user groups takes a long time. Setup with a single corporate data model (figure 1). Enterprise and maintenance are very expensive. All this contributes to data warehouses are created to provide knowledge workers a relatively inflexible architecture. Second, local access with consistent and integrated data from all major business behavior is considered in the data warehouse design neither areas. They offer the possibility to correlate information on the conceptual nor on the internal but only on the exter- across the enterprise. nal layer.
  • 2. Data Warehouse Architectures 2.2 But, because of the huge sub- ject area (the corporation) and the vast amount of different opera- tional sources that need to be inte- grated, enterprise data warehouses OLAP Distributed OLAP Middleware are very difficult and time-con- Server suming to implement. Too many Global different data sources and too Schema many different user requirements Centralized Asian DW Amer. Service Sales are likely reasons for the failure of Sales All Logistics Service the whole data warehouse project. Europe America Asia Since the data warehousing setup strategy from the analysts Fig. 1: Enterprise Fig. 3: Distributed Enterprise Data point of view is, as Inmon puts it, Data Warehouse Warehouse “give me what I want and I will tell you what I really want” [Inmo92], an incremental approach The benefits of this divide-and-conquer approach are clear. has proven to be the right choice. In the distributed enterprise data warehouse the individual data marts can be created and managed flexibly. 2.2 Data Marts The flexibility for the individual data marts is traded for additional complexity in the management of the global The term data mart stands for schema and query processing across several data marts. the idea of a highly-focused ver- Table 2.1 compares some characteristics to the centralized sion of a data warehouse (figure data warehouse. 2). There is a need to distinguish data marts which are local OLAP OLAP Characteristics Centralized Distributed DW extracts from an enterprise data Server Server DW Local Area Wide Area warehouse and data marts which are serving as the data warehouse Conceptual Flexibility low low moderate for a small business group with Data Distribution none moderate high Sales well defined needs. In the latter Service Logistics Query Processing easy moderate complex case, which is the definition we Communication Costs low moderate high will use in this paper, the limita- Asia Security Requirements. none low low tion in size and scope of the data Local Maintenance hard easy easy warehouse, e.g. only sales or mar- Fig. 2: Data Marts Global Maintenance n/a easy moderate keting, dramatically reduces setup costs. Data marts can be deployed in weeks or months. That Tab. 2.1 Characteristics of Centralized and is why most of the current data warehousing projects Distributed Data Warehouses include data marts. 2.3 Distributed Enterprise Data Warehouses There are basically two scenarios for a distributed enterprise data warehouse. The local area scenario (e.g. the Although the data mart idea is very attractive and data marts of only the Asian division) characterizes what potentially offers the opportunity to build a corporate wide first comes into mind. The extension of this approach to a data warehouse from the bottom up, the benefits of data wide area scenario (the global setting depicted in figure 3) marts can easily be outweighed if there is no corporate- is possible, but the high degree of data distribution results wide data warehouse strategy. If data mart design is not in much more complexity for query processing since com- done carefully, independent islands of information are cre- munication costs become a major factor. The more global ated and cross-functional analysis, as in the enterprise data the distributed data warehouse is, the more important is warehouse, is not possible. increasing local access by replication. On the contrary, the more local the scenario is, the more likely techniques for Therefore, the pragmatic approach to combine the vir- load balancing in a shared-nothing server cluster can be tues of the former is to integrate the departmental data applied. marts into a common schema. The result is a distributed enterprise data warehouse (figure 3), also called federated data warehouse [WeCa96] or enterprise data mart [Info97].
  • 3. 2.4 Discussion 2.4 Discussion Site A Site C The current trend in the creation of data warehouses follows the bottom up approach where many data marts are created with a later integration into a distributed enterprise data warehouse in mind. The Ovum report [WeCa96] pre- Site B dicts that centralized data warehouses will become increas- ingly unpopular due to their inherent drawbacks and thus Query Result be replaced by a set of interoperating data marts. The commercial product community seems to confirm Complete data cube the prediction. A first step towards easy integration was Selected canditate aggregates recently done by the Informatica Corp. with the release of its tools for enterprise data mart support. These tools are Other possible candidates proposed to easily allow the creation and maintenance of Fig. 4: Patchworking algorithm distributed data marts and a seamless integration into a common schema. Moreover, query tools can directly every minute. If user behavior is considered at all, only a fix access the metadata by providing a metadata exchange number of queries is taken as the basis for the aggregation API. OLAP tool vendors like Microstrategy Inc. already algorithms ([GHRU97], [Gupt97], [BaPT97]). On the announced to support this feature. For querying, “it means commercial product side there are OLAP tools like Micros- that all data marts are working off the same sets of data def- trategy’s DSS Tools or Informix’ Metacube which are able initions and business rules allowing users to query aggre- to give the DBA hints for the creation of aggregates based gate data that may be distributed across multiple disparate on monitored user behavior. However, if the DBA decides data marts” [Ecke97]. Companies like Cognos [Cogn97] to create aggregates, these are static and a change in the and Hewlett Packard [HP97] also offer already rudimental user behavior is not reflected. support for distributed data warehouses. Today, aggregates are more or less seen as another But to our knowledge, no query tool today is able to multidimensional access method, similar to an index, that transparently and performantly generate and execute que- must be managed by the DBA. But only a self-maintaining, ries in distributed setting. An intelligent middleware layer dynamic aggregate management working like a (distrib- as the basis for traffic routing, aggregate caching and load uted) system buffer or cache for aggregates offers the balancing is needed. This is the problem we will focus on needed flexibility and the potential for high query response in the remaining sections. times and low human maintenance efforts. The selection and the usage of aggregates should be completely transpar- 3 Achieving Performance in Distributed ent to the user. The DBA should only be able to setup OLAP Systems parameters for the creation of but not the content of aggre- gates. In order to deserve the attribute on-line, queries should never run much longer than a minute. In this section Partial Aggregates. The algorithms for the selection of we will outline some specific methods and necessary pre- aggregates explicitly supporting the multidimensional requisites to achieve performance in large distributed model compute preaggregates that are based on the full OLAP environments. Note, that some of these also apply in scope of the data cube. The query result in figure 4, if mate- the non-distributed case but become essential in the distrib- rialized, would be such a complete aggregate. Relationally uted case. spoken, only different attribute combinations in the group- by clause are considered. The where-clause in the materi- 3.1 The Selection and Use of Aggregates alized view definition is always empty. The benefit of this approach is that query processing is relatively simple. Researches and vendors agree that the main approach Therefore, this is the only kind of aggregates commercial to speed up OLAP queries is the use of aggregates or mate- products support today. rialized views [LeRT96]. In the last two years many articles on the selection of views to materialize in a data warehouse However, materialized views created in that manner were published. A good overview over techniques on the may contain much data that is never needed and therefore selection of aggregates can be found in [ThSe97]. storage space is wasted. Even more important, in a distrib- uted setting with inherent data fragmentation the handling Dynamic Management of Aggregates. None of the recent of partial aggregates (the candidates in figure 4) becomes articles considers the dynamics of an on-line environment a necessary prerequisite. A data mart server has only parts with dozens of completely different queries to execute in of the whole data cube stored locally. In fact, that is the fun-
  • 4. Achieving Performance in Distributed OLAP Systems 3.2 damental idea of having data marts in a distributed enter- The fragmentation of the raw data is already fix due to prise data warehouse. Thus, at least at the lower aggrega- the definition of the data mart. Usually, there is not much of tion levels it does not even make sense to create aggregates a choice here. The problems as well as the performance with full, enterprise-wide scope. potential of distributed data warehouses arise from the The big drawback is clearly that query optimization in mentioned techniques of dynamic aggregate management the presence of partial aggregates becomes more complex. and dynamic data allocation. These techniques were not A patchworking algorithm finding the least expensive previously applicable because redundancy in transaction- aggregate partitions must be invoked (figure 4). oriented systems always requires expensive synchroniza- tion mechanisms. Data Allocation and Load Balancing. Directly related to The query optimizer of a distributed OLAP system the question of selecting aggregates in a distributed envi- first needs to be aware of the aggregations which would be ronment is the allocation problem. The system should not helpful in order to speed up query processing. If this infor- only manage the dynamic creation of partial aggregates, mation is known it must invoke a patchworking algorithm but also decide where to place them on a tactical basis. finding the least expensive aggregate partitions (figure 4). Moreover, the traditional method to increase local access The choice how to assemble the query result may vary from performance in distributed environments by the use of rep- query to query because the underlying redundant fragments licas can and should also be applied in distributed OLAP. may be dropped for the selection of new ones by the Hence, redundancy in the distributed data warehouse resource management and the system and network load can be distinguished into vertical redundancy involving the may change. creation of summary data with a coarser granularity than An important point is that the determination of candi- the detailed raw data, and horizontal redundancy being dates as well as the patchworking itself, which includes the introduced by the replication of aggregates (figure 5). comparison of selection predicates, is only feasible in the context of a multidimensional data model. In contrast to the aggr op. simple relational data model, selections (slices) can be replicate defined by classification nodes (see section 4.2). Predicates Summary Data Summary Data defined in that manner contain much more semantics as a aggr op. simple relational selection predicate. If one would allow arbitrary predicates for the definition of fragments, patch- Raw Data working would not be possible. Vertical Horizontal Redundancy Redundancy 3.2 Quality of Service Specification Fig. 5: Horizontal and Vertical Redundancy Another point greatly distinguishes the distributed OLAP scenario from traditional distributed databases. In In order to allow the system to adapt to changing user the analytical context general figures are of interest and not requirements, there should be no fixed locations for any individual entities. It is often the case that approximate kind of redundant data fragments. Aggregate data should numbers are sufficient if they can be delivered faster. Thus, be close to sites where they are needed in order to minimize it is feasible to trade correctness for performance. Some the communication costs. In order to increase parallel factors characterizing correctness are accuracy, actuality query execution, data fragments should be distributed on and consistency. several data mart servers, especially if communication cost is low. Redundant fragments must dynamically be created Accuracy. One approach to relaxing the accuracy of the and dropped according to the changing user behavior. reports is offering summary data that is either aggregated at Thus, a flexible adapting migration and replication a higher level than requested or offering only parts of the policy is essential for a data based load balancing at the requested information. For example, if a query requesting data mart servers. Involved in the process of choosing and sales figures for each city in Germany would take hours, an allocating aggregates are not only access characteristics but answer containing sales figures for each county or even also communication cost, maintenance cost for redundant only for South Germany might be enough for the moment. data and additional storage cost. An algorithm dealing with these issues is given in [Dyre96]. Differences to traditional distributed DBMS. Although the fragmentation and allocation problematic is well Actuality. Absolute actuality in a worldwide distributed known from traditional distributed database systems, there enterprise data warehouse is an expensive requirement. If are several specifics for distributed data warehouses. each data mart is updated once a day for example, it means for the distributed data warehouse that it is updated every
  • 5. 4 The Distributed OLAP system CubeStar few hours. Therefore, the maintenance cost for summary Clients Result data can increases dramatically. But getting the most actual numbers may not be necessary for the analyst. Hence, a certain degree of staleness can be acceptable allowing Query redundant data to be updated or dropped gradually Middleware Parser Syntax Check Query Restructurer [Lenz96]. Generator Semantic Check Broker Consistency. The consistency of the reports generated in a Partition session should be another adjustable factor. In general, it is Delivery Distributed Global Partition Directory Server Classification Selector Cache desirable that a sequence of drill-down and roll-up opera- Information tions relies on the same base figures. However, this require- ment might not be crucial, if the analyst just wants to get an Data Mart Server Bidder B B overview over some figures. Hence, this condition as well might be relaxed according to the analysts needs. Note, that RM AE RM AE Aggregation Resource generating consistent reports can be very expensive, Engine Manager because all reports in a session must have the same actual- ity. AE Distributed AE Server Federation Especially in a large distributed environment where communication costs can not be neglected, these factors Control Flow Data Flow are critical for reasonable overall performance. A little inaccurate, a little stale or a little inconsistent data may be Fig. 6: Architecture of CUBESTAR sufficient for the analyst, if he just gets the figures in a few seconds rather than minutes. The third layer (server layer) consists of a federation of data mart servers. Each data mart server controls the 4 The Distributed OLAP system CUBESTAR locally available set of data partitions and the parallel query execution at its machine. This section focuses on the architecture and the prin- ciples for query processing and resource management of 4.2 The Data Model the prototypical distributed OLAP system CUBESTAR. In order to deal with the complexity of the issues we borrowed CUBESTAR consequently applies the multidimen- many ideas from the Mariposa system ([SDK+94]) and sional data model at all levels of query processing and applied them to our special application domain. aggregate management. Only issues of physical data access are managed by the underlying relational database engines. 4.1 The Architecture Since the main operation in OLAP is the application of aggregation operations, classification hierarchies can be The general architecture of CUBESTAR consists of a used to define groups for the application of aggregation three tier architecture as shown in figure 6. At the client operations (figure 7). For example, the typical OLAP- layer are the end user tools with reporting facilities. Que- query of figure 7 specified in CQL asks for the sales figures ries are issued in the Cube-Query-Language (CQL, at the granularity of product families subsumed by video [BaLe97], figure 7), specifically supporting the multidi- equipment (= camcorders and recorders). mensional data model. The heart of the architecture is the middleware layer Multidimensional Objects. In analogy to the ‘cubette’- hiding the details of data distribution. Its main task is to approach of [Dyre96], a query may be seen as a multidi- translate a user query and generate an optimized distributed mensional subcube, called Multidimensional Object (MO), query execution plan. For distributed query processing and basically described by three components [Lehn98]: aggregate management we apply a microeconomic para- • A specification of the original cell and the applied digm. A site tries to maximize revenues earned for taking aggregation operator (SUM(Sales)) part in query processing. In order to compete, the site has to maintain a set of redundant data according to the current • A multidimensional scope specification consisting of market needs. The middleware layer also implements a glo- a tuple of classification nodes (<P.Group = ‘Video’, bal classification information service providing a globally G.Country = ‘Germany’, T.Year = ‘1997’>) consistent view of the multidimensional data model. • A data granularity specification consisting of a tuple of categories (<P.Family, G.Top, T.Top>)
  • 6. The Distributed OLAP system CubeStar 4.3 TOP ALL for the aggregation efforts. The value of a query result is computed by a weighted combination of the following fac- Consumer tors: Area Electronics • Actual size of the raw data MOs required for the com- Group Audio Video putation (number of tuples). Taking the size of the raw data is reasonable, since Family Camcorder Recorder requests for aggregates of huge fact tables should be article# TR-75 TS-78 A200 V-201 ClassicI more expensive than those of smaller ones. Moreover, this decision favors the storage of higher aggregates, since they need little space and yield high profits. SELECT SUM(SALES) FROM Product P, Geography G, Time T • Actual size of the query result (number of tuples). WHERE P.Group = ‘Video’, This term simply values the additional overhead for G.Country = ‘Germany’, storing large aggregates. T.Year = ‘1997’ • Aggregation time. UPTO P.Family This is the time spent from issuing the query to deliver- ing the result, i.e. the local query processing time. Fig. 7: A Sample Classification Hierarchy and Higher aggregation time results in a lower price, thus a Sample Query in CQL favoring fast data mart servers. • Transfer time. Within the distributed multidimensional OLAP envi- The computation may require the shipping of interme- ronment, multidimensional objects reflect the basic units diate MOs from one data mart server to another. for distributed storage management and query processing. Therefore, different types of MOs with regard to their gen- 4.4 Query Processing eration in the distributed environment are considered. The raw or base data partitions at each data mart are A query in CUBESTAR is basically a multidimensional called Base Multidimensional Objects (BMO). Thus, object specified by a CQL query like the one in figure 7. BMOs represent data of the finest granularity. BMOs are Affiliated with each query is a user defined quality of ser- not allowed to migrate but are assigned a unique home site, vice specification consisting of a limitation to the query where they are uploaded by the data mart loading tools. processing time and an indicator for the requested accuracy Consequently, they are per definition up-to-date. of the data. Thus, the user has the choice to trade speed for actuality. From the BMOs any number of MOs can be derived. In a relational sense these derived MOs are materialized The query is issued to a broker which tries to perform the views. However, because of the definition of MOs, the query in the best way possible on behalf of the user. Since questions whether one MO is contained in, intersected by all users are treated equally and prices are regulated, a or can be derived from another MO is easily decidable. query is not assigned a budget. The brokers objective is According to the requirements of the system MOs are simply to get the best service available for the users allowed to migrate or be replicated freely. request, i.e. to minimize query execution time under the quality of service constraints. After query processing, the broker pays the price according to the billing scheme to the 4.3 Microeconomics participating data mart servers. Note, that in order to sup- An elegant solution to cope with the complexity of port user priorities it would make sense to limit the users dynamic aggregate management and load balancing is to budget resulting in a second objective, price, to query opti- put all issues related to shared resources into a microecono- mization. mic framework ([SDK+94]). In an economy where every Query processing basically proceeds in the three steps site tries to selfishly maximize its utility there is no need for query preparation, query optimization, and query execu- a central coordinator; the decision process is inherently tion. decentralized. Moreover, the competition for orders and In the query preparation phase, the query string is handed profits allows the system to dynamically adapt to market over to the parser, where it is checked for syntactical and demands, i.e. query behavior and system load as well. semantic correctness. The result of this step is a tree-like In contrast to Mariposa where demand and supply regulate representation of the query with the queried MO at the top, the prices, for the sake of simplicity a uniform billing intermediate results at the inner nodes, and raw data at the scheme is applied to each executed query having fixed rates leaves.
  • 7. 4.5 Storage Management For the generation of a distributed query execution plan the 4.5 Storage Management current distribution of the aggregates must be taken into account. Note, that due to the dynamic aggregation man- So far, the existence of many redundant multidimensional agement the existence and location of aggregates and rep- objects was assumed. This subsection focuses on when, licates changes frequently. Therefore, the broker may issue how, and where these MOs are generated. In order to deal a broadcast request to the data servers in order to obtain the with the complexity of the issues of section 3.1, this prob- necessary information. This kind of meta data is cached in lem is handled by the application of microeconomics. the partition directory cache in order to avoid too many The resource manager at a data server aims towards maxi- broadcasts. Using this method, it is possible that invalid mal usage of its local resources in terms of revenues. meta data are used. In this case, there are two possibilities, Therefore, a site always tries to expand its market share by either force a data mart server to the recomputation of cleverly creating and dropping aggregates and replicas. In dropped aggregates or submit a new broadcast. order to estimate current demands the resource manager At the data mart server side, the bidder component receiv- maintains a query history list containing those requests the ing the request checks for those locally available MOs site was rejected to answer, either because it did not have which can contribute to the query. Each bid from a data any suitable MOs or other bidders were preferred. The list mart server includes the following data entries include for each query the actually selected MOs with their locations and revenues. Based on this informa- • the identifier of the data mart server, tion the resource manager can decide to obtain new MOs in • a set of tuples { (Mi, Ti) } each of which contains the order to underbid its competitors in the future. If the new description of an offered MO in the repository and the MO is to be acquired from a remote location, the resource time the site expects to aggregate that MO to the gran- manager acts as a client itself, and therefore also has to pay ularity of the query M, for the MO from its own account. In this case, it is doing an investment. • an indicator for the available processing power the site is willing to reserve for the query depending on the sys- The other way to use the storage space is to offer it for tem load, query processing. If MOs from different locations must be coalesced, one site will receive the other parts. Note, that it • the free space the site is willing to reserve for interme- is very lucrative to receive other pieces of MOs fitting to diate results possibly from other data mart servers. locally stored ones, because next time a larger MO can be The query optimizer’s task is now to select from the set of offered. multidimensional objects offered by the bidding sites In fact, using the intermediate results of the query process- those, which require the least aggregation and transporta- ing is the most convenient way to get new MOs also in the tion efforts using a patchworking algorithm. To estimate local case. Each of these MOs is checked by the resource communication costs, average network throughputs manager for usefulness. Basically, storing MOs that were between the sites are used. already needed is working like a cache, with a microecono- The result of this step is a distributed query execution plan, mic heuristics for replacement. which is basically a tree with physically existing MOs at the leaves, the target MO at the top, and intermediate 5 Summary and Future Work results at the inner nodes. Attached to each node is an oper- ation, an estimated processing time for the operation, and In the first part of this paper we made clear the draw- the location for the execution of the operation denoting the backs of the centralized data warehouse and the non-inte- bidders that won the competition. grated data mart approach. In order to circumvent these In the query execution step, the bidders are informed drawbacks and combine the benefits, integrated distributed whether their offer was accepted or not. The notification enterprise data warehouses were presented as a possible includes the complete query execution plan allowing the solution. The on-line exploration of distributed data ware- sites to understand the decision of the broker and draw con- houses or distributed OLAP is related to many complex clusions for their future resource management. issues like distributed query optimization, meta data man- agement and security policy handling, each of which If all MOs necessary for processing have arrived at the site, demanding innovative solutions. We characterized some the local query execution is initiated by an aggregation possibilities for achieving query performance by dynamic engine. If subqueries can locally be executed in parallel, and partial aggregation, data based load balancing and the the aggregation engine may spawn new aggregation specification of the quality of service. Finally, we discussed engines running in parallel threads. After the aggregation the architectural framework of the prototypical distributed engine completed its task, in the last step the resulting MO OLAP system CUBESTAR. is sent to the target location.
  • 8. Summary and Future Work 5 We consider sophisticated aggregation management a HaRU96 Harinarayan, V.; Rajaraman, A.; Ullman, J.D.: major research issue. This covers algorithms for using dis- Implementing Data Cubes Efficiently, in: 25th tributed partial aggregations and deciding which aggregate International Conference on Management of Data, combinations yield the highest benefit and are the best to be (SIGMOD96, Montreal, Quebec, Canada, June 4-6, 1996), pp. 205-216 materialized. A reasonable solution offering the needed flexibility is the application of a microeconomic framework HP97 N.N.: HP Intelligent Warehouse, White Paper, as in the Mariposa system ([SDK+94]). We admit, that this Hewlett Packard, http://www.hp.com, 1997 last statement needs to be proven by the implementation. Inmo92 Inmon, W.H.: Building the Data Warehouse, John Wiley, 1992 Info97 N.N.: Enterprise-Scalable Data Marts: A New References Strategy for Building and Deploying Fast, Scalable BaLe97 Bauer, A.; Lehner, W.: The Cube-Query-Language Data Warehousing Systems, White Paper, for Multidimensional Statistical and Scientific Informatica, http://www.informatica.com, 1997 Database Systems, in: 5th International Conference Lehn98 Lehner, W.: Modeling Large Scale OLAP on Database Systems For Advanced Applications Scenarios, in: 6th International Conference on (DASFAA’97, Melbourne, Australia, April 1-4, Extending Database Technology, (EDBT’98, 1997), pp. 263-272 Valencia, Spain, March 23-27, 1998) BaPT97 Baralis, E.; Paraboschi, S.; Teniente, E.: Lenz96 Lenz, R.: Adaptive Distributed Data Management Materialized View Selection in a Multidimensional with Weak Consistent Replicated Data. In: Database, in: 23rd International Conference on Very Proceedings of the 11th annual Symposium on Large Data Bases (VLDB’97, Athens, Greece, Applied Computing (SAC’96), Philadelphia, 1996 1997) LeRT96 Lehner, W.; Ruf, T.; Teschke, M.: Improving Query CoCS93 Codd, E.F.; Codd, S.B.; Salley, C.T.: Providing Response Time in Scientific Databases Using Data OLAP (On-line Analytical Processing) to User Aggregation, in: 7th International Conference and Analysts: An IT Mandate, White Paper, Arbor Workshop on Database and Expert Systems Software Corporation, 1993 Applications (DEXA’96, Zurich, Switzerland, Cogn97 Cognos Corporation: Distributed OLAP, White Sept. 9-10, 1996) Paper, http://www.cognos.com, 1997 SDK+94 Stonebraker, M.; Devine, R.; Kornacker, M.; Litwin, Cube97 The CUBESTAR Project: http:// W.; Pfeffer, A.; Sah, A.; Staelin, C.: An Economic www6.informatik.uni-erlangen.de/research/ Paradigm for Query Processing and Data Migration cubestar.html in Mariposa, in: Proceedings of 3rd International Conference on Parallel and Distributed Information Dyre96 Dyreson, C.: Information Retrieval from an Systems (PDIS’94, Austin, TX., Sept. 28-30), 1994, Incomplete Data Cube, in: 22th International pp. 58-67 Conference on Very Large Data Bases, (VLDB’96, Mumbai, India, 1996) ThSe97 Theodoratos, D.; Sellis, T.: Data Warehouse Configuration, in: 23th International Conference on Ecke97 Eckerson, W.: Building and Managing Data Marts, Very Large Data Bases, (VLDB’97, Athens, Patricia Seybold Group Report for Informatica, Greece, 1997) http://www.informatica.com, 1997 WeCa96 Wells, D.; Carnelley, P.: Ovum evaluates: The Data GHRU97 Gupta, H.; Harinarayan, V.; Rajaraman, A.; Ullman, Warehouse, Ovum Ltd., London, 1996 J.D.: Index Selection for OLAP, in: 13th International Conference on Data Engineering, (ICDE’97, Birmingham, UK, April 7-11, 1997) Gupt97 Gupta, H.: Selection of Views to Materialize in a Data Warehouse, in: 6th International Conference on Database Theory (ICDT‘97, Delphi, Greece, Jan 8-10, 1997), pp. 98-112