SlideShare une entreprise Scribd logo
1  sur  68
Télécharger pour lire hors ligne
DBM630: Data Mining and
                       Data Warehousing

                              MS.IT. Rangsit University
                                                 Semester 2/2011




                           Lecture 2&3
    Data Warehouse and OLAP Technology

    by Kritsada Sriphaew (sriphaew.k AT gmail.com)

1
Outline
Part I: Basic Knowledge about Data Warehousing
 Dimensional Modeling
 Data Cube
 Architecture


Part II: OLAP and Cube Computations
 OLAP, Data Cube and Data Analysis
 Cube Computations
 Demo PivotTable (bring laptop with MS Excel, if you have)

 2                              Data Warehousing and Data Mining by Kritsada Sriphaew
What is Data Warehouse?
       Defined in many different ways, but not rigorously.
         A decision support database that is maintained separately
          from the organization’s operational DB
         Support information processing by providing a solid
          platform of consolidated, historical data for analysis.

       “A data warehouse is a subject-oriented, integrated, time-
        variant, and nonvolatile collection of data in support of
        management’s decision-making process.” (definition by W. H. Inmon)

       Data warehousing:
         Process of constructing and using data warehouses
    3                                  Data Warehousing and Data Mining by Kritsada Sriphaew
Four Properties of Data Warehouses
 subject-oriented   จัดเก็บเป็นเรื่องๆ


 integrated   รวบรวมอยู่ในรูปแบบเดียวกัน


 time-variant มีข้อมูลตามมิติเวลา


 non-volatile มีเสถียรภาพ ข้อมูลไม่สูญหาย
4                     Data Warehousing and Data Mining by Kritsada Sriphaew
Subject-Oriented
   Subject-Oriented Property
       Organized around major subjects, such as customer, product, sales.
       Focusing on the modeling and analysis of data for decision makers, not on daily
        operations or transaction processing.
       Provide a simple and concise view around particular subject issues by excluding
        data that are not useful in the decision support process.

                 OPERATIONAL DB                  DATA WAREHOUSE
                 •   Loans                       •   Customer
                 •   Savings                     •   Vendor
                 •   Bank card                   •   Product
                 •   Trust                       •   Activity
                 An application orientation      A subject orientation


5                                             Data Warehousing and Data Mining by Kritsada Sriphaew
Integrated
   Integrated Property
       Constructed by integrating multiple, heterogeneous data sources
           Relational databases, flat files, on-line transaction
            records
       Data cleaning and data integration techniques are applied.
           Ensure consistency in naming conventions, encoding
            structures, attribute measures, etc. among different data
            sources
             e.g.,   Hotel price: currency, tax, breakfast covered, etc.
           When data is moved to the warehouse, it is converted.

6                                            Data Warehousing and Data Mining by Kritsada Sriphaew
Time Variant/Non-Volatile
   Time Variant Property
       The time horizon for the data warehouse is significantly longer than that of
        operational systems.
           Operational database: current value data.
           Data warehouse data: provide information from a historical perspective (e.g., past 5-10
            years)
       Every key structure in the data warehouse
           Contains an element of time, explicitly or implicitly
           But the key of operational data may or may not contain “time element”.


   Non-Volatile Property
       A physically separated store of data transformed from the operational
        environment.
       Operational update of data does not occur in the data warehouse environment.
           Does not require transaction processing, recovery, and concurrency control mechanisms
           Requires only two operations in data accessing: initial loading of data and access of
            data.
7                                                 Data Warehousing and Data Mining by Kritsada Sriphaew
Operational DBMS vs. Data Warehouse
(OLTP vs. OLAP)
       OLTP (on-line transaction processing)
           Major task of traditional relational DBMS
           Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll,
            registration, accounting, etc.
       OLAP (on-line analytical processing)
           Major task of data warehouse system
           Data analysis and decision making
       Distinct features (OLTP vs. OLAP):
           User and system orientation: customer vs. market
           Data contents: current, detailed vs. historical, consolidated
           Design: ER + application vs. star + subject
           View: current, local vs. evolutionary, integrated
           Access patterns: create/select/insert/delete/update vs. read-only but complex queries
    8                                              Data Warehousing and Data Mining by Kritsada Sriphaew
OLTP vs. OLAP
                     OLTP                             OLAP
users                clerk, IT professional           knowledge worker
function             day to day operations            decision support
DB design            application-oriented             subject-oriented
data                 current, up-to-date              historical,
                     detailed, flat relational        summarized,
                     isolated                         multidimensional
                                                      integrated, consolidated
usage                repetitive                       ad-hoc
access               read/write                       lots of scans
                     index/hash on primary key
unit of work         short, simple transaction        complex query
# records accessed   tens                             millions
#users               thousands                        tens
DB size              100MB-GB                         100GB-TB
metric               transaction throughput           query throughput, response
9                                       Data Warehousing and Data Mining by Kritsada Sriphaew
Heterogeneous DBMS vs. Data Warehouse
    Traditional heterogeneous DB integration
        Build wrappers/mediators on top of heterogeneous DBs
        Query-driven approach
          When a query is posed to a client site, a meta-dictionary is used
           to translate the query into queries appropriate for individual
           heterogeneous sites involved, and the results are integrated into a
           global answer set
          Complex information filtering, compete for resources


    Data warehouse (DW)
        Another database for high-performance analysis
        Update-driven approach
          Information from heterogeneous sources is integrated in advance
           and stored in warehouses for direct query/analysis

    10                                  Data Warehousing and Data Mining by Kritsada Sriphaew
Heterogeneous DBMS vs. Data Warehouse
            OLTP                                  Query-driven Approach
                       Sales

                                                 Wrapper/
                       Purchasing
                                                 Mediators

                       Production
            Heterogeneous Operational Database

OLTP                                             Update-driven Approach
    Sales
                               Data
    Purchasing                 Warehouse         Query

    Production                                    OLAP
Heterogeneous Operational Database

   11                                  Data Warehousing and Data Mining by Kritsada Sriphaew
Why Separate Data Warehouse?
   High performance for both systems
       DBMS— tuned for OLTP: access methods, indexing, concurrency
        control, recovery
       Warehouse—tuned for OLAP: complex OLAP queries,
        multidimensional view, consolidation.
   Different functions and different data:
       Missing data: Decision support (OLAP) requires historical data which
        operational DBs (OLTP) do not typically maintain
       Data consolidation: DS requires consolidation (aggregation,
        summarization) of data from heterogeneous sources
       Data quality: different sources typically use inconsistent data
        representations, codes and formats which have to be reconciled

12                                     Data Warehousing and Data Mining by Kritsada Sriphaew
* A multidimensional data model
 From Tables and Spreadsheets to Data Cubes
  A data warehouse is based on a multidimensional data
   model which views data in the form of a data cube
  A data cube, such as sales, allows data to be modeled
   and viewed in multiple dimensions
           Fact table contains measures (such as units_sold, dollars_sold) and keys to each of
            the related dimension tables
           Dimension tables, such as location (branch, country, continent), or time (day, week,
            month, quarter, year)
  Date        Branch     Item       Buyer        Units sold   Dollars sold       Branch    Country    Continent
1/1/2008      London    VCD      First Company      20            5000           London      UK        Europe
1/1/2008      Bangkok    TV      First Company      30            9000          Glasgow      UK        Europe
10/1/2008     London     Ham     First Company      20            1000           Berlin    Germany     Europe
4/2/2008      London    Milk     First Company      80            1600          Bangkok    Thailand     Asia
15/2/2008     Bangkok   VCD      Best Company       30            7500           Phuket    Thailand     Asia
2/5/2008      Bangkok   Orange   Best Company       20            500            Tokyo      Japan       Asia


   13                         Fact Table                                          Dimension Table
                                                         Data Warehousing and Data Mining by Kritsada Sriphaew
Three Concepts in Data Cubes
     Three concepts in data cubes are
     (1) Multidimension (2) Hierarchy                                          Location Dimension Table (3 levels)
     (3) Measure                                           Fact Table                 Branch     Country      Continent

               4 dimensions                            2 measures                    London        UK          Europe

                                                                                     Glasgow       UK          Europe
  Date        Branch     Item         Buyer        Units sold   Dollars sold
                                                                                      Berlin    Germany        Europe
1/1/2008      London     VCD       First Company      20           5000
                                                                                     Bangkok    Thailand        Asia
1/1/2008    Bangkok       TV       First Company      30           9000
                                                                                      Phuket    Thailand        Asia
10/1/2008     London     Ham       First Company      20           1000
                                                                                      Tokyo       Japan         Asia
4/2/2008      London     Milk      First Company      80           1600
                                                                                    Item       Subcategory        Category
15/2/2008   Bangkok      VCD       Best Company       30           7500
                                                                                   VCD           Electric        Non-Food
2/5/2008    Bangkok     Orange     Best Company       20            500
                                                                                    TV           Electric        Non-Food
      Buyer            Buyer Group
                                                                                    Shirt        Clothes         Non-Food
 First Company           Group 1
                                       Customer Dimension Table                     Ham        Process food         Food
Second Company           Group 1       (2 levels)                                   Milk       Fresh food           Food
 Third Company           Group 1
                                                                                   Orange      Fresh food           Food
  Best Company           Group 2

 Good Company            Group 2                                               Product Dimension Table (3 levels)
      14                                                    Data Warehousing and Data Mining by Kritsada Sriphaew
Cuboids in Data Cubes
 Cuboid concept is formed by the number of
  dimensions.
 The top most 0-D cuboid, which holds the highest-
  level of summarization, is called an apex cuboid.
 In data warehousing literature, an n-D base cube
  where n is the total number of dimensions is called a
  base cuboid.
 The lattice of cuboids forms a data cube.



 15                        Data Warehousing and Data Mining by Kritsada Sriphaew
An Example of Cuboids (Dimension)
   (add vs. delete dimensions)
                                                                                  Time          Sales
                                                           Sales
 0-D (apex) cuboid           Sales(all)                                            Q1             5000
                                                             24000
Add dim.       Delete dim.                                                         Q2             4000

                      Sales(Q1), Sales(Q2),                                        Q3             8000
1-D cuboid
                      Sales(Q3), Sales(Q4)                                         Q4             7000
Add dim.       Delete dim.
                                                                   Time      Thailand           Japan
                  Sales(Q1, Thailand), Sales(Q1, Japan)
                  Sales(Q2, Thailand), Sales(Q2, Japan)
                                                                     Q1           2000              3000
 2-D cuboid       Sales(Q3, Thailand), Sales(Q3, Japan)              Q2           1500              2500
                  Sales(Q4, Thailand), Sales(Q4, Japan)
Add dim.
                                                                     Q3           2000              6000
               Delete dim.
                                                                     Q4           3000              4000
 3-D cuboid
                                                                             Thailand                      Japan
Sales(Q1, Thailand, Food), Sales(Q1, Thailand, NonFood)     Time
Sales(Q2, Thailand, Food), Sales(Q2, Thailand, NonFood)                   Food    NonFood          Food        NonFood
Sales(Q3, Thailand, Food), Sales(Q3, Thailand, NonFood)      Q1            1500           500           2000       1000
Sales(Q4, Thailand, Food), Sales(Q4, Thailand, NonFood)
Sales(Q1, Japan, Food), Sales(Q1, Japan, NonFood)            Q2             900           600           1500       1000
Sales(Q2, Japan, Food), Sales(Q2, Japan, NonFood)
Sales(Q3, Japan, Food), Sales(Q3, Japan, NonFood)            Q3            1200           800           4000       2000
Sales(Q4, Japan, Food), Sales(Q4, Japan, NonFood)            Q4            2000          1000           2500       1500

      16                                                  Data Warehousing and Data Mining by Kritsada Sriphaew
Data Cube: A Lattice of Cuboids (Dimension)
                                     all
                                                                                    0-D (apex) cuboid

             time         product      location         customer
                                                                                    1-D cuboids

time, item          time, location         item, location            location, customer
                                                                                    2-D cuboids
                              time, customer       item, customer

 time, item, location                time, location, customer
                                                                                    3-D cuboids
                     time, item, customer          item, location, customer


                                                                                    4-D(base) cuboid
                         time, item, location, customer

    17                                              Data Warehousing and Data Mining by Kritsada Sriphaew
Designing Data Warehouses
   Conceptual Modeling (Star schema, Snowflake, Fact
    constellations), Concept Hierarchy

   Physical Modeling (Data sources, Data storage, OLAP
    engine)




18                          Data Warehousing and Data Mining by Kritsada Sriphaew
Conceptual Modeling of Data Warehouses
        Modeling data warehouses: dimensions & measures
          Star schema: A fact table in the middle connected
           to a set of dimension tables
          Snowflake schema: A refinement of star schema
           where some dimensional hierarchy is normalized
           into a set of smaller dimension tables, forming a
           shape similar to snowflake
          Fact constellations: Multiple fact tables share
           dimension tables, viewed as a collection of stars,
           therefore called galaxy schema or fact constellation

19                            Data Warehousing and Data Mining by Kritsada Sriphaew
Example of Star Schema
 time
 time_key                                                        item
 day                                                          item_key
 day_of_the_week            Sales Fact Table                  item_name
 month                                                        brand
 quarter                            time_key                  type
 year                                                         supplier_type
                                     item_key
                                    branch_key
        branch                                                location
                                  location_key
        branch_key                                            location_key
        branch_name                 units_sold                street
        branch_type                                           city
                                  dollars_sold                province_or_street
                                                              country
                                     avg_sales
                 Measures
20                                  Data Warehousing and Data Mining by Kritsada Sriphaew
Example of Snowflake Schema
time
time_key                                       item
day                                         item_key               supplier
day_of_the_week    Sales Fact Table         item_name              supplier_key
month                                       brand                  supplier_type
quarter                   time_key          type
year                        item_key        supplier_key

                          branch_key
     branch                                 location
                         location_key
                                            location_key
     branch_key
                           units_sold       street
     branch_name
                                            city_key             city
     branch_type
                         dollars_sold
                                                                 city_key
                           avg_sales                             city
                                                                 province_or_street
       Measures                                                  country

21                               Data Warehousing and Data Mining by Kritsada Sriphaew
Example of Fact Constellation
 time                                                           Shipping Fact Table
                                             item
 time_key                                                         time_key
 day               Sales Fact Table       item_key
 day_of_the_week                          item_name                     item_key
 month              time_key              brand
 quarter                                  type                       shipper_key
 year                     item_key        supplier_type
                                                                   from_location

                    branch_key                                        to_location
 branch               location_key       location
 branch_key                                                          dollars_cost
                         units_sold      location_key
 branch_name                             street                    units_shipped
 branch_type           dollars_sold      city
                                         province_or_street              shipper
                         avg_sales       country
        Measures                                                         shipper_key
                                                                         shipper_name
                                                                         location_key
                                                                         shipper_type
22                                    Data Warehousing and Data Mining by Kritsada Sriphaew
Measures: Three Categories
    Distributive: if the result derived by applying the function to n
     aggregate values is the same as that derived by applying the function
     on all the data without partitioning.
         e.g., count(), sum(), min(), max().
    Algebraic: if it can be computed by an algebraic function with M
     arguments (where M is a bounded integer), each of which is obtained
     by applying a distributive aggregate function.
         e.g., avg(), standard_deviation().
    Holistic: if there is no constant bound on the storage size needed to
     describe a sub-aggregate.
         e.g., median(), mode(), rank().



    23                                Data Warehousing and Data Mining by Kritsada Sriphaew
A Concept Hierarchy: Dimension (location)
all                                  all


region                Europe               ...          North_America


country     Germany      ...   Spain               Canada            ...     Mexico


city      Frankfurt   ...            Vancouver ...               Toronto


office                         L. Chan       ...   M. Wind

 24                               Data Warehousing and Data Mining by Kritsada Sriphaew
A Concept Hierarchy: Dimension
 (Distributive - count(), sum(), min(), max())
all                                 all
                                          sum = 2100

               sum = 1400                                     sum = 700
region                 Europe                          North_America


        sum = 900                 sum = 500           sum = 600           sum = 100
country       Germany           Spain                   Canada            Mexico


city Frankfurt Berlin Aechen     Segovia Madrid Vancouver Toronto
         400    200     300        400        100           200            400
                                                                                    Mexico
                                                                                      100

  25                                Data Warehousing and Data Mining by Kritsada Sriphaew
A Concept Hierarchy: Dimension
 (Algebraic - avg(), standard_deviation())
all                                   all
                                            avg = 262.5
                                                    count=8
                 avg = 280      count=5                        avg = 233.33
region                   Europe                         North_America
                                                                               count=3

        avg = 300                   avg = 250           avg = 300           avg = 100
country        Germany            Spain                    Canada           Mexico
                        count=3         count=2                                count=1
                                                                count=2

city Frankfurt Berlin Aechen       Segovia Madrid Vancouver Toronto
           400    200     300        400        100           200            400
                                                                                     Mexico
                                                                                        100
      26                              Data Warehousing and Data Mining by Kritsada Sriphaew
A Concept Hierarchy: Dimension
  (Holistic - median(), mode(), rank())
all                                        all
                                                 median = 250

                 median = 300                                         median = 200
region                      Europe                            North_America


        median = 300                 median = 250         median = 300
country       Germany                 Spain                    Canada   Mexico
                                                                     median = 100

city Frankfurt Berlin Aechen           Segovia Madrid Vancouver Toronto
           400      200      300         400         100            200            400
                                                                                           Mexico
                                                                                             100

      27                                  Data Warehousing and Data Mining by Kritsada Sriphaew
Multidimensional Data
         Sales volume as a function of product, month, and
          region
                                   Dimensions: Product, Location, Time
                                   Hierarchical summarization paths

                                          Industry Region                  Year
Product




                                          Category Country             Quarter

                                          Product        City        Month Week

                                                         Office          Day
                Month

      28                          Data Warehousing and Data Mining by Kritsada Sriphaew
A Sample Data Cube
                                                        Total annual sales
                           Time                         of TV in U.S.A.
             1Qtr   2Qtr    3Qtr     4Qtr      sum
        TV
      PC                                                 U.S.A
    VCR




                                                                     Country
 sum
                                                        Canada

                                                        Mexico

                                                           sum




29                             Data Warehousing and Data Mining by Kritsada Sriphaew
Browsing a Data Cube
 Visualization
 OLAP capabilities
 Interactive manipulation




 30                          Data Warehousing and Data Mining by Kritsada Sriphaew
Typical OLAP Operations
    Drill up (roll up): summarize data
        by climbing up hierarchy or by dimension reduction
    Drill down (roll down): reverse of drill up
        from higher level summary to lower level summary or detailed data, or introducing new
         dimensions
    Slice and dice:
        project and select
        Slice (select on 1 dim.), dice (select on 2 or more dim.)
    Pivot (rotate):
        reorient the cube, visualization, 3D to series of 2D planes.
    Other operations
        drill across: involving (across) more than one fact table
        drill through: through the bottom level of the cube to its back-end relational tables
         (using SQL)


    31                                           Data Warehousing and Data Mining by Kritsada Sriphaew
A Star-Net Query Model
                                                           Each circle is called
     Shipping Method        Customer Orders                a footprint
                  All                 All
                   Shipping type                            Customer Sex
                            group     Contracts
                      Shipping type              All
                                     Order
                                           Male/Female
Time     Annually                             Product line     All
     All       Quarterly Daily            Item       Product group    Product
                          City                  Salesperson
                 Country                         District
              Region              Promotion
                                  type                  Division
             All                 All
                                                        All

     Location
                                Promotion                             Organization

32                                   Data Warehousing and Data Mining by Kritsada Sriphaew
An Example
     Three concepts in data cubes are                                 * A multi-dimensional data model

     (1) Multidimension (2) Hierarchy                                          Location Dimension Table (3 levels)
     (3) Measure                                           Fact Table                 Branch     Country      Continent

               4 dimensions                            2 measures                    London        UK          Europe

                                                                                     Glasgow       UK          Europe
  Date        Branch     Item         Buyer        Units sold   Dollars sold
                                                                                      Berlin    Germany        Europe
1/1/2008      London     VCD       First Company      20           5000               Dimension Table
                                                                                     Bangkok    Thailand        Asia
1/1/2008    Bangkok       TV       First Company      30           9000
                                                                                      Phuket    Thailand        Asia
10/1/2008     London     Ham       First Company      20           1000
                                                                                      Tokyo       Japan         Asia
4/2/2008      London     Milk      First Company      80           1600
                                                                                    Item       Subcategory        Category
15/2/2008   Bangkok      VCD       Best Company       30           7500
                                                                                   VCD           Electric        Non-Food
2/5/2008    Bangkok     Orange     Best Company       20            500
                                                                                    TV           Electric        Non-Food
      Buyer            Buyer Group
                                                                                    Shirt        Clothes         Non-Food
 First Company           Group 1
                                       Customer Dimension Table                     Ham        Process food         Food
Second Company           Group 1       (2 levels)                                   Milk       Fresh food           Food
 Third Company           Group 1
                                                                                   Orange      Fresh food           Food
  Best Company           Group 2

 Good Company            Group 2                                               Product Dimension Table (3 levels)
      33                                                    Data Warehousing and Data Mining by Kritsada Sriphaew
OLAP Operations (Drill down vs. Drill up)
                                                All customers                                                        All customers
               Thailand                    Japan             Total                                Thailand           Japan           Total
Time                                                                                Time
         Food        NonFood            Food   NonFood
                                                                   Drill down                  Food    NonFood    Food     NonFood


2006      2400             2200     11000          5000      20600 Roll down          Q1        1500        500   2000       1000     5000
2007      4500             3200     12000          6000      25700                    Q2         900        600   1500       1000     4000
2008      5600             2900     10000          5500      24000                    Q3        1200        800   4000       2000     8000
Total    12500             8300     33000       16500        70300     Drill up       Q4        2000     1000     2500       1500     7000
                               Location                                Roll up    Total 2008    5600     2900     10000      5500    24000
                                  all                                                          Drill down            Drill up
                                                                                               Roll down             Roll up
                           continent
                                                                                                                          All customers
                            country
                                                                                                  Thailand           Japan           Total
Time                     branch                               Product                Time
  all   year         month                     subcategory       all                            Food   NonFood    Food     NonFood


               quarter        day           item      category                      January      700        200    800        500    2200
                              buyer                                                February      500        100    700        200    1500
                         buyer group
                                                                                    March        300        200    500        300    1300
                                  all                                              Total Q1     1500        500   2000      1000     5000

                             Customer
        34                                                                  Data Warehousing and Data Mining by Kritsada Sriphaew
OLAP Operations (Slice and Dice)
                                                    All customers                                                        Buyer Group 1
                Thailand                     Japan              Total                           Thailand            Japan           Total
Time                                                                                 Time
          Food            NonFood         Food      NonFood
                                                                                              Food    NonFood    Food    NonFood

2006           2400         2200          11000       5000      20600        Slice
                                                                                      2006    1100      1500     7000      2000     11600
2007           4500         3200          12000       6000      25700                 2007    1200      1200     8000      3000     13400
2008           5600         2900          10000       5500      24000                 2008    3000      1200     5000      1800     11000
Total     12500             8300          33000     16500       70300                Total    5300      3900     20000     6800     36000
                               Location
                                 all
                                                                                                      January            Buyer Group 1
                           continent
                                                                    Dice
                                                                                                 Thailand            Japan          Total
                            country
                                                                                      Time
                                                                                              Food     NonFood   Food     NonFood


Time                       branch                                                      2006     200        200     800       300     1500
 all    year          month                       subcategory       all   Product
                                                                                       2007     100        100     900       400     1500
               quarter        day            item        category
                                                                                       2008     300        200     400       100     1000
                              buyer

                         buyer group                                                  Total     600        500    2100       800     4000

                                    all


                             Customer
        35                                                                    Data Warehousing and Data Mining by Kritsada Sriphaew
OLAP Operations (Pivot)
                                                    All customers
                Thailand                     Japan              Total
Time
            Food          NonFood         Food      NonFood


2006           2400         2200          11000       5000      20600
2007           4500         3200          12000       6000      25700             Pivot
2008           5600         2900          10000       5500      24000
Total       12500           8300          33000     16500       70300                                                     All customers
                               Location                                                   Thailand                Japan           Total
                                                                          Time
                                 all                                                 2006   2007   2008   2006    2007     2008
                           continent                                      Food       2400   4500   5600   11000   12000   10000   45500
                            country                                   NonFood        2200   3200   2900   5000    6000     5500   24800
                                                                          Total      4600   7700   8500   16000   18000   15500   70300
Time                       branch
 all    year          month                       subcategory       all

               quarter        day            item        category         Product
                              buyer

                         buyer group

                                    all


                             Customer
       36                                                                     Data Warehousing and Data Mining by Kritsada Sriphaew
OLAP Operations (Drill Across)
                                        All customers                                                           All customers
                  Thailand         Japan          Total                                 Thailand           Japan           Total
Time                                                                     Time
           Food       NonFood   Food    NonFood
                                                                                      Food    NonFood   Food     NonFood

2006         2400       2200    11000     5000    20600                  2006         1500      1500    6000      3000     12000
2007         4500       3200    12000     6000    25700 Drill Across     2007         3500      2500    8000      4000     18000
2008         5600       2900    10000     5500    24000                  2008         4000      1500    5000      3000     13500
Total     12500         8300    33000    16500    70300                  Total        9000      5500    19000    10000     43500


                             Sales                                                           Purchase




                                                                            Product
        Product




                                                          Drill Across




                                                                                             Location
                        Location
        37                                                     Data Warehousing and Data Mining by Kritsada Sriphaew
OLAP Operations (Drill Through)
                                      All customers
                Thailand         Japan          Total
Time
            Food    NonFood   Food    NonFood


2006        2400      2200    11000     5000    20600
2007        4500      3200    12000     6000    25700
2008        5600      2900    10000     5500    24000        In order to see the detail at the relational
Total       12500     8300    33000    16500    70300        table, we can perform ‘drill through’.

                                            Drill Through
                    Sales
                                                  Date      Branch      Item        Buyer        Units sold   Dollars sold

                                                1/1/2008    Phuket     VCD       First Company      10            250

                                                1/1/2008    Bangkok     TV       First Company      30            900
 Product




                                                10/1/2008   Phuket      TV       First Company      10            300

                                                4/2/2008    Phuket     Stereo    First Company      40            200
                                                15/2/2008   Bangkok    VCD       Best Company       30            750

                                                2/5/2008    Bangkok   Computer   Best Company       20            600
                 Location
           38                                                 Data Warehousing and Data Mining by Kritsada Sriphaew
An Example of OLAP tools




39                Data Warehousing and Data Mining by Kritsada Sriphaew
An Example of OLAP tools




                                                                       Organization
                                                                       (continent level)


        Measures   Time
                   (year level)

     Dimension
                                           2 Dimensions: (1) Time
                                                         (2) Organization



40                                Data Warehousing and Data Mining by Kritsada Sriphaew
An Example of OLAP tools




41                Data Warehousing and Data Mining by Kritsada Sriphaew
An Example of OLAP tools




42                Data Warehousing and Data Mining by Kritsada Sriphaew
Designing Data Warehouses
   Conceptual Modeling (Star schema, Snowflake, Fact
    constellations), Concept Hierarchy

   Physical Modeling (Data sources, Data storage, OLAP
    engine)




43                          Data Warehousing and Data Mining by Kritsada Sriphaew
Multi-Tiered Architecture
                            Bottom Tier          Middle Tier              Top Tier

                                Monitor
                                &                OLAP Server
 other           Metadata
 sources                        Integrator

                                                                       Analysis
Operational Extract                                                    Query
DBs         Transform        Data                   Serve              Reports
               Load
                             Warehouse                                 Data
                                                                       mining



                              Data Marts

Data Sources           Data Storage        OLAP Engine            Front-End Tools
44                                  Data Warehousing and Data Mining by Kritsada Sriphaew
Three Physical Data Warehouse Models
    Enterprise Warehouse
      collects all of the information about subjects spanning the entire
       organization
    Data Mart
      a subset of corporate-wide data that is of value to a specific groups
       of users. Its scope is confined to specific, selected groups, such as
       marketing data mart
            Independent vs. dependent (directly from warehouse)
             data mart
    Virtual Warehouse
      A set of views over operational databases
      Only some of the possible summary views may be materialized
      Easy to build but require excess capacity on operational DB server.



    45                                Data Warehousing and Data Mining by Kritsada Sriphaew
OLAP Server Architectures
    Relational OLAP (ROLAP)
        Use relational or extended-relational DBMS to store and manage
         warehouse data and OLAP middle ware to support missing pieces
        Include optimization of DBMS backend, implementation of aggregation
         navigation logic, and additional tools/services
        greater scalability, we may keep summary in relational DB
    Multidimensional OLAP (MOLAP)
        Array-based multidimensional storage engine(sparse matrix techniques)
        fast indexing to pre-computed summarized data
    Hybrid OLAP (HOLAP)
        User flexibility, e.g., low level: relational, high-level: array
    Specialized SQL servers
        specialized support for SQL queries over star/snowflake schemas in read-
         only environment.

    46                                         Data Warehousing and Data Mining by Kritsada Sriphaew
Efficient Data Cube Computation
   Data cube can be viewed as a lattice of cuboids
       The bottom-most cuboid is the base cuboid
       The top-most cuboid (apex) contains only one cell
       How many cuboids in an n-dimensional cube with L levels?
                     n
              T   ( Li 1)
                   i 1
 Materialization of data cube
       Materialize every (cuboid) (full materialization), none (no materialization),
        or some (partial materialization)
       Selection of which cuboids to materialize
         Based on size, sharing, access frequency, etc.


47                                         Data Warehousing and Data Mining by Kritsada Sriphaew
Cube: A Lattice of Cuboids (Dimension)
                                     all
                                                                                  0-D (apex) cuboid

             time         product      location         customer
                                                                                   1-D cuboids

time, item          time, location     item, location               location, customer
                                                                                   2-D cuboids
                              time, customer      item, customer

 time, item, location                time, location, customer
                                                                                  3-D cuboids
                     time, item, customer           item, location, customer


                                                                                   4-D(base) cuboid
                         time, item, location, customer

   48                                             Data Warehousing and Data Mining by Kritsada Sriphaew
Multi-way Array Aggregation for Cube
Computation
    Partition arrays into chunks (a small subcube which fits in memory).
    Compressed sparse array addressing: (chunk_id, offset)
    Compute aggregates in “multiway” by visiting cube cells in the order which minimizes
     the # of times to visit each cell, and reduces memory access and storage cost.


     C        c3 61
            c2 45
                        62     63     64
                      46     47     48
          c1 29    30     31     32
         c0
     b3    B13    14      15     16          60     What is the best traversing order
                                           44
                                        28 56       to do multi-way aggregation?
     b2     9
B                                          40
                                        24 52
     b1     5                             36
                                                    The size of the dimensions A, B,
                                        20          and C is 6000, 1000, and 100000.
     b0     1     2        3     4
            a0    a1      a2    a3
                      A
    49                                       Data Warehousing and Data Mining by Kritsada Sriphaew
Multi-way Array Aggregation for Cube
Computation


                    C      c3 61
                         c2 45
                                       62       63       64
                                    46       47       48
                       c1 29     30       31       32
                     c0
                         B13         14       15       16                60
                    b3                                              44
                B                                              28        56
                    b2   9
                                                                    40
                                                               24        52
                    b1   5
                                                                    36
                                                               20
                    b0   1           2         3       4
                         a0         a1       a2       a3
                                         A




50                             Data Warehousing and Data Mining by Kritsada Sriphaew
Multi-way Array Aggregation for Cube
Computation


                    C      c3 61
                         c2 45
                                       62       63       64
                                    46       47       48
                       c1 29     30       31       32
                     c0
                         B13        14       15       16                60
                    b3                                             44
                B                                             28        56
                    b2   9
                                                                   40
                                                              24        52
                    b1   5
                                                                   36
                                                              20
                    b0   1          2         3       4
                         a0        a1       a2       a3
                                        A




51                           Data Warehousing and Data Mining by Kritsada Sriphaew
Multi-Way Array Aggregation for Cube
Computation (Cont.)
    Method: the planes should be sorted and computed
     according to their size in ascending order.
        Idea: keep the smallest plane in the main memory, fetch and
         compute only one chunk at a time for the largest plane
    Limitation of the method: computing well only for a small
     number of dimensions
        If there are a large number of dimensions, “bottom-up computation”
         and iceberg cube computation methods can be explored. Iceberg
         cubes store only cube partitions where the aggregate value is above
         some min. support.



    52                                 Data Warehousing and Data Mining by Kritsada Sriphaew
Multi-Way Array Aggregation
(An Example)
    Suppose that a base cuboid has three dimensions A, B, C with the following
     numbers of cells: |A| = 6,000, |B| = 1,000, |C| = 50,000. Suppose that the
     dimensions A, B and C are partitioned into 6, 5 and 1000 portions for chunking,
     respectively.
      If each cube cell stores one measure with 4 bytes, what is the total size of the
        computed cube if the cube is very dense? That is, calculate the size of a base
        cuboid.
            The number of cells in the computed cube is
                6 x 103 x 103 x 5 x 104 = 3x1011 cells.
            The total size of the computed cube is
                4 x 3 x 1011 = 1.2x1012 bytes.
        State the order for computing the chunks in the cube that requires the least
         amount of space, and compute the total amount of main memory space required
         for computing the 2-D planes.

    53                                      Data Warehousing and Data Mining by Kritsada Sriphaew
Multi-Way Array Aggregation for Cube
Computation (Cont.)
                                                            C
One chunk includes
A: 6000/6       = 1000                                A
B: 1000/5       =   200
C: 50000/1000 =     50
The space we need for                                         B
computing the cube is
                                (3) One chunk of plane AC:
(1) All elements of plane AB:
                                    =    1000x50 cells
    = 6000x1000 cells
                                    =      50000 cells
    =     6000000 cells
                                    =     200000 bytes
    =   24000000 bytes
                                (4) One cube of ABC:
(2) One row of plane BC:
                                    = 1000x200x50 cells
    =     1000x50 cells
                                    =   10000000 cells
    =       50000 cells
                                    =   40000000 bytes
    =      200000 bytes
                                Total memory for
                                keeping the result
                                       = 64400000 bytes
                                       = 6.44x107 bytes.
54                          Data Warehousing and Data Mining by Kritsada Sriphaew
Efficient Processing OLAP Queries
   Determine which operations should be performed on
    the available cuboids:
       transform drill, roll, etc. into corresponding SQL and/or
        OLAP operations, e.g, dice = selection + projection
 Determine to which materialized cuboid(s) the
  relevant operations should be applied.
 Exploring indexing structures and compressed vs.
  dense array structures in MOLAP


 55                                 Data Warehousing and Data Mining by Kritsada Sriphaew
Metadata Repository
   Meta data is the data defining warehouse objects. It
    has the following kinds
       The algorithms used for summarization
       The mapping from operational environment to the data
        warehouse
       Data related to system performance
           warehouse schema, view and derived data definitions
       Business data
           business terms and definitions, ownership of data, charging
            policies

56                                     Data Warehousing and Data Mining by Kritsada Sriphaew
Data Warehouse Back-End Tools and Utilities
    Extraction:
        get data from multiple, heterogeneous, and external sources
    Transformation:
        convert data from legacy or host format to warehouse format
    Load:
        sort, summarize, consolidate, compute views, check integrity, and
         build indices and partitions
    Cleaning:
        detect errors in the data and rectify them when possible
    Refresh:
        propagate the updates from the data sources to the warehouse



    57                                  Data Warehousing and Data Mining by Kritsada Sriphaew
From OLAP to On Line Analytical Mining (OLAM)
   Why online analytical mining?
       High quality of data in data warehouses
         DW contains integrated, consistent, cleaned data
       Available information processing structure surrounding data warehouses
         ODBC, OLEDB, Web accessing, service facilities, reporting
         and OLAP tools
       OLAP-based exploratory data analysis
         mining with drilling, dicing, pivoting, etc.
       On-line selection of data mining functions
         integration and swapping of multiple mining functions,
         algorithms, and tasks.

58                                        Data Warehousing and Data Mining by Kritsada Sriphaew
An OLAM Architecture (Online Analytical Mining)
       Mining query                               Mining result Layer4
                                                                   User Interface

                        User GUI API
                                                                       Layer3
     OLAM                                  OLAP
     Engine                                Engine                      OLAP/OLAM

                        Data Cube API

                                                                   Layer2
                           MDDB
                                                                   MDDB
                                                    Meta Data

     Filtering&Integration Database API             Filtering
                                                                   Layer1
                         Data cleaning     Data
                                                                   Data Repository
         Databases                         Warehouse
                        Data integration

59                                Data Warehousing and Data Mining by Kritsada Sriphaew
Major Applications of Data Warehouses
Applications
 Online Analytical Processing (OLAP)
 Data Mining
 Customer Relationship Management (CRM)




60                      Data Warehousing and Data Mining by Kritsada Sriphaew
Some commercial tools for
Data Warehouse
 Microsoft Excel’s PivotTable
 Crystal Reports’ Business Objects
 IBM Cognos
 Microsoft SQL Server Analysis Services
 Oracle Express
 Microsoft Proclarity
 Etc.




 61                       Data Warehousing and Data Mining by Kritsada Sriphaew
Exercise 1
   Given the following fact table
        Date       Branch    Item            Buyer            Units Dollars
                                                              sold   sold
       1/01/2008   London     Chair     First Company           20         5000
       1/01/2008   Bangkok     TV       First Company           30         9000
      10/01/2008   London     Ham      Second Company           20         1000
       4/02/2008   London     Milk      First Company           80         1600
      15/02/2008   Bangkok    VCD       Best Company            30         7500
       2/05/2008    Berlin   Orange     Best Company            20         500
      12/06/2008   London     VCD       Good Company            20         5000
      14/06/2008   Bangkok     TV       First Company           10        12000
      16/07/2008    Phuket    VCD       Good Company            20         1000
      24/07/2008   London    Orange     First Company           30         6000
       2/08/2008   Bangkok    Table     Best Company            30         7500
      12/08/2008   Bangkok    Ham       Best Company            20         500
      12/10/2008   London     Table     Good Company            20         5000
      14/11/2008   Bangkok     TV       First Company           8          2000
      16/11/2008    Phuket    Ham       Good Company            5          2000
      24/12/2008   London     Milk      First Company           30         6000
62                                    Knowledge Management and Discovery © Kritsada Sriphaew
Exercise 1 (cont.)
   Given the following dimensions of the data
                                        Location
                                           all

                                     continent

                                      country


             Time
              all year      month
                                       branch
                                                     subcategory
                                                                   Product
                                                                    all

                         quarter        day        item     category
                                        buyer
                                   buyer group

                                           all


                                       Customer

63                                        Knowledge Management and Discovery © Kritsada Sriphaew
Exercise 1 (cont.)
   Question 1: Write concept hierarchies of time,
    location, product and customer.
       Buyer groups are
         Group1: Quality Group; good company, best company
         Group2: Number Group; first company, second company
       Product’s categories are
         Cat1: Food
             Subcategories   are drink and non-drink
           Cat2: NonFood
             Subcategories   are electronic and non-electronic

64                                    Knowledge Management and Discovery © Kritsada Sriphaew
Exercise 1 (cont.)
   Question 2: Write a result table of this star-net query
    model with sum dollars sold.
                                         Location
                                           all

                                     continent

                                      country


             Time
              all year      month
                                       branch
                                                     subcategory
                                                                   Product
                                                                    all

                         quarter        day        item     category
                                        buyer
                                   buyer group

                                           all


                                       Customer
65                                        Knowledge Management and Discovery © Kritsada Sriphaew
Exercise 1 (cont.)
   Question 3: Write a result table of this star-net query
    model with average units sold.
                                       Location
                                         all

                                    continent

                                     country


              Time
               all year      month
                                     branch
                                                  subcategory
                                                                Product
                                                                 all

                          quarter     day       item    category
                                       buyer
                                buyer group

                                         all


                                      Customer

66                                       Knowledge Management and Discovery © Kritsada Sriphaew
Exercise 1 (cont.)
   Question 4: Write a result table of this star-net query
    model with sum dollars sold. Slice only Bangkok
    location.
                                   Location
                                    all

                                continent
                                 country

              Time
               all year   month
                               branch                     Product
                                                subcategory all

                      quarter     day         item    category
                                  buyer
                            buyer group

                                     all


                                  Customer
67                                         Knowledge Management and Discovery © Kritsada Sriphaew
Exercise 2
    (Data Cube Computation) Suppose that a base cuboid has three
     dimensions A, B, C with the following numbers of cells: |A| = 20,000,
     |B| = 8,000, |C| = 1,000. Also suppose that the dimensions A, B and C
     are partitioned into 10, 8 and 4 portions for chunking, respectively.
      Question 1: If each cube cell stores one measure with 4 bytes, what
        is the total size of the computed cube if the cube is very dense?
        That is, calculate the size of a base cuboid.
           The number of cells in the computed cube ?
           The total size (bytes) of the computed cube ?
      Question 2: State the order for computing the chunks in the cube
        that requires the least amount of space, and compute the total
        amount of main memory space required for computing the 2-D
        planes.

    68                             Knowledge Management and Discovery © Kritsada Sriphaew

Contenu connexe

Tendances

Big data Analytics
Big data AnalyticsBig data Analytics
Big data AnalyticsTUSHAR GARG
 
Guide to Data Monetization
Guide to Data MonetizationGuide to Data Monetization
Guide to Data MonetizationMike Davie
 
Big Data : Risks and Opportunities
Big Data : Risks and OpportunitiesBig Data : Risks and Opportunities
Big Data : Risks and OpportunitiesKenny Huang Ph.D.
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notesMohit Saini
 
Data Visualization With Tableau | Edureka
Data Visualization With Tableau | EdurekaData Visualization With Tableau | Edureka
Data Visualization With Tableau | EdurekaEdureka!
 
Enterprise Data Management
Enterprise Data ManagementEnterprise Data Management
Enterprise Data ManagementBhavendra Chavan
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Introduction to data warehousing
Introduction to data warehousing   Introduction to data warehousing
Introduction to data warehousing Girish Dhareshwar
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and AnalyticsSrinath Perera
 
An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL DatabasesRajith Pemabandu
 
Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An OverviewMachinePulse
 
Data Discovery and Visualization
Data Discovery and VisualizationData Discovery and Visualization
Data Discovery and VisualizationDr. Neil Brittliff
 
3 pillars of big data : structured data, semi structured data and unstructure...
3 pillars of big data : structured data, semi structured data and unstructure...3 pillars of big data : structured data, semi structured data and unstructure...
3 pillars of big data : structured data, semi structured data and unstructure...PROWEBSCRAPER
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
 

Tendances (20)

Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Guide to Data Monetization
Guide to Data MonetizationGuide to Data Monetization
Guide to Data Monetization
 
OLAP technology
OLAP technologyOLAP technology
OLAP technology
 
Big Data : Risks and Opportunities
Big Data : Risks and OpportunitiesBig Data : Risks and Opportunities
Big Data : Risks and Opportunities
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Data Visualization With Tableau | Edureka
Data Visualization With Tableau | EdurekaData Visualization With Tableau | Edureka
Data Visualization With Tableau | Edureka
 
Enterprise Data Management
Enterprise Data ManagementEnterprise Data Management
Enterprise Data Management
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Introduction to data warehousing
Introduction to data warehousing   Introduction to data warehousing
Introduction to data warehousing
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL Databases
 
Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An Overview
 
Data Discovery and Visualization
Data Discovery and VisualizationData Discovery and Visualization
Data Discovery and Visualization
 
Introduction data mining
Introduction data miningIntroduction data mining
Introduction data mining
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
3 pillars of big data : structured data, semi structured data and unstructure...
3 pillars of big data : structured data, semi structured data and unstructure...3 pillars of big data : structured data, semi structured data and unstructure...
3 pillars of big data : structured data, semi structured data and unstructure...
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 

En vedette (7)

Weather conditions
Weather conditionsWeather conditions
Weather conditions
 
Dbm630 Lecture01
Dbm630 Lecture01Dbm630 Lecture01
Dbm630 Lecture01
 
MIT628_coursesyllabus
MIT628_coursesyllabusMIT628_coursesyllabus
MIT628_coursesyllabus
 
Lecture01_Introduction
Lecture01_IntroductionLecture01_Introduction
Lecture01_Introduction
 
Multimedia Systems
Multimedia SystemsMultimedia Systems
Multimedia Systems
 
Lecture01
Lecture01Lecture01
Lecture01
 
Csc533 ch3a mm_framework
Csc533 ch3a mm_frameworkCsc533 ch3a mm_framework
Csc533 ch3a mm_framework
 

Similaire à Dbm630_Lecture02-03

1.4 data warehouse
1.4 data warehouse1.4 data warehouse
1.4 data warehouseKrish_ver2
 
11666 Bitt I 2008 Lect3
11666 Bitt I 2008 Lect311666 Bitt I 2008 Lect3
11666 Bitt I 2008 Lect3ambujm
 
Dataware house multidimensionalmodelling
Dataware house multidimensionalmodellingDataware house multidimensionalmodelling
Dataware house multidimensionalmodellingmeghu123
 
11667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect411667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect4ambujm
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousingwork
 
DATAWAREHOUSE MAIn under data mining for
DATAWAREHOUSE MAIn under data mining forDATAWAREHOUSE MAIn under data mining for
DATAWAREHOUSE MAIn under data mining forAyushMeraki1
 
Dataware housing
Dataware housingDataware housing
Dataware housingwork
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lakepunedevscom
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...DATAVERSITY
 
Prague data management meetup 2017-02-28
Prague data management meetup 2017-02-28Prague data management meetup 2017-02-28
Prague data management meetup 2017-02-28Martin Bém
 
Data mining 2 - Data warehouse (cheat sheet - printable)
Data mining 2 - Data warehouse (cheat sheet - printable)Data mining 2 - Data warehouse (cheat sheet - printable)
Data mining 2 - Data warehouse (cheat sheet - printable)yesheeka
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothAdaryl "Bob" Wakefield, MBA
 
02. Data Warehouse and OLAP
02. Data Warehouse and OLAP02. Data Warehouse and OLAP
02. Data Warehouse and OLAPAchmad Solichin
 
Data Mining Concepts and Techniques
Data Mining Concepts and TechniquesData Mining Concepts and Techniques
Data Mining Concepts and TechniquesPratik Tambekar
 

Similaire à Dbm630_Lecture02-03 (20)

Ch03
Ch03Ch03
Ch03
 
2. olap warehouse
2. olap warehouse2. olap warehouse
2. olap warehouse
 
1.4 data warehouse
1.4 data warehouse1.4 data warehouse
1.4 data warehouse
 
Chpt2.ppt
Chpt2.pptChpt2.ppt
Chpt2.ppt
 
11666 Bitt I 2008 Lect3
11666 Bitt I 2008 Lect311666 Bitt I 2008 Lect3
11666 Bitt I 2008 Lect3
 
Dataware house multidimensionalmodelling
Dataware house multidimensionalmodellingDataware house multidimensionalmodelling
Dataware house multidimensionalmodelling
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
11667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect411667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect4
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
DATAWAREHOUSE MAIn under data mining for
DATAWAREHOUSE MAIn under data mining forDATAWAREHOUSE MAIn under data mining for
DATAWAREHOUSE MAIn under data mining for
 
Lecture1
Lecture1Lecture1
Lecture1
 
Dataware housing
Dataware housingDataware housing
Dataware housing
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
Chapter 2
Chapter 2Chapter 2
Chapter 2
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
 
Prague data management meetup 2017-02-28
Prague data management meetup 2017-02-28Prague data management meetup 2017-02-28
Prague data management meetup 2017-02-28
 
Data mining 2 - Data warehouse (cheat sheet - printable)
Data mining 2 - Data warehouse (cheat sheet - printable)Data mining 2 - Data warehouse (cheat sheet - printable)
Data mining 2 - Data warehouse (cheat sheet - printable)
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
 
02. Data Warehouse and OLAP
02. Data Warehouse and OLAP02. Data Warehouse and OLAP
02. Data Warehouse and OLAP
 
Data Mining Concepts and Techniques
Data Mining Concepts and TechniquesData Mining Concepts and Techniques
Data Mining Concepts and Techniques
 

Plus de Aj Kritsada Sriphaew

Plus de Aj Kritsada Sriphaew (6)

Lecture1-Introduction-Jan7-2017.pptx
Lecture1-Introduction-Jan7-2017.pptxLecture1-Introduction-Jan7-2017.pptx
Lecture1-Introduction-Jan7-2017.pptx
 
IRS185-RSU185-lecture03.pdf
IRS185-RSU185-lecture03.pdfIRS185-RSU185-lecture03.pdf
IRS185-RSU185-lecture03.pdf
 
Google Sites and Digital Portfolios.pptx
Google Sites and Digital Portfolios.pptxGoogle Sites and Digital Portfolios.pptx
Google Sites and Digital Portfolios.pptx
 
I18N.pdf
I18N.pdfI18N.pdf
I18N.pdf
 
210724 DoCare_Proposal_BDMS Pattaya_Quotation Device Set.pdf
210724 DoCare_Proposal_BDMS Pattaya_Quotation Device Set.pdf210724 DoCare_Proposal_BDMS Pattaya_Quotation Device Set.pdf
210724 DoCare_Proposal_BDMS Pattaya_Quotation Device Set.pdf
 
Flash Tutorial
Flash TutorialFlash Tutorial
Flash Tutorial
 

Dbm630_Lecture02-03

  • 1. DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University Semester 2/2011 Lecture 2&3 Data Warehouse and OLAP Technology by Kritsada Sriphaew (sriphaew.k AT gmail.com) 1
  • 2. Outline Part I: Basic Knowledge about Data Warehousing  Dimensional Modeling  Data Cube  Architecture Part II: OLAP and Cube Computations  OLAP, Data Cube and Data Analysis  Cube Computations  Demo PivotTable (bring laptop with MS Excel, if you have) 2 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 3. What is Data Warehouse?  Defined in many different ways, but not rigorously.  A decision support database that is maintained separately from the organization’s operational DB  Support information processing by providing a solid platform of consolidated, historical data for analysis.  “A data warehouse is a subject-oriented, integrated, time- variant, and nonvolatile collection of data in support of management’s decision-making process.” (definition by W. H. Inmon)  Data warehousing:  Process of constructing and using data warehouses 3 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 4. Four Properties of Data Warehouses  subject-oriented จัดเก็บเป็นเรื่องๆ  integrated รวบรวมอยู่ในรูปแบบเดียวกัน  time-variant มีข้อมูลตามมิติเวลา  non-volatile มีเสถียรภาพ ข้อมูลไม่สูญหาย 4 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 5. Subject-Oriented  Subject-Oriented Property  Organized around major subjects, such as customer, product, sales.  Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.  Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. OPERATIONAL DB DATA WAREHOUSE • Loans • Customer • Savings • Vendor • Bank card • Product • Trust • Activity An application orientation A subject orientation 5 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 6. Integrated  Integrated Property  Constructed by integrating multiple, heterogeneous data sources  Relational databases, flat files, on-line transaction records  Data cleaning and data integration techniques are applied.  Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources  e.g., Hotel price: currency, tax, breakfast covered, etc.  When data is moved to the warehouse, it is converted. 6 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 7. Time Variant/Non-Volatile  Time Variant Property  The time horizon for the data warehouse is significantly longer than that of operational systems.  Operational database: current value data.  Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)  Every key structure in the data warehouse  Contains an element of time, explicitly or implicitly  But the key of operational data may or may not contain “time element”.  Non-Volatile Property  A physically separated store of data transformed from the operational environment.  Operational update of data does not occur in the data warehouse environment.  Does not require transaction processing, recovery, and concurrency control mechanisms  Requires only two operations in data accessing: initial loading of data and access of data. 7 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 8. Operational DBMS vs. Data Warehouse (OLTP vs. OLAP)  OLTP (on-line transaction processing)  Major task of traditional relational DBMS  Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.  OLAP (on-line analytical processing)  Major task of data warehouse system  Data analysis and decision making  Distinct features (OLTP vs. OLAP):  User and system orientation: customer vs. market  Data contents: current, detailed vs. historical, consolidated  Design: ER + application vs. star + subject  View: current, local vs. evolutionary, integrated  Access patterns: create/select/insert/delete/update vs. read-only but complex queries 8 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 9. OLTP vs. OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date historical, detailed, flat relational summarized, isolated multidimensional integrated, consolidated usage repetitive ad-hoc access read/write lots of scans index/hash on primary key unit of work short, simple transaction complex query # records accessed tens millions #users thousands tens DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response 9 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 10. Heterogeneous DBMS vs. Data Warehouse  Traditional heterogeneous DB integration  Build wrappers/mediators on top of heterogeneous DBs  Query-driven approach  When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set  Complex information filtering, compete for resources  Data warehouse (DW)  Another database for high-performance analysis  Update-driven approach  Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query/analysis 10 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 11. Heterogeneous DBMS vs. Data Warehouse OLTP Query-driven Approach Sales Wrapper/ Purchasing Mediators Production Heterogeneous Operational Database OLTP Update-driven Approach Sales Data Purchasing Warehouse Query Production OLAP Heterogeneous Operational Database 11 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 12. Why Separate Data Warehouse?  High performance for both systems  DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery  Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation.  Different functions and different data:  Missing data: Decision support (OLAP) requires historical data which operational DBs (OLTP) do not typically maintain  Data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources  Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled 12 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 13. * A multidimensional data model From Tables and Spreadsheets to Data Cubes  A data warehouse is based on a multidimensional data model which views data in the form of a data cube  A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions  Fact table contains measures (such as units_sold, dollars_sold) and keys to each of the related dimension tables  Dimension tables, such as location (branch, country, continent), or time (day, week, month, quarter, year) Date Branch Item Buyer Units sold Dollars sold Branch Country Continent 1/1/2008 London VCD First Company 20 5000 London UK Europe 1/1/2008 Bangkok TV First Company 30 9000 Glasgow UK Europe 10/1/2008 London Ham First Company 20 1000 Berlin Germany Europe 4/2/2008 London Milk First Company 80 1600 Bangkok Thailand Asia 15/2/2008 Bangkok VCD Best Company 30 7500 Phuket Thailand Asia 2/5/2008 Bangkok Orange Best Company 20 500 Tokyo Japan Asia 13 Fact Table Dimension Table Data Warehousing and Data Mining by Kritsada Sriphaew
  • 14. Three Concepts in Data Cubes Three concepts in data cubes are (1) Multidimension (2) Hierarchy Location Dimension Table (3 levels) (3) Measure Fact Table Branch Country Continent 4 dimensions 2 measures London UK Europe Glasgow UK Europe Date Branch Item Buyer Units sold Dollars sold Berlin Germany Europe 1/1/2008 London VCD First Company 20 5000 Bangkok Thailand Asia 1/1/2008 Bangkok TV First Company 30 9000 Phuket Thailand Asia 10/1/2008 London Ham First Company 20 1000 Tokyo Japan Asia 4/2/2008 London Milk First Company 80 1600 Item Subcategory Category 15/2/2008 Bangkok VCD Best Company 30 7500 VCD Electric Non-Food 2/5/2008 Bangkok Orange Best Company 20 500 TV Electric Non-Food Buyer Buyer Group Shirt Clothes Non-Food First Company Group 1 Customer Dimension Table Ham Process food Food Second Company Group 1 (2 levels) Milk Fresh food Food Third Company Group 1 Orange Fresh food Food Best Company Group 2 Good Company Group 2 Product Dimension Table (3 levels) 14 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 15. Cuboids in Data Cubes  Cuboid concept is formed by the number of dimensions.  The top most 0-D cuboid, which holds the highest- level of summarization, is called an apex cuboid.  In data warehousing literature, an n-D base cube where n is the total number of dimensions is called a base cuboid.  The lattice of cuboids forms a data cube. 15 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 16. An Example of Cuboids (Dimension) (add vs. delete dimensions) Time Sales Sales 0-D (apex) cuboid Sales(all) Q1 5000 24000 Add dim. Delete dim. Q2 4000 Sales(Q1), Sales(Q2), Q3 8000 1-D cuboid Sales(Q3), Sales(Q4) Q4 7000 Add dim. Delete dim. Time Thailand Japan Sales(Q1, Thailand), Sales(Q1, Japan) Sales(Q2, Thailand), Sales(Q2, Japan) Q1 2000 3000 2-D cuboid Sales(Q3, Thailand), Sales(Q3, Japan) Q2 1500 2500 Sales(Q4, Thailand), Sales(Q4, Japan) Add dim. Q3 2000 6000 Delete dim. Q4 3000 4000 3-D cuboid Thailand Japan Sales(Q1, Thailand, Food), Sales(Q1, Thailand, NonFood) Time Sales(Q2, Thailand, Food), Sales(Q2, Thailand, NonFood) Food NonFood Food NonFood Sales(Q3, Thailand, Food), Sales(Q3, Thailand, NonFood) Q1 1500 500 2000 1000 Sales(Q4, Thailand, Food), Sales(Q4, Thailand, NonFood) Sales(Q1, Japan, Food), Sales(Q1, Japan, NonFood) Q2 900 600 1500 1000 Sales(Q2, Japan, Food), Sales(Q2, Japan, NonFood) Sales(Q3, Japan, Food), Sales(Q3, Japan, NonFood) Q3 1200 800 4000 2000 Sales(Q4, Japan, Food), Sales(Q4, Japan, NonFood) Q4 2000 1000 2500 1500 16 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 17. Data Cube: A Lattice of Cuboids (Dimension) all 0-D (apex) cuboid time product location customer 1-D cuboids time, item time, location item, location location, customer 2-D cuboids time, customer item, customer time, item, location time, location, customer 3-D cuboids time, item, customer item, location, customer 4-D(base) cuboid time, item, location, customer 17 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 18. Designing Data Warehouses  Conceptual Modeling (Star schema, Snowflake, Fact constellations), Concept Hierarchy  Physical Modeling (Data sources, Data storage, OLAP engine) 18 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 19. Conceptual Modeling of Data Warehouses  Modeling data warehouses: dimensions & measures  Star schema: A fact table in the middle connected to a set of dimension tables  Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake  Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 19 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 20. Example of Star Schema time time_key item day item_key day_of_the_week Sales Fact Table item_name month brand quarter time_key type year supplier_type item_key branch_key branch location location_key branch_key location_key branch_name units_sold street branch_type city dollars_sold province_or_street country avg_sales Measures 20 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 21. Example of Snowflake Schema time time_key item day item_key supplier day_of_the_week Sales Fact Table item_name supplier_key month brand supplier_type quarter time_key type year item_key supplier_key branch_key branch location location_key location_key branch_key units_sold street branch_name city_key city branch_type dollars_sold city_key avg_sales city province_or_street Measures country 21 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 22. Example of Fact Constellation time Shipping Fact Table item time_key time_key day Sales Fact Table item_key day_of_the_week item_name item_key month time_key brand quarter type shipper_key year item_key supplier_type from_location branch_key to_location branch location_key location branch_key dollars_cost units_sold location_key branch_name street units_shipped branch_type dollars_sold city province_or_street shipper avg_sales country Measures shipper_key shipper_name location_key shipper_type 22 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 23. Measures: Three Categories  Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning.  e.g., count(), sum(), min(), max().  Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function.  e.g., avg(), standard_deviation().  Holistic: if there is no constant bound on the storage size needed to describe a sub-aggregate.  e.g., median(), mode(), rank(). 23 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 24. A Concept Hierarchy: Dimension (location) all all region Europe ... North_America country Germany ... Spain Canada ... Mexico city Frankfurt ... Vancouver ... Toronto office L. Chan ... M. Wind 24 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 25. A Concept Hierarchy: Dimension (Distributive - count(), sum(), min(), max()) all all sum = 2100 sum = 1400 sum = 700 region Europe North_America sum = 900 sum = 500 sum = 600 sum = 100 country Germany Spain Canada Mexico city Frankfurt Berlin Aechen Segovia Madrid Vancouver Toronto 400 200 300 400 100 200 400 Mexico 100 25 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 26. A Concept Hierarchy: Dimension (Algebraic - avg(), standard_deviation()) all all avg = 262.5 count=8 avg = 280 count=5 avg = 233.33 region Europe North_America count=3 avg = 300 avg = 250 avg = 300 avg = 100 country Germany Spain Canada Mexico count=3 count=2 count=1 count=2 city Frankfurt Berlin Aechen Segovia Madrid Vancouver Toronto 400 200 300 400 100 200 400 Mexico 100 26 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 27. A Concept Hierarchy: Dimension (Holistic - median(), mode(), rank()) all all median = 250 median = 300 median = 200 region Europe North_America median = 300 median = 250 median = 300 country Germany Spain Canada Mexico median = 100 city Frankfurt Berlin Aechen Segovia Madrid Vancouver Toronto 400 200 300 400 100 200 400 Mexico 100 27 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 28. Multidimensional Data  Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Product Category Country Quarter Product City Month Week Office Day Month 28 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 29. A Sample Data Cube Total annual sales Time of TV in U.S.A. 1Qtr 2Qtr 3Qtr 4Qtr sum TV PC U.S.A VCR Country sum Canada Mexico sum 29 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 30. Browsing a Data Cube  Visualization  OLAP capabilities  Interactive manipulation 30 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 31. Typical OLAP Operations  Drill up (roll up): summarize data  by climbing up hierarchy or by dimension reduction  Drill down (roll down): reverse of drill up  from higher level summary to lower level summary or detailed data, or introducing new dimensions  Slice and dice:  project and select  Slice (select on 1 dim.), dice (select on 2 or more dim.)  Pivot (rotate):  reorient the cube, visualization, 3D to series of 2D planes.  Other operations  drill across: involving (across) more than one fact table  drill through: through the bottom level of the cube to its back-end relational tables (using SQL) 31 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 32. A Star-Net Query Model Each circle is called Shipping Method Customer Orders a footprint All All Shipping type Customer Sex group Contracts Shipping type All Order Male/Female Time Annually Product line All All Quarterly Daily Item Product group Product City Salesperson Country District Region Promotion type Division All All All Location Promotion Organization 32 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 33. An Example Three concepts in data cubes are * A multi-dimensional data model (1) Multidimension (2) Hierarchy Location Dimension Table (3 levels) (3) Measure Fact Table Branch Country Continent 4 dimensions 2 measures London UK Europe Glasgow UK Europe Date Branch Item Buyer Units sold Dollars sold Berlin Germany Europe 1/1/2008 London VCD First Company 20 5000 Dimension Table Bangkok Thailand Asia 1/1/2008 Bangkok TV First Company 30 9000 Phuket Thailand Asia 10/1/2008 London Ham First Company 20 1000 Tokyo Japan Asia 4/2/2008 London Milk First Company 80 1600 Item Subcategory Category 15/2/2008 Bangkok VCD Best Company 30 7500 VCD Electric Non-Food 2/5/2008 Bangkok Orange Best Company 20 500 TV Electric Non-Food Buyer Buyer Group Shirt Clothes Non-Food First Company Group 1 Customer Dimension Table Ham Process food Food Second Company Group 1 (2 levels) Milk Fresh food Food Third Company Group 1 Orange Fresh food Food Best Company Group 2 Good Company Group 2 Product Dimension Table (3 levels) 33 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 34. OLAP Operations (Drill down vs. Drill up) All customers All customers Thailand Japan Total Thailand Japan Total Time Time Food NonFood Food NonFood Drill down Food NonFood Food NonFood 2006 2400 2200 11000 5000 20600 Roll down Q1 1500 500 2000 1000 5000 2007 4500 3200 12000 6000 25700 Q2 900 600 1500 1000 4000 2008 5600 2900 10000 5500 24000 Q3 1200 800 4000 2000 8000 Total 12500 8300 33000 16500 70300 Drill up Q4 2000 1000 2500 1500 7000 Location Roll up Total 2008 5600 2900 10000 5500 24000 all Drill down Drill up Roll down Roll up continent All customers country Thailand Japan Total Time branch Product Time all year month subcategory all Food NonFood Food NonFood quarter day item category January 700 200 800 500 2200 buyer February 500 100 700 200 1500 buyer group March 300 200 500 300 1300 all Total Q1 1500 500 2000 1000 5000 Customer 34 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 35. OLAP Operations (Slice and Dice) All customers Buyer Group 1 Thailand Japan Total Thailand Japan Total Time Time Food NonFood Food NonFood Food NonFood Food NonFood 2006 2400 2200 11000 5000 20600 Slice 2006 1100 1500 7000 2000 11600 2007 4500 3200 12000 6000 25700 2007 1200 1200 8000 3000 13400 2008 5600 2900 10000 5500 24000 2008 3000 1200 5000 1800 11000 Total 12500 8300 33000 16500 70300 Total 5300 3900 20000 6800 36000 Location all January Buyer Group 1 continent Dice Thailand Japan Total country Time Food NonFood Food NonFood Time branch 2006 200 200 800 300 1500 all year month subcategory all Product 2007 100 100 900 400 1500 quarter day item category 2008 300 200 400 100 1000 buyer buyer group Total 600 500 2100 800 4000 all Customer 35 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 36. OLAP Operations (Pivot) All customers Thailand Japan Total Time Food NonFood Food NonFood 2006 2400 2200 11000 5000 20600 2007 4500 3200 12000 6000 25700 Pivot 2008 5600 2900 10000 5500 24000 Total 12500 8300 33000 16500 70300 All customers Location Thailand Japan Total Time all 2006 2007 2008 2006 2007 2008 continent Food 2400 4500 5600 11000 12000 10000 45500 country NonFood 2200 3200 2900 5000 6000 5500 24800 Total 4600 7700 8500 16000 18000 15500 70300 Time branch all year month subcategory all quarter day item category Product buyer buyer group all Customer 36 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 37. OLAP Operations (Drill Across) All customers All customers Thailand Japan Total Thailand Japan Total Time Time Food NonFood Food NonFood Food NonFood Food NonFood 2006 2400 2200 11000 5000 20600 2006 1500 1500 6000 3000 12000 2007 4500 3200 12000 6000 25700 Drill Across 2007 3500 2500 8000 4000 18000 2008 5600 2900 10000 5500 24000 2008 4000 1500 5000 3000 13500 Total 12500 8300 33000 16500 70300 Total 9000 5500 19000 10000 43500 Sales Purchase Product Product Drill Across Location Location 37 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 38. OLAP Operations (Drill Through) All customers Thailand Japan Total Time Food NonFood Food NonFood 2006 2400 2200 11000 5000 20600 2007 4500 3200 12000 6000 25700 2008 5600 2900 10000 5500 24000 In order to see the detail at the relational Total 12500 8300 33000 16500 70300 table, we can perform ‘drill through’. Drill Through Sales Date Branch Item Buyer Units sold Dollars sold 1/1/2008 Phuket VCD First Company 10 250 1/1/2008 Bangkok TV First Company 30 900 Product 10/1/2008 Phuket TV First Company 10 300 4/2/2008 Phuket Stereo First Company 40 200 15/2/2008 Bangkok VCD Best Company 30 750 2/5/2008 Bangkok Computer Best Company 20 600 Location 38 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 39. An Example of OLAP tools 39 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 40. An Example of OLAP tools Organization (continent level) Measures Time (year level) Dimension 2 Dimensions: (1) Time (2) Organization 40 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 41. An Example of OLAP tools 41 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 42. An Example of OLAP tools 42 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 43. Designing Data Warehouses  Conceptual Modeling (Star schema, Snowflake, Fact constellations), Concept Hierarchy  Physical Modeling (Data sources, Data storage, OLAP engine) 43 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 44. Multi-Tiered Architecture Bottom Tier Middle Tier Top Tier Monitor & OLAP Server other Metadata sources Integrator Analysis Operational Extract Query DBs Transform Data Serve Reports Load Warehouse Data mining Data Marts Data Sources Data Storage OLAP Engine Front-End Tools 44 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 45. Three Physical Data Warehouse Models  Enterprise Warehouse  collects all of the information about subjects spanning the entire organization  Data Mart  a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart  Independent vs. dependent (directly from warehouse) data mart  Virtual Warehouse  A set of views over operational databases  Only some of the possible summary views may be materialized  Easy to build but require excess capacity on operational DB server. 45 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 46. OLAP Server Architectures  Relational OLAP (ROLAP)  Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces  Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools/services  greater scalability, we may keep summary in relational DB  Multidimensional OLAP (MOLAP)  Array-based multidimensional storage engine(sparse matrix techniques)  fast indexing to pre-computed summarized data  Hybrid OLAP (HOLAP)  User flexibility, e.g., low level: relational, high-level: array  Specialized SQL servers  specialized support for SQL queries over star/snowflake schemas in read- only environment. 46 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 47. Efficient Data Cube Computation  Data cube can be viewed as a lattice of cuboids  The bottom-most cuboid is the base cuboid  The top-most cuboid (apex) contains only one cell  How many cuboids in an n-dimensional cube with L levels? n T   ( Li 1) i 1  Materialization of data cube  Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)  Selection of which cuboids to materialize  Based on size, sharing, access frequency, etc. 47 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 48. Cube: A Lattice of Cuboids (Dimension) all 0-D (apex) cuboid time product location customer 1-D cuboids time, item time, location item, location location, customer 2-D cuboids time, customer item, customer time, item, location time, location, customer 3-D cuboids time, item, customer item, location, customer 4-D(base) cuboid time, item, location, customer 48 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 49. Multi-way Array Aggregation for Cube Computation  Partition arrays into chunks (a small subcube which fits in memory).  Compressed sparse array addressing: (chunk_id, offset)  Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost. C c3 61 c2 45 62 63 64 46 47 48 c1 29 30 31 32 c0 b3 B13 14 15 16 60 What is the best traversing order 44 28 56 to do multi-way aggregation? b2 9 B 40 24 52 b1 5 36 The size of the dimensions A, B, 20 and C is 6000, 1000, and 100000. b0 1 2 3 4 a0 a1 a2 a3 A 49 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 50. Multi-way Array Aggregation for Cube Computation C c3 61 c2 45 62 63 64 46 47 48 c1 29 30 31 32 c0 B13 14 15 16 60 b3 44 B 28 56 b2 9 40 24 52 b1 5 36 20 b0 1 2 3 4 a0 a1 a2 a3 A 50 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 51. Multi-way Array Aggregation for Cube Computation C c3 61 c2 45 62 63 64 46 47 48 c1 29 30 31 32 c0 B13 14 15 16 60 b3 44 B 28 56 b2 9 40 24 52 b1 5 36 20 b0 1 2 3 4 a0 a1 a2 a3 A 51 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 52. Multi-Way Array Aggregation for Cube Computation (Cont.)  Method: the planes should be sorted and computed according to their size in ascending order.  Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane  Limitation of the method: computing well only for a small number of dimensions  If there are a large number of dimensions, “bottom-up computation” and iceberg cube computation methods can be explored. Iceberg cubes store only cube partitions where the aggregate value is above some min. support. 52 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 53. Multi-Way Array Aggregation (An Example)  Suppose that a base cuboid has three dimensions A, B, C with the following numbers of cells: |A| = 6,000, |B| = 1,000, |C| = 50,000. Suppose that the dimensions A, B and C are partitioned into 6, 5 and 1000 portions for chunking, respectively.  If each cube cell stores one measure with 4 bytes, what is the total size of the computed cube if the cube is very dense? That is, calculate the size of a base cuboid.  The number of cells in the computed cube is  6 x 103 x 103 x 5 x 104 = 3x1011 cells.  The total size of the computed cube is  4 x 3 x 1011 = 1.2x1012 bytes.  State the order for computing the chunks in the cube that requires the least amount of space, and compute the total amount of main memory space required for computing the 2-D planes. 53 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 54. Multi-Way Array Aggregation for Cube Computation (Cont.) C One chunk includes A: 6000/6 = 1000 A B: 1000/5 = 200 C: 50000/1000 = 50 The space we need for B computing the cube is (3) One chunk of plane AC: (1) All elements of plane AB: = 1000x50 cells = 6000x1000 cells = 50000 cells = 6000000 cells = 200000 bytes = 24000000 bytes (4) One cube of ABC: (2) One row of plane BC: = 1000x200x50 cells = 1000x50 cells = 10000000 cells = 50000 cells = 40000000 bytes = 200000 bytes Total memory for keeping the result = 64400000 bytes = 6.44x107 bytes. 54 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 55. Efficient Processing OLAP Queries  Determine which operations should be performed on the available cuboids:  transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection  Determine to which materialized cuboid(s) the relevant operations should be applied.  Exploring indexing structures and compressed vs. dense array structures in MOLAP 55 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 56. Metadata Repository  Meta data is the data defining warehouse objects. It has the following kinds  The algorithms used for summarization  The mapping from operational environment to the data warehouse  Data related to system performance  warehouse schema, view and derived data definitions  Business data  business terms and definitions, ownership of data, charging policies 56 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 57. Data Warehouse Back-End Tools and Utilities  Extraction:  get data from multiple, heterogeneous, and external sources  Transformation:  convert data from legacy or host format to warehouse format  Load:  sort, summarize, consolidate, compute views, check integrity, and build indices and partitions  Cleaning:  detect errors in the data and rectify them when possible  Refresh:  propagate the updates from the data sources to the warehouse 57 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 58. From OLAP to On Line Analytical Mining (OLAM)  Why online analytical mining?  High quality of data in data warehouses DW contains integrated, consistent, cleaned data  Available information processing structure surrounding data warehouses ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools  OLAP-based exploratory data analysis mining with drilling, dicing, pivoting, etc.  On-line selection of data mining functions integration and swapping of multiple mining functions, algorithms, and tasks. 58 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 59. An OLAM Architecture (Online Analytical Mining) Mining query Mining result Layer4 User Interface User GUI API Layer3 OLAM OLAP Engine Engine OLAP/OLAM Data Cube API Layer2 MDDB MDDB Meta Data Filtering&Integration Database API Filtering Layer1 Data cleaning Data Data Repository Databases Warehouse Data integration 59 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 60. Major Applications of Data Warehouses Applications  Online Analytical Processing (OLAP)  Data Mining  Customer Relationship Management (CRM) 60 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 61. Some commercial tools for Data Warehouse  Microsoft Excel’s PivotTable  Crystal Reports’ Business Objects  IBM Cognos  Microsoft SQL Server Analysis Services  Oracle Express  Microsoft Proclarity  Etc. 61 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 62. Exercise 1  Given the following fact table Date Branch Item Buyer Units Dollars sold sold 1/01/2008 London Chair First Company 20 5000 1/01/2008 Bangkok TV First Company 30 9000 10/01/2008 London Ham Second Company 20 1000 4/02/2008 London Milk First Company 80 1600 15/02/2008 Bangkok VCD Best Company 30 7500 2/05/2008 Berlin Orange Best Company 20 500 12/06/2008 London VCD Good Company 20 5000 14/06/2008 Bangkok TV First Company 10 12000 16/07/2008 Phuket VCD Good Company 20 1000 24/07/2008 London Orange First Company 30 6000 2/08/2008 Bangkok Table Best Company 30 7500 12/08/2008 Bangkok Ham Best Company 20 500 12/10/2008 London Table Good Company 20 5000 14/11/2008 Bangkok TV First Company 8 2000 16/11/2008 Phuket Ham Good Company 5 2000 24/12/2008 London Milk First Company 30 6000 62 Knowledge Management and Discovery © Kritsada Sriphaew
  • 63. Exercise 1 (cont.)  Given the following dimensions of the data Location all continent country Time all year month branch subcategory Product all quarter day item category buyer buyer group all Customer 63 Knowledge Management and Discovery © Kritsada Sriphaew
  • 64. Exercise 1 (cont.)  Question 1: Write concept hierarchies of time, location, product and customer.  Buyer groups are  Group1: Quality Group; good company, best company  Group2: Number Group; first company, second company  Product’s categories are  Cat1: Food  Subcategories are drink and non-drink  Cat2: NonFood  Subcategories are electronic and non-electronic 64 Knowledge Management and Discovery © Kritsada Sriphaew
  • 65. Exercise 1 (cont.)  Question 2: Write a result table of this star-net query model with sum dollars sold. Location all continent country Time all year month branch subcategory Product all quarter day item category buyer buyer group all Customer 65 Knowledge Management and Discovery © Kritsada Sriphaew
  • 66. Exercise 1 (cont.)  Question 3: Write a result table of this star-net query model with average units sold. Location all continent country Time all year month branch subcategory Product all quarter day item category buyer buyer group all Customer 66 Knowledge Management and Discovery © Kritsada Sriphaew
  • 67. Exercise 1 (cont.)  Question 4: Write a result table of this star-net query model with sum dollars sold. Slice only Bangkok location. Location all continent country Time all year month branch Product subcategory all quarter day item category buyer buyer group all Customer 67 Knowledge Management and Discovery © Kritsada Sriphaew
  • 68. Exercise 2  (Data Cube Computation) Suppose that a base cuboid has three dimensions A, B, C with the following numbers of cells: |A| = 20,000, |B| = 8,000, |C| = 1,000. Also suppose that the dimensions A, B and C are partitioned into 10, 8 and 4 portions for chunking, respectively.  Question 1: If each cube cell stores one measure with 4 bytes, what is the total size of the computed cube if the cube is very dense? That is, calculate the size of a base cuboid.  The number of cells in the computed cube ?  The total size (bytes) of the computed cube ?  Question 2: State the order for computing the chunks in the cube that requires the least amount of space, and compute the total amount of main memory space required for computing the 2-D planes. 68 Knowledge Management and Discovery © Kritsada Sriphaew