Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!

11 300 vues

Publié le

Joe Caserta went over the details inside the big data ecosystem and the Caserta Concepts Data Pyramid, which includes Data Ingestion, Data Lake/Data Science Workbench and the Big Data Warehouse. He then dove into the foundation of dimensional data modeling, which is as important as ever in the top tier of the Data Pyramid. Topics covered:

- The 3 grains of Fact Tables
- Modeling the different types of Slowly Changing Dimensions
- Advanced Modeling techniques like Ragged Hierarchies, Bridge Tables, etc.
- ETL Architecture.

He also talked about ModelStorming, a technique used to quickly convert business requirements into an Event Matrix and Dimensional Data Model.

This was a jam-packed abbreviated version of 4 days of rigorous training of these techniques being taught in September by Joe Caserta (Co-Author, with Ralph Kimball, The Data Warehouse ETL Toolkit) and Lawrence Corr (Author, Agile Data Warehouse Design).

For more information, visit http://casertaconcepts.com/.

Publié dans : Technologie
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • Great treatment and summary from two of my favorite frameworks, thank you!
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • This is a great mix of Kimball dimensional modelling and architecture with Corr requirements gathering.
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • adhoc query in hadoop
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici

Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!

  1. 1. Big Data Warehousing Dimensional Modeling Still Matters!!!
  2. 2. Big Data Warehousing Meetup
  3. 3. Launched Big Data practice Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit (Wiley) Data Analysis, Data Warehousing and Business Intelligence since 1996 Began consulting database programing and data modeling 25+ years hands-on experience building database solutions Founded Caserta Concepts Web log analytics solution published in Intelligent Enterprise Launched Data Science, Data Interaction and Cloud practices Laser focus on extending Data Analytics with Big Data solutions 1986 2004 1996 2009 2001 2013 2012 2014 Dedicated to Data Governance Techniques on Big Data (Innovation) Top 20 Big Data Consulting - CIO Review Top 20 Most Powerful Big Data consulting firms Launched Big Data Warehousing (BDW) Meetup NYC: 2,000+ Members 2015 Awarded for getting data out of SAP for data analytics Established best practices for big data ecosystem implementations Caserta Timeline Awarded Top Healthcare Analytics Solution Provider
  4. 4. About Caserta Concepts • Technology innovation consulting company with expertise in: • Big Data Solutions • Data Warehousing • Business Intelligence • Solve highly complex business data challenges • Award-winning solutions • Business Transformation • Maximize Data Value • Innovation partner: • Transformative Data Strategies • Modern Data Engineering • Advanced Analytics • Strategic Consulting • Technical Architecture • Design and Build Solutions • Data Science & Analytics • Data on the Cloud • Data Interaction & Visualization
  5. 5. Client Portfolio Retail/eCommerce & Manufacturing Digital Media/AdTech Education & Services Finance. Healthcare & Insurance
  6. 6. Enrollments Claims Finance ETL Ad-Hoc Query Horizontally Scalable Environment - Optimized for Analytics Big Data Lake Canned Reporting Big Data Analytics NoSQL Databases ETL Ad-Hoc/Canned Reporting Traditional BI Spark MapReduce Pig/Hive N1 N2 N4N3 Nn Hadoop Distributed File System (HDFS) Traditional DW Appliance Others… The Evolution of Modern Data Engineering Data Science
  7. 7. Big Data Warehouse Data Science Workspace Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” The Data Pyramid Metadata  Catalog ILM  who has access, how long do we “manage it” Raw machine data collection, collect everything Data is ready to be turned into information: organized, well defined, complete. Agile business insight through data- munging, machine learning, blending with external data, development of to-be BDW facts Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data  Data has different audience and usage patterns each tier.  All tiers work cohesively to comprise the Big Data Ecosystem  All tiers are governed. Only the top tier is fully governed Fully Data Governed ( trusted) User community arbitrary queries and reporting Usage Pattern Data Governance
  8. 8. Data Warehousing Requirements • Improve analysis and measurements in terms of: • Timeliness • Flexibility • Level of Detail • Historical Completeness • Quality • Data requirements • Breadth – multiple data sources • Depth – ability to query the atomic detailed level • Consistency – reconcile differences between data sources, track history correctly
  9. 9. What is Dimensional Modeling Dimensional modeling is a logical design technique that seeks to present the data in a standard, intuitive framework that allows for high-performance access • Star-like structure – Star Join Schema • Single fact table per star • Surrounded by 4 -16 dimensional tables Sales_Fact date_key product_key store_key promotion_key sale_amount sale_quantity cost_amount customer_count Product product_key description full_description SKU_number package_size brand subcategory category department package_type diet_type weight weight_unit_of_measure units_per_retail_case units_per_shipping_case cases_per_pallet shelf_width_cm shelf_height_cm shelf_depth_cm Promotion promotion_key promotion_name price_reduction_type ad_type display_type coupon_type ad_media_type display_provider promo_cost promo_begin_date promo_end_date Store store_key name store_number store_street_address city store_county store_state store_zip sales_district sales_region store_manager store_phone floor_plan_type first_opened_date last_remodel_date store_sqft grocery_sqft frozen_sqft Calendar date_key date day_of_week day_number_in_month day_number_overall week_number_in_year week_number_overall Month quarter fiscal_period year holiday_flag
  10. 10. …a bit of background on Star Join Optimization • A join strategy observed by most relational query optimizers • Rewrites queries to first pre-filter dimension tables • This varies by RDBMS but generally: • Dimension table keys are identified based on query filters • These keys are then used to filter the relevant fact rows based leveraging indexes • Remaining fact rows are then hash joined back to the dimensions • Weird huh – but it results in 10x performance increases in dimensional queries
  11. 11. Star Join Optimization Continued This query Select store_name, sum(sales) From f_sales f join d_store s on s.store_key = f.store_key join d_date d on d.date_key = f.date_key Where s.region = ‘northeast’ and d.year = 2015 Group by store_name Becomes Select store_name, sum(sales) From f_sales f join d_store s on s.store_key = f.store_key Where f.store_key in (select store_key from d_store where region = ‘northeast’) and f.date_key in (select date_key from d_date where year = 2015)
  12. 12. Row Oriented • Most relational databases are row-oriented: Rows are persisted to disk contiguously • Compression is difficult as because there is often little similarity across rows • If you retrieve one column of data from a row, the entire row must be deserialized (translated from machine code)
  13. 13. Column Oriented • Columns are stored contiguously • Compression and encoding is very efficient because values across columns are often very similar – very efficient for sparse data • If you retrieve one column, only that one column is deserialized, not the entire row
  14. 14. MPP’s • MPP = Massively Parallel processing • Large scale parallel systems that present themselves as relational databases (SQL Fun) • Most leverage columnar storage (although some like Teradata are mainly row-oriented) • Very little performance tuning – generally no indexes or constraints
  15. 15. MPP’s Modeling Considerations • Generally love star schema’s • Efforts should be made to keep dimensions small and promote “broadcast” joins and distribution strategies (send copies of small dimensions to all nodes in the cluster) • De-normalization sometimes favored to avoid large table to large table joins • Choose distribution keys that evenly balance the data across the cluster, and collocate your fact with the largest dimension. ..and if you have to run Teradata: Teradata favors 3nf, to reduce row size (wide tables are not favored) and to play well with it’s optimizer..
  16. 16. Storage Optimization for Dimensional Modeling on Hadoop • “late bind” is great, but don’t avoid ETL for queries with high SLA’s and to maximize cluster resources and concurrency • Limit the amount of data that is scanned, limit the columns that are retrieved and deserialized • USE ORC or Parquet formats (Column Oriented) • Use partition strategies on Fact tables that include the primary business date – reduce the folders of data that are scanned
  17. 17. Hadoop Modeling Considerations Very similar to MPP (in many ways Impala and Hive are MPP engines) • Keep dimension tables small - broadcast joins! • Denormalization is often favorable for very frequently accessed attributes and to avoid large table joins • Often the choice for normalization is based on data management concerns (to avoid reprocessing large amounts of fact/event data if an attribute changes)
  18. 18. Advantages of Dimensional Models • Data modeled from a user perspective • Data modeled from a query/measurement perspective • Simplicity – fewer tables, fewer joins, explicit facts • High performance query processing – star join optimization and aggregate management • Dimensional techniques for tracking history precisely • Breaks enterprise analytics and data warehouse development down into individual business processes worth measuring
  19. 19. Conformed Dimensions and Facts • Conformed means: • Exactly the same table or replicated copy • Subset of data with the same structure • Roll-up dimension with the same attributes • Attributes shared by conformed dimensions must have the same business meaning and their overlapping contents must be spelled identically so they can line up as common row headers in drill-across queries • Conformed Facts: Compatible calculation methods and units of measure – support additivity across business processes • If attributes or measures cannot be conformed, they must be uniquely labeled – weakens analytical potential and increases number of data items that must be maintained and explained
  20. 20. Data Mart Event/Bus Matrix • Single Document (single page) design overview of data warehouse. • Identify shared dimensions across the enterprise and communicate the importance of conforming them. • Coordinate separate data mart implementation teams/projects • Initial logical design showing business processes/subject areas against logical dimensions. • Planning/Design Tool - Modelstorming
  21. 21. Data Mart Event/Bus Matrix • Build a matrix of Subject Areas against Dimensions to identify common dimensions • If any dimension appears in multiple subject areas (and most do) it must be conformed.
  22. 22. Agile Data Warehouse Design • Understand the requirements and challenges of DW/BI design agile practices on DW/BI design • Use BEAM✲ to modelstorm BI data requirements • Use dimensional modeling • Plan, design and develop data warehouses incrementally • Identify and use dimensional design patterns • Find books and online material that support the techniques covered
  23. 23. Data-Driven Analysis - FAIL • (Operational) data source analysis • Data Profiling Tools • Data re-modeling • Avoids early user involvement • Packaged Apps are challenging sources to profile • New/agile data sources may not even exist yet • “Build it and they will come” – Field of Dreams • DW design don’t met BI user requirements
  24. 24. Reporting-Driven Analysis - FAIL Data ModelerBusiness Analyst BI Stakeholders BI Stakeholders BI Stakeholders Interview notes Requirement Document Data Models Database Schemas
  25. 25. Modelstorming: Data Modeling + Brainstorming Data Modeler Developers BI Stakeholders Quick Inclusive FunInteractive
  26. 26. Why Model with Stakeholders • Identify significant events worth measuring • Scope and Prioritization • Discover how business events are described • Dimensions • Determine how they are measured • Measures, Hierarchies, Comparisons, KPIs • Unearth budgets, forecasts, targets and other user-controlled data sources • Extra data, Common summarization levels: Physical Optimization • Understand how information will be used • BI Applications, BI Tools • Get the data urgency, availability, duration requirements • ETL Techniques, Storage Requirements, Archival Strategy • Uncover the users’ success criteria and how to measure it • Service level Agreements, ROI • Communicate Project Goals and Manage Expectations
  27. 27. • • BI User’s Business Process, Organizational, Hierarchical,and Data Knowledge Focused Data Profiling • • Event Matrix Logical and Physical (Kimball-esque) Dimensional Data Models Structured,non-technical,collaborative working conversation directly with BI Users BEAM✲ Data Modeler Business Stakeholders BEAM✲ Methodology
  28. 28. 1. 2. 3. 4. Who is involved? What did they do? To what is it done? When did it happen? Where did it take place? 5. HoW many or much was recorded – how can it be measured? 6. Why did it happen? 7. HoW did it happen – in what manner? The 7W BEAM✲ Building Blocks
  29. 29. BEAM✲ – a Four Step Process • 1. Modelstorm Business Event(s) • 2. Modelstorm their Dimensions • 3. Modelstorm Event Sequences • 4. Star Schema Design
  30. 30.  Verbs – Events – Relationships – Fact Tables  Nouns – Details – Entities – Dimensions  Main Clause – Subject-Verb-Object  Prepositions – connect additional details to the main clause  Interrogatives –The 7Ws – Dimension Types  Business Vocabulary - no IT-Speak Design Using Natural Language
  31. 31. Identify Event Type Early
  32. 32.  Discrete Event -> Transaction Instantaneous/short duration,irregularly occurring events or transactions  Recurring Event -> Periodic Snapshot – measurement Regularly occurring events,ongoing processes,typically use to measure cumulative of discrete events  Evolving Event -> Accumulating Snapshot – timeline Non-instantaneous/longer duration,irregularly occurring events or transactions Represents current status - reflects adjustments Adjust Conversation Based on Event Type
  33. 33. Dimensional Attributes • Subject and Operational ID High cardinality descriptive attribute (e.g. Product Name) used to ‘uniquely’ identify each member and it matching natural/business key(s) (BK) • Discriminators Descriptive attributes, mostly ’physical’, used to differentiate members in place of the subject. For complex heterogeneous dimensions many discriminators may not be valid for the entire population: exclusive attributes (Xn) • Categorical Information Mainly ‘logical’ labels used to segment or categorise members. Often represent characteristics (DCn,n) that control the validity of exclusive attributes
  34. 34. Using the 7Ws to discover Dimensional Attributes 34
  35. 35. Modelstorming Event Sequences 35
  36. 36. Dimensional Data Modeling • Provide an introduction to data warehousing • Concepts • Terminology • Design • Fundamental dimensional modeling • Stars, Snowflakes • Facts, Dimensions • Slowly Changing Dimensions • Advanced techniques • Multi-valued dimensions • Variable-Depth Hierarchies • Project Lifecycle & Management • The Bus Matrix/Iterative development • Maintaining the data warehouse
  37. 37. Dimension Table Fundamentals • Dimensions explain Business processes. They describe ‘independent’ business entities in terms that are familiar to users. • A dimension record is a collection of text-like descriptors used to constrain and group fact table records. • Wide rows with numerous columns but small tables – usually less than a million rows. • Normally, there is exactly one value for each dimensional attribute associated with a fact record.
  38. 38. Fact Table Fundamentals • Record numeric measurements associated with important business activities, processes or events. • Support high performance access. • Provide detailed figures which can be analyzed by all relevant dimensions – allow drill down to detail and constraints on detail attributes. • Each fact table stores measurements that match a specific granularity.
  39. 39. Star Schema Layout
  40. 40. Data Model Design Canvas
  41. 41. Snowflake Schema Product Product_key Description Full_description SKU_number Package_size Brand_key Department Diest_type Package_type_key Weight Weight_unit_of_measur e Brand Brand_key Brand_descriptio n Category_key Package Type Package_type_key package_type_descript ion Category Category_key Category_description  A snowflake schema is a dimensional model where one or more dimensions of a star schema have been normalized  The resulting additional lookup tables are known as outriggers
  42. 42. Dimensional Modeling Steps • 1. Choose the Business Process • Usually a single source of data • 2. Identify the Granularity • What does the fact record mean? • 3. Select the Dimensions • All descriptive context that has a single value in the presence of the measurement, i.e., is true to the grain • 4. Define the Facts • Numeric additive measurements, true to the grain
  43. 43. Fact Table Granularity • Transactional – process or event based • Instantaneous/short duration, irregularly occurring events or transactions • Periodic Snapshot – measurement based • Regularly occurring events, ongoing processes • Accumulating Snapshot – process/measurement based • Non-instantaneous/longer duration, irregularly occurring events or transactions • Reflects adjustments and current status • Typically has multiple date dimensions i.e. invoice date, shipment date, payment date • Line items form key documents i.e. Invoices, POs, Claims
  44. 44. Transaction Facts • Most basic view of operational system is individual transaction • Low-level transactions fit dimensional framework • Single fact. Almost never add more numeric facts instead add more transaction types. Not a schema change but a data content change Account EmployeeAudit Time Day Transaction time_key account_key location_key transaction_key employee_key audit_key accountnum transactionref amount LocationTransaction
  45. 45. • Good For:  Analyzing behavior or efficiency in extreme detail  Sequential behavior – fraud detection, cancellation warning  Basket analysis • Poor for:  Current status analysis Account EmployeeAudit Time Day Transactions time_key account_key location_key transaction_key employee_key audit_key accountnum transactionref amount LocationTransaction Transaction Facts
  46. 46. Snapshot Facts • Most Transaction-level fact tables must be accompanied by a snapshot table to give a practical view of the current and historical status Account Audit Time Month Monthly_snapshot time_key account_key status_key audit_key earned_revenue transaction_count ending_balance avg_daily_balance Status
  47. 47. Snapshot Facts • Suppress transaction type, location… • Time is month rather than day • Add Status Dimension e.g. New Account, Closed • More facts, open ended Account Audit Time Month Monthly_snapshot time_key account_key status_key audit_key earned_revenue transaction_count ending_balance avg_daily_balance Status
  48. 48. Periodic Snapshot Facts • The grain of a Periodic snapshot can be Daily, Weekly, Monthly, Yearly, etc. • Usually an ‘open’ current rolling period • Facts are not revisited, only added • Select surrogate keys valid at the end of each period.
  49. 49. Accumulating Snapshot Facts • An individual record is created when a shipment invoice is created. • Unknown dimensions are not applicable and their surrogate keys must point to the special record in the dimension corresponding to Not Applicable. • Over time, the record is revisited and the foreign keys are overwritten with keys pointing to dimension records with applicable values.
  50. 50. Insurance Grain Examples transaction_time_key policy_key customer_key agent_key coverage_key covered_item_key transaction_key amount Transaction grain reporting_month_key policy_key customer_key agent_key coverage_key covered_item_key status_key earned_premium incurred_claims change_in_reserve reserve_balance number_transactions Periodic Snapshot grain effective_date_key expiration_date_key first_claim_date_key last_payment_date_key policy_key customer_key agent_key coverage_key covered_item_key status_key earned_premium_to_date number_claims_to_date claims_payments_to_date Accumulating Line Item grain
  51. 51. Factless Fact Tables • Something happened but there was nothing to measure • Only possible quantity would be 1 • No money allocated to the event
  52. 52. The ‘Right’ Granularity • Makes query and analysis as simple as possible • Appeals to the users conceptual model of the business • Granular data is the most resilient to change • Granular data is the most dimensional • Granular facts are the most additive • There may not be a single ‘right’ granularity – don’t mix granularities. Design separate facts.
  53. 53. Dimensions • Identify relevant dimensions that describe the business process • the 5Ws • Record the granularity for each dimension including time. The granularity of the process determines the granularities of the dimensions. • Good dimensions have as many attributes as possible. 100+ for a customer of product is normal but don’t get caught up in the detail initially • Resist snow flaking and normalization
  54. 54. Dimensional Attributes • Operational IDs and Descriptors • Natural/business keys and high cardinality descriptors i.e. product_name used to identify individual items • Discriminators • Descriptive attributes, mostly ‘physical’, used to differentiate items • Categorical Information • Mainly ‘logical’ labels used to segment or categorize items. Define known hierarchies and drill-down paths.
  55. 55. Dimensions – Handling Change • Slowly Changing Dimensions • Type 1. Overwrite History • Don’t care about previous values or mistake corrections – “as is” reporting • Type 2. Track History • Analyze historical facts using the dimensional values that were valid at the time. Analyze new facts using the current dimensional values – “as was” reporting. • Type 3. Alternative Realities • Analyze all facts using current or previous dimensional values. “as is or as previously” reporting.
  56. 56. Type 1 SCD – Pros vs. Cons Processing similar to operational system Appropriate for correcting mistakes History is rewritten and lost If overwritten attributes are used to define aggregates, they must be rebuilt
  57. 57. Type 2 SCD – Pros vs. Cons Gracefully track many changes to dimension values Each new record partitions history perfectly No need to build historic aggregates User queries must not use surrogate key values for filtering or sorting Browse queries must select distinct and count distinct Growth of dimension table.
  58. 58. Type 3 SCD – Pros vs. Cons Creates no additional dimension records Provides only “as is” and “as previous” analysis Requires restructuring of table when attribute policy changes
  59. 59. Why Use Surrogate Keys • Insulate data warehouse from production code glitches and administrative changes • Type 2 slowly changing dimensions • Ability to change the grain of the dimension if necessary • Allow dimensions to encode uncertainty – not known, not applicable – avoid using NULLs • Save space in fact tables • Join efficiently between fact and dimension tables • Enforce referential integrity via ETL process
  60. 60. Design for Agile Iterations • Withstand changes in user behavior: • Atomic level detail facts – logical design independent of expected query patterns • Dimensions are symmetrically equal entry points • User interfaces and Query strategies are symmetrical • Cope gracefully with structural changes • Existing queries, reports and analytics run unaltered
  61. 61. Sales_Fact date_key product_key store_key promotion_key sale_amount sale_quantity customer_count Product product_key description full_description SKU_number package_size brand subcategory category department package_type diet_type weight weight_unit_of_measure units_per_retail_case units_per_shipping_case cases_per_pallet Promotion promotion_key promotion_name price_reduction_type ad_type display_type coupon_type ad_media_type display_provider promo_cost promo_begin_date promo_end_date Store store_key store_name store_number store_street_address city store_county store_state store_zip sales_district sales_region store_manager store_phone floor_plan_type first_opened_date last_remodel_date store_sqft grocery_sqft frozen_sqft Calendar date_key date day_of_week day_number_in_month day_number_overall week_number_in_year week_number_overall Month quarter fiscal_period year holiday_flag Graceful Agile Extensibility? 61 Product product_key description full_description SKU_number package_size brand subcategory category department package_type diet_type weight weight_unit_of_measure units_per_retail_case units_per_shipping_case cases_per_pallet shelf_width_cm shelf_height_cm shelf_depth_cm New Dimensional Attributes Sales_Fact date_key product_key store_key promotion_key sale_amount sale_quantity customer_count cost_amount Sales_Fact date_key product_key store_key promotion_key Weather_Key sale_amount sale_quantity customer_count cost_amount Weather Weather_Key Wether_description New Facts Consistent with Grain New Dimensions Single valued for each fact record Store store_key Department store_name store_number store_street_address city store_county store_state store_zip sales_district sales_region store_manager store_phone floor_plan_type first_opened_date last_remodel_date store_sqft grocery_sqft frozen_sqft
  62. 62. Junk/Abstract Dimensions Remove flags from fact table: • Collect together miscellaneous flags and non-additive fact junk • Group, if possible, junk that is correlated • Search the data for these correlations • Remove comments field to its separate dimension
  63. 63. N-Level Hierarchies • Bill of materials • Organization structure • Cost centre rollups Custom er key = Parent Customer Key Customer Customer key Customer Number Customer Name Address Parent Customer Key
  64. 64. Consulting Visio Europe Visio Office OLAP Services DTS Repository SQL Server BackOffice Developer Tools Windows Games Multimedia Entertainment Products Education Software WebTV MSN.co.uk Hotmail.com MSNBC Online MSNBC Expedia.co.uk Expedia MSN OnlineServices Microsoft Example Organization Hierarchy Company Structure Parent Key Subsidiary Key Subsidiary Level Sequence Number Lowest Flag Highest Flag 1 1 2 2 0 2 2 3 3 4 4444 5 Consulting Visio Europe Visio Office OLAP Services DTS Repository SQL Server BackOffice Developer Tools Windows Games Multimedia Entertainment Products Education Software WebTV MSN.co.uk Hotmail.com MSNBC Online MSNBC Expedia.co.uk Expedia MSN OnlineServices Microsoft 0 1 2
  65. 65. Using the Company Structure Table • Analyze all, immediate, intermediate or lowest subsidiaries: • Analyze all, immediate, intermediate or highest parents: • Browse Organization Structures: Customer key= Parent Key Customer key= SubsidiaryKey Company Structure Parent Key Subsidiary Key Subsidiary Level Sequence Number lowest Flag Highest Flag Subsidiary Customer Customer key Customer Number Customer Name Address Parent Customer Customer key Customer Number Customer Name Address Customer key= SubsidiaryKey ParentKey= Customer KeyCompany Structure Parent Key Subsidiary Key Subsidiary Level Sequence Number lowest Flag Highest Flag Any Fact Table Time Key Customer Key Service Key Revenue Subsidiary Customer Customer key Customer Number Customer Name Address Customer key= Parent Key SubsidiaryKey= Customer Key Company Structure Parent Key Subsidiary Key Subsidiary Level Sequence Number lowest Flag Highest Flag Parent Customer Customer key Customer Number Customer Name Address Any Fact Table Time Key Customer Key Service Key Revenue
  66. 66. Multi-valued Dimensions • Dimensions normal have a single value for each instance of a measure • When a required dimension has multiple values we may have the fact granularity wrong/not detailed enough e.g. • A ‘Daily Product totals by Store’ fact table would cause Customer and Employee to be multi-valued • Solution – design atomic level detailed fact tables, detailed data is the most dimensional • What happens when we are already at the atomic level of fact measurement?
  67. 67. Patient Patient Key Patient Name Gender Date of Birth Address Post Code ... Billing Fact Date_key Doctor Key Service Key Patient Key Location key Payer Key Status Charged Amount Service Quantity Paid Amount ... Calendar Date_key Date Month ... Doctor Doctor Key Doctor Name ... Service Service Key CPT4 HCPCS Service Service Type Service Category Location Location key Location ... Payer Payer Key Payer Name Payer Type ... Billing Fact Date_key Doctor Key Service Key Patient Key Location key Payer Key DGroup Key Status Charged Amount Service Quantity Paid Amount ... Diagnosis Group DGroup Key Diagnosis Key Diagnosis Group Weighting Factor Diagnosis Diagnosis Key ICD10CM Diagnosis Diagnosis Type Body System Multi-valued Diagnosis Dimension
  68. 68. Multi-value Bridge Tables • Bridge table or link entities resolves many-to-many relationship between the fact table and the multi-valued dimension • Embellish with weighting factors for fact allocation. Sum of weighting factors of any one fact row is 1. Avoids hard coding the allocations in the fact table Diagnosis Group DGroup Key Diagnosis Key Diagnosis Group Weighting Factor
  69. 69. To Be Continued…… Next Meetup: Overview of ETL best practices to load dimensional structures
  70. 70. Recommended Reading 70 Lawrence Corr, Jim Stagnitto Agile Data Warehouse Design Ralph Kimball, Joe Caserta The Data Warehouse ETL Toolkit
  71. 71. Workshops: www.casertaconcepts.com/training Sept 21-22 (2 days), Agile Data Warehousing taught by Lawrence Corr Sept 23-24 (2 days), ETL Architecture and Design taught by Joe Caserta (Big Data module added) USE BDW GROUP 20% OFF DISCOUNT CODE: BDW20 Agile DW & ETL Training in NYC, 2015
  72. 72. Joe Caserta President, Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta Thank You 72
  73. 73. Questions and Answers 73

×