Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×
Chargement dans…3

Consultez-les par la suite

1 sur 96 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)


Plus récents (20)

DW 101

  1. 1. Data Warehousing Denise Jeffries [email_address] [email_address] 205.747.3301
  2. 2. Databases vs Data Warehousing <ul><li>Often mistaken for each other – vastly different </li></ul><ul><ul><li>The database supports data storage & retrieval for an application or specific purpose </li></ul></ul><ul><ul><ul><li>Don’t bog it down with informational reporting, operational in nature </li></ul></ul></ul><ul><ul><ul><ul><li>(would your app work after your next acquisition, when you have grown your customer base, have more users, produce even more reports)… </li></ul></ul></ul></ul><ul><ul><li>A data warehouse is used for informational purposes </li></ul></ul><ul><ul><ul><li>To facilitate business reporting and analysis </li></ul></ul></ul><ul><ul><ul><li>It is not operational </li></ul></ul></ul>
  3. 3. Definition of a Data Warehouse <ul><li>Data warehouse — + subject oriented + integrated + time variant + nonvolatile collection of data for management’s decisions …data warehouses are granular. They contain the bedrock data that forms the single source for all Decision Support System/Executive Information System processing. With a data warehouse there is reconcilability of information when there are differences of opinion. The atomic data found in the warehouse can be shaped in many ways, satisfying both known requirements and standing ready to satisfy unknown requirements. </li></ul><ul><li>http://www.itquestionbank.com/types-of-data-warehouse.html </li></ul>
  4. 4. DW Project Components <ul><li>Business Requirements </li></ul><ul><li>Physical (hw/sw) environment setup </li></ul><ul><li>Data Modeling </li></ul><ul><li>ETL </li></ul><ul><li>OLAP or ROLAP cube design </li></ul><ul><li>Report Development </li></ul><ul><li>Query Optimization </li></ul><ul><li>Data Quality Assurance </li></ul><ul><li>Promote to Production </li></ul><ul><li>Maintenance </li></ul><ul><li>Enhancement </li></ul>
  5. 5. Key milestones in the early years of data warehousing: <ul><li>1960: General Mills & Dartmouth College research project – coin terms DIMENSIONS & FACTS </li></ul><ul><li>1967: Edward Yourdan “Real-Time Systems Design” </li></ul><ul><li>1970: ACNielsen and IRI provide dimensional data marts for retail sales </li></ul><ul><li>1979 – Tom DeMarco “Structured Analysis and Design” </li></ul><ul><li>1988 – Barry Devlin and Paul Murphy publish “An architecture for business information systems” in IBM Systems Journal – coin term “business data warehouse” </li></ul><ul><li>1991 – Bill Inmon publishes book “Building the Data Warehouse” </li></ul><ul><li>1995 – The Data Warehouse Institute is founded (for profit) </li></ul><ul><li>1996 – Ralph Kimball publishes “The Data Warehouse Tookit” </li></ul><ul><li>2000 – Wayne Eckerson “Data Quality and the Bottom Line” report from TDWI </li></ul><ul><li>2004 – IBM states their main competitors are Oracle and Teradata </li></ul>
  6. 6. The beginnings <ul><li>Commercial viability occurred with a drop in disk storage prices. </li></ul><ul><li>Then came the BI vendors </li></ul><ul><li>The ETL vendors </li></ul><ul><li>The Data modelers </li></ul><ul><li>And the database vendors fought…. </li></ul>
  7. 7. History of Data Warehousing <ul><li>Data Warehouses became a distinct type of computer database during the late 1980s and early 1990s. They were developed to meet a growing demand for management information and analysis that could not be met by operational systems </li></ul><ul><ul><li>the extra processing load of reporting reduced the response time of the operational systems </li></ul></ul><ul><ul><ul><li>the development of reports in operational systems requires writing specific SQL queries which put a heavy load on the system </li></ul></ul></ul><ul><li>Separate computer databases began to be built that were specifically designed to support management information and analysis purposes. </li></ul><ul><li>Data warehouses were able to bring in data from a range of different data sources </li></ul><ul><ul><li>mainframe computers, minicomputers, personal computers and office automation software such as spreadsheets, </li></ul></ul><ul><li>Data warehouses integrate this information in a single place. </li></ul><ul><li>User-friendly reporting tools and freedom from operational impacts, has led to a growth of data warehousing systems </li></ul><ul><li>http://www.dedupe.com/history.php </li></ul>
  8. 8. History of Data Warehousing <ul><li>As technology improved (lower cost for more performance) and user requirements increased (faster data load cycle times and more features), data warehouses have evolved through several fundamental stages: </li></ul><ul><li>Offline Operational Databases - Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system's performance. </li></ul><ul><li>Offline Data Warehouse - Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure. </li></ul><ul><li>The real next generation warehousing – not really being done: </li></ul><ul><li>Real Time Data Warehouse - Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.) </li></ul><ul><li>Integrated Data Warehouse - Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization. </li></ul><ul><li>http://www.dedupe.com/history.php </li></ul>
  9. 9. Data Warehouse Architecture <ul><li>The term data warehouse architecture describes the overall structure of the system. </li></ul><ul><ul><li>historical terms include decision support systems (DSS), management information systems (MIS) </li></ul></ul><ul><ul><li>Newer terms include business intelligence competency center (BICC) </li></ul></ul><ul><li>The data warehouse architecture describes the overall system components: infrastructure, data and processes. </li></ul><ul><ul><li>The infrastructure technology stack perspective determines the hardware and software products needed to implement the components of the system. </li></ul></ul><ul><ul><li>The data perspective typically diagrams the source and target data structures and aid the user in understanding what data assets are available and how they are related. </li></ul></ul><ul><ul><li>The process perspective is primarily concerned with communicating the process and flow of data from the originating source system through the process of loading the data warehouse, and often the process that client products use to access and extract data from the warehouse. </li></ul></ul><ul><li>Architecture facilitates the structure, function and interrelationships of each component. </li></ul>
  10. 10. Advantages to DW <ul><li>Enables end-user access to a wide variety of data </li></ul><ul><li>Increased data consistency </li></ul><ul><li>Additional documentation of the data (published data models, data dictionaries) </li></ul><ul><li>Lower overall computing costs and increased productivity </li></ul><ul><li>Area to combine related data from separate sources </li></ul><ul><li>Flexible, easy to change computing infrastructure to support data changes in applications systems and business structures/hierarchies </li></ul><ul><li>Empowering end-users to perform ad-hoc queries and reports without impacting the performance of the operational systems </li></ul><ul><li>An enabler of commercial business applications, most notably customer relationship management (CRM) i.e. through feed-back loops. </li></ul>
  11. 11. Data Integration <ul><li>Data integration is the aspect of combining diverse sources and giving the user a unified view of their data. </li></ul><ul><li>This important problem emerges in a variety of situations both commercial (when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories). </li></ul><ul><li>Data integration appears with increasing frequency as the volume and the need to share existing data explodes. </li></ul><ul><li>It has been the focus of extensive theoretical work and numerous open problems remain to be solved. </li></ul><ul><li>In practice, data integration is frequently called Enterprise Information Integration. </li></ul>
  12. 12. Data Warehousing Toolsets <ul><li>Data modeling </li></ul><ul><ul><li>Diagrams: ERD, etc. </li></ul></ul><ul><li>Data Dictionary </li></ul><ul><li>ETL Tools </li></ul><ul><li>Database of Choice </li></ul><ul><ul><li>Oracle, SQLServer, DB2, Teradata, Netezza, …. </li></ul></ul><ul><li>SQL and it’s tools </li></ul><ul><li>Data Validation </li></ul><ul><li>Bug trackers/issue trackers: Testing </li></ul>
  13. 13. Types of Data Warehouses <ul><ul><li>Not Data marts </li></ul></ul><ul><li>Operational Data Store (ODS) </li></ul><ul><li>Data warehouse (enterprise data warehouse - EDW) </li></ul><ul><li>Exploration data warehouse </li></ul><ul><li>Decision Support System (aka:Management Information System MIS) </li></ul>
  14. 14. Brief Description of Terms <ul><li>Operational Systems are the internal and external core systems that support the day-to-day business operations. They are accessed through application program interfaces (APIs) and are the source of data for the data warehouse and operational data store. (Encompasses all operational systems including ERP, relational and legacy.) </li></ul><ul><li>Data Acquisition is the set of processes that capture, integrate, trans-form, cleanse, reengineer and load source data into the data warehouse and operational data store. Data reengineering is the process of investigating, standardizing and providing clean consolidated data. </li></ul><ul><li>The Data Warehouse is a subject-oriented, integrated, time-variant, non-volatile collection of data used to support the strategic decision-making process for the enterprise. It is the central point of data integration for business intelligence and is the source of data for the data marts, delivering a common view of enterprise data. </li></ul><ul><li>Primary Storage Management consists of the processes that manage data within and across the data warehouse and operational data store. It includes processes for backup and recovery, partitioning, summarization, aggregation, and archival and retrieval of data to and from alternative storage. </li></ul><ul><li>Alternative Storage is the set of devices used to cost-effectively store data warehouse and exploration warehouse data that is needed but not frequently accessed. These devices are less expensive than disks and still provide adequate performance when the data is needed. </li></ul><ul><li>Data Delivery is the set of processes that enable end users and their supporting IS group to build and manage views of the data warehouse within their data marts. It involves a three-step process consisting of filtering, formatting and delivering data from the data warehouse to the data marts. </li></ul><ul><li>The Data Mart is customized and/or summarized data derived from the data warehouse and tailored to support the specific analytical requirements of a business unit or function. It utilizes a common enterprise view of strategic data and provides business units more flexibility, control and responsibility. The data mart may or may not be on the same server or location as the data warehouse. </li></ul>
  15. 15. Desc terms cont’d <ul><li>The Operational Data Store (ODS) is a subject-oriented, integrated, current, volatile collection of data used to support the tactical decision-making process for the enterprise. It is the central point of data integration for business management, delivering a common view of enterprise data. </li></ul><ul><li>Meta Data Management is the process for managing information needed to promote data legibility, use and administration. Contents are described in terms of data about data, activity and knowledge. </li></ul><ul><li>The Exploration Warehouse is a DSS architectural structure whose purpose is to provide a safe haven for exploratory and ad hoc processing. An exploration warehouse utilizes data compression to provide fast response times with the ability to access the entire database. </li></ul><ul><li>The Data Mining Warehouse is an environment created so analysts may test their hypotheses, assertions and assumptions developed in the exploration warehouse. Specialized data mining tools containing intelligent agents are used to perform these tasks. </li></ul><ul><li>Activities are the events captured by the enterprise legacy and/or ERP systems as well as external transactions such as Internet interactions. </li></ul><ul><li>Statistical Applications are set up to perform complex, difficult statistical analyses such as exception, means, average and pattern analyses. The data warehouse is the source of data for these analyses. These applications analyze massive amounts of detailed data and require a reasonably performing environment. </li></ul><ul><li>Analytic Applications are pre-designed, ready-to-install, decision sup-port applications. They generally require some customization to fit the specific requirements of the enterprise. The source of data is the data warehouse. Examples of these applications are risk analysis, database marketing (CRM) analyses, vertical industry &quot;data marts in a box,&quot; etc. </li></ul><ul><li>External Data is any data outside the normal data collected through an enterprise's internal applications. There can be any number of sources of external data such as demographic, credit, competitor and financial information. Generally, external data is purchased by the enterprise from a vendor of such information. </li></ul>
  16. 16. Bill Inmon <ul><li>Recognized as founder of the data warehouse (wrote the first book, offered first conference with Arnie Barnett, wrote the first column in a magazine {IBM Journal}, offered the first classes) </li></ul><ul><li>Created the accepted definition of what a DW is (subject orientated, nonvolatile, integrated, time variant collection of data in support of management’s decisions) </li></ul><ul><li>Approach is top-down </li></ul><ul><li>1991 Founded Prism Solutions, took public, 1995 founded PineCone Systems, renamed Ambeo. </li></ul><ul><li>1999 created Corporate Information Factory website to educate professionals </li></ul>
  17. 17. http://www.inmoncif.com/library/cif/
  18. 18. Ralph Kimball <ul><li>One of the original architects of data warehousing. </li></ul><ul><ul><li>DW must be understandable and FAST </li></ul></ul><ul><ul><li>Developed Dimensional Modeling (Kimball method) is the standard in decision support </li></ul></ul><ul><ul><ul><li>Bottom up approach </li></ul></ul></ul><ul><ul><li>1986 founded Red Brick Systems (used indexes for performance gains), 1992 acquired by Informix, now owned by IBM </li></ul></ul><ul><ul><ul><li>Coinventor of Xerox Star workstation (first commerical product to use mice, icons and windows) </li></ul></ul></ul>
  19. 19. Data Models <ul><li>Provide definition and format of data </li></ul><ul><ul><li>represent information areas of interest or Subject Areas </li></ul></ul><ul><li>Modeling methodologies: </li></ul><ul><ul><li>Bottom-up model design: </li></ul></ul><ul><ul><ul><li>Start with existing structures </li></ul></ul></ul><ul><ul><li>Top-down model design: </li></ul></ul><ul><ul><ul><li>Created fresh (by SME’s) as reference point/template </li></ul></ul></ul>
  20. 20. Data Normalization – what is it <ul><li>Normalization is a relational database modeling process where the relations or tables are progressively decomposed into smaller relations to a point where all attributes in a relation are very tightly coupled with the primary key of the relation. Most data modelers try to achieve the “Third Normal Form” with all of the relations before they de-normalize for performance, ease of query or other reasons. </li></ul><ul><li>First Normal Form : A relation is said to be in First Normal Form if it describes a single entity and it contains no arrays or repeating attributes. For example, an order table or relation with multiple line items would not be in First Normal Form because it would have repeating sets of attributes for each line item. The relational theory would call for separate tables for order and line items. </li></ul><ul><li>Second Normal Form : A relation is said to be in Second Normal Form if in addition to the First Normal Form properties, all attributes are fully dependent on the primary key for the relation. </li></ul><ul><li>Third Normal Form : A relation is in Third Normal Form if in addition to Second Normal Form, all non-key attributes are completely independent of each other. http://www.sserve.com/ftp/dwintro.doc </li></ul>
  21. 21. Entity Relationship Diagrams example 3 rd normal form
  22. 22. Star Schema (facts and dimensions) <ul><li>The facts that the data warehouse helps analyze are classified along different dimensions: </li></ul><ul><ul><li>The FACT table houses the main data </li></ul></ul><ul><ul><ul><li>Includes a large amount of aggregated data (i.e. price, units sold) </li></ul></ul></ul><ul><ul><li>DIMENSION tables off the FACT include attributes that describe the FACT </li></ul></ul><ul><li>Star schemas provide simplicity for users </li></ul>
  23. 23. Star Schema example (Sales db)
  24. 24. SQL to select from Star Schema <ul><li>SELECT Brand, Country, SUM (Units Sold) </li></ul><ul><li>FROM Fact.Sales </li></ul><ul><li>JOIN Dim.Date </li></ul><ul><li>ON Date_FK = Date_PK </li></ul><ul><li>JOIN Dim.Store </li></ul><ul><li>ON Store_FK = Store_PK </li></ul><ul><li>JOIN Dim.Product </li></ul><ul><li>ON Product_FK = Product_PK </li></ul><ul><li>WHERE [Year] = 2010 </li></ul><ul><li>AND Product Category = ‘TV' GROUP BY Brand, Country </li></ul>
  25. 25. SnowFlake Schema <ul><li>Central FACT </li></ul><ul><li>Connected to multiple DIMENSIONS which are NORMALIZED into related tables </li></ul><ul><li>Snowflaking effects DIMS and never FACT </li></ul><ul><li>Used in Data warehouses and data marts when speed is more important than efficiency/ease of data selection </li></ul><ul><li>Needed for many BI OLAP tools </li></ul><ul><li>Stores less data </li></ul>
  26. 26. Snowflake Schema example (Sales db)
  27. 27. SQL to select from SnowFlake <ul><li>SELECT </li></ul><ul><ul><li>B.Brand, </li></ul></ul><ul><ul><li>G.Country, </li></ul></ul><ul><ul><li>SUM (F.Units_Sold) </li></ul></ul><ul><li>FROM Fact_Sales F (NOLOCK) </li></ul><ul><li>INNER JOIN Dim_Date D (NOLOCK) ON F.Date_Id = D.Id </li></ul><ul><li>INNER JOIN Dim_Store S (NOLOCK) ON F.Store_Id = S.Id </li></ul><ul><li>INNER JOIN Dim_Geography G (NOLOCK) ON S.Geography_Id = G.Id </li></ul><ul><li>INNER JOIN Dim_Product P (NOLOCK) ON F.Product_Id = P.Id </li></ul><ul><li>INNER JOIN Dim_Product_Category C (NOLOCK) ON P.Product_Category_Id = C.ID </li></ul><ul><li>INNER JOIN Dim_Brand B (NOLOCK) ON P.Brand_Id = B.Id WHERE D.Year = 2010 </li></ul><ul><li>AND C.Product_Category = 'tv' </li></ul><ul><li>GROUP BY B.Brand, G.Country </li></ul>
  28. 28. Comparison of SQL Star vs SnowFlake <ul><li>SELECT Brand, Country, SUM (Units Sold) </li></ul><ul><li>FROM Fact.Sales </li></ul><ul><li>JOIN Dim.Date </li></ul><ul><li>ON Date_FK = Date_PK </li></ul><ul><li>JOIN Dim.Store </li></ul><ul><li>ON Store_FK = Store_PK </li></ul><ul><li>JOIN Dim.Product </li></ul><ul><li>ON Product_FK = Product_PK </li></ul><ul><li>WHERE [Year] = 2010 </li></ul><ul><li>AND Product Category = ‘TV' GROUP BY Brand, Country </li></ul><ul><li>SELECT </li></ul><ul><ul><li>B.Brand, </li></ul></ul><ul><ul><li>G.Country, </li></ul></ul><ul><ul><li>SUM (F.Units_Sold) </li></ul></ul><ul><li>FROM Fact_Sales F (NOLOCK) </li></ul><ul><li>INNER JOIN Dim_Date D (NOLOCK) ON F.Date_Id = D.Id </li></ul><ul><li>INNER JOIN Dim_Store S (NOLOCK) ON F.Store_Id = S.Id </li></ul><ul><li>INNER JOIN Dim_Geography G (NOLOCK) ON S.Geography_Id = G.Id </li></ul><ul><li>INNER JOIN Dim_Product P (NOLOCK) ON F.Product_Id = P.Id </li></ul><ul><li>INNER JOIN Dim_Product_Category C (NOLOCK) ON P.Product_Category_Id = C.ID </li></ul><ul><li>INNER JOIN Dim_Brand B (NOLOCK) ON P.Brand_Id = B.Id WHERE D.Year = 2010 </li></ul><ul><li>AND C.Product_Category = 'tv' </li></ul><ul><li>GROUP BY </li></ul><ul><li>B.Brand, </li></ul><ul><li>G.Country </li></ul>
  29. 29. Basic EDW Data Model Design Party Account Product & Service Event Each represents a subject area in the model, with third normal tables to accommodate the data and its relationships with hierarchy
  30. 30. Account, Customer & Address Relationships Account Contact Party Address link Account Party link Address Account Party Account Information loaded from ALL Source Systems ETL process builds the relationship between Accounts and Customers (Party) based on the relationship file from CUSTOMER CRM SYSTEM
  31. 31. Architecture for an EDW or other large Data Warehouse <ul><li>How do get from where you are to implement an actual system? </li></ul><ul><ul><li>Start with defining your requirements </li></ul></ul><ul><ul><li>Then modeling </li></ul></ul><ul><ul><li>Budget $$$ </li></ul></ul><ul><ul><li>Hire staff </li></ul></ul><ul><ul><li>Engage partners </li></ul></ul><ul><ul><li>DO IT YOURSELF, DO NOT RELY ON THE EXPERTS – staff augment and hire the talent internally </li></ul></ul>
  33. 33. Some Interesting Info: <ul><li>http://www.itquestionbank.com/an-introduction-to-data-warehousing.html </li></ul><ul><li>http://www.ralphkimball.com/ </li></ul><ul><li>http://www.inmoncif.com/home/ </li></ul>
  34. 34. SECTION 2 <ul><li>Data Warehouse Architecture </li></ul>
  35. 35. Where we are at / Next steps <ul><li>Have identified data needs </li></ul><ul><li>Designed a model to fit those needs </li></ul><ul><li>Now we need to identify how we will set up the architecture </li></ul><ul><ul><li>Physical hardware </li></ul></ul><ul><ul><li>Software </li></ul></ul><ul><li>People </li></ul><ul><ul><li>Who do we need on the project team </li></ul></ul><ul><li>Processes </li></ul>
  36. 36. State of many mature companies
  37. 37. EDW Process State Staging Area EDW Metadata | Data Governance | Data Management DM CPS MANTAS CRDB MKTG FIN SALES EDW Data cleansing Data profiling Sync & Sort BI Source System Cleanse / Pre-process IMP RM OEC ALS AFS ST RE DFP SBA AFS V-PR
  38. 38. Information Factory Concept
  39. 39. Moore’s Law (yes, it applies here too) <ul><li>Sharply increasing power of computer hardware </li></ul><ul><li>With increase in power decrease in price </li></ul><ul><li>(capacity of microprocessor will double every 18 months) also holds true for other computer components </li></ul><ul><li>Desktop power increasing as well as service power requirements (where GO GREEN comes from) </li></ul>
  40. 40. Explosion in innovation <ul><li>BI software now able to be deployed on intranet vs hard to maintain thick client apps </li></ul><ul><ul><li>Thick client still used for developers </li></ul></ul><ul><li>Web server, application server, database server </li></ul><ul><ul><li>Allows offloading of processing to correct tier </li></ul></ul><ul><ul><ul><li>More power for everyone </li></ul></ul></ul>
  41. 41. Change in Business <ul><li>Global economy changed needs of organizations worldwide </li></ul><ul><li>Global markets </li></ul><ul><li>Mergers and Acquisitions </li></ul><ul><li>All increase data needs </li></ul><ul><li>More tech savvy end users (demand more data, more tools… </li></ul><ul><li>More information demanding executives facilitates sponsorship of DW </li></ul>
  42. 42. DW Evolving <ul><li>Care should be taken </li></ul><ul><ul><li>i.e. vendor claims </li></ul></ul><ul><ul><li>Size is not a factor </li></ul></ul><ul><ul><li>Operational vs informational </li></ul></ul><ul><ul><ul><li>Operational pre-defined </li></ul></ul></ul><ul><ul><ul><li>Informational more adhoc in nature </li></ul></ul></ul><ul><ul><ul><li>Performance </li></ul></ul></ul><ul><ul><ul><li>Volitile vs non volitile data </li></ul></ul></ul><ul><ul><ul><li>DW saves data for longer periods than transactional/operational systems (trending analysis, where I was vs where I am…..) </li></ul></ul></ul><ul><ul><li>Real-time DW vs point in time </li></ul></ul>
  43. 43. DW needs to be extendable, align with business structure http://www.sserve.com/ftp/dwintro.doc
  44. 44. Enterprise Data Solution Data Marts and OLAP Enterprise Data Warehouse Source Systems Reporting Data Mining OLAP Analysis Dashboard Scorecard Master Data Management Application Master / Reference Data Store
  45. 45. EDW - Objective Follow the process methodology to achieve these architectural aspects : Meta Data, Security, Scalability, Reliability and Supportability
  46. 46. EDW – Data Model Design Party Account Product & Service Event Each represents the subject area we have in the model, with third normal tables to accommodate the data and its relationships with hierarchy
  47. 47. Account, Customer & Address Relationships Account Contact Party Address link Account Party link Address Account Party Account Information loaded from ALL Source Systems Customer Information Loaded EDW ETL process builds the relationship between Account and Customers based on the relationship file from RM
  48. 48. Single definition of a data element <ul><li>DW brings in the data from multiple sources and conforms it so that it can be viewed together </li></ul><ul><ul><li>Multiple systems have individual customers/addresses, but warehouse gives single view of the customer and all the systems they are in </li></ul></ul><ul><ul><ul><li>Helping move from product centric systems to customer centric systems </li></ul></ul></ul>
  49. 49. Business view of data <ul><li>DW is only successful is it provides the view the business needs of its data </li></ul><ul><li>A data warehouse is a structured extensible environment designed for the analysis of non-volatile data, logically and physically transformed from multiple source applications to align with business structure, updated and maintained for a long time period, expressed in simple business terms, and summarized for quick analysis. </li></ul><ul><ul><li>Vivek R. Gupta, Senior Consultant [email_address] System Services corporation, Chicago, Illinois http://www.system-services.com </li></ul></ul>
  50. 50. Example of conforming data for business view: http://www.sserve.com/ftp/dwintro.doc
  51. 51. Business use of DW <ul><li>Business should use data mart created off data warehouse </li></ul><ul><li>Business uses want to use existing tools/methods (replicate queires, Excel, extract to Access) against DW and validate the data between existing and DW </li></ul><ul><li>Over time LoB gains confidence in DW and then begins to explore new possibilities of data use and tool use </li></ul>
  52. 52. EDW – Process Flow
  53. 53. EDW ETL Design Source to Stage Mapping (For AFS) Stage to EDW Mapping (for AFS) EDW to FDM Mapping (for FACT)
  54. 54. ETL Tools are prolific <ul><li>Abinitio </li></ul><ul><li>Syncsort DMExpress 6.5 </li></ul><ul><li>Oracle Warehouse Builder (OWB) 11gR1 Oracle  </li></ul><ul><li>Data Integrator & Data Services  XI 3.0 SAP Business Objects </li></ul><ul><li>IBM Information Server (Datastage) 8.1 IBM </li></ul><ul><li>SAS Data Integration Studio 4.2 SAS Institute </li></ul><ul><li>PowerCenter 9.0 Informatica  </li></ul><ul><li>Elixir Repertoire 7.2.2 Elixir </li></ul><ul><li>Data Migrator 7.6 Information Builders 8. SQL Server Integration Services 10 Microsoft  </li></ul><ul><li>Talend Open Studio & Integration Suite4.0Talend </li></ul><ul><li>DataFlow Manager 6.5 Pitney Bowes Business Insight </li></ul><ul><li>Data Integrator 9.2 Pervasive </li></ul><ul><li>Open Text Integration Center 7.1 Open Text  </li></ul><ul><li>Transformation Manager 5.2.2 ETL Solutions Ltd. </li></ul><ul><li>Data Manager/Decision Stream 8.2 IBM (Cognos) </li></ul><ul><li>Clover ETL 2.9.2 Javlin  </li></ul><ul><li>ETL4ALL 4.2 IKAN </li></ul><ul><li>DB2 Warehouse Edition 9.1 IBM </li></ul><ul><li>Pentaho Data Integration 3.0 Pentaho   </li></ul><ul><li>Adeptia Integration Suite5.1 Adeptia </li></ul><ul><li>Expressor </li></ul><ul><li>Sun – SeeBeyond ETL integrator </li></ul>
  55. 55. Commonly used toolsets: <ul><li>Comercial ETL Tools: </li></ul><ul><li>IBM Infosphere DataStage </li></ul><ul><li>Informatica PowerCenter </li></ul><ul><li>Oracle Warehouse Builder (OWB) </li></ul><ul><li>Oracle Data Integrator (ODI) </li></ul><ul><li>SAS ETL Studio </li></ul><ul><li>Business Objects Data Integrator(BODI) </li></ul><ul><li>Microsoft SQL Server Integration Services(SSIS) </li></ul><ul><li>Ab Initio </li></ul><ul><li>Freeware, open source ETL tools: </li></ul><ul><li>Pentaho Data Integration (Kettle) </li></ul><ul><li>Talend Integrator Suite </li></ul><ul><li>CloverETL </li></ul><ul><li>Jasper ETL </li></ul>
  56. 56. ETL Extract, Transform, Load <ul><li>Created to improve and facilitate data warehousing </li></ul><ul><li>EXTRACT </li></ul><ul><ul><li>Data brought in from external sources </li></ul></ul><ul><li>TRANSFORM </li></ul><ul><ul><li>Data fit to standards </li></ul></ul><ul><li>LOAD </li></ul><ul><ul><li>Load converted data into target DW </li></ul></ul><ul><li>Steps: </li></ul><ul><li>Initiate </li></ul><ul><li>Build reference data </li></ul><ul><li>Extract from sources </li></ul><ul><li>Validate </li></ul><ul><li>Transform </li></ul><ul><li>Load into staging tables </li></ul><ul><li>Audit reports </li></ul><ul><li>Publish </li></ul><ul><li>Archive </li></ul><ul><li>cleanup </li></ul>
  57. 57. Reconciliation Overview (EDW-data mart)
  58. 58. EDW Data Flow
  59. 59. EDW – Security Scheme Database User/Schema Unix User
  60. 60. EDW - Infrastructure Development Environment Production Environment
  61. 61. EDW – Development (SDLC)
  62. 62. EDW Development Project Cycle (New Source to EDW)
  63. 63. EDW – Support – Escalation Procedure
  64. 64. EDW – Support Process
  65. 65. EDW - Roadmap Management Architecture (Metadata, Data Security, Systems Management)
  66. 66. Architecture Exercise (1 of 2) <ul><li>Identify needs in the following categories </li></ul><ul><ul><li>Physical hardware </li></ul></ul><ul><ul><ul><li>CPU, Memory, Disk </li></ul></ul></ul><ul><ul><ul><ul><li>For disk – how much? </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Use your model and calculate size </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><li>For database & tools </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Will data be behind a firewall? </li></ul></ul></ul></ul><ul><ul><li>Software </li></ul></ul><ul><ul><ul><li>Database </li></ul></ul></ul><ul><ul><ul><li>ETL </li></ul></ul></ul><ul><ul><ul><li>BI tools </li></ul></ul></ul><ul><ul><ul><ul><li>Application and web server </li></ul></ul></ul></ul>
  67. 67. Architecture Exercise (2 of 2) <ul><li>Break into former 6 teams </li></ul><ul><li>Ask each team to consider what they will need to build a DW </li></ul><ul><ul><li>Hardware </li></ul></ul><ul><ul><li>Software </li></ul></ul><ul><ul><li>People </li></ul></ul><ul><ul><li>Processes </li></ul></ul><ul><ul><li>Support and Operations </li></ul></ul><ul><li>Allow 30 minutes to brainstorm then discuss as a class </li></ul><ul><li>Volunteer team presents what they came up with </li></ul><ul><ul><li>Lists needs on board </li></ul></ul><ul><ul><ul><li>each progressive team adds what they have to the list </li></ul></ul></ul><ul><li>Group discussion on what they have uncovered </li></ul>
  68. 68. SECTION 3 <ul><li>What is Data Quality </li></ul><ul><ul><li>I can’t tell you what’s important, but your users can. </li></ul></ul><ul><ul><ul><li>Look for the fields that can identify potential problems with the data </li></ul></ul></ul><ul><li>What is Master Data Management (MDM) </li></ul>
  69. 69. Data Quality <ul><li>Data doesn’t stay the same </li></ul><ul><ul><li>Sometimes it does </li></ul></ul><ul><li>Considerations: </li></ul><ul><ul><li>What happens to the warehouse when the data changes </li></ul></ul><ul><ul><li>When needs change </li></ul></ul>
  70. 70. Roadmap to DQ <ul><li>Data profiling </li></ul><ul><li>Establishing metrics/measures </li></ul><ul><li>Design and implement the rules </li></ul><ul><li>Deploy the plan </li></ul><ul><li>Review errors/exceptions </li></ul><ul><li>Monitor the results </li></ul>
  71. 71. Data Profiling <ul><li>What’s in the data </li></ul><ul><ul><li>Analyze the columns in the tables </li></ul></ul><ul><ul><ul><li>Provides metadata </li></ul></ul></ul><ul><ul><ul><li>Allows for good specifications for programmers </li></ul></ul></ul><ul><ul><ul><li>Reduces project risk (as data is now known) </li></ul></ul></ul><ul><ul><ul><ul><li>How many rows, number of distinct values in a column, how many null, data type identification </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Shows the data pattern </li></ul></ul></ul></ul>
  72. 72. Data Profiling Example
  73. 73. Data Quality is measured as the degree of superiority, or excellence, of the various data that we use to create information products. <ul><li>“ Reason #1 for the failure of CRM projects : Data is ignored. Enterprise must have a detailed understanding of the quality of their data. How to clean it up, how to keep it clean, where to source it, and what 3 rd -party data is required. Action item: Have a data quality strategy. Devote ½ of the total timeline of the CRM project to data elements.” - Gartner </li></ul>
  74. 74. Data Quality Tools (Gartner Magic Quadrant)
  75. 75. Dimensions of Quality Informatica.com
  76. 76. Data Quality Measures <ul><li>Definition </li></ul><ul><li>Accuracy </li></ul><ul><li>Completeness </li></ul><ul><li>Coverage </li></ul><ul><li>Timeliness </li></ul><ul><li>Validity </li></ul>
  77. 77. Definition <ul><li>Conformance: The degree to which data values are consistent with their agreed upon definitions. </li></ul><ul><ul><li>A detailed definition must first exist before this can be measured. </li></ul></ul><ul><ul><li>Information quality begins with a comprehensive understanding of the data inventory. The information about the data is as important as the data itself. </li></ul></ul><ul><ul><li>A Data Dictionary must exist! An organized, authoritive collection of attributes is equivalent to the old “Card Catalog” in a library, or the “Parts and List Description” section of an inventory system. It must contain all the know usage rules and an acceptable list of values. All known caveats and anomalies must be descried. </li></ul></ul>
  78. 78. Accuracy <ul><li>The degree to which a piece of data is correct and believable. The value can be compared to the original source for correctness, but it can still be unbelievable. Conformed values can be compared to lists of reference values. </li></ul><ul><ul><li>Zip code 35244 is correct and believable. </li></ul></ul><ul><ul><li>Zip code 3524B is incorrect and unbelievable. </li></ul></ul><ul><ul><li>Zip code 35290 is incorrect but believable (it looks right, but does not exist). </li></ul></ul><ul><ul><li>AL is a correct and believable state code (compared to the list of valid state codes) </li></ul></ul><ul><ul><li>A1 is an incorrect and unbelievable state code (compared to the list of valid state codes) </li></ul></ul><ul><ul><li>AA is an incorrect but believable state code (compared to the list of valid state codes) </li></ul></ul>
  79. 79. Completeness <ul><li>The Degree to which all information expected is received. This is measured in two ways: </li></ul><ul><ul><li>Do we have all the records that were sent to us? </li></ul></ul><ul><ul><ul><li>Counts from the provider can be compared against counts of data received. </li></ul></ul></ul><ul><ul><li>Did the provider send us all the records that they have or just some of them? </li></ul></ul><ul><ul><li>This is difficult to measure without auditing and trending the source. </li></ul></ul><ul><ul><ul><li>How would we know that the provider had a ‘glitch’ in their system and records were missing from our feed? </li></ul></ul></ul>
  80. 80. Measures of Completeness <ul><li>The following questions can be answered for counts: </li></ul><ul><ul><li>How many records per batch by provider? </li></ul></ul><ul><ul><li>How is this batch’s counts compared to the previous month’s average. </li></ul></ul><ul><ul><li>How is the batch’s counts compared to the same time period last year? </li></ul></ul><ul><ul><li>How does this batch’s counts compare to a 12 month average? </li></ul></ul>
  81. 81. Coverage <ul><li>The degree to which all fields are populated with data. Columns of data can be measured for % of missing values and compared to expected % missing. </li></ul><ul><ul><li>i.e. Sale Type Code is expected to be populated 100% by all sources for Sales documents. </li></ul></ul>
  82. 82. Timeliness <ul><li>The degree to which provider files are received, processed and made available to for assembly to data marts. Expected receipt times are compared to actual receipt times. </li></ul><ul><ul><li>Late or missing files are flagged and reported on. </li></ul></ul><ul><ul><li>Proactive alerts trigger communication with the provider contact. </li></ul></ul><ul><ul><li>Proactive communication can alert to assembly processes. </li></ul></ul><ul><ul><li>Excessive lag times can be reported to providers in order to request delivery sooner. </li></ul></ul>
  83. 83. Validity <ul><li>The degree to which the relationships between different data are valid. </li></ul><ul><ul><li>Zip code 48108 is accurate. State code AL is accurate. Zip code 48108 is invalid for the state of AL. </li></ul></ul>
  84. 84. Data Quality Measures <ul><li>How do you know if your data is of high quality? </li></ul><ul><ul><li>Agree upon the measure that are important to the organization and consistently report them out. </li></ul></ul><ul><ul><li>Use the data measures to communicate and inform. </li></ul></ul>
  85. 85. Measurement Informatica.com
  86. 86. Exercise: Changing the Data Warehouse (1 of 2) <ul><li>So, you need to add a new source </li></ul><ul><li>Or, you need to receive additional data from an existing source </li></ul><ul><li>Could be the data quality is an issue </li></ul><ul><li>Could be that the business rules weren’t defined adequately </li></ul>
  87. 87. Brainstorming Group Exercise (2 of 2) <ul><li>The data changed due to DQ measures – what do we have to do in the DW? </li></ul><ul><ul><li>What has to change </li></ul></ul><ul><ul><li>Estimate the change </li></ul></ul><ul><ul><li>Implement the change </li></ul></ul><ul><ul><li>How do we make sure it doesn’t happen again? </li></ul></ul><ul><ul><ul><li>What DQ measure can help? </li></ul></ul></ul>
  88. 88. MDM Master Data Management <ul><li>The newest ‘buzz word’ </li></ul>
  89. 89. Exercise: <ul><li>What processes need to be put in place for MDM </li></ul><ul><ul><li>Who needs to be involved </li></ul></ul><ul><ul><li>Who owns it </li></ul></ul>
  90. 90. SECTION 4 <ul><li>BI Tools </li></ul><ul><li>BICC </li></ul><ul><li>Jobs </li></ul><ul><li>Certifications </li></ul>
  91. 91. SECTION 4 <ul><li>What is business intelligence </li></ul><ul><ul><li>What are BI tools </li></ul></ul><ul><ul><li>What is a business intelligence competency center (BICC) </li></ul></ul><ul><li>What jobs are available </li></ul><ul><ul><li>certifications </li></ul></ul>
  92. 92. BI Tools
  93. 93. BICC
  94. 94. Jobs in Data Warehousing
  95. 95. Certifications in DW
  96. 96. References <ul><li>Data Management and Integration Topic, Gartner, http://www.gartner.com/it/products/research/asset_137953_2395.jsp </li></ul><ul><ul><li>Articles: Key Issues for Implementing an Enterprise wide Data Quality Improvement Project, 2008, Key Issues for Enterprise Information Management Initiatives, 2008, Key Issues for Establishing Information Governance Policies, Processes and Organization, 2008 </li></ul></ul><ul><li>Data Quality Management, The Most Critical Initiative You Can Implement, J. G. Geiger, http://www2.sas.com/proceedings/sugi29/098-29.pdf </li></ul><ul><li>Information Management, How to Measure and Monitor the Quality of Master Data, http://www.information-management.com/issues/2007_58/master_data_management_mdm_quality-10015358-1.html?ET=informationmgmt:e963:2046487a:&st=email </li></ul><ul><li>Data Management Assn of Michigan Bits & Bytes, Critical Data Quality Controls, D Jeffries, Fall 2006 http://dama-michigan.org/2%20Newsletter.pdf </li></ul>