ETL

ETL & Basic OLAP OperationsCSE – 590 Data Mining Prof. Anita Wasilewska SUNY Stony Brook Presented By : -> Preeti Kudva (106887833) -> Kinjal Khandhar(106878039)

Presentation Slides of Prof. Anita Wasilewska.

http://en.wikipedia.org/wiki/Extract,_transform,_load

Ralph Kimball, Joe Caserta, The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming and Delivering Data

Conceptual modeling for ETL processes by Panos Vassiliadis,Alkis Simitsis,Spiros Skiadopoulos.

http://en.wikipedia.org/wiki/Category:ETL_tools

http://www.1keydata.com/datawarehousing/tooletl.html

http://www.bi-bestpractices.com/view-articles/4738

http://www.computerworld. com/databasetopics/data/story/0,10801,80222,00. html ,[object Object]

What is ETL? Extract - Extract relevant data. Transform - Transform data to DW format. - Build DW keys, etc. - Cleansing of data. Load - Load data into DW. - Build aggregates, etc. https://eprints.kfupm.edu.sa/74341/1/74341.pdf

Other sources Extract Transform Load Refresh Operational DBs Metadata Analysis Query Reports Data mining Serve Data Warehouse Data Marts Data Sources Front-End Tools Data Storage Data Warehouse Architecture http://infolab.stanford.edu/warehousing/

Extract data from different data source formats like flat files, Relational Database Systems,etc.

Convert data into a specific format for transformation processing.

Result in a check if data meets expected pattern/structure.http://en.wikipedia.org/wiki/Extract,_transform,_load

Types of Data Sources ,[object Object], - Snapshot sources – provides only full copy of source, e.g., files - Specific sources – each is different, e.g., legacy systems - Logged sources – writes change log, e.g., DB log - Queryable sources – provides query interface, e.g., RDBMS ,[object Object], - Replicated sources – publish/subscribe mechanism - Call back sources – calls external code (ETL) when changes occur - Internal action sources – only internal actions when changes occur.eg.DB triggers. ,[object Object],https://intranet.cs.aau.dk/fileadmin/user_upload/Education/Courses/2009/DWML/slides/DW4_ETL.pdf

Extract from Operational System ,[object Object], – Create/Import Data Sources definition – Define Stage or Work Areas() – Validate Connectivity – Preview/Analyze Sources – Define Extraction Scheduling Determine Extract Windows for source system Batch Extract (Overnight, weekly, monthly) Continuous extracts (Trigger on Source Table) ,[object Object], – Connect to the predefined Data Sources as scheduled – Get raw data save locally in workspace DB

Transformation – a series of rules/functions. Common Transformations are: ,[object Object]

Encoding free-form values(map Male to 1 & Mr to M)

Merge/Purge(join data from multiple sources).

Calculate(sale_amt = qty * price)

Null value handling(null = not to load)

Customized transformation(based on user).http://www.bi-bestpractices.com/view-articles/4738

Common Transformations..contd ,[object Object],-EBCDIC ASCII/UniCode. -String manipulations. -Date/time format conversions. -- E.g., unix time 1201928400 = what time? ,[object Object],- To the desired DW format. - Depending on source format. ,[object Object],- Table matches production keys to surrogate DW keys. - Correct handling of history - especially for total reload. https://intranet.cs.aau.dk/fileadmin/user_upload/Education/Courses/2009/DWML/slides/DW4_ETL.pdf

BI does not work on “raw” data - Pre-processing necessary for BI analysis. ,[object Object], - Spellings, codings, … ,[object Object], - Production keys, comments,… ,[object Object], - City name instead of ZIP code, e.g., Aalborg Centrum vs. DK-9000 ,[object Object], - E.g., customer data from customer address, customer name, … Aalborg University 2009 - DWML course

Cleansing ,[object Object], -They are hard to understand in query/analysis operations. ,[object Object], - Normal, abnormal, outside bounds, impossible,… - Facts can be taken in/out of analyses. ,[object Object], - Use NULLs only for measure values (estimates instead?) - Use special dimension key (i.e., surrogate key value) for NULL dimension values E.g., for the time dimension, instead of NULL, use special key values to represent “Date not known”, “Soon to happen”. ,[object Object],Aalborg University 2009 - DWML course

Data Quality – most imp ,[object Object]

Data in DW must be:1] Precise - DW data must match known numbers - or explanation needed. 2] Complete - DW has all relevant data and the users know. 3] Consistent - No contradictory data: aggregates fit with detail data. 4] Unique - The same things is called the same and has the same key(customers). 5] Timely - Data is updated ”frequently enough” and the users know when.

Improving Data Quality ,[object Object], -Responsibility for data quality. -Includes manual inspections and corrections! ,[object Object]

Construct programs that check data quality - Are totals as expected? -Do results agree with alternative source? -Number of NULL values?

Transformation in Operational System ,[object Object], – Specify Criteria/Filter for aggregation. – Define operators (Mostly Set/SQL based). – Map columns using operators/Lookups. – Define other transformation rules. – Define mappings and/or add new fields. ,[object Object], – Transform (Cleanse, consolidate, Apply Business Rule, De-Normalize/Normalize) Extracted Data by applying the operators mapped in design time. – Aggregate (create & populate raw table). – Create & populate Staging table.

Load ,[object Object],-Loading chunks is much faster than total load. ,[object Object],- Large overhead (optimization, locking, etc.) for every SQL call. -DB load tools are much faster. ,[object Object],- Drop index and rebuild after load - Can be done per index partition ,[object Object],- Dimensions can be loaded concurrently - Fact tables can be loaded concurrently - Partitions can be loaded concurrently http://en.wikipedia.org/wiki/Extract,_transform,_load

Load ,[object Object], - Referential integrity and data consistency must be ensured before loading (Why?) --Because they won’t be checked in the DW again - Can be done by loader ,[object Object], - Can be built and loaded at the same time as the detail data. ,[object Object], - Load without log. -Sort load file first. -Make only simple transformations in loader. -Use loader facilities for building aggregates. http://en.wikipedia.org/wiki/Extract,_transform,_load

Load in Operational Systems ,[object Object], – Design Warehouse. – Map Staging Data to fact or dimension table attributes. ,[object Object],– Publish Staging data to Data mart (update dimension tables along with the fact tables.

ETL Tools ,[object Object], -Oracle Warehouse Builder -IBM DB2 Warehouse Manager -Microsoft Integration Services ,[object Object], - Data modeling -ETL code generation -Scheduling DW jobs ,[object Object], - Choose based on your own needs http://en.wikipedia.org/wiki/Category:ETL_tools

ETL Examplehttp://www.stylusstudio.com/etl/

Example of how ETL Works!!!! Consider the HR Department database:-

Extract Step for the Use-Case: ,[object Object]

Extracting data can be done using XML convertors .Just select our table and choose dbase III convertor will transfer the data into XML.

The result of this extraction will be an XML file similar to this:

<?xml version="1.0" encoding="UTF-8"?><table date="20060731" rows="5"> <row row="1"> <NAME>Guiles, Makenzie</NAME> <STREET>145 Meadowview Road</STREET> <CITY>South Hadley</CITY> <STATE>MA</STATE> <ZIP>01075</ZIP> <DEAR_WHO>Macy</DEAR_WHO> <TEL_HOME>(413)555-6225</TEL_HOME> <BIRTH_DATE>19770201</BIRTH_DATE> <HIRE_DATE>20060703</HIRE_DATE> <INSIDE>yes</INSIDE> </row> ...</table>,[object Object]

Transforming Data into the Target Form ,[object Object], In a production ETL operation, likely each step would be more complicated, and/or would use different technologies or methods. 1] Convert the dates from CCYYMMDD into CCYY-MM-DD (the "ISO 8601" format) [etl-code-1.xsl] 2]Split the first and last names [etl-code-2.xsl] 3]Assign the manager based on inside or external sales [etl-code-3.xsl] 4]Map the data to the new schema [etl-code-4.xsl] This mapping can be done by using XSLT mapper.

Output of above steps + etl-target.rdbxml gives:

Loading Our ETL Results into the Data Repository loading is a just matter of writing the output of the last XSLT transform step into the etl-target.rdbxml map we built earlier.

References: Data Mining: Concepts & Techniques by Jiawei Han and MichelineKamber. http://personalpages.manchester.ac.uk/staff/G.Nenadic/CN3023/Lecture4.pdf http://en.wikipedia.org/wiki/Online_Analytical_Processing http://www.cs.sfu.ca/~han http://en.wikipedia.org/wiki/OLAP_cube#cite_note-OLAPGlossary1995-5 http://www.fmt.vein.hu/softcom/dw

Overview OLAP OLAP Cube & Multidimensional data OLAP Operations: - Roll up (Drill up) - Drill down (Rolldown) - Slice & Dice - Pivot - Other operations ,[object Object],[object Object]

OLAP Cube Data warehouse & OLAP tools are based on a multidimensional data model which views data in the form of a data cube. An OLAP (Online Analytical Processing) cube is a data structure that allows fast analysis of data. The OLAP cube consists of numeric facts called measures which are categorized by dimensions. -Dimensions: perspective or entities with respect to which an organization wants to keep records. -Facts: quantities by which we want to analyze relations between dimensions. The cube metadata may be created from a star schema or snowflake schema of tables in a relational database. Measures are derived from the records in the fact table and dimensions are derived from the dimension tables. Reference: http://en.wikipedia.org/wiki/OLAP_cube#cite_note-OLAPGlossary1995-5

Concept Hierarchy A concept hierarchy defines a sequence of mappings from a set of low level concepts to higher level, more general concepts. Each of the elements of a dimension could be summarized using a hierarchy. The hierarchy is a series of parent-child relationships, typically where a parent member represents the consolidation of the members which are its children. Parent members can be further aggregated as the children of another parent. Reference: http://en.wikipedia.org/wiki/OLAP_cube#cite_note-OLAPGlossary1995-5

Example – Star Schema item time item_key item_name brand type supplier_type time_key day day_of_the_week month quarter year location branch location_key street city province_or_street country branch_key branch_name branch_type Sales Fact Table time_key item_key branch_key location_key units_sold Measures Reference: http://www.cs.sfu.ca/~han

Example: Example: Dimensions: Item, Location, Time Hierarchical summarization paths Region Type Region Year Brand Country Quarter Item City Month Street Week Day Item Month Reference: http://www.cs.sfu.ca/~han

Working Example(1) Reference: http://www.fmt.vein.hu/softcom/dw

Roll up(Drill up) Performs aggregation on a data cube eitherby climbing up the concept hierarchy for a dimension or by dimension reduction.[http://www.cs.sfu.ca/~han] Specific grouping on one dimension where we go from a lower level of aggregation to a higher. [http://personalpages.manchester.ac.uk/staff/G.Nenadic/CN3023/lecture4.pdf] ,[object Object]

e.g. summarization over aggregate hierarchy (total sales per region, state),[object Object],[object Object]

Slice and Dice Slice: - performs a selection on one dimension of the given cube, resulting in a sub cube. ,[object Object],Dice: -performs a selection operation on two or more dimensions. ,[object Object],[object Object]

Pivot Visualization operation which rotates the data axes in view in order to provide an alternate presentation of data. Select a different dimension (orientation) for analysis[http://personalpages.manchester.ac.uk/staff/G.Nenadic/CN3023/lecture4.pdf] E.g. pivot operation where location & item in a 2D slice are rotated. Other examples: - rotating the axes in a 3D cube. - transforming 3D cube into series of 2D planes.

Working Example(2) Dimension Tables: Market(Market_ID, City, Region) Product(Product_ID, Name, Category) Time( Time_ID,Week,Month,Quarter) Fact Table: Sales(Market_ID,Product_ID,Time_ID,Amount) Reference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf

Roll up & Drill down on Working Example(2) SELECT S.Product_ID,M.City, SUM(S.Amount) INTO City_Sales FROM Sales S, Market M WHEREM.Market_ID = S.Market_ID GROUP BY S.Product_ID, M.City Roll up salesonMarket from city to region SELECTT.Product_ID,M.Region, SUM(T.Amount) FROM City_Sales T, Market M WHERE T.City = M.City GROUP BY T.Product_ID, M.Region Drill down sales on Market from region to city Reference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf

Slice & Dice on Working Example(2) Dicing sales in the time dimension (e.g. total sales for each product in each quarter) SELECT S.Product_ID, T.Quarter, SUM(S.Amount) FROM Sales S, Time T WHERET.Time_ID = S.Time_ID AND T.Week=‘Week12’ AND (S.Product_ID =‘1002’ OR S.Product_ID =‘1003’) GROUP BY T.Quarter, S.Product_ID Slicing the data cube in the time dimension (e.g. choosing sales only in week 12) SELECT S.* FROM Sales S, Time T WHERET.Time_ID = S.Time_ID AND T.Week = ‘Week12’ Reference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf

Other Operations drill across: executes queries involving (across) more than one fact table. drill through: makes use of relational SQL facilities to drill through the bottom level of the cube to its back-end relational tables. Reference: [http://www.cs.sfu.ca/~han]

ETL

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à ETL

Similaire à ETL (20)

Plus de butest

Plus de butest (20)

ETL