SlideShare une entreprise Scribd logo
1  sur  72
ETL & Basic OLAP OperationsCSE – 590 Data Mining				Prof. Anita Wasilewska				SUNY Stony Brook Presented By : 		-> Preeti Kudva (106887833) 		-> Kinjal Khandhar(106878039)
REFERENCES : ,[object Object]
Presentation Slides of Prof. Anita Wasilewska.
http://en.wikipedia.org/wiki/Extract,_transform,_load
Ralph Kimball, Joe Caserta, The Data Warehouse ETL Toolkit: Practical Techniques for  Extracting, Cleaning, Conforming and Delivering Data
Conceptual modeling for ETL processes by Panos Vassiliadis,Alkis Simitsis,Spiros Skiadopoulos.
http://en.wikipedia.org/wiki/Category:ETL_tools
http://www.1keydata.com/datawarehousing/tooletl.html
http://www.bi-bestpractices.com/view-articles/4738
http://www.computerworld. com/databasetopics/data/story/0,10801,80222,00. html ,[object Object]
What is ETL?	  Extract      - Extract relevant data. Transform      - Transform data to DW format.      - Build DW keys, etc.      - Cleansing of data. Load     - Load data into DW.     - Build aggregates, etc. https://eprints.kfupm.edu.sa/74341/1/74341.pdf
Other sources Extract Transform Load Refresh Operational  DBs Metadata Analysis Query Reports Data mining Serve Data Warehouse Data Marts Data Sources Front-End Tools Data Storage Data Warehouse Architecture http://infolab.stanford.edu/warehousing/
Extract
Extract ,[object Object]
Extract data from different data source formats like flat files, Relational Database Systems,etc.
Convert data into a specific format for transformation processing.
Parse extracted data.
Result in a check if data meets expected pattern/structure.http://en.wikipedia.org/wiki/Extract,_transform,_load
Types of Data Sources ,[object Object],      - Snapshot sources – provides only full copy of source, e.g., files 	- Specific sources – each is different, e.g., legacy systems 	- Logged sources – writes change log, e.g., DB log 	- Queryable sources – provides query interface, e.g., RDBMS ,[object Object],	- Replicated sources – publish/subscribe mechanism 	- Call back sources – calls external code (ETL) when changes occur 	- Internal action sources – only internal actions when changes  occur.eg.DB triggers. ,[object Object],https://intranet.cs.aau.dk/fileadmin/user_upload/Education/Courses/2009/DWML/slides/DW4_ETL.pdf
Extract from Operational System ,[object Object],    – Create/Import Data Sources definition     – Define Stage or Work Areas()     – Validate Connectivity     – Preview/Analyze Sources     – Define Extraction Scheduling                 Determine Extract Windows for source system                 Batch Extract (Overnight, weekly, monthly)                 Continuous extracts (Trigger on Source Table) ,[object Object],    – Connect to the predefined Data Sources as scheduled     – Get raw data save locally in workspace DB
Transform
Transformation – a series of 						   rules/functions. Common Transformations are: ,[object Object]
 Cleanse(Automated)
Synonym Substitutions.
Spelling Corrections.
Encoding free-form values(map Male to 1 & Mr to M)
 Merge/Purge(join data from multiple sources).
 Aggregate(eg.rollup)
 Calculate(sale_amt = qty * price)
 Data type conversion
 Data content audit
 Null value handling(null = not to load)
 Customized transformation(based on user).http://www.bi-bestpractices.com/view-articles/4738
Common Transformations..contd ,[object Object],-EBCDIC  ASCII/UniCode. 	-String manipulations. 	-Date/time format conversions. 		-- E.g., unix time 1201928400 = what time? ,[object Object],- To the desired DW format. 	- Depending on source format. ,[object Object],- Table matches production keys to surrogate DW keys. 	- Correct handling of history - especially for total reload. https://intranet.cs.aau.dk/fileadmin/user_upload/Education/Courses/2009/DWML/slides/DW4_ETL.pdf
Cleansing	 ,[object Object]
BI does not work on “raw” data	- Pre-processing necessary for BI analysis. ,[object Object],	- Spellings, codings, … ,[object Object],	- Production keys, comments,… ,[object Object],	- City name instead of ZIP code, e.g., Aalborg Centrum vs. DK-9000 ,[object Object],	- E.g., customer data from customer address, customer name, … Aalborg University 2009 - DWML course
Cleansing	 ,[object Object],	-They are hard to understand in query/analysis operations. ,[object Object],	- Normal, abnormal, outside bounds, impossible,… 	- Facts can be taken in/out of analyses. ,[object Object],	- Use NULLs only for measure values (estimates instead?) 	- Use special dimension key (i.e., surrogate key value) for 	 NULL dimension values E.g., for the time dimension, instead of NULL, use special key values to represent “Date not known”, “Soon to happen”. ,[object Object],Aalborg University 2009 - DWML course
Data Quality – most imp ,[object Object]
Data in DW must be:1] Precise      - DW data must match known numbers - or explanation needed. 2] Complete 	     - DW has all relevant data and the users know. 3] Consistent 	      - No contradictory data: aggregates fit with detail data. 4] Unique            - The same things is called the same and has the same key(customers). 5] Timely 	        - Data is updated ”frequently enough” and the users know when.
Improving Data Quality ,[object Object],	-Responsibility for data quality. 	-Includes manual inspections and corrections! ,[object Object]
Construct programs that check data quality	- Are totals as expected? 	-Do results agree with alternative source? 	-Number of NULL values?
Transformation in Operational System ,[object Object],     – Specify Criteria/Filter for aggregation.      – Define operators (Mostly Set/SQL based).      – Map columns using operators/Lookups.      – Define other transformation rules.      – Define mappings and/or add new fields. ,[object Object],    – Transform (Cleanse, consolidate, Apply Business Rule, De-Normalize/Normalize) Extracted Data by applying the  operators mapped in design time.    – Aggregate (create & populate raw table).    – Create & populate Staging table.
Load
Load ,[object Object],-Loading chunks is much faster than total load. ,[object Object],- Large overhead (optimization, locking, etc.) for every  SQL call. 	-DB load tools are much faster. ,[object Object],- Drop index and rebuild after load 	- Can be done per index partition ,[object Object],- Dimensions can be loaded concurrently 	- Fact tables can be loaded concurrently 	- Partitions can be loaded concurrently http://en.wikipedia.org/wiki/Extract,_transform,_load
Load	 ,[object Object],	- Referential integrity and data consistency must be ensured before loading (Why?) 		--Because they won’t be checked in the DW again 	- Can be done by loader ,[object Object],	- Can be built and loaded at the same time as the detail data. ,[object Object],	- Load without log. 	-Sort load file first. 	-Make only simple transformations in loader. 	-Use loader facilities for building aggregates. http://en.wikipedia.org/wiki/Extract,_transform,_load
Load in Operational Systems ,[object Object],   – Design Warehouse.    – Map Staging Data to fact or dimension table attributes. ,[object Object],– Publish Staging data to Data mart (update dimension tables along with the fact tables.
ETL Tools ,[object Object],	-Oracle Warehouse Builder 	-IBM DB2 Warehouse Manager 	-Microsoft Integration Services ,[object Object],	- Data modeling 	-ETL code generation 	-Scheduling DW jobs ,[object Object],	- Choose based on your own needs  http://en.wikipedia.org/wiki/Category:ETL_tools
		ETL Examplehttp://www.stylusstudio.com/etl/
Example of how ETL Works!!!! Consider the HR Department database:-
Extract Step for the Use-Case: ,[object Object]
Extracting data can be done using XML convertors .Just select our table and choose dbase III convertor will transfer the data into XML.
The result of this extraction will be an XML file similar to this:
<?xml version="1.0" encoding="UTF-8"?><table date="20060731" rows="5">    <row row="1">        <NAME>Guiles, Makenzie</NAME>        <STREET>145 Meadowview Road</STREET>        <CITY>South Hadley</CITY>        <STATE>MA</STATE>        <ZIP>01075</ZIP>        <DEAR_WHO>Macy</DEAR_WHO>        <TEL_HOME>(413)555-6225</TEL_HOME>        <BIRTH_DATE>19770201</BIRTH_DATE>        <HIRE_DATE>20060703</HIRE_DATE>        <INSIDE>yes</INSIDE>    </row>    ...</table>,[object Object]
Transforming Data into the Target Form  ,[object Object],	In a production ETL operation, likely each step would be more complicated, and/or would use different technologies or methods. 1] Convert the dates from CCYYMMDD into CCYY-MM-DD (the "ISO 8601" format) [etl-code-1.xsl]     2]Split the first and last names [etl-code-2.xsl]      3]Assign the manager based on inside or external sales [etl-code-3.xsl]      4]Map the data to the new schema [etl-code-4.xsl]       This mapping can be done by using XSLT mapper.
Output of above steps + etl-target.rdbxml gives:
Loading Our ETL Results into the Data Repository  loading is a just matter of writing the output of the last XSLT transform step into the etl-target.rdbxml map we built earlier.
Olap Operations
References: Data Mining: Concepts & Techniques by Jiawei Han and MichelineKamber. http://personalpages.manchester.ac.uk/staff/G.Nenadic/CN3023/Lecture4.pdf http://en.wikipedia.org/wiki/Online_Analytical_Processing http://www.cs.sfu.ca/~han http://en.wikipedia.org/wiki/OLAP_cube#cite_note-OLAPGlossary1995-5 http://www.fmt.vein.hu/softcom/dw
Overview OLAP OLAP Cube & Multidimensional data OLAP Operations:     - Roll up (Drill up) 	- Drill down (Rolldown)  	- Slice & Dice 	- Pivot     - Other operations ,[object Object],[object Object]
OLAP Cube Data warehouse & OLAP tools are based on a multidimensional data model which views data in the form of a data cube. An OLAP (Online Analytical Processing) cube is a data structure that allows fast analysis of data. The OLAP cube consists of numeric facts called measures which are categorized by dimensions.       -Dimensions: perspective or entities with respect to which an organization wants to keep records.      -Facts: quantities by which we want to analyze relations between dimensions. The cube metadata may be created from a star schema or snowflake schema of tables in a relational database. Measures are derived from the records in the fact table and dimensions are derived from the dimension tables. Reference: http://en.wikipedia.org/wiki/OLAP_cube#cite_note-OLAPGlossary1995-5
Concept Hierarchy A concept hierarchy defines a sequence of mappings from a set of low level concepts to higher level,  more general concepts. Each of the elements of a dimension could be summarized using a hierarchy. The hierarchy is a series of parent-child relationships, typically where a parent member represents the consolidation of the members which are its children. Parent members can be further aggregated as the children of another parent. Reference: http://en.wikipedia.org/wiki/OLAP_cube#cite_note-OLAPGlossary1995-5
Example – Star Schema item time item_key item_name brand type supplier_type time_key day day_of_the_week month quarter year location branch location_key street city province_or_street country branch_key branch_name branch_type Sales Fact Table            time_key               item_key            branch_key          location_key units_sold Measures Reference: http://www.cs.sfu.ca/~han
Example:  Example:      Dimensions: Item, Location, Time Hierarchical summarization paths Region    Type   Region         Year   Brand   Country  Quarter   Item   City     Month Street       Week                                         Day Item Month Reference: http://www.cs.sfu.ca/~han
Working Example(1) Reference: http://www.fmt.vein.hu/softcom/dw
Roll up(Drill up) Performs aggregation on a data cube eitherby climbing up the concept hierarchy for a dimension or by dimension reduction.[http://www.cs.sfu.ca/~han] Specific grouping on one dimension where we go from a lower level of aggregation to a higher. [http://personalpages.manchester.ac.uk/staff/G.Nenadic/CN3023/lecture4.pdf] ,[object Object]
e.g. summarization over aggregate hierarchy (total sales per region, state),[object Object],[object Object]
Slice and Dice Slice:     - performs a selection on one dimension of the given cube, resulting in a sub cube. ,[object Object],Dice:    -performs a selection operation on two or more dimensions. ,[object Object],[object Object]
Pivot Visualization operation which rotates the data axes in view in order to provide an alternate presentation of data. Select a different dimension (orientation) for analysis[http://personalpages.manchester.ac.uk/staff/G.Nenadic/CN3023/lecture4.pdf] E.g. pivot operation where location & item in a 2D slice are rotated. Other examples:      - rotating the axes in a 3D cube.      - transforming 3D cube into series of 2D planes.
Working Example(2) Dimension Tables: Market(Market_ID, City, Region) Product(Product_ID,  Name, Category) Time( Time_ID,Week,Month,Quarter) Fact Table: Sales(Market_ID,Product_ID,Time_ID,Amount) Reference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf
Roll up & Drill down on Working Example(2) SELECT S.Product_ID,M.City, SUM(S.Amount) INTO City_Sales FROM Sales S, Market M WHEREM.Market_ID = S.Market_ID GROUP BY S.Product_ID, M.City Roll up salesonMarket from city to region SELECTT.Product_ID,M.Region, SUM(T.Amount)     FROM City_Sales T, Market M      WHERE T.City = M.City     GROUP BY T.Product_ID, M.Region Drill down sales on Market from region to city Reference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf
Slice & Dice on Working Example(2)      Dicing sales in the time dimension (e.g. total sales for each product in each quarter) SELECT S.Product_ID, T.Quarter, SUM(S.Amount) FROM Sales S, Time T WHERET.Time_ID = S.Time_ID AND T.Week=‘Week12’ AND (S.Product_ID =‘1002’ OR S.Product_ID =‘1003’) GROUP BY T.Quarter, S.Product_ID     Slicing the data cube in the time dimension (e.g. choosing sales only in week 12) SELECT S.* FROM Sales S, Time T WHERET.Time_ID = S.Time_ID AND T.Week = ‘Week12’ Reference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf
Other Operations drill across: executes queries involving (across) more than one fact table. drill through: makes use of relational SQL facilities to drill through the bottom level of the cube to its back-end relational tables. Reference: [http://www.cs.sfu.ca/~han]

Contenu connexe

Tendances

Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Cloudera, Inc.
 
KnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge baseKnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge base
Laurent Alquier
 

Tendances (20)

Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
 
SKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSING
SKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSINGSKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSING
SKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSING
 
Advanced integration services on microsoft ssis 1
Advanced integration services on microsoft ssis 1Advanced integration services on microsoft ssis 1
Advanced integration services on microsoft ssis 1
 
U-SQL Query Execution and Performance Basics (SQLBits 2016)
U-SQL Query Execution and Performance Basics (SQLBits 2016)U-SQL Query Execution and Performance Basics (SQLBits 2016)
U-SQL Query Execution and Performance Basics (SQLBits 2016)
 
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
 
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
 
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
 
Azure Data Lake and U-SQL
Azure Data Lake and U-SQLAzure Data Lake and U-SQL
Azure Data Lake and U-SQL
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
 
U-SQL Query Execution and Performance Tuning
U-SQL Query Execution and Performance TuningU-SQL Query Execution and Performance Tuning
U-SQL Query Execution and Performance Tuning
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Uncovering SQL Server query problems with execution plans - Tony Davis
Uncovering SQL Server query problems with execution plans - Tony DavisUncovering SQL Server query problems with execution plans - Tony Davis
Uncovering SQL Server query problems with execution plans - Tony Davis
 
Killer Scenarios with Data Lake in Azure with U-SQL
Killer Scenarios with Data Lake in Azure with U-SQLKiller Scenarios with Data Lake in Azure with U-SQL
Killer Scenarios with Data Lake in Azure with U-SQL
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep Dive
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Datastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing Datastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
 
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQLTaming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
 
KnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge baseKnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge base
 

En vedette (6)

Spider Setup with AWS/sandbox
Spider Setup with AWS/sandboxSpider Setup with AWS/sandbox
Spider Setup with AWS/sandbox
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on Hadoop
 
Ronalao termpresent
Ronalao termpresentRonalao termpresent
Ronalao termpresent
 
Oltp vs olap
Oltp vs olapOltp vs olap
Oltp vs olap
 
An Overview of All Ericsson Labs APIs
An Overview of All Ericsson Labs APIsAn Overview of All Ericsson Labs APIs
An Overview of All Ericsson Labs APIs
 
Best topics for seminar
Best topics for seminarBest topics for seminar
Best topics for seminar
 

Similaire à ETL

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
Dan Querimit - BI Portfolio
Dan Querimit - BI PortfolioDan Querimit - BI Portfolio
Dan Querimit - BI Portfolio
querimit
 
Building the DW - ETL
Building the DW - ETLBuilding the DW - ETL
Building the DW - ETL
ganblues
 
PLSQL - Raymond Wu
PLSQL - Raymond WuPLSQL - Raymond Wu
PLSQL - Raymond Wu
raymond wu
 

Similaire à ETL (20)

SQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersSQL Server 2008 Development for Programmers
SQL Server 2008 Development for Programmers
 
Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722
 
Datawarehousing & DSS
Datawarehousing & DSSDatawarehousing & DSS
Datawarehousing & DSS
 
ETL and pivoting in spark
ETL and pivoting in sparkETL and pivoting in spark
ETL and pivoting in spark
 
ETL and pivoting in spark
ETL and pivoting in sparkETL and pivoting in spark
ETL and pivoting in spark
 
Dan Querimit - BI Portfolio
Dan Querimit - BI PortfolioDan Querimit - BI Portfolio
Dan Querimit - BI Portfolio
 
Data ware house design
Data ware house designData ware house design
Data ware house design
 
Data ware house design
Data ware house designData ware house design
Data ware house design
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
Dipankar resume 2.0 (1)
Dipankar resume 2.0 (1)Dipankar resume 2.0 (1)
Dipankar resume 2.0 (1)
 
ITReady DW Day2
ITReady DW Day2ITReady DW Day2
ITReady DW Day2
 
Building the DW - ETL
Building the DW - ETLBuilding the DW - ETL
Building the DW - ETL
 
Dwh faqs
Dwh faqsDwh faqs
Dwh faqs
 
PLSQL - Raymond Wu
PLSQL - Raymond WuPLSQL - Raymond Wu
PLSQL - Raymond Wu
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
Discovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterDiscovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @Twitter
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 

Plus de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 

Plus de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

ETL

  • 1. ETL & Basic OLAP OperationsCSE – 590 Data Mining Prof. Anita Wasilewska SUNY Stony Brook Presented By : -> Preeti Kudva (106887833) -> Kinjal Khandhar(106878039)
  • 2.
  • 3. Presentation Slides of Prof. Anita Wasilewska.
  • 5. Ralph Kimball, Joe Caserta, The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming and Delivering Data
  • 6. Conceptual modeling for ETL processes by Panos Vassiliadis,Alkis Simitsis,Spiros Skiadopoulos.
  • 10.
  • 11. What is ETL? Extract - Extract relevant data. Transform - Transform data to DW format. - Build DW keys, etc. - Cleansing of data. Load - Load data into DW. - Build aggregates, etc. https://eprints.kfupm.edu.sa/74341/1/74341.pdf
  • 12. Other sources Extract Transform Load Refresh Operational DBs Metadata Analysis Query Reports Data mining Serve Data Warehouse Data Marts Data Sources Front-End Tools Data Storage Data Warehouse Architecture http://infolab.stanford.edu/warehousing/
  • 14.
  • 15. Extract data from different data source formats like flat files, Relational Database Systems,etc.
  • 16. Convert data into a specific format for transformation processing.
  • 18. Result in a check if data meets expected pattern/structure.http://en.wikipedia.org/wiki/Extract,_transform,_load
  • 19.
  • 20.
  • 22.
  • 26. Encoding free-form values(map Male to 1 & Mr to M)
  • 27. Merge/Purge(join data from multiple sources).
  • 29. Calculate(sale_amt = qty * price)
  • 30. Data type conversion
  • 32. Null value handling(null = not to load)
  • 33. Customized transformation(based on user).http://www.bi-bestpractices.com/view-articles/4738
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39. Data in DW must be:1] Precise - DW data must match known numbers - or explanation needed. 2] Complete - DW has all relevant data and the users know. 3] Consistent - No contradictory data: aggregates fit with detail data. 4] Unique - The same things is called the same and has the same key(customers). 5] Timely - Data is updated ”frequently enough” and the users know when.
  • 40.
  • 41. Construct programs that check data quality - Are totals as expected? -Do results agree with alternative source? -Number of NULL values?
  • 42.
  • 43. Load
  • 44.
  • 45.
  • 46.
  • 47.
  • 49. Example of how ETL Works!!!! Consider the HR Department database:-
  • 50.
  • 51. Extracting data can be done using XML convertors .Just select our table and choose dbase III convertor will transfer the data into XML.
  • 52. The result of this extraction will be an XML file similar to this:
  • 53.
  • 54.
  • 55. Output of above steps + etl-target.rdbxml gives:
  • 56. Loading Our ETL Results into the Data Repository loading is a just matter of writing the output of the last XSLT transform step into the etl-target.rdbxml map we built earlier.
  • 58. References: Data Mining: Concepts & Techniques by Jiawei Han and MichelineKamber. http://personalpages.manchester.ac.uk/staff/G.Nenadic/CN3023/Lecture4.pdf http://en.wikipedia.org/wiki/Online_Analytical_Processing http://www.cs.sfu.ca/~han http://en.wikipedia.org/wiki/OLAP_cube#cite_note-OLAPGlossary1995-5 http://www.fmt.vein.hu/softcom/dw
  • 59.
  • 60. OLAP Cube Data warehouse & OLAP tools are based on a multidimensional data model which views data in the form of a data cube. An OLAP (Online Analytical Processing) cube is a data structure that allows fast analysis of data. The OLAP cube consists of numeric facts called measures which are categorized by dimensions. -Dimensions: perspective or entities with respect to which an organization wants to keep records. -Facts: quantities by which we want to analyze relations between dimensions. The cube metadata may be created from a star schema or snowflake schema of tables in a relational database. Measures are derived from the records in the fact table and dimensions are derived from the dimension tables. Reference: http://en.wikipedia.org/wiki/OLAP_cube#cite_note-OLAPGlossary1995-5
  • 61. Concept Hierarchy A concept hierarchy defines a sequence of mappings from a set of low level concepts to higher level, more general concepts. Each of the elements of a dimension could be summarized using a hierarchy. The hierarchy is a series of parent-child relationships, typically where a parent member represents the consolidation of the members which are its children. Parent members can be further aggregated as the children of another parent. Reference: http://en.wikipedia.org/wiki/OLAP_cube#cite_note-OLAPGlossary1995-5
  • 62. Example – Star Schema item time item_key item_name brand type supplier_type time_key day day_of_the_week month quarter year location branch location_key street city province_or_street country branch_key branch_name branch_type Sales Fact Table time_key item_key branch_key location_key units_sold Measures Reference: http://www.cs.sfu.ca/~han
  • 63. Example: Example: Dimensions: Item, Location, Time Hierarchical summarization paths Region Type Region Year Brand Country Quarter Item City Month Street Week Day Item Month Reference: http://www.cs.sfu.ca/~han
  • 64. Working Example(1) Reference: http://www.fmt.vein.hu/softcom/dw
  • 65.
  • 66.
  • 67.
  • 68. Pivot Visualization operation which rotates the data axes in view in order to provide an alternate presentation of data. Select a different dimension (orientation) for analysis[http://personalpages.manchester.ac.uk/staff/G.Nenadic/CN3023/lecture4.pdf] E.g. pivot operation where location & item in a 2D slice are rotated. Other examples: - rotating the axes in a 3D cube. - transforming 3D cube into series of 2D planes.
  • 69. Working Example(2) Dimension Tables: Market(Market_ID, City, Region) Product(Product_ID, Name, Category) Time( Time_ID,Week,Month,Quarter) Fact Table: Sales(Market_ID,Product_ID,Time_ID,Amount) Reference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf
  • 70. Roll up & Drill down on Working Example(2) SELECT S.Product_ID,M.City, SUM(S.Amount) INTO City_Sales FROM Sales S, Market M WHEREM.Market_ID = S.Market_ID GROUP BY S.Product_ID, M.City Roll up salesonMarket from city to region SELECTT.Product_ID,M.Region, SUM(T.Amount) FROM City_Sales T, Market M WHERE T.City = M.City GROUP BY T.Product_ID, M.Region Drill down sales on Market from region to city Reference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf
  • 71. Slice & Dice on Working Example(2) Dicing sales in the time dimension (e.g. total sales for each product in each quarter) SELECT S.Product_ID, T.Quarter, SUM(S.Amount) FROM Sales S, Time T WHERET.Time_ID = S.Time_ID AND T.Week=‘Week12’ AND (S.Product_ID =‘1002’ OR S.Product_ID =‘1003’) GROUP BY T.Quarter, S.Product_ID Slicing the data cube in the time dimension (e.g. choosing sales only in week 12) SELECT S.* FROM Sales S, Time T WHERET.Time_ID = S.Time_ID AND T.Week = ‘Week12’ Reference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf
  • 72. Other Operations drill across: executes queries involving (across) more than one fact table. drill through: makes use of relational SQL facilities to drill through the bottom level of the cube to its back-end relational tables. Reference: [http://www.cs.sfu.ca/~han]
  • 73. 10 Challenging Problems in Data Mining Research Xindong Wu Department of Computer Science University of Vermont 33 Colchester Avenue, Burlington, Vermont 05405, USA xwu@cs.uvm.edu Qiang Yang Department of Computer Science Hong Kong University of Science & Technology Clearwater Bay, Kowloon, Hong Kong, China ong Kong University of Science and Technology Clearwater Bay, Kowloon, Hong Kong, China Presented in : ICDM '05The Fifth IEEE International Conference on Data Mining
  • 74.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84. 3. Sequential and Time Series Data Real time series data obtained from Wireless sensors in Hong Kong UST CS department hallway
  • 85. 4. Mining Complex Knowledge fromComplex Data Important type of complex knowledge is in the form of graphs. Challenge: More research required in the field of discovering graphs and structured patterns from large data. Data that are not i.i.d. (independent and identically distributed) -many objects are not independent of each other, and are not of a single type. Challenge: Data mining systems required that can soundly mine the rich structure of relations among objects. -E.g.: interlinked Web pages, social networks, metabolic networks in the cell
  • 86.
  • 87.
  • 88. Entities/ nodes are distributed. Hence, distributed means of identification desired.
  • 89.
  • 90.
  • 91.
  • 92. In Distributed Environment(sensor/IP Network),distributed probes are placed at locations within the network.
  • 93.
  • 94. Multi-agent Data Mining : Agents are often distributed & have proactive and reactive features.http://www-ai.cs.uni-dortmund.de/auto?self=$ejr31cyc http://www.csc.liv.ac.uk/~ali/wp/MADM.pdf
  • 95. 7. Data Mining for Biological andEnvironmental Problems Mining Biological data – extremely imp problem. eg.HIV Vaccine design. Molecular biology eg. DNA chemical properties, 3D structures,functional properties.
  • 96.
  • 97.
  • 98. 8. Data-mining-Process RelatedProblems Sampling Feature Selection Mining….
  • 99.
  • 100. How to do data mining for protection of security and privacy?
  • 101. Knowledge integrity assessment - Data are intentionally modified from their original version, in order to misinform the recipients or for privacy and security - Development of measures to evaluate the knowledge integrity of a collection of -- Data -- Knowledge and patterns
  • 102.
  • 103. 10. Dealing with Non-static,Unbalanced and Cost-sensitive Data Data is non-static,constantly changing.eg of collecting data in 2000,then 2001,2002 ……Problem is to correct the bias. Deal with unbalanced & cost-sensitive data: There is much information on costs and benefits, but no overall model of profit and loss. Data may evolve with a bias introduced by sampling ICML 2003 Workshop on Learning from Imbalanced Data Sets
  • 104.
  • 105. Conclusion There is still a lack of timely exchange of important topics in the community as a whole. These problems are sampled from a small, albeit important, segment of the community. The list should obviously be a function of time for this dynamic field.