11. What is ETL? Extract - Extract relevant data. Transform - Transform data to DW format. - Build DW keys, etc. - Cleansing of data. Load - Load data into DW. - Build aggregates, etc. https://eprints.kfupm.edu.sa/74341/1/74341.pdf
12. Other sources Extract Transform Load Refresh Operational DBs Metadata Analysis Query Reports Data mining Serve Data Warehouse Data Marts Data Sources Front-End Tools Data Storage Data Warehouse Architecture http://infolab.stanford.edu/warehousing/
39. Data in DW must be:1] Precise - DW data must match known numbers - or explanation needed. 2] Complete - DW has all relevant data and the users know. 3] Consistent - No contradictory data: aggregates fit with detail data. 4] Unique - The same things is called the same and has the same key(customers). 5] Timely - Data is updated ”frequently enough” and the users know when.
40.
41. Construct programs that check data quality - Are totals as expected? -Do results agree with alternative source? -Number of NULL values?
56. Loading Our ETL Results into the Data Repository loading is a just matter of writing the output of the last XSLT transform step into the etl-target.rdbxml map we built earlier.
58. References: Data Mining: Concepts & Techniques by Jiawei Han and MichelineKamber. http://personalpages.manchester.ac.uk/staff/G.Nenadic/CN3023/Lecture4.pdf http://en.wikipedia.org/wiki/Online_Analytical_Processing http://www.cs.sfu.ca/~han http://en.wikipedia.org/wiki/OLAP_cube#cite_note-OLAPGlossary1995-5 http://www.fmt.vein.hu/softcom/dw
59.
60. OLAP Cube Data warehouse & OLAP tools are based on a multidimensional data model which views data in the form of a data cube. An OLAP (Online Analytical Processing) cube is a data structure that allows fast analysis of data. The OLAP cube consists of numeric facts called measures which are categorized by dimensions. -Dimensions: perspective or entities with respect to which an organization wants to keep records. -Facts: quantities by which we want to analyze relations between dimensions. The cube metadata may be created from a star schema or snowflake schema of tables in a relational database. Measures are derived from the records in the fact table and dimensions are derived from the dimension tables. Reference: http://en.wikipedia.org/wiki/OLAP_cube#cite_note-OLAPGlossary1995-5
61. Concept Hierarchy A concept hierarchy defines a sequence of mappings from a set of low level concepts to higher level, more general concepts. Each of the elements of a dimension could be summarized using a hierarchy. The hierarchy is a series of parent-child relationships, typically where a parent member represents the consolidation of the members which are its children. Parent members can be further aggregated as the children of another parent. Reference: http://en.wikipedia.org/wiki/OLAP_cube#cite_note-OLAPGlossary1995-5
62. Example – Star Schema item time item_key item_name brand type supplier_type time_key day day_of_the_week month quarter year location branch location_key street city province_or_street country branch_key branch_name branch_type Sales Fact Table time_key item_key branch_key location_key units_sold Measures Reference: http://www.cs.sfu.ca/~han
63. Example: Example: Dimensions: Item, Location, Time Hierarchical summarization paths Region Type Region Year Brand Country Quarter Item City Month Street Week Day Item Month Reference: http://www.cs.sfu.ca/~han
68. Pivot Visualization operation which rotates the data axes in view in order to provide an alternate presentation of data. Select a different dimension (orientation) for analysis[http://personalpages.manchester.ac.uk/staff/G.Nenadic/CN3023/lecture4.pdf] E.g. pivot operation where location & item in a 2D slice are rotated. Other examples: - rotating the axes in a 3D cube. - transforming 3D cube into series of 2D planes.
69. Working Example(2) Dimension Tables: Market(Market_ID, City, Region) Product(Product_ID, Name, Category) Time( Time_ID,Week,Month,Quarter) Fact Table: Sales(Market_ID,Product_ID,Time_ID,Amount) Reference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf
70. Roll up & Drill down on Working Example(2) SELECT S.Product_ID,M.City, SUM(S.Amount) INTO City_Sales FROM Sales S, Market M WHEREM.Market_ID = S.Market_ID GROUP BY S.Product_ID, M.City Roll up salesonMarket from city to region SELECTT.Product_ID,M.Region, SUM(T.Amount) FROM City_Sales T, Market M WHERE T.City = M.City GROUP BY T.Product_ID, M.Region Drill down sales on Market from region to city Reference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf
71. Slice & Dice on Working Example(2) Dicing sales in the time dimension (e.g. total sales for each product in each quarter) SELECT S.Product_ID, T.Quarter, SUM(S.Amount) FROM Sales S, Time T WHERET.Time_ID = S.Time_ID AND T.Week=‘Week12’ AND (S.Product_ID =‘1002’ OR S.Product_ID =‘1003’) GROUP BY T.Quarter, S.Product_ID Slicing the data cube in the time dimension (e.g. choosing sales only in week 12) SELECT S.* FROM Sales S, Time T WHERET.Time_ID = S.Time_ID AND T.Week = ‘Week12’ Reference: http://personalpages.manchester.ac.uk/staff/G.Nenandic/CN3023/lecture4.pdf
72. Other Operations drill across: executes queries involving (across) more than one fact table. drill through: makes use of relational SQL facilities to drill through the bottom level of the cube to its back-end relational tables. Reference: [http://www.cs.sfu.ca/~han]
73. 10 Challenging Problems in Data Mining Research Xindong Wu Department of Computer Science University of Vermont 33 Colchester Avenue, Burlington, Vermont 05405, USA xwu@cs.uvm.edu Qiang Yang Department of Computer Science Hong Kong University of Science & Technology Clearwater Bay, Kowloon, Hong Kong, China ong Kong University of Science and Technology Clearwater Bay, Kowloon, Hong Kong, China Presented in : ICDM '05The Fifth IEEE International Conference on Data Mining
84. 3. Sequential and Time Series Data Real time series data obtained from Wireless sensors in Hong Kong UST CS department hallway
85. 4. Mining Complex Knowledge fromComplex Data Important type of complex knowledge is in the form of graphs. Challenge: More research required in the field of discovering graphs and structured patterns from large data. Data that are not i.i.d. (independent and identically distributed) -many objects are not independent of each other, and are not of a single type. Challenge: Data mining systems required that can soundly mine the rich structure of relations among objects. -E.g.: interlinked Web pages, social networks, metabolic networks in the cell
86.
87.
88. Entities/ nodes are distributed. Hence, distributed means of identification desired.
94. Multi-agent Data Mining : Agents are often distributed & have proactive and reactive features.http://www-ai.cs.uni-dortmund.de/auto?self=$ejr31cyc http://www.csc.liv.ac.uk/~ali/wp/MADM.pdf
95. 7. Data Mining for Biological andEnvironmental Problems Mining Biological data – extremely imp problem. eg.HIV Vaccine design. Molecular biology eg. DNA chemical properties, 3D structures,functional properties.
100. How to do data mining for protection of security and privacy?
101. Knowledge integrity assessment - Data are intentionally modified from their original version, in order to misinform the recipients or for privacy and security - Development of measures to evaluate the knowledge integrity of a collection of -- Data -- Knowledge and patterns
102.
103. 10. Dealing with Non-static,Unbalanced and Cost-sensitive Data Data is non-static,constantly changing.eg of collecting data in 2000,then 2001,2002 ……Problem is to correct the bias. Deal with unbalanced & cost-sensitive data: There is much information on costs and benefits, but no overall model of profit and loss. Data may evolve with a bias introduced by sampling ICML 2003 Workshop on Learning from Imbalanced Data Sets
104.
105. Conclusion There is still a lack of timely exchange of important topics in the community as a whole. These problems are sampled from a small, albeit important, segment of the community. The list should obviously be a function of time for this dynamic field.