SlideShare une entreprise Scribd logo
1  sur  19
Data warehousing and mining

  Session VII (Part 1) 15:45 - 16:10

         Sunita Sarawagi
     School of IT, IIT Bombay
Introduction
• Organizations getting larger and amassing ever
  increasing amounts of data
• Historic data encodes useful information about
  working of an organization.
• However, data scattered across multiple sources,
  in multiple formats.
• Data warehousing: process of consolidating data
  in a centralized location
• Data mining: process of analyzing data to find
  useful patterns and relationships
Dr. Sunita Sarawagi    Data Warehousing & Mining     2
Typical data analysis tasks
• Report the per-capita deposits broken down by
  region and profession.
• Are deposits from rural coastal areas increasing
  over last five years?
• What percent of small business loans were cleared?
• Why is it less than last year’s? How did similar
  businesses that did not take loans perform?
• What should be the new rules for loan eligibility?

Dr. Sunita Sarawagi   Data Warehousing & Mining   3
Decision support tools
                                                                   Mining
  Direct                   Reporting             OLAP              tools
  Query                    tools
                                                Essbase           Intelligent Miner
                          Crystal reports


Merge                                                                     Relational
Clean                            Data warehouse                           DBMS+
Summarize                                                                 e.g. Redbrick

Detailed                                                        GIS
transactional                                                   data
data                      Operational data                                 Census
       Bombay branch Delhi branch                      Calcutta branch     data
             Oracle                                           IMS           SAS
    Dr. Sunita Sarawagi             Data Warehousing & Mining                     4
Data warehouse construction
• Heterogeneous data integration
     – merge from various sources, fuzzy matches
     – remove inconsistencies
• Data cleaning:
     – missing data, outliers, clean fields e.g. names/addresses
     – Data mining techniques
• Data loading: summarize, create indices
• Products: Prism warehouse manager, Platinum info
   refiner, info pump, QDB, Vality

Dr. Sunita Sarawagi     Data Warehousing & Mining              5
Warehouse maintenance
• Data refresh
     – when to refresh, what form to send updates?
• Materialized view maintenance with batch
  updates.
• Query evaluation using materialized views
• Monitoring and reporting tools
     – HP intelligent warehouse advisor

Dr. Sunita Sarawagi   Data Warehousing & Mining      6
Decision support tools
                                                                   Mining
  Direct                   Reporting             OLAP              tools
  Query                    tools
                                                Essbase           Intelligent Miner
                          Crystal reports


Merge                                                                     Relational
Clean                            Data warehouse                           DBMS+
Summarize                                                                 e.g. Redbrick

Detailed                                                        GIS
transactional                                                   data
data                      Operational data                                 Census
       Bombay branch Delhi branch                      Calcutta branch     data
             Oracle                                           IMS           SAS
    Dr. Sunita Sarawagi             Data Warehousing & Mining                     7
OLAP
 Fast, interactive answers to large aggregate queries.
 • Multidimensional model: dimensions with
   hierarchies
    – Dim 1: Bank location:
             • branch-->city-->state
       – Dim 2: Customer:
             • sub profession --> profession
       – Dim 3: Time:
             • month --> quarter --> year
 • Measures: loan amount, #transactions, balance
Dr. Sunita Sarawagi       Data Warehousing & Mining   8
OLAP
• Navigational operators: Pivot, drill-down,
  roll-up, select.
• Hypothesis driven search: E.g. factors
  affecting defaulters
     – view defaulting rate on age aggregated over other
       dimensions
     – for particular age segment detail along profession
• Need interactive response to aggregate queries..
Dr. Sunita Sarawagi    Data Warehousing & Mining            9
OLAP products
• About 30 OLAP vendors
• Dominant ones:
     – Oracle Express: largest market share: 20%
     – Arbor Essbase: technology leader
     – Microsoft Plato: introduced late last year,
       rapidly taking over...



Dr. Sunita Sarawagi     Data Warehousing & Mining    10
Microsoft OLAP strategy
• Plato: OLAP server: powerful, integrating various
  operational sources
• OLE-DB for OLAP: emerging industry standard
  based on MDX --> extension of SQL for OLAP
• Pivot-table services: integrate with Office 2000
     – Every desktop will have OLAP capability.
• Client side caching and calculations
• Partitioned and virtual cube
• Hybrid relational and multidimensional storage
Dr. Sunita Sarawagi   Data Warehousing & Mining    11
Data mining
• Process of semi-automatically analyzing large
  databases to find interesting and useful
  patterns
• Overlaps with machine learning, statistics,
  artificial intelligence and databases but
     – more scalable in number of features and instances
     – more automated to handle heterogeneous data

Dr. Sunita Sarawagi    Data Warehousing & Mining     12
Some basic operations
• Predictive:
      – Regression
      – Classification
• Descriptive:
      – Clustering / similarity matching
      – Association rules and variants
      – Deviation detection


Dr. Sunita Sarawagi   Data Warehousing & Mining   13
Classification
• Given old data about customers and payments,
  predict new applicant’s loan eligibility.
Previous customers             Classifier                Decision rules
  Age
                                                            Salary > 5 L
  Salary                                                                        Good/
  Profession                                                     Prof. = Exec    bad
  Location
  Customer
  type
                                                     New applicant’s data
 Dr. Sunita Sarawagi     Data Warehousing & Mining                              14
Classification methods
• Nearest neighbor
• Regression: (linear or any polynomial)
     – a*salary + b*age + c = eligibility score.
• Decision tree classifier
• Probabilistic/generative models
• Neural networks

Dr. Sunita Sarawagi   Data Warehousing & Mining    15
Clustering
• Unsupervised learning when old data with class
  labels not available e.g. when introducing a new
  product.
• Group/cluster existing customers based on time
  series of payment history such that similar customers
  in same cluster.
• Key requirement: Need a good measure of similarity
  between instances.
• Identify micro-markets and develop policies for each
Dr. Sunita Sarawagi   Data Warehousing & Mining    16
Association rules
                                                           T
                                                     Milk, cereal
• Given set T of groups of items                     Tea, milk
• Example: set of item sets purchased
                                                     Tea, rice, bread
• Goal: find all rules on itemsets of the
  form a-->b such that
     – support of a and b > user threshold s
     – conditional probability (confidence) of b
       given a > user threshold c
• Example: Milk --> bread
• Purchase of product A --> service B
Dr. Sunita Sarawagi      Data Warehousing & Mining   cereal     17
Mining market
• Around 20 to 30 mining tool vendors
• Major players:
     –   Clementine,
     –   IBM’s Intelligent Miner,
     –   SGI’s MineSet,
     –   SAS’s Enterprise Miner.
• All pretty much the same set of tools
• Many embedded products: fraud detection, electronic
   commerce applications
Dr. Sunita Sarawagi      Data Warehousing & Mining   18
Conclusions
• The value of warehousing and mining in
  effective decision making based on concrete
  evidence from old data
• Challenges of heterogeneity and scale in
  warehouse construction and maintenance
• Grades of data analysis tools: straight
  querying, reporting tools, multidimensional
  analysis and mining.
Dr. Sunita Sarawagi    Data Warehousing & Mining   19

Contenu connexe

Tendances

Yahoo! TAO Case Study Excerpt
Yahoo! TAO Case Study ExcerptYahoo! TAO Case Study Excerpt
Yahoo! TAO Case Study ExcerptDenny Lee
 
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopChicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopCloudera, Inc.
 
Austin fraser sap hana presentation
Austin fraser sap hana presentationAustin fraser sap hana presentation
Austin fraser sap hana presentationShane Sale
 

Tendances (6)

Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Yahoo! TAO Case Study Excerpt
Yahoo! TAO Case Study ExcerptYahoo! TAO Case Study Excerpt
Yahoo! TAO Case Study Excerpt
 
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopChicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
 
Austin fraser sap hana presentation
Austin fraser sap hana presentationAustin fraser sap hana presentation
Austin fraser sap hana presentation
 
Technical presentation
Technical presentationTechnical presentation
Technical presentation
 
Prashanth Updated C.V
Prashanth Updated C.VPrashanth Updated C.V
Prashanth Updated C.V
 

Similaire à Session7part1

Data mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsData mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsGDi Techno Solutions
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine
 
`Data mining
`Data mining`Data mining
`Data miningJebin R
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarioskcmallu
 
MIS: Business Intelligence
MIS: Business IntelligenceMIS: Business Intelligence
MIS: Business IntelligenceJonathan Coleman
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningSi Krishan
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData StoryLynn Langit
 
Intro to Data warehousing lecture 16
Intro to Data warehousing   lecture 16Intro to Data warehousing   lecture 16
Intro to Data warehousing lecture 16AnwarrChaudary
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...Fabio Fumarola
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxAIMLSEMINARS
 
Assessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use CasesAssessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use CasesDATAVERSITY
 
Software architecture & design patterns for MS CRM Developers
Software architecture & design patterns for MS CRM  Developers Software architecture & design patterns for MS CRM  Developers
Software architecture & design patterns for MS CRM Developers sebedatalabs
 

Similaire à Session7part1 (20)

Session7part1
Session7part1Session7part1
Session7part1
 
Data mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsData mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
 
Dbm630_lecture02-03
Dbm630_lecture02-03Dbm630_lecture02-03
Dbm630_lecture02-03
 
Dbm630_Lecture02-03
Dbm630_Lecture02-03Dbm630_Lecture02-03
Dbm630_Lecture02-03
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
 
Lecture1
Lecture1Lecture1
Lecture1
 
`Data mining
`Data mining`Data mining
`Data mining
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
 
Data mining & column stores
Data mining & column storesData mining & column stores
Data mining & column stores
 
MIS: Business Intelligence
MIS: Business IntelligenceMIS: Business Intelligence
MIS: Business Intelligence
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData Story
 
Intro to Data warehousing lecture 16
Intro to Data warehousing   lecture 16Intro to Data warehousing   lecture 16
Intro to Data warehousing lecture 16
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptx
 
Assessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use CasesAssessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use Cases
 
Software architecture & design patterns for MS CRM Developers
Software architecture & design patterns for MS CRM  Developers Software architecture & design patterns for MS CRM  Developers
Software architecture & design patterns for MS CRM Developers
 

Dernier

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 

Dernier (20)

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 

Session7part1

  • 1. Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay
  • 2. Introduction • Organizations getting larger and amassing ever increasing amounts of data • Historic data encodes useful information about working of an organization. • However, data scattered across multiple sources, in multiple formats. • Data warehousing: process of consolidating data in a centralized location • Data mining: process of analyzing data to find useful patterns and relationships Dr. Sunita Sarawagi Data Warehousing & Mining 2
  • 3. Typical data analysis tasks • Report the per-capita deposits broken down by region and profession. • Are deposits from rural coastal areas increasing over last five years? • What percent of small business loans were cleared? • Why is it less than last year’s? How did similar businesses that did not take loans perform? • What should be the new rules for loan eligibility? Dr. Sunita Sarawagi Data Warehousing & Mining 3
  • 4. Decision support tools Mining Direct Reporting OLAP tools Query tools Essbase Intelligent Miner Crystal reports Merge Relational Clean Data warehouse DBMS+ Summarize e.g. Redbrick Detailed GIS transactional data data Operational data Census Bombay branch Delhi branch Calcutta branch data Oracle IMS SAS Dr. Sunita Sarawagi Data Warehousing & Mining 4
  • 5. Data warehouse construction • Heterogeneous data integration – merge from various sources, fuzzy matches – remove inconsistencies • Data cleaning: – missing data, outliers, clean fields e.g. names/addresses – Data mining techniques • Data loading: summarize, create indices • Products: Prism warehouse manager, Platinum info refiner, info pump, QDB, Vality Dr. Sunita Sarawagi Data Warehousing & Mining 5
  • 6. Warehouse maintenance • Data refresh – when to refresh, what form to send updates? • Materialized view maintenance with batch updates. • Query evaluation using materialized views • Monitoring and reporting tools – HP intelligent warehouse advisor Dr. Sunita Sarawagi Data Warehousing & Mining 6
  • 7. Decision support tools Mining Direct Reporting OLAP tools Query tools Essbase Intelligent Miner Crystal reports Merge Relational Clean Data warehouse DBMS+ Summarize e.g. Redbrick Detailed GIS transactional data data Operational data Census Bombay branch Delhi branch Calcutta branch data Oracle IMS SAS Dr. Sunita Sarawagi Data Warehousing & Mining 7
  • 8. OLAP Fast, interactive answers to large aggregate queries. • Multidimensional model: dimensions with hierarchies – Dim 1: Bank location: • branch-->city-->state – Dim 2: Customer: • sub profession --> profession – Dim 3: Time: • month --> quarter --> year • Measures: loan amount, #transactions, balance Dr. Sunita Sarawagi Data Warehousing & Mining 8
  • 9. OLAP • Navigational operators: Pivot, drill-down, roll-up, select. • Hypothesis driven search: E.g. factors affecting defaulters – view defaulting rate on age aggregated over other dimensions – for particular age segment detail along profession • Need interactive response to aggregate queries.. Dr. Sunita Sarawagi Data Warehousing & Mining 9
  • 10. OLAP products • About 30 OLAP vendors • Dominant ones: – Oracle Express: largest market share: 20% – Arbor Essbase: technology leader – Microsoft Plato: introduced late last year, rapidly taking over... Dr. Sunita Sarawagi Data Warehousing & Mining 10
  • 11. Microsoft OLAP strategy • Plato: OLAP server: powerful, integrating various operational sources • OLE-DB for OLAP: emerging industry standard based on MDX --> extension of SQL for OLAP • Pivot-table services: integrate with Office 2000 – Every desktop will have OLAP capability. • Client side caching and calculations • Partitioned and virtual cube • Hybrid relational and multidimensional storage Dr. Sunita Sarawagi Data Warehousing & Mining 11
  • 12. Data mining • Process of semi-automatically analyzing large databases to find interesting and useful patterns • Overlaps with machine learning, statistics, artificial intelligence and databases but – more scalable in number of features and instances – more automated to handle heterogeneous data Dr. Sunita Sarawagi Data Warehousing & Mining 12
  • 13. Some basic operations • Predictive: – Regression – Classification • Descriptive: – Clustering / similarity matching – Association rules and variants – Deviation detection Dr. Sunita Sarawagi Data Warehousing & Mining 13
  • 14. Classification • Given old data about customers and payments, predict new applicant’s loan eligibility. Previous customers Classifier Decision rules Age Salary > 5 L Salary Good/ Profession Prof. = Exec bad Location Customer type New applicant’s data Dr. Sunita Sarawagi Data Warehousing & Mining 14
  • 15. Classification methods • Nearest neighbor • Regression: (linear or any polynomial) – a*salary + b*age + c = eligibility score. • Decision tree classifier • Probabilistic/generative models • Neural networks Dr. Sunita Sarawagi Data Warehousing & Mining 15
  • 16. Clustering • Unsupervised learning when old data with class labels not available e.g. when introducing a new product. • Group/cluster existing customers based on time series of payment history such that similar customers in same cluster. • Key requirement: Need a good measure of similarity between instances. • Identify micro-markets and develop policies for each Dr. Sunita Sarawagi Data Warehousing & Mining 16
  • 17. Association rules T Milk, cereal • Given set T of groups of items Tea, milk • Example: set of item sets purchased Tea, rice, bread • Goal: find all rules on itemsets of the form a-->b such that – support of a and b > user threshold s – conditional probability (confidence) of b given a > user threshold c • Example: Milk --> bread • Purchase of product A --> service B Dr. Sunita Sarawagi Data Warehousing & Mining cereal 17
  • 18. Mining market • Around 20 to 30 mining tool vendors • Major players: – Clementine, – IBM’s Intelligent Miner, – SGI’s MineSet, – SAS’s Enterprise Miner. • All pretty much the same set of tools • Many embedded products: fraud detection, electronic commerce applications Dr. Sunita Sarawagi Data Warehousing & Mining 18
  • 19. Conclusions • The value of warehousing and mining in effective decision making based on concrete evidence from old data • Challenges of heterogeneity and scale in warehouse construction and maintenance • Grades of data analysis tools: straight querying, reporting tools, multidimensional analysis and mining. Dr. Sunita Sarawagi Data Warehousing & Mining 19

Notes de l'éditeur

  1. Start with a real-life scenario
  2. CHECK ON THE PRODUCTS INTERESTING ALGORITHMS
  3. Cognos and microstrategy next in line 1.4B in 1997, 40% growth from 1994-97, expected to be 3B in 2000 Source: http://www.olapreport.com/Market.htm
  4. Each topic is a talk..
  5. Absolute: 40 M$ 40M$, expected to grow 10 times by 2000 --Forrester research