1) Data warehousing involves consolidating data from multiple sources into a centralized location, while data mining analyzes this data to find useful patterns.
2) Typical analysis tasks include examining loan approval rates by region or determining factors that influence loan defaults.
3) Online analytical processing (OLAP) tools allow interactive analysis of aggregated data and hypothesis testing through drill-downs and pivots. Leading OLAP products include Oracle Express and Essbase.
1. Data warehousing and mining
Session VII (Part 1) 15:45 - 16:10
Sunita Sarawagi
School of IT, IIT Bombay
2. Introduction
• Organizations getting larger and amassing ever
increasing amounts of data
• Historic data encodes useful information about
working of an organization.
• However, data scattered across multiple sources,
in multiple formats.
• Data warehousing: process of consolidating data
in a centralized location
• Data mining: process of analyzing data to find
useful patterns and relationships
Dr. Sunita Sarawagi Data Warehousing & Mining 2
3. Typical data analysis tasks
• Report the per-capita deposits broken down by
region and profession.
• Are deposits from rural coastal areas increasing
over last five years?
• What percent of small business loans were cleared?
• Why is it less than last year’s? How did similar
businesses that did not take loans perform?
• What should be the new rules for loan eligibility?
Dr. Sunita Sarawagi Data Warehousing & Mining 3
4. Decision support tools
Mining
Direct Reporting OLAP tools
Query tools
Essbase Intelligent Miner
Crystal reports
Merge Relational
Clean Data warehouse DBMS+
Summarize e.g. Redbrick
Detailed GIS
transactional data
data Operational data Census
Bombay branch Delhi branch Calcutta branch data
Oracle IMS SAS
Dr. Sunita Sarawagi Data Warehousing & Mining 4
5. Data warehouse construction
• Heterogeneous data integration
– merge from various sources, fuzzy matches
– remove inconsistencies
• Data cleaning:
– missing data, outliers, clean fields e.g. names/addresses
– Data mining techniques
• Data loading: summarize, create indices
• Products: Prism warehouse manager, Platinum info
refiner, info pump, QDB, Vality
Dr. Sunita Sarawagi Data Warehousing & Mining 5
6. Warehouse maintenance
• Data refresh
– when to refresh, what form to send updates?
• Materialized view maintenance with batch
updates.
• Query evaluation using materialized views
• Monitoring and reporting tools
– HP intelligent warehouse advisor
Dr. Sunita Sarawagi Data Warehousing & Mining 6
7. Decision support tools
Mining
Direct Reporting OLAP tools
Query tools
Essbase Intelligent Miner
Crystal reports
Merge Relational
Clean Data warehouse DBMS+
Summarize e.g. Redbrick
Detailed GIS
transactional data
data Operational data Census
Bombay branch Delhi branch Calcutta branch data
Oracle IMS SAS
Dr. Sunita Sarawagi Data Warehousing & Mining 7
8. OLAP
Fast, interactive answers to large aggregate queries.
• Multidimensional model: dimensions with
hierarchies
– Dim 1: Bank location:
• branch-->city-->state
– Dim 2: Customer:
• sub profession --> profession
– Dim 3: Time:
• month --> quarter --> year
• Measures: loan amount, #transactions, balance
Dr. Sunita Sarawagi Data Warehousing & Mining 8
9. OLAP
• Navigational operators: Pivot, drill-down,
roll-up, select.
• Hypothesis driven search: E.g. factors
affecting defaulters
– view defaulting rate on age aggregated over other
dimensions
– for particular age segment detail along profession
• Need interactive response to aggregate queries..
Dr. Sunita Sarawagi Data Warehousing & Mining 9
10. OLAP products
• About 30 OLAP vendors
• Dominant ones:
– Oracle Express: largest market share: 20%
– Arbor Essbase: technology leader
– Microsoft Plato: introduced late last year,
rapidly taking over...
Dr. Sunita Sarawagi Data Warehousing & Mining 10
11. Microsoft OLAP strategy
• Plato: OLAP server: powerful, integrating various
operational sources
• OLE-DB for OLAP: emerging industry standard
based on MDX --> extension of SQL for OLAP
• Pivot-table services: integrate with Office 2000
– Every desktop will have OLAP capability.
• Client side caching and calculations
• Partitioned and virtual cube
• Hybrid relational and multidimensional storage
Dr. Sunita Sarawagi Data Warehousing & Mining 11
12. Data mining
• Process of semi-automatically analyzing large
databases to find interesting and useful
patterns
• Overlaps with machine learning, statistics,
artificial intelligence and databases but
– more scalable in number of features and instances
– more automated to handle heterogeneous data
Dr. Sunita Sarawagi Data Warehousing & Mining 12
13. Some basic operations
• Predictive:
– Regression
– Classification
• Descriptive:
– Clustering / similarity matching
– Association rules and variants
– Deviation detection
Dr. Sunita Sarawagi Data Warehousing & Mining 13
14. Classification
• Given old data about customers and payments,
predict new applicant’s loan eligibility.
Previous customers Classifier Decision rules
Age
Salary > 5 L
Salary Good/
Profession Prof. = Exec bad
Location
Customer
type
New applicant’s data
Dr. Sunita Sarawagi Data Warehousing & Mining 14
15. Classification methods
• Nearest neighbor
• Regression: (linear or any polynomial)
– a*salary + b*age + c = eligibility score.
• Decision tree classifier
• Probabilistic/generative models
• Neural networks
Dr. Sunita Sarawagi Data Warehousing & Mining 15
16. Clustering
• Unsupervised learning when old data with class
labels not available e.g. when introducing a new
product.
• Group/cluster existing customers based on time
series of payment history such that similar customers
in same cluster.
• Key requirement: Need a good measure of similarity
between instances.
• Identify micro-markets and develop policies for each
Dr. Sunita Sarawagi Data Warehousing & Mining 16
17. Association rules
T
Milk, cereal
• Given set T of groups of items Tea, milk
• Example: set of item sets purchased
Tea, rice, bread
• Goal: find all rules on itemsets of the
form a-->b such that
– support of a and b > user threshold s
– conditional probability (confidence) of b
given a > user threshold c
• Example: Milk --> bread
• Purchase of product A --> service B
Dr. Sunita Sarawagi Data Warehousing & Mining cereal 17
18. Mining market
• Around 20 to 30 mining tool vendors
• Major players:
– Clementine,
– IBM’s Intelligent Miner,
– SGI’s MineSet,
– SAS’s Enterprise Miner.
• All pretty much the same set of tools
• Many embedded products: fraud detection, electronic
commerce applications
Dr. Sunita Sarawagi Data Warehousing & Mining 18
19. Conclusions
• The value of warehousing and mining in
effective decision making based on concrete
evidence from old data
• Challenges of heterogeneity and scale in
warehouse construction and maintenance
• Grades of data analysis tools: straight
querying, reporting tools, multidimensional
analysis and mining.
Dr. Sunita Sarawagi Data Warehousing & Mining 19
Notes de l'éditeur
Start with a real-life scenario
CHECK ON THE PRODUCTS INTERESTING ALGORITHMS
Cognos and microstrategy next in line 1.4B in 1997, 40% growth from 1994-97, expected to be 3B in 2000 Source: http://www.olapreport.com/Market.htm
Each topic is a talk..
Absolute: 40 M$ 40M$, expected to grow 10 times by 2000 --Forrester research