SlideShare une entreprise Scribd logo
1  sur  42
 Data Explosion Problem
1. Automated data collection tools (e.g. web, sensor networks) and mature
database technology lead to tremendous amounts of data stored in databases,
data warehouses and other information repositories.
2. Currently enterprises are facing data explosion problem.
 Electronic Information an Important Asset for Business Decisions
1. With the growth of electronic information, enterprises began to realizing that the
accumulated information can be an important asset in their business decisions.
2. There is a potential business intelligence hidden in the large volume of data.
3. This intelligence can be the secret weapon on which the success of a business
may depend.
1. It is not a Simple Matter to discover Business
Intelligence from Mountain of Accumulated Data.
2. What is required are Techniques that allow the enterprise
to Extract the Most Valuable Information.
3. The Field of Data Mining provides such Techniques.
4. These techniques can Find Novel Patterns (unknown)
that may Assist an Enterprise in Understanding the
business better and in forecasting.
• SQL. SQL is a query language, difficult for business people to
use
• EIS = Executive Information Systems. EIS systems
provide graphical interfaces that give executives a pre-
programmed (and therefore limited) selection of reports,
automatically generating the necessary SQL for each.
• OLAP allows views along multiple dimensions, and drill-drown,
therefore giving access to a vast array of analyses. However, it
requires manual navigation through scores of reports, requiring
the user to notice interesting patterns themselves.
• Data Mining picks out interesting patterns. The user can
then use visualization tools to investigate further.
• What is driving sales of walking sticks ?
• Step 1: View some OLAP graphs:
e.g. walking stick sales by city.
• Step 2: Noticing that Islamabad has high
sales
you decide to investigate further.
• (Before OLAP, you would have to have
written a very complex SQL query instead
of just simply clicking to drill-down).
• It seems that old people are responsible
for most walking stick sales.
You confirm this by viewing a chart of age
distributions by state.
• But imagine if you had to do this
manual investigation for all of the
10,000 products in your range !
Here, OLAP gives way to Data Mining.
Walking Sticks Sales by City
50
10
400
Karachi
Lahore
Islamabad
Walking Sticks Sales in
Islamabad by Age
10 30
360
Less than 20
20 to 60
Older than 60
Age Distribution by City
0
20
40
60
80
Karachi Lahore Islamabad
Younger than 20
20 to 60
Older than 60
• Expert Systems = Rule-Driven Deduction
Top-down: From known rules (expertise) and data to decisions.
Expert
System
Rules
Data
Decisions
• Data Mining = Data-Driven Induction
Bottom-up: From data about past decisions to discovered rules (general
rules induced from the data).
Data
Mining
RulesData
(including past decisions)
 Machine Learning techniques are designed to deal with a limited
amount of artificial intelligence data. Where the Data Mining
Techniques deal with large amount of databases data.
 Data Mining (Knowledge Discovery in Databases)
 Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) information or patterns from data in large
databases. What is not data mining?
 (Deductive) query processing.
 Expert systems or small ML/statistical programs
 What is not Data Mining?
 (Deductive) query processing.
 Expert systems or small ML/statistical programs
 Random Guessing vs. Potential Knowledge
 Suppose we have to Forecast the Probability of Rain in Islamabad city for any
particular day.
 Without any Prior Knowledge the probability of rain would be 50% (pure random
guess).
 If we had a lot of weather data, then we can extract potential rules using Data
Mining which can then forecast the chance of rain better than random
guessing.
 Example: The Rule
if [Temperature = ‘hot’ and Humidity = ‘high’] then there is 66.6% chance of
rain.
Temperature Humidity Windy Rain
hot high false No
hot high true Yes
hot high false Yes
mild high false No
cool normal false No
cool normal true Yes
• Step 0: Determine Business Objective
- e.g. Forecasting the probability of rain
 - Must have relevant prior knowledge and goals of application.
• Step 1: Prepare Data
 - Noisy and Missing values handling (Data Cleaning).
 - Data Transformation (Normalization/Discretization).
 - Attribute/Feature Selection.
• Step 2: Choosing the Function of Data Mining
 - Classification, Clustering, Association Rules
• Step 3: Choosing The Mining Algorithm
 - Selection of correct algorithm depending upon the quality of data.
 - Selection of correct algorithm depending upon the density of data.

• Step 4: Data Mining
- Search patterns of interest:- A typical data mining algorithm can
mine millions of patterns.
• Step 5: Visualization/Knowledge Representation -
Visualization/Representation of interesting patterns, etc
 Data mining: the core of
knowledge discovery process.
Data Integration
Databases
DataWarehouse
Task-relevant Data
Data Mining
Pattern Evaluation
1. Relational databases
2. Data warehouses
3. Transactional databases
4. Advanced DB and information repositories
 Time-series data and temporal data
 Text databases
 Multimedia databases
 Data Stream (Sensor Networks Data)
 WWW
 Data Preprocessing
 Handling Missing and Noisy Data (Data Cleaning).
 Techniques we will cover.
▪ Missing values Imputation using Mean, Median and Mod.
▪ Missing values Imputation using K-Nearest Neighbor.
▪ Missing values Imputation using Association Rules Mining.
▪ Missing values Imputation using Fault-Tolerant Patterns (Will be a
research project).
▪ Data Binning for Noisy Data.
TID Refund Country Taxable Income Cheat
1 Yes USA 125K No
2 UK 100K No
3 No Australia 70K No
4 120K No
5 No NZL 95K Yes
 Data Preprocessing
 Data Transformation (Discretization and Normalization).
 With the help of data transformation rules become more General and Compact.
 General and Compact rules increase the Accuracy of Classification.
Age
15
18
40
33
55
48
12
23
Child = (0 to 20)
Young = (21 to 47)
Old = (48 to 120)
Age
Child
Child
Young
Young
Old
Old
Child
Young
1. If attribute 1 = value1 & attribute 2 = value2 and Age = 08 then
Buy_Computer = No.
2. If attribute 1 = value1 & attribute 2 = value2 and Age = 09 then
Buy_Computer = No.
3. If attribute 1 = value1 & attribute 2 = value2 and Age = 10 then
Buy_Computer = No.
1. If attribute 1 = value1 &
attribute 2 = value2 and
Age = Child then
Buy_Computer = No.
 Data Preprocessing
 Attribute Selection/Feature Selection
▪ Selection of those attributes which are more relevant to data mining task.
▪ Advantage1: Decrease the processing time of mining task.
▪ Advantage2: Generalize the rules.
 Example
▪ If our mining goal is to find that countries which has more Cheat on which
Taxable Income.
▪ Then obviously the date attribute will not be an important factor in our
mining task.
Date Refund Country Taxable Income Cheat
11/02/2002 Yes USA 125K No
13/02/2002 Yes UK 100K No
16/02/2002 No Australia 120K Yes
21/03/2002 No Australia 120K Yes
26/02/2002 No NZL 95K Yes
 Data Preprocessing
 We will cover two Attribute/Feature Selection
Techniques
▪ Principle Component Analysis
▪ Wrapper Based
▪ Filter Based
 Association Rule Mining
 In Association Rule Mining Framework we have to find all the
rules in a transactional/relational dataset which contain a support
(frequency) Greater than some minimum support (min_sup)
threshold (provided by the user).
 For example with min_sup = 50%.
Transaction ID Items Bought
2000 Bread,Butter,Egg
1000 Bread,Butter, Egg
4000 Bread,Butter, Tea
5000 Butter, Ice cream, Cake
Itemset Support
{Butter} 4
{Bread} 3
{Egg} 2
{Bread,Butter} 3
{Bread, Butter, Egg} 2
 Association Rule Mining
 Topic we will cover
 Frequent Itemset Mining Algorithms (Apriori, FP-Growth, Bit-
vector ).
 Fault-Tolerant/Approximate Frequent Itemset Mining.
 N-Most Interesting Frequent Itemset Mining.
 Closed and Maximal Frequent Itemset Mining.
 Incremental Frequent Itemset Mining
 Sequential Patterns.
 Projects
▪ Mining Fault-Tolerant Using Pattern-Growth.
▪ Application of Fault-Tolerant Frequent Pattern is Missing values
Imputation (Course Project).
 Classification and Prediction
 Finding models (functions) that describe and distinguish classes or concepts for future
prediction
 Example: Classify rainy/un-rainy cities based on Temperature, Humidify and
Windy Attributes.
 Must have known the previous business decisions (Supervised Learning).
City Temperature Humidity Windy Rain
Lahore hot low false No
Islamabad hot high true Yes
Islamabad hot high false Yes
Multan mild low false No
Karachi cool normal false No
Rawalpindi hot high true Yes
Rule
• If Temperature = Hot &
Humidity = High then
Rain = Yes.
City Temperature Humidity Windy Rain
Muree hot high false ?
Sibi mild low true ?
Prediction of
unknown record
 Cluster Analysis
 Group data to form new classes based on un-labels class data.
 Business decisions are unknown (Also called unsupervised Learning).
 Example: Classify rainy/un-rainy cities based on Temperature, Humidify
and Windy Attributes.
City Temperature Humidity Windy Rain
Lahore hot low false ?
Islamabad hot high true ?
Islamabad hot high false ?
Multan mild low false ?
Karachi cool normal false ?
Rawalpindi hot high true ?
3 clusters
 Outlier Analysis
 Outlier: A data object that does not comply with the general behavior
of the data.
 It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
2 Outliers
 A data mining system/query may generate thousands of
patterns, not all of them are interesting.
 Suggested approach: Query-based, Constraint
mining
 Interestingness Measures: A pattern is interesting
if it is easily understood by humans, valid on new or test
data with some degree of certainty, potentially useful,
novel, or validates some hypothesis that a user seeks to
confirm
 Find all the interesting patterns: Completeness
 Can a data mining system find all the interesting patterns?
 Remember most of the problems in Data Mining are NP-
Complete.
 There is no global best solution for any single problem.
 Search for only interesting patterns: Optimization
 Can a data mining system find only the interesting patterns?
 Approaches
▪ First general all the patterns and then filter out the uninteresting
ones.
▪ Generate only the interesting patterns—Constraint based mining (Give
threshold factors in mining)
 Book Chapter
 Chapter 1 of “Jiawei Han and Micheline Kamber” book
“Data Mining: Concepts and Techniques”.
 Some Nice Resources
 ACM Special Interest Group on Knowledge Discovery and Data
Mining (SIGKDD) http://www.acm.org/sigs/sigkdd/.
 Knowledge Discovery Nuggets www.kdnuggests.com.
 IEEE Transactions on Knowledge and Data Engineering –
http://www.computer.org/tkde/.
 IEEE Transactions on Pattern Analysis and Machine Intelligence –
http://www.computer.org/tpami/.
 Data Mining and Knowledge Discovery - Publisher: Springer
Science+Business Media B.V., Formerly Kluwer Academic
Publishers B.V. http://www.kluweronline.com/issn/1384-
5810/. current and previous offerings of Data Mining course at
Stanford, CMU, MIT and Helsinki.
 The course will be mainly based on research
literature, following text may however be consulted:
 Jiawei Han and Micheline Kamber. “Data Mining: Concepts and
Techniques”.
1. David Hand, Heikki Mannila and Padhraic Smyth. “Principles of
Data Mining”. Pub. Prentice Hall of India, 2004.
2. Sushmita Mitra and Tinku Acharya. “Data Mining: Multimedia, Soft
Computing and Bioinformatics”. Pub. Wiley an Sons Inc. 2003.
3. Usama M. Fayyad et al. “Advances in Knowledge Discovery and
Data Mining”, The MIT Press, 1996.
 Data quality
 Missing values imputation using Mean,
Median and k-Nearest Neighbor approach
 Distance Measure
 Data quality is a major concern in Data Mining and Knowledge
Discovery tasks.
 Why: At most all Data Mining algorithms induce knowledge
strictly from data.
 The quality of knowledge extracted highly depends on the
quality of data.
 There are two main problems in data quality:-
 Missing data:The data not present.
 Noisy data:The data present but not correct.
 Missing/Noisy data sources:-
 Hardware failure.
 Data transmission error.
 Data entry problem.
 Refusal of responds to answer certain questions.
age income student buys_computer
<=30 high no ?
>40 medium yes ?
31…40 medium yes ?
age income student buys_computer
<=30 high yes yes
<=30 high no yes
>40 medium yes no
>40 medium no no
>40 low yes yes
31…40 no yes
31…40 medium yes yes
Data Mining
• If ‘age <= 30’ and income = ‘high’ then
buys_computer = ‘yes’
• If ‘age > 40’ and income = ‘medium’
then buys_computer = ‘no’
Discover only those
rules which contain
support (frequency)
greater >= 2
Due to the missing value in training
dataset, the accuracy of prediction
decreases and becomes “66.7%”
Training data
Testing data or actual data
 Imputation is a term that denotes a procedure that replaces the
missing values in a dataset by some plausible values
 i.e. by considering relationship among correlated
values among the attributes of the dataset.
Attribute 1 Attribute 2 Attribute 3 Attribute 4
20 cool high false
cool high true
20 cool high true
20 mild low false
30 cool normal false
10 mild high true
If we consider only
{attribute#2}, then value
“cool” appears in 4
records.
Probability of Imputing
value (20) = 75%
Probability of Imputing
value (30) = 25%
Attribute 1 Attribute 2 Attribute 3 Attribute 4
20 cool high false
cool high true
20 cool high true
20 mild low false
30 cool normal false
10 mild high true
For {attribute#4} the
value “true” appears in 3
records
Probability of Imputing
value (20) = 50%
Probability of Imputing
value (10) = 50%
Attribute 1 Attribute 2 Attribute 3 Attribute 4
20 cool high false
cool high true
20 cool high true
20 mild low false
30 cool normal false
10 mild high true
For {attribute#2,
attribute#3} the value
{“cool”, “high”} appears
in only 2 records
Probability of Imputing
value (20) = 100%
 Missing data randomness is divided into three classes.
1. Missing completely at random (MCAR):- It occurs when
the probability of instance (case) having missing value for an
attribute does not depend on either the known attribute
values or missing data attribute.
2. Missing at random (MAR):- It occurs when the probability
of instance (case) having missing value for an attribute
depends on the known attribute values, but not on the
missing data attribute.
3. Not missing at random (NMAR):-When the probability of
an instance having a missing value for an attribute could
depend on the value of that attribute.
 Ignoring and discarding data:- There are two main ways to
discard data with missing values.
 Discard all those records which have missing data also
called as discard case analysis.
 Discarding only those attributes which have high level of
missing data.
 Imputation using Mean/median or Mod:- One of the most frequently
used method (Statistical technique).
 Replace (numeric continuous) type “attribute missing
values” using mean/median. (Median robust against
noise).
 Replace (discrete) type attribute missing values using
MOD.
 Replace missing values using prediction/classification model:-
 Advantage:- it considers relationship among the known attribute
values and the missing values, so the imputation accuracy is very
high.
 Disadvantage:- If there is no correlation exist for some missing
attribute values and know attribute values.The imputation can’t
be performed.
 (Alternative approach):- Use hybrid combination of
Prediction/Classification model and Mean/MOD.
▪ First try to impute missing value using prediction/classification
model, and then Median/MOD.
 We will study more about this topic in Association Rules Mining.
 Similarity
 Numerical measure of how alike two data objects are.
 Is higher when objects are more alike.
 Often falls in the range [0,1]
 Dissimilarity
 Numerical measure of how different are two data
objects
 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Upper limit varies
 Proximity refers to a similarity or dissimilarity
 Remember K-Nearest Neighbor are determined on the
bases of some kind of “distance” between points.
 Two major classes of distance measure:
1. Euclidean : based on position of points in some k
-dimensional space.
2. Noneuclidean : not related to position or space.
 Applying a distance measure largely depends on the type of
input data
 Major scales of measurement:
1. Nominal Data (aka Nominal ScaleVariables)
▪ Typically classification data, e.g. m/f
▪ no ordering, e.g. it makes no sense to state that M > F
▪ Binary variables are a special case of Nominal scale variables.
2. Ordinal Data (aka Ordinal Scale)
▪ ordered but differences between values are not important
▪ e.g., political parties on left to right spectrum given labels 0, 1, 2
▪ e.g., Likert scales, rank on a scale of 1..5 your degree of satisfaction
▪ e.g., restaurant ratings
 Applying a distance function largely depends on the type of
input data
 Major scales of measurement:
3. Numeric type Data (aka interval scaled)
▪ Ordered and equal intervals. Measured on a linear scale.
▪ Differences make sense
▪ e.g., temperature (C,F), height, weight, age, date
• Only certain operations can be performed on
certain scales of measurement.
Nominal Scale
Ordinal Scale
Interval Scale
1. Equality
2. Count
3. Rank
(Cannot quantify difference)
4. Quantify the difference
 d is a distance measure if it is a function from
pairs of points to reals such that:
1. d(x,x) = 0.
2. d(x,y) = d(y,x).
3. d(x,y) > 0.
 L2 norm (also common or Euclidean distance):
 The most common notion of “distance.”
 L1 norm (also Manhattan distance)
 distance if you had to travel along coordinates only.
)||...|||(|),( 22
22
2
11 pp j
x
i
x
j
x
i
x
j
x
i
xjid 
||...||||),(
2211 pp j
x
i
x
j
x
i
x
j
x
i
xjid 
x = (5,5)
y = (9,8)L2-norm:
dist(x,y) = (42+32) = 5
L1-norm:
dist(x,y) = 4+3 = 7
4
3
5
 L∞ norm : d(x,y) = the maximum of the
differences between x and y in any dimension.

Contenu connexe

Tendances

Big Data & Oracle Technologies
Big Data & Oracle TechnologiesBig Data & Oracle Technologies
Big Data & Oracle TechnologiesOleksii Movchaniuk
 
The Intelligent Thing -- Using In-Memory for Big Data and Beyond
The Intelligent Thing -- Using In-Memory for Big Data and BeyondThe Intelligent Thing -- Using In-Memory for Big Data and Beyond
The Intelligent Thing -- Using In-Memory for Big Data and BeyondInside Analysis
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An OverviewC. Scyphers
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataSpringPeople
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataHaluan Irsad
 
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityCaserta
 
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...Caserta
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataVipin Batra
 
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...Mark Rittman
 
Lecture 19
Lecture 19Lecture 19
Lecture 19Shani729
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopSavvycom Savvycom
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Seeling Cheung
 
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...Simplilearn
 
Testing the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTesting the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTechWell
 

Tendances (20)

Big Data & Oracle Technologies
Big Data & Oracle TechnologiesBig Data & Oracle Technologies
Big Data & Oracle Technologies
 
The Intelligent Thing -- Using In-Memory for Big Data and Beyond
The Intelligent Thing -- Using In-Memory for Big Data and BeyondThe Intelligent Thing -- Using In-Memory for Big Data and Beyond
The Intelligent Thing -- Using In-Memory for Big Data and Beyond
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data Quality
 
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
 
Lecture 19
Lecture 19Lecture 19
Lecture 19
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
Bigdata
Bigdata Bigdata
Bigdata
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
 
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
 
Testing the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTesting the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big Problems
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 

En vedette

Python tutorial
Python tutorialPython tutorial
Python tutorialShani729
 
Lecture 29
Lecture 29Lecture 29
Lecture 29Shani729
 
Lecture 35
Lecture 35Lecture 35
Lecture 35Shani729
 
Lecture 37
Lecture 37Lecture 37
Lecture 37Shani729
 
Lecture 40
Lecture 40Lecture 40
Lecture 40Shani729
 
Lecture 39
Lecture 39Lecture 39
Lecture 39Shani729
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodShani729
 
Lecture 38
Lecture 38Lecture 38
Lecture 38Shani729
 
Python tutorialfeb152012
Python tutorialfeb152012Python tutorialfeb152012
Python tutorialfeb152012Shani729
 

En vedette (9)

Python tutorial
Python tutorialPython tutorial
Python tutorial
 
Lecture 29
Lecture 29Lecture 29
Lecture 29
 
Lecture 35
Lecture 35Lecture 35
Lecture 35
 
Lecture 37
Lecture 37Lecture 37
Lecture 37
 
Lecture 40
Lecture 40Lecture 40
Lecture 40
 
Lecture 39
Lecture 39Lecture 39
Lecture 39
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth method
 
Lecture 38
Lecture 38Lecture 38
Lecture 38
 
Python tutorialfeb152012
Python tutorialfeb152012Python tutorialfeb152012
Python tutorialfeb152012
 

Similaire à Data warehousing and mining furc

Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Roger Barga
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
1-_Intro_to_Data_Minning__DWH.ppt
1-_Intro_to_Data_Minning__DWH.ppt1-_Intro_to_Data_Minning__DWH.ppt
1-_Intro_to_Data_Minning__DWH.pptBsMath3rdsem
 
Introduction to AutoML and Data Science using the Oracle Autonomous Database ...
Introduction to AutoML and Data Science using the Oracle Autonomous Database ...Introduction to AutoML and Data Science using the Oracle Autonomous Database ...
Introduction to AutoML and Data Science using the Oracle Autonomous Database ...Sandesh Rao
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining Sushil Kulkarni
 
IT Operation Analytic for security- MiSSconf(sp1)
IT Operation Analytic for security- MiSSconf(sp1)IT Operation Analytic for security- MiSSconf(sp1)
IT Operation Analytic for security- MiSSconf(sp1)stelligence
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningNofel Elahi
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptAravindReddy565690
 
data mining presentation power point for the study
data mining presentation power point for the studydata mining presentation power point for the study
data mining presentation power point for the studyanjanishah774
 
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.pptlect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.pptDEEPAK948083
 
lec01-IntroductionToDataMining.pptx
lec01-IntroductionToDataMining.pptxlec01-IntroductionToDataMining.pptx
lec01-IntroductionToDataMining.pptxAmjadAlDgour
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptxXanGwaps
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysNEWYORKSYS-IT SOLUTIONS
 

Similaire à Data warehousing and mining furc (20)

Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
Data warehousing and Data mining
Data warehousing and Data mining Data warehousing and Data mining
Data warehousing and Data mining
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
1-_Intro_to_Data_Minning__DWH.ppt
1-_Intro_to_Data_Minning__DWH.ppt1-_Intro_to_Data_Minning__DWH.ppt
1-_Intro_to_Data_Minning__DWH.ppt
 
Introduction to AutoML and Data Science using the Oracle Autonomous Database ...
Introduction to AutoML and Data Science using the Oracle Autonomous Database ...Introduction to AutoML and Data Science using the Oracle Autonomous Database ...
Introduction to AutoML and Data Science using the Oracle Autonomous Database ...
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
IT Operation Analytic for security- MiSSconf(sp1)
IT Operation Analytic for security- MiSSconf(sp1)IT Operation Analytic for security- MiSSconf(sp1)
IT Operation Analytic for security- MiSSconf(sp1)
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).ppt
 
data mining presentation power point for the study
data mining presentation power point for the studydata mining presentation power point for the study
data mining presentation power point for the study
 
lect1.ppt
lect1.pptlect1.ppt
lect1.ppt
 
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.pptlect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
 
lec01-IntroductionToDataMining.pptx
lec01-IntroductionToDataMining.pptxlec01-IntroductionToDataMining.pptx
lec01-IntroductionToDataMining.pptx
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
 
Msbi by quontra us
Msbi by quontra usMsbi by quontra us
Msbi by quontra us
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
 
Datawarehouse
DatawarehouseDatawarehouse
Datawarehouse
 

Plus de Shani729

Interaction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interactionInteraction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interactionShani729
 
Fm lecturer 13(final)
Fm lecturer 13(final)Fm lecturer 13(final)
Fm lecturer 13(final)Shani729
 
Lecture slides week14-15
Lecture slides week14-15Lecture slides week14-15
Lecture slides week14-15Shani729
 
Dwh lecture slides-week15
Dwh lecture slides-week15Dwh lecture slides-week15
Dwh lecture slides-week15Shani729
 
Dwh lecture slides-week10
Dwh lecture slides-week10Dwh lecture slides-week10
Dwh lecture slides-week10Shani729
 
Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8Shani729
 
Dwh lecture slides-week5&6
Dwh lecture slides-week5&6Dwh lecture slides-week5&6
Dwh lecture slides-week5&6Shani729
 
Dwh lecture slides-week3&4
Dwh lecture slides-week3&4Dwh lecture slides-week3&4
Dwh lecture slides-week3&4Shani729
 
Dwh lecture slides-week2
Dwh lecture slides-week2Dwh lecture slides-week2
Dwh lecture slides-week2Shani729
 
Dwh lecture slides-week1
Dwh lecture slides-week1Dwh lecture slides-week1
Dwh lecture slides-week1Shani729
 
Dwh lecture slides-week 13
Dwh lecture slides-week 13Dwh lecture slides-week 13
Dwh lecture slides-week 13Shani729
 
Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13Shani729
 
Lecture 36
Lecture 36Lecture 36
Lecture 36Shani729
 
Lecture 34
Lecture 34Lecture 34
Lecture 34Shani729
 
Lecture 33
Lecture 33Lecture 33
Lecture 33Shani729
 
Lecture 32
Lecture 32Lecture 32
Lecture 32Shani729
 
Lecture 31
Lecture 31Lecture 31
Lecture 31Shani729
 
Lecture 30
Lecture 30Lecture 30
Lecture 30Shani729
 
Lecture 28
Lecture 28Lecture 28
Lecture 28Shani729
 
Lecture 27
Lecture 27Lecture 27
Lecture 27Shani729
 

Plus de Shani729 (20)

Interaction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interactionInteraction design _beyond_human_computer_interaction
Interaction design _beyond_human_computer_interaction
 
Fm lecturer 13(final)
Fm lecturer 13(final)Fm lecturer 13(final)
Fm lecturer 13(final)
 
Lecture slides week14-15
Lecture slides week14-15Lecture slides week14-15
Lecture slides week14-15
 
Dwh lecture slides-week15
Dwh lecture slides-week15Dwh lecture slides-week15
Dwh lecture slides-week15
 
Dwh lecture slides-week10
Dwh lecture slides-week10Dwh lecture slides-week10
Dwh lecture slides-week10
 
Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8Dwh lecture slidesweek7&8
Dwh lecture slidesweek7&8
 
Dwh lecture slides-week5&6
Dwh lecture slides-week5&6Dwh lecture slides-week5&6
Dwh lecture slides-week5&6
 
Dwh lecture slides-week3&4
Dwh lecture slides-week3&4Dwh lecture slides-week3&4
Dwh lecture slides-week3&4
 
Dwh lecture slides-week2
Dwh lecture slides-week2Dwh lecture slides-week2
Dwh lecture slides-week2
 
Dwh lecture slides-week1
Dwh lecture slides-week1Dwh lecture slides-week1
Dwh lecture slides-week1
 
Dwh lecture slides-week 13
Dwh lecture slides-week 13Dwh lecture slides-week 13
Dwh lecture slides-week 13
 
Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13Dwh lecture slides-week 12&13
Dwh lecture slides-week 12&13
 
Lecture 36
Lecture 36Lecture 36
Lecture 36
 
Lecture 34
Lecture 34Lecture 34
Lecture 34
 
Lecture 33
Lecture 33Lecture 33
Lecture 33
 
Lecture 32
Lecture 32Lecture 32
Lecture 32
 
Lecture 31
Lecture 31Lecture 31
Lecture 31
 
Lecture 30
Lecture 30Lecture 30
Lecture 30
 
Lecture 28
Lecture 28Lecture 28
Lecture 28
 
Lecture 27
Lecture 27Lecture 27
Lecture 27
 

Dernier

Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Forming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptForming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptNoman khan
 
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdfAkritiPradhan2
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfManish Kumar
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书rnrncn29
 
Javier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptxJavier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptxJavier Fernández Muñoz
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organizationchnrketan
 
Levelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodLevelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodManicka Mamallan Andavar
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical trainingGladiatorsKasper
 
Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labsamber724300
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfBalamuruganV28
 

Dernier (20)

Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Forming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptForming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).ppt
 
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
 
Javier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptxJavier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptx
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organization
 
Levelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodLevelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument method
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training
 
Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labs
 
Designing pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxDesigning pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptx
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdf
 

Data warehousing and mining furc

  • 1.
  • 2.  Data Explosion Problem 1. Automated data collection tools (e.g. web, sensor networks) and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. 2. Currently enterprises are facing data explosion problem.  Electronic Information an Important Asset for Business Decisions 1. With the growth of electronic information, enterprises began to realizing that the accumulated information can be an important asset in their business decisions. 2. There is a potential business intelligence hidden in the large volume of data. 3. This intelligence can be the secret weapon on which the success of a business may depend.
  • 3. 1. It is not a Simple Matter to discover Business Intelligence from Mountain of Accumulated Data. 2. What is required are Techniques that allow the enterprise to Extract the Most Valuable Information. 3. The Field of Data Mining provides such Techniques. 4. These techniques can Find Novel Patterns (unknown) that may Assist an Enterprise in Understanding the business better and in forecasting.
  • 4. • SQL. SQL is a query language, difficult for business people to use • EIS = Executive Information Systems. EIS systems provide graphical interfaces that give executives a pre- programmed (and therefore limited) selection of reports, automatically generating the necessary SQL for each. • OLAP allows views along multiple dimensions, and drill-drown, therefore giving access to a vast array of analyses. However, it requires manual navigation through scores of reports, requiring the user to notice interesting patterns themselves. • Data Mining picks out interesting patterns. The user can then use visualization tools to investigate further.
  • 5. • What is driving sales of walking sticks ? • Step 1: View some OLAP graphs: e.g. walking stick sales by city. • Step 2: Noticing that Islamabad has high sales you decide to investigate further. • (Before OLAP, you would have to have written a very complex SQL query instead of just simply clicking to drill-down). • It seems that old people are responsible for most walking stick sales. You confirm this by viewing a chart of age distributions by state. • But imagine if you had to do this manual investigation for all of the 10,000 products in your range ! Here, OLAP gives way to Data Mining. Walking Sticks Sales by City 50 10 400 Karachi Lahore Islamabad Walking Sticks Sales in Islamabad by Age 10 30 360 Less than 20 20 to 60 Older than 60 Age Distribution by City 0 20 40 60 80 Karachi Lahore Islamabad Younger than 20 20 to 60 Older than 60
  • 6. • Expert Systems = Rule-Driven Deduction Top-down: From known rules (expertise) and data to decisions. Expert System Rules Data Decisions • Data Mining = Data-Driven Induction Bottom-up: From data about past decisions to discovered rules (general rules induced from the data). Data Mining RulesData (including past decisions)
  • 7.  Machine Learning techniques are designed to deal with a limited amount of artificial intelligence data. Where the Data Mining Techniques deal with large amount of databases data.  Data Mining (Knowledge Discovery in Databases)  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases. What is not data mining?  (Deductive) query processing.  Expert systems or small ML/statistical programs  What is not Data Mining?  (Deductive) query processing.  Expert systems or small ML/statistical programs
  • 8.  Random Guessing vs. Potential Knowledge  Suppose we have to Forecast the Probability of Rain in Islamabad city for any particular day.  Without any Prior Knowledge the probability of rain would be 50% (pure random guess).  If we had a lot of weather data, then we can extract potential rules using Data Mining which can then forecast the chance of rain better than random guessing.  Example: The Rule if [Temperature = ‘hot’ and Humidity = ‘high’] then there is 66.6% chance of rain. Temperature Humidity Windy Rain hot high false No hot high true Yes hot high false Yes mild high false No cool normal false No cool normal true Yes
  • 9. • Step 0: Determine Business Objective - e.g. Forecasting the probability of rain  - Must have relevant prior knowledge and goals of application. • Step 1: Prepare Data  - Noisy and Missing values handling (Data Cleaning).  - Data Transformation (Normalization/Discretization).  - Attribute/Feature Selection. • Step 2: Choosing the Function of Data Mining  - Classification, Clustering, Association Rules • Step 3: Choosing The Mining Algorithm  - Selection of correct algorithm depending upon the quality of data.  - Selection of correct algorithm depending upon the density of data.  • Step 4: Data Mining - Search patterns of interest:- A typical data mining algorithm can mine millions of patterns. • Step 5: Visualization/Knowledge Representation - Visualization/Representation of interesting patterns, etc
  • 10.  Data mining: the core of knowledge discovery process. Data Integration Databases DataWarehouse Task-relevant Data Data Mining Pattern Evaluation
  • 11. 1. Relational databases 2. Data warehouses 3. Transactional databases 4. Advanced DB and information repositories  Time-series data and temporal data  Text databases  Multimedia databases  Data Stream (Sensor Networks Data)  WWW
  • 12.  Data Preprocessing  Handling Missing and Noisy Data (Data Cleaning).  Techniques we will cover. ▪ Missing values Imputation using Mean, Median and Mod. ▪ Missing values Imputation using K-Nearest Neighbor. ▪ Missing values Imputation using Association Rules Mining. ▪ Missing values Imputation using Fault-Tolerant Patterns (Will be a research project). ▪ Data Binning for Noisy Data. TID Refund Country Taxable Income Cheat 1 Yes USA 125K No 2 UK 100K No 3 No Australia 70K No 4 120K No 5 No NZL 95K Yes
  • 13.  Data Preprocessing  Data Transformation (Discretization and Normalization).  With the help of data transformation rules become more General and Compact.  General and Compact rules increase the Accuracy of Classification. Age 15 18 40 33 55 48 12 23 Child = (0 to 20) Young = (21 to 47) Old = (48 to 120) Age Child Child Young Young Old Old Child Young 1. If attribute 1 = value1 & attribute 2 = value2 and Age = 08 then Buy_Computer = No. 2. If attribute 1 = value1 & attribute 2 = value2 and Age = 09 then Buy_Computer = No. 3. If attribute 1 = value1 & attribute 2 = value2 and Age = 10 then Buy_Computer = No. 1. If attribute 1 = value1 & attribute 2 = value2 and Age = Child then Buy_Computer = No.
  • 14.  Data Preprocessing  Attribute Selection/Feature Selection ▪ Selection of those attributes which are more relevant to data mining task. ▪ Advantage1: Decrease the processing time of mining task. ▪ Advantage2: Generalize the rules.  Example ▪ If our mining goal is to find that countries which has more Cheat on which Taxable Income. ▪ Then obviously the date attribute will not be an important factor in our mining task. Date Refund Country Taxable Income Cheat 11/02/2002 Yes USA 125K No 13/02/2002 Yes UK 100K No 16/02/2002 No Australia 120K Yes 21/03/2002 No Australia 120K Yes 26/02/2002 No NZL 95K Yes
  • 15.  Data Preprocessing  We will cover two Attribute/Feature Selection Techniques ▪ Principle Component Analysis ▪ Wrapper Based ▪ Filter Based
  • 16.  Association Rule Mining  In Association Rule Mining Framework we have to find all the rules in a transactional/relational dataset which contain a support (frequency) Greater than some minimum support (min_sup) threshold (provided by the user).  For example with min_sup = 50%. Transaction ID Items Bought 2000 Bread,Butter,Egg 1000 Bread,Butter, Egg 4000 Bread,Butter, Tea 5000 Butter, Ice cream, Cake Itemset Support {Butter} 4 {Bread} 3 {Egg} 2 {Bread,Butter} 3 {Bread, Butter, Egg} 2
  • 17.  Association Rule Mining  Topic we will cover  Frequent Itemset Mining Algorithms (Apriori, FP-Growth, Bit- vector ).  Fault-Tolerant/Approximate Frequent Itemset Mining.  N-Most Interesting Frequent Itemset Mining.  Closed and Maximal Frequent Itemset Mining.  Incremental Frequent Itemset Mining  Sequential Patterns.  Projects ▪ Mining Fault-Tolerant Using Pattern-Growth. ▪ Application of Fault-Tolerant Frequent Pattern is Missing values Imputation (Course Project).
  • 18.  Classification and Prediction  Finding models (functions) that describe and distinguish classes or concepts for future prediction  Example: Classify rainy/un-rainy cities based on Temperature, Humidify and Windy Attributes.  Must have known the previous business decisions (Supervised Learning). City Temperature Humidity Windy Rain Lahore hot low false No Islamabad hot high true Yes Islamabad hot high false Yes Multan mild low false No Karachi cool normal false No Rawalpindi hot high true Yes Rule • If Temperature = Hot & Humidity = High then Rain = Yes. City Temperature Humidity Windy Rain Muree hot high false ? Sibi mild low true ? Prediction of unknown record
  • 19.  Cluster Analysis  Group data to form new classes based on un-labels class data.  Business decisions are unknown (Also called unsupervised Learning).  Example: Classify rainy/un-rainy cities based on Temperature, Humidify and Windy Attributes. City Temperature Humidity Windy Rain Lahore hot low false ? Islamabad hot high true ? Islamabad hot high false ? Multan mild low false ? Karachi cool normal false ? Rawalpindi hot high true ? 3 clusters
  • 20.  Outlier Analysis  Outlier: A data object that does not comply with the general behavior of the data.  It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis 2 Outliers
  • 21.  A data mining system/query may generate thousands of patterns, not all of them are interesting.  Suggested approach: Query-based, Constraint mining  Interestingness Measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm
  • 22.  Find all the interesting patterns: Completeness  Can a data mining system find all the interesting patterns?  Remember most of the problems in Data Mining are NP- Complete.  There is no global best solution for any single problem.  Search for only interesting patterns: Optimization  Can a data mining system find only the interesting patterns?  Approaches ▪ First general all the patterns and then filter out the uninteresting ones. ▪ Generate only the interesting patterns—Constraint based mining (Give threshold factors in mining)
  • 23.  Book Chapter  Chapter 1 of “Jiawei Han and Micheline Kamber” book “Data Mining: Concepts and Techniques”.
  • 24.  Some Nice Resources  ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) http://www.acm.org/sigs/sigkdd/.  Knowledge Discovery Nuggets www.kdnuggests.com.  IEEE Transactions on Knowledge and Data Engineering – http://www.computer.org/tkde/.  IEEE Transactions on Pattern Analysis and Machine Intelligence – http://www.computer.org/tpami/.  Data Mining and Knowledge Discovery - Publisher: Springer Science+Business Media B.V., Formerly Kluwer Academic Publishers B.V. http://www.kluweronline.com/issn/1384- 5810/. current and previous offerings of Data Mining course at Stanford, CMU, MIT and Helsinki.
  • 25.  The course will be mainly based on research literature, following text may however be consulted:  Jiawei Han and Micheline Kamber. “Data Mining: Concepts and Techniques”. 1. David Hand, Heikki Mannila and Padhraic Smyth. “Principles of Data Mining”. Pub. Prentice Hall of India, 2004. 2. Sushmita Mitra and Tinku Acharya. “Data Mining: Multimedia, Soft Computing and Bioinformatics”. Pub. Wiley an Sons Inc. 2003. 3. Usama M. Fayyad et al. “Advances in Knowledge Discovery and Data Mining”, The MIT Press, 1996.
  • 26.  Data quality  Missing values imputation using Mean, Median and k-Nearest Neighbor approach  Distance Measure
  • 27.  Data quality is a major concern in Data Mining and Knowledge Discovery tasks.  Why: At most all Data Mining algorithms induce knowledge strictly from data.  The quality of knowledge extracted highly depends on the quality of data.  There are two main problems in data quality:-  Missing data:The data not present.  Noisy data:The data present but not correct.  Missing/Noisy data sources:-  Hardware failure.  Data transmission error.  Data entry problem.  Refusal of responds to answer certain questions.
  • 28. age income student buys_computer <=30 high no ? >40 medium yes ? 31…40 medium yes ? age income student buys_computer <=30 high yes yes <=30 high no yes >40 medium yes no >40 medium no no >40 low yes yes 31…40 no yes 31…40 medium yes yes Data Mining • If ‘age <= 30’ and income = ‘high’ then buys_computer = ‘yes’ • If ‘age > 40’ and income = ‘medium’ then buys_computer = ‘no’ Discover only those rules which contain support (frequency) greater >= 2 Due to the missing value in training dataset, the accuracy of prediction decreases and becomes “66.7%” Training data Testing data or actual data
  • 29.  Imputation is a term that denotes a procedure that replaces the missing values in a dataset by some plausible values  i.e. by considering relationship among correlated values among the attributes of the dataset. Attribute 1 Attribute 2 Attribute 3 Attribute 4 20 cool high false cool high true 20 cool high true 20 mild low false 30 cool normal false 10 mild high true If we consider only {attribute#2}, then value “cool” appears in 4 records. Probability of Imputing value (20) = 75% Probability of Imputing value (30) = 25%
  • 30. Attribute 1 Attribute 2 Attribute 3 Attribute 4 20 cool high false cool high true 20 cool high true 20 mild low false 30 cool normal false 10 mild high true For {attribute#4} the value “true” appears in 3 records Probability of Imputing value (20) = 50% Probability of Imputing value (10) = 50% Attribute 1 Attribute 2 Attribute 3 Attribute 4 20 cool high false cool high true 20 cool high true 20 mild low false 30 cool normal false 10 mild high true For {attribute#2, attribute#3} the value {“cool”, “high”} appears in only 2 records Probability of Imputing value (20) = 100%
  • 31.  Missing data randomness is divided into three classes. 1. Missing completely at random (MCAR):- It occurs when the probability of instance (case) having missing value for an attribute does not depend on either the known attribute values or missing data attribute. 2. Missing at random (MAR):- It occurs when the probability of instance (case) having missing value for an attribute depends on the known attribute values, but not on the missing data attribute. 3. Not missing at random (NMAR):-When the probability of an instance having a missing value for an attribute could depend on the value of that attribute.
  • 32.  Ignoring and discarding data:- There are two main ways to discard data with missing values.  Discard all those records which have missing data also called as discard case analysis.  Discarding only those attributes which have high level of missing data.  Imputation using Mean/median or Mod:- One of the most frequently used method (Statistical technique).  Replace (numeric continuous) type “attribute missing values” using mean/median. (Median robust against noise).  Replace (discrete) type attribute missing values using MOD.
  • 33.  Replace missing values using prediction/classification model:-  Advantage:- it considers relationship among the known attribute values and the missing values, so the imputation accuracy is very high.  Disadvantage:- If there is no correlation exist for some missing attribute values and know attribute values.The imputation can’t be performed.  (Alternative approach):- Use hybrid combination of Prediction/Classification model and Mean/MOD. ▪ First try to impute missing value using prediction/classification model, and then Median/MOD.  We will study more about this topic in Association Rules Mining.
  • 34.  Similarity  Numerical measure of how alike two data objects are.  Is higher when objects are more alike.  Often falls in the range [0,1]  Dissimilarity  Numerical measure of how different are two data objects  Lower when objects are more alike  Minimum dissimilarity is often 0  Upper limit varies  Proximity refers to a similarity or dissimilarity
  • 35.  Remember K-Nearest Neighbor are determined on the bases of some kind of “distance” between points.  Two major classes of distance measure: 1. Euclidean : based on position of points in some k -dimensional space. 2. Noneuclidean : not related to position or space.
  • 36.  Applying a distance measure largely depends on the type of input data  Major scales of measurement: 1. Nominal Data (aka Nominal ScaleVariables) ▪ Typically classification data, e.g. m/f ▪ no ordering, e.g. it makes no sense to state that M > F ▪ Binary variables are a special case of Nominal scale variables. 2. Ordinal Data (aka Ordinal Scale) ▪ ordered but differences between values are not important ▪ e.g., political parties on left to right spectrum given labels 0, 1, 2 ▪ e.g., Likert scales, rank on a scale of 1..5 your degree of satisfaction ▪ e.g., restaurant ratings
  • 37.  Applying a distance function largely depends on the type of input data  Major scales of measurement: 3. Numeric type Data (aka interval scaled) ▪ Ordered and equal intervals. Measured on a linear scale. ▪ Differences make sense ▪ e.g., temperature (C,F), height, weight, age, date
  • 38. • Only certain operations can be performed on certain scales of measurement. Nominal Scale Ordinal Scale Interval Scale 1. Equality 2. Count 3. Rank (Cannot quantify difference) 4. Quantify the difference
  • 39.  d is a distance measure if it is a function from pairs of points to reals such that: 1. d(x,x) = 0. 2. d(x,y) = d(y,x). 3. d(x,y) > 0.
  • 40.  L2 norm (also common or Euclidean distance):  The most common notion of “distance.”  L1 norm (also Manhattan distance)  distance if you had to travel along coordinates only. )||...|||(|),( 22 22 2 11 pp j x i x j x i x j x i xjid  ||...||||),( 2211 pp j x i x j x i x j x i xjid 
  • 41. x = (5,5) y = (9,8)L2-norm: dist(x,y) = (42+32) = 5 L1-norm: dist(x,y) = 4+3 = 7 4 3 5
  • 42.  L∞ norm : d(x,y) = the maximum of the differences between x and y in any dimension.