SlideShare a Scribd company logo
1 of 53
q-Maxim on Data mining and machine learning 
Some intuition about data mining / machine learning in jargon free lucid language 
1 
By 
Jagadish C.A. (Rao) , Founder of q-Maxim 
V 1.4a 13-8-2013
BY READING ONE CAN GET SOME INTUITION ABOUT WHAT DATA MINING IS ALL ABOUT AND HOW ONE CAN APPLY IT IN THEIR OWN WORK 
THIS PRESENTATION GIVES OVERVIEW OF DATA MINING & MACHINE LEARNING THEN GOES ON TO DESCRIBE SOME OF THE ASPECTS IN SOME DETAIL 
2
3 
• 
Overview - what is data mining & machine learning – why, where used 
•Types of data mining 
•Data mining Steps - overview 
•Data mining Steps in detail 
•Caution notice, Data mining software, references 
•About q-Maxim & Jagadish C A
What is data mining? 
•Many interpretations about the term 
•“Data mining is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems” – Wikipedia 
•In other words -Data mining is process of knowledge discovery in large databases 
4
What is data mining? 
•Process of analyzing data to identify patterns or relationship. 
•Data mining involves developing predictive capacity OR descriptive capacity for dataset of interest 
•As compared to querying, reporting, or even OLAP it is possible to get information without asking specific questions. 
•Usually involving complex algorithms and advanced statistical techniques 
See an example of predictive data mining & terminology in the next slide. Data is generally in the form shown 
5
What is data mining? Example of prediction - predicting house prices 
6 
Row no 
Area [sq. Ft.] 
Number of rooms 
Age of flat [years] 
Gym 
[Y/N] 
Swimming pool [Y/N] 
............... Other features not shown............. 
Market price X 100000 Rupees 
1 
1800 
5 
1.1 
yes 
yes 
68.6 
2 
900 
3 
4 
no 
no 
34.5 
3 
1720 
5 
8 
yes 
no 
47.7 
4 
560 
2 
.7 
no 
no 
25.4 
..... 
1000 
2400 
6 
3 
yes 
yes 
91.8 
Our task is to predict market price of flats in Bangalore. We have the dataset (sample below) of 1000 flats & their market price of past data. Knowing various aspects like area, number of rooms , age of flat, etc of a flat we would like to predict market value the flat. 
Called Target or outcome or output 
Called Predictors or inputs or features 
Records or rows
What is data mining? What it is & what it is not – some intuition 
Example1: 
My company has extensive sales related data related to various locations & time periods. We would like to answer following business questions. 
“What were unit sales in New England last March? What is the trend like? Drill down to Boston”. 
This is not a data mining problem. 
“What’s likely to be Boston unit sales next month? Why?” 
This is a data mining problem. 
Example2: 
I apply for a credit card. Bank checks through income, age, past credit record, assets and credit card repayment records of thousands of other credit card holders of background similar to mine to decide whether I am creditworthy or not. 
This is a data mining problem. 
7
Machine learning ? 
•One of the most important applications of data mining is in “Machine Learning” 
•Definition : “A computer is able to learn by experience without explicitly being programmed – & improves performance as it learns” 
•Based on field of artificial intelligence 
•Examples : 
–Mining data from large datasets website click trough data to improve purchase conversion rate 
–Autonomous self flying helicopter (Stanford University) 
–Voice recognition (Siri in iPhone) 
–Classify e-mail as spam or not spam (Outlook filtering spam) 
–handwriting recognition (tablets) 
–Computer Vision (reading car number plates & giving speeding tickets) 
–Self driven cars (Google self driving car) 
–Recommender systems (Amazon recommending books) 
8
Why data mining? 
•Data deluge, exponential growth of data (40% yearly growth of data –McKinsey global institute study. In 2012, every day, 2.5 quintillion bytes of data are created – other sources) but too little information 
Note : quintillion = 1 followed by 18 zeros 
•There is a great need to extract useful information from the data and to interpret the data to develop useful knowledge. 
9
Why data mining? applications 
Wide ranging applications: 
–Biology –e.g. genome research 
–Health care – e.g. Deciding on treatment for emergency room patients 
–Pharma – e.g. drug discovery 
–Artificial intelligence applications e.g. Self driven car, machine vision 
–Manufacturing 
– engineering 
–Social media analysis 
–Banking, finance 
–Advanced data analysis in Six Sigma 
10
12 
• used 
Overview - what is data mining & machine learning – why, where •Types of data mining 
•Data mining Steps - overview 
•Data mining Steps in detail 
•Caution notice, Data mining software, references 
•About q-Maxim & Jagadish C A
Types of data mining 
1.Classification predicted target is of discrete class such as True/ false. Examples: 
whether an email is spam or not, whether a financial transaction is fraud or not, whether tumor is malignant or not. 
Number of classes could be 2 or more 
Note: This is predictive type data mining 
13
Types of data mining 
2. Regression predicted target is of continuous value type Examples: knowing area (m2), number of rooms (1-5), etc we are predicting market price (US$) of the house 
Note: This is predictive type data mining 
14 
Example : market price prediction based on area two predictive curves fitted 
Are of house(m2) 
Market Price (US$)
Types of data mining 
3. Clustering method of assigning a set of objects into groups based on similarities automatically. Example: 
create customer segmentation based on income, age, race, location, etc 
Note: This is descriptive type data mining 
15 
Example : Three clusters found
Types of data mining 
4. Anomaly Detection detecting anomaly based on patterns that do not conform to an established normal behavior. Example: 
financial fraud detection, network intrusion attempt, aircraft engine failure prediction based on vibration, Monitoring machines in data center for detecting failures before they occur 
Note: This is predictive type data mining 
16
Types of data mining 
5. Association Rule Discovering interesting rules between variables. An association algorithm creates rules that describe how often events have occurred together. Example: 
“A supermarket chain found that people who buy hotdog sausages also buy tomato ketchups in 99% of cases” = High Support “People who buy hotdog buns buy hangers in 0.005% of cases” = Low support. Conclusion: Keep hotdog sausages & tomato ketchup in adjacent racks thus increasing the probability of purchase 
Note : This presentation covers types #1 & #2 only 
17
20 
• 
Overview - what is data mining & machine learning – why, where used 
•Types of data mining 
•Data mining Steps - overview 
•Data mining Steps in detail 
•Data mining software, references 
•About q-Maxim & Jagadish C A
Data mining Steps – overview Predictive data mining phases 
Has two major phases: 
1.Learning phase 
Expose the dataset consisting of past data to learning algorithm (more of this later) so that it builds a predictive model (or learns). Tune the model until error between predicted vs actual values of target variable is as low as possible & is within acceptable limits. 
2.Scoring phase 
Use the model for making predictions (or score) in real time or productionize the model 
See schematic in the next slide, details about each of the steps in subsequent slides 
21
Data mining – overview 
example - predicting market price of house using simple linear learning 
algorithm 
22 
Sampled Training dataset 
Known 
1. Area of house 
2. Number of rooms 
3. Age of house 
4. Location 
5. Gym [y/n] 
6. ..... Etc, etc 
Learning algorithm 
predictive hypothesis 
h(x) 
Prediction 
market price of 
house 
Called target or 
Called features outcome 
or predictors 
h(x) is a linear equation of 
the type: 
hθ(x) = θ0+ θ1x1 + θ2x2 +....... Θnxn 
Past data of 
housing market 
having features & 
predictors 
Learning 
phase 
scoring 
phase
23 
• used 
Overview - what is data mining & machine learning – why, where •Types of data mining 
•Data mining Steps - overview 
•Data mining Steps in detail 
•Data mining software, references 
•About q-Maxim & Jagadish C A
24 
Data mining Steps in detail 
Business 
objectives 
Data from 
many sources 
selection 
Target data 
Pre-processing, 
clean ,exploring 
Pre-processed 
data 
transformation 
Transformed 
data 
Data mining 
Train 
model 
Interpret / evaluate 
Knowledge 
model 
in daily use; 
evaluate 
performance 
Export in PMML 
& deploy 
Data 
mining 
project 
1 2 3 4 5 6 
identify and 
define 
business 
opportunity
Data mining Steps in detail Predictive data mining steps 
Each of the steps shown in the schematic diagram in the previous slide is explained in some detail in the following slides 
Steps are numbered (such as this: ) as per the marking in the schematic diagram in the previous slide 
25 
6
Data mining Steps in detail selection 
1.identify and define business opportunity 
2.Select data mining project (s) 
3.Identify data sources, could be 
1.at many databases 
2.External –Social media (Facebook, Twitter, news items, blogs) 
3.Internal – ERP, CRM, Data warehouse ,relational technologies, XML-databases, MS-Office files, etc 
Dataset might consist of thousands (even millions) of records and hundreds & sometimes thousands of features. For example, suppose we are doing a data mining project on census records of US citizens, dataset will have > 300 million records as population of US is about 300 million 
26 
1 
2
Data mining Steps in detail Selection 
•Extract data of interest 
Many techniques may have to be used to extract useful information such as - 
•SQL 
•Roll-up 
•Drill-down 
•Slice and dice 
•Pivot 
27 
1 
2
Data mining Steps in detail Pre-processing –scrub, explore 
•Scrub data 
•Clean data – errors, inconsistent units, etc . E.g.: area of flat might in m2 in some records and in ft2 in other records 
•Fill missing data e.g. some fields might be empty 
•Hide identity if necessary e.g. Patient medical records 
•Remove duplicate fields 
28 
3
Data mining Steps in detail Pre-processing –scrub, explore 
•Explore data by visualisation 
•Visualise the data to get a quick overview. Use some of these graph types: 
–Scatter plots, Box plots, bar charts, Histograms, Scatter plots, histograms, density plots 
–Advanced graphs: Heat maps, Cluster dendrograms 
see next slide for pictures of graph types 
29 
3
Data mining Steps in detail pre-processing –visualisation graph types 
30 
3 
histograms 
Box plots 
bar plots 
Density plots 
scatter plots 
Heat maps Clustering dendrogram
Data mining Steps in detail transformation 
Sometimes it is necessary to convert features or target variables to a different format. One or more of these may be used: 
–Feature scaling 
•Make sure features are on a similar scale – Convert every feature to a scale between -1 to +1 This makes some of the data mining programs to run faster. 
–Mean normalization 
•Replace each feature value by value- mean of the dataset so that features have zero mean. 
31 
4
Data mining Steps in detail transformation (cont.) 
–Combine several features to a single feature (e.g. Convert dimensions of the house to area) 
–Date conversion for doing date arithmetic 
–Generally, if target variable data is skewed, apply one these functions 
•Log, square root, squared, polynomial ... 
32 
4
Data mining Steps in detail train model 
Reaching this stage constitutes typically as much as 60% of the data mining effort 
This step has several sub-steps & is explained in some detail 
Schematic picture of this step is in next slide. Additional explanation in subsequent slides 
33 
5
Data mining Steps in detail train model 
34 
5 
Sampled data Split data into 
1.Training 2. Validation 3. Test datasets typically in the ratio : 70:15:15 
Training dataset Sample pre- dataset 
processed, transformed Build predictive 
model on training , validation datasets using one or more learning algorithms$$ 
Predictive model 
Measure the 
performance of prediction of model on validation dataset using error rate. Tune model as necessary 
Tuned Predictive model 
$$typical Learning algorithms : 
1.Linear regression 
2.Polynomial regression 
3.Logistic regression 
4.Neural network 
5.Support vector machine 
6.Random forest
Data mining Steps in detail train model -sample & split dataset 
–Cleaned & transformed data is sampled as original data set may be very large 
–Sampled data is split into three subsets typically in 70:15:15 ratio into: 
–Training 
–Validation 
–Test 
datasets 
35 
5
Data mining Steps in detail train model -sample & split dataset (cont.) 
–Only Training and validation dataset is used to build model. Model is built on training dataset & predictive performance is repeatedly tested on validation dataset. 
–Goodness of the Model so build is evaluated on Test dataset 
36 
5
Data mining Steps in detail train model -build model 
–Depending on application, one or more of the learning algorithms is used to build predictive models. 
–Each learning algorithm is based on different principles 
–Most common algorithms are: 
1.Linear regression 
2.Polynomial regression 
3.Logistic regression 
4.Neural networks 
5.Support vector machine (SVM) 
6.Random forest 
37 
5
Data mining Steps in detail train model -build model 
-Each learning algorithm has different parameters for improving its performance called tuning parameters 
- Most of the data mining programs have libraries for doing this 
Brief explanation about learning algorithms follows in next few slides 
38 
5
Data mining Steps in detail train model -build model 
Learning algorithms : 
1.Linear regression 
Simplest of the lot assumes linear relationship between features and target . Hypothesis of model with n features would look like this: hθ(x) = θ0+ θ1x1 + θ2x2 +....... Θnxn 
2.Polynomial regression 
Assumes polynomial relationship between features and target . Typical hypothesis of a polynomial model would look like this: hθ(x) = θ0+ θ1x2 + θ2x3 + θ2x4 
39 
5
Data mining Steps in detail train model -build model 
Learning algorithms : 
3. Target is classification type e.g. E-mail spam or not spam, tumour malignant or benign. Typical hypothesis of for a model with 4 features would look like this 
hθ(x) = g(θ0+ θ1x1 + θ2x2 +θ3x3 +Θ4x4) 
40 
5
Data mining Steps in detail train model -build model 
Learning algorithms (advanced) : 
4. Neural networks 
Can handle categorical & regression target types. Is a machine learning type algorithm. Can handle non-linear & complicated type of hypothesis. Resemble functioning of neurons in human brain. Though not easy to understand working, can produce very good predictions. 
5. Support vector machine (SVM) 
Can handle categorical & regression target types. Is a machine learning type algorithm. Can handle non-linear & complicated type of hypothesis. 
41 
5
Data mining Steps in detail train model -build model 
Learning algorithms (advanced) : 
6. Random forest (decision tree) 
Can handle categorical & regression target types. These are ensemble learning method that operate by constructing a multitude of decision trees (see next slide for example) . Is a recursive partitioning method of machine learning type algorithm. Can handle non-linear & complicated type of hypothesis. Can also list relative importance of features. 
42 
5
Data mining Steps in detail train model -build model-decision tree example 
43 
5 
A tree showing survival of passengers on the Titanic ("sibsp" is the number of spouses or siblings aboard). The figures under the leaves show the probability of survival and the percentage of observations in the leaf. Source: WIKIPEDIA 
Decision tree of possibility of a person surviving Titanic sinking
Data mining Steps in detail Interpret / evaluate performance 
–Predictive ability of the model so built is evaluated applying on unseen data i.e. test dataset (also called scoring) 
–predictive ability is measured by error measures 
–Error measures are different for regression and classification problems 
44 
6
Data mining Steps in detail Interpret / evaluate performance (cont.) 
–Common Error measures for regression 
•Adjusted R2, AIC,BIC 
• Root-mean-square error (RMSE),mean squared error (MSE) of an estimator is one of many ways to quantify the difference between values implied by an estimator and the true values of the quantity being estimated. 
–Common Error measures for classification 
•Precision, recall, F1 score, accuracy 
•Lift, Area under ROC (receiver operating characteristic curve) 
45 
6
Data mining Steps in detail Interpret / evaluate performance 
–These error measures are used as a basis for 
–Confirming performance of the model 
–Comparing performance of different algorithms 
–Sometimes model is able to fit very well on the training & validation sets but unable to generalise on new samples. Could be a underfit (called high bias) or overfit (called high variance). 
46 
6
Data mining Steps in detail Interpret / evaluate performance 
•Not always the performance of the model is to the desired level. One or more of the following measures could be tried to improve the performance: 
–Increase training samples 
–Increase number of features 
–Decrease number of features 
–Add polynomial features (e.g. hθ(x) = θ0+ θ1x2 + θ2x3 + θ2x4) 
–improving the model by tuning learning algorithm. Each algorithm has tuning parameters e.g. For SVM learning algorithm it is cost, gamma, epsilon ) 
47 
6
Data mining Steps in detail deploying model 
–models are deployed for routine use & data can be scored in real-time 
–Before deploying model is often exported to open standard -PMML format 
–PMML (Predictive Model Markup Language) provides a standard way to represent data mining models. It allows for the interchange of models among different tools and environments 
–Companies like Zemenentis provide PMML based scoring engines for many platforms 
48 
6
49 
• used 
Overview - what is data mining & machine learning – why, where •Types of data mining 
•Data mining Steps - overview 
•Data mining Steps in detail 
•Caution notice, Data mining software, references 
•About q-Maxim & Jagadish C A
Data mining caution notice 
–One must carefully distinguish between correlation and causation 
–Fact that Data mining studies indicate high level of performance of the model does not necessarily imply causation. 
–It is possible to get good correlation by fitting data around just noise not signal 
–Healthy scepticism is desirable. Before concluding about causation facts have to be verified. 
50 
6
Data mining software software 
51 
Several data mining packages exist which make data mining task relatively painless. Some of the prominent Open source ones are: 
1.R 
2.Rattle – R with graphical interface 
3.Octave
Data mining software software 
52 
Some of the prominent Commercial ones are: 
1.Revolution analytics – enhanced R 
2.Minitab (some data mining aspects) 
3.Ms-office data mining add-in 
4.IBM-SPSS, SAS, Statistica 
5.Microsoft Office -data mining extensions 
6....... & Many more
Data mining references 
1.Data mining - Wikipedia, the free encyclopedia 
2.Big data: The next frontier for innovation, competition, and productivity –McKinsey global publication 
3.Machine learning- Wikipedia 
4.A. Guazzelli, M. Zeller, W. Chen, and G. Williams. PMML: An Open Standard for Sharing Models. The R Journal, Volume 1/1, May 2009. 
5.Data analysis and machine learning online courses in Coursera 
6.R: A programming language and software environment for statistical computing, data mining, and graphics. Numerous other R resources on the web 
7.Rattle: A Data Mining GUI for R - WILLIAMS - The R Journal 
8.Support vector machine (SVM) –Wikipedia 
9.Neural network software - Wikipedia, the free encyclopedia 
10.Random forest - Wikipedia, the free encyclopedia 
11.Publications / websites of commercial data mining software companies listed in previous slide 
12.Jagadish’s notes based on his past experience 
53
54 
• used 
Overview - what is data mining & machine learning – why, where •Types of data mining 
•Data mining Steps - overview 
•Data mining Steps in detail 
•Data mining software, references 
•About q-Maxim, Jagadish C A
CONTACT US FOR DETAILS (DETAILS NEXT SLIDE) 
QUESTIONS? 
DOUBTS? 
WHAT NEXT? 
WOULD YOU LIKE TO DISCUSS FURTHER TO EXPLORE DATA MINING/MACHINE LEARNING ? 
55
Contact: : q-Maxim , Jagadish C.A. (Rao) Founder, President jagadish.chandra@qmaxim.com +91 9538328704 +91 80 2693 1804 LinkedIn: http://in.linkedin.com/in/jagdishca/ blog: qmaxim.wordpress.com 
Note : Contents of this presentation, concepts, data, style are proprietary in nature and & is subject to intellectual property restrictions 
Q-Maxim is niche consultancy focussed on advanced problem solving, Quality, optimization and Japanese quality methodologies. 
About Us:

More Related Content

What's hot

Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
Saif Ullah
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
butest
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
Phi Jack
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
Thanveen
 
Data mining by_ashok
Data mining by_ashokData mining by_ashok
Data mining by_ashok
Ashok Kumar
 

What's hot (20)

Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data Mining
Data MiningData Mining
Data Mining
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques
 
Introduction data mining
Introduction data miningIntroduction data mining
Introduction data mining
 
Introduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in DatabaseIntroduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in Database
 
Dwdm
DwdmDwdm
Dwdm
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)
 
Data mining
Data miningData mining
Data mining
 
Data Mining
Data MiningData Mining
Data Mining
 
Basic Overview of Data Mining
Basic Overview of Data MiningBasic Overview of Data Mining
Basic Overview of Data Mining
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining by_ashok
Data mining by_ashokData mining by_ashok
Data mining by_ashok
 
Data mining
Data mining Data mining
Data mining
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 

Viewers also liked

Data Mining in Life Insurance Business
Data Mining in Life Insurance BusinessData Mining in Life Insurance Business
Data Mining in Life Insurance Business
Ankur Khanna
 
Introduction to Data Mining for Newbies
Introduction to Data Mining for NewbiesIntroduction to Data Mining for Newbies
Introduction to Data Mining for Newbies
Eunjeong (Lucy) Park
 
12 งานนำสนอ cluster analysis
12 งานนำสนอ cluster analysis12 งานนำสนอ cluster analysis
12 งานนำสนอ cluster analysis
khuwawa2513
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
smj
 

Viewers also liked (20)

Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Clustering Medical Data to Predict the Likelihood of Diseases
Clustering Medical Data to Predict the Likelihood of DiseasesClustering Medical Data to Predict the Likelihood of Diseases
Clustering Medical Data to Predict the Likelihood of Diseases
 
Data Mining in Life Insurance Business
Data Mining in Life Insurance BusinessData Mining in Life Insurance Business
Data Mining in Life Insurance Business
 
Lecture - Data Mining
Lecture - Data MiningLecture - Data Mining
Lecture - Data Mining
 
Crisp dm
Crisp dmCrisp dm
Crisp dm
 
Preprocessing of Academic Data for Mining Association Rule, Presentation @WAD...
Preprocessing of Academic Data for Mining Association Rule, Presentation @WAD...Preprocessing of Academic Data for Mining Association Rule, Presentation @WAD...
Preprocessing of Academic Data for Mining Association Rule, Presentation @WAD...
 
Introduction to Data Mining for Newbies
Introduction to Data Mining for NewbiesIntroduction to Data Mining for Newbies
Introduction to Data Mining for Newbies
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
 
cluster analysis
cluster analysis cluster analysis
cluster analysis
 
12 งานนำสนอ cluster analysis
12 งานนำสนอ cluster analysis12 งานนำสนอ cluster analysis
12 งานนำสนอ cluster analysis
 
CRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologyCRISP-DM: a data science project methodology
CRISP-DM: a data science project methodology
 
Introduction to predictive modeling v1
Introduction to predictive modeling v1Introduction to predictive modeling v1
Introduction to predictive modeling v1
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data mining
Data miningData mining
Data mining
 
Forest resource
Forest resourceForest resource
Forest resource
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Machine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataMachine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web Data
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 

Similar to Data mining and Machine learning expained in jargon free & lucid language

lec01-IntroductionToDataMining.pptx
lec01-IntroductionToDataMining.pptxlec01-IntroductionToDataMining.pptx
lec01-IntroductionToDataMining.pptx
AmjadAlDgour
 
Data Mining and Data Warehousing (MAKAUT)
Data Mining and Data Warehousing (MAKAUT)Data Mining and Data Warehousing (MAKAUT)
Data Mining and Data Warehousing (MAKAUT)
Bikramjit Sarkar, Ph.D.
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and
butest
 
Data Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business DatabasesData Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business Databases
butest
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applications
Subrat Swain
 
ch1vsat2k_BDA_Introduction11Jan17-converted.pptx
ch1vsat2k_BDA_Introduction11Jan17-converted.pptxch1vsat2k_BDA_Introduction11Jan17-converted.pptx
ch1vsat2k_BDA_Introduction11Jan17-converted.pptx
Mrityunjay Emmi
 
A Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence ApplicationA Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence Application
Kate Subramanian
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Young Alista
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Harry Potter
 

Similar to Data mining and Machine learning expained in jargon free & lucid language (20)

Data mining
Data miningData mining
Data mining
 
lec01-IntroductionToDataMining.pptx
lec01-IntroductionToDataMining.pptxlec01-IntroductionToDataMining.pptx
lec01-IntroductionToDataMining.pptx
 
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET-	 Fault Detection and Prediction of Failure using Vibration AnalysisIRJET-	 Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
 
Data Mining in Operating System
Data Mining in Operating SystemData Mining in Operating System
Data Mining in Operating System
 
Kp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptxKp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptx
 
Big data overview
Big data overviewBig data overview
Big data overview
 
Data mining and its applications!
Data mining and its applications!Data mining and its applications!
Data mining and its applications!
 
Data mining 1 - Introduction (cheat sheet - printable)
Data mining 1 - Introduction (cheat sheet - printable)Data mining 1 - Introduction (cheat sheet - printable)
Data mining 1 - Introduction (cheat sheet - printable)
 
Data Mining and Data Warehousing (MAKAUT)
Data Mining and Data Warehousing (MAKAUT)Data Mining and Data Warehousing (MAKAUT)
Data Mining and Data Warehousing (MAKAUT)
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and
 
Data Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business DatabasesData Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business Databases
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applications
 
6 ijaems sept-2015-6-a review of data security primitives in data mining
6 ijaems sept-2015-6-a review of data security primitives in data mining6 ijaems sept-2015-6-a review of data security primitives in data mining
6 ijaems sept-2015-6-a review of data security primitives in data mining
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
ch1vsat2k_BDA_Introduction11Jan17-converted.pptx
ch1vsat2k_BDA_Introduction11Jan17-converted.pptxch1vsat2k_BDA_Introduction11Jan17-converted.pptx
ch1vsat2k_BDA_Introduction11Jan17-converted.pptx
 
A Survey on Data Mining
A Survey on Data MiningA Survey on Data Mining
A Survey on Data Mining
 
A Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence ApplicationA Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence Application
 
Using data mining in e commerce
Using data mining in e commerceUsing data mining in e commerce
Using data mining in e commerce
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 

Recently uploaded

Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Dipal Arora
 
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
Sheetaleventcompany
 
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
dollysharma2066
 
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabiunwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
Abortion pills in Kuwait Cytotec pills in Kuwait
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
amitlee9823
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usage
Matteo Carbone
 
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
Renandantas16
 

Recently uploaded (20)

Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
 
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
 
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
 
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
 
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabiunwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
 
How to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League CityHow to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League City
 
RSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors DataRSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors Data
 
BAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
BAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRLBAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
BAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
 
Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
 
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
 
Katrina Personal Brand Project and portfolio 1
Katrina Personal Brand Project and portfolio 1Katrina Personal Brand Project and portfolio 1
Katrina Personal Brand Project and portfolio 1
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usage
 
John Halpern sued for sexual assault.pdf
John Halpern sued for sexual assault.pdfJohn Halpern sued for sexual assault.pdf
John Halpern sued for sexual assault.pdf
 
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
 
Uneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration PresentationUneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration Presentation
 
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Century
 
Organizational Transformation Lead with Culture
Organizational Transformation Lead with CultureOrganizational Transformation Lead with Culture
Organizational Transformation Lead with Culture
 

Data mining and Machine learning expained in jargon free & lucid language

  • 1. q-Maxim on Data mining and machine learning Some intuition about data mining / machine learning in jargon free lucid language 1 By Jagadish C.A. (Rao) , Founder of q-Maxim V 1.4a 13-8-2013
  • 2. BY READING ONE CAN GET SOME INTUITION ABOUT WHAT DATA MINING IS ALL ABOUT AND HOW ONE CAN APPLY IT IN THEIR OWN WORK THIS PRESENTATION GIVES OVERVIEW OF DATA MINING & MACHINE LEARNING THEN GOES ON TO DESCRIBE SOME OF THE ASPECTS IN SOME DETAIL 2
  • 3. 3 • Overview - what is data mining & machine learning – why, where used •Types of data mining •Data mining Steps - overview •Data mining Steps in detail •Caution notice, Data mining software, references •About q-Maxim & Jagadish C A
  • 4. What is data mining? •Many interpretations about the term •“Data mining is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems” – Wikipedia •In other words -Data mining is process of knowledge discovery in large databases 4
  • 5. What is data mining? •Process of analyzing data to identify patterns or relationship. •Data mining involves developing predictive capacity OR descriptive capacity for dataset of interest •As compared to querying, reporting, or even OLAP it is possible to get information without asking specific questions. •Usually involving complex algorithms and advanced statistical techniques See an example of predictive data mining & terminology in the next slide. Data is generally in the form shown 5
  • 6. What is data mining? Example of prediction - predicting house prices 6 Row no Area [sq. Ft.] Number of rooms Age of flat [years] Gym [Y/N] Swimming pool [Y/N] ............... Other features not shown............. Market price X 100000 Rupees 1 1800 5 1.1 yes yes 68.6 2 900 3 4 no no 34.5 3 1720 5 8 yes no 47.7 4 560 2 .7 no no 25.4 ..... 1000 2400 6 3 yes yes 91.8 Our task is to predict market price of flats in Bangalore. We have the dataset (sample below) of 1000 flats & their market price of past data. Knowing various aspects like area, number of rooms , age of flat, etc of a flat we would like to predict market value the flat. Called Target or outcome or output Called Predictors or inputs or features Records or rows
  • 7. What is data mining? What it is & what it is not – some intuition Example1: My company has extensive sales related data related to various locations & time periods. We would like to answer following business questions. “What were unit sales in New England last March? What is the trend like? Drill down to Boston”. This is not a data mining problem. “What’s likely to be Boston unit sales next month? Why?” This is a data mining problem. Example2: I apply for a credit card. Bank checks through income, age, past credit record, assets and credit card repayment records of thousands of other credit card holders of background similar to mine to decide whether I am creditworthy or not. This is a data mining problem. 7
  • 8. Machine learning ? •One of the most important applications of data mining is in “Machine Learning” •Definition : “A computer is able to learn by experience without explicitly being programmed – & improves performance as it learns” •Based on field of artificial intelligence •Examples : –Mining data from large datasets website click trough data to improve purchase conversion rate –Autonomous self flying helicopter (Stanford University) –Voice recognition (Siri in iPhone) –Classify e-mail as spam or not spam (Outlook filtering spam) –handwriting recognition (tablets) –Computer Vision (reading car number plates & giving speeding tickets) –Self driven cars (Google self driving car) –Recommender systems (Amazon recommending books) 8
  • 9. Why data mining? •Data deluge, exponential growth of data (40% yearly growth of data –McKinsey global institute study. In 2012, every day, 2.5 quintillion bytes of data are created – other sources) but too little information Note : quintillion = 1 followed by 18 zeros •There is a great need to extract useful information from the data and to interpret the data to develop useful knowledge. 9
  • 10. Why data mining? applications Wide ranging applications: –Biology –e.g. genome research –Health care – e.g. Deciding on treatment for emergency room patients –Pharma – e.g. drug discovery –Artificial intelligence applications e.g. Self driven car, machine vision –Manufacturing – engineering –Social media analysis –Banking, finance –Advanced data analysis in Six Sigma 10
  • 11. 12 • used Overview - what is data mining & machine learning – why, where •Types of data mining •Data mining Steps - overview •Data mining Steps in detail •Caution notice, Data mining software, references •About q-Maxim & Jagadish C A
  • 12. Types of data mining 1.Classification predicted target is of discrete class such as True/ false. Examples: whether an email is spam or not, whether a financial transaction is fraud or not, whether tumor is malignant or not. Number of classes could be 2 or more Note: This is predictive type data mining 13
  • 13. Types of data mining 2. Regression predicted target is of continuous value type Examples: knowing area (m2), number of rooms (1-5), etc we are predicting market price (US$) of the house Note: This is predictive type data mining 14 Example : market price prediction based on area two predictive curves fitted Are of house(m2) Market Price (US$)
  • 14. Types of data mining 3. Clustering method of assigning a set of objects into groups based on similarities automatically. Example: create customer segmentation based on income, age, race, location, etc Note: This is descriptive type data mining 15 Example : Three clusters found
  • 15. Types of data mining 4. Anomaly Detection detecting anomaly based on patterns that do not conform to an established normal behavior. Example: financial fraud detection, network intrusion attempt, aircraft engine failure prediction based on vibration, Monitoring machines in data center for detecting failures before they occur Note: This is predictive type data mining 16
  • 16. Types of data mining 5. Association Rule Discovering interesting rules between variables. An association algorithm creates rules that describe how often events have occurred together. Example: “A supermarket chain found that people who buy hotdog sausages also buy tomato ketchups in 99% of cases” = High Support “People who buy hotdog buns buy hangers in 0.005% of cases” = Low support. Conclusion: Keep hotdog sausages & tomato ketchup in adjacent racks thus increasing the probability of purchase Note : This presentation covers types #1 & #2 only 17
  • 17. 20 • Overview - what is data mining & machine learning – why, where used •Types of data mining •Data mining Steps - overview •Data mining Steps in detail •Data mining software, references •About q-Maxim & Jagadish C A
  • 18. Data mining Steps – overview Predictive data mining phases Has two major phases: 1.Learning phase Expose the dataset consisting of past data to learning algorithm (more of this later) so that it builds a predictive model (or learns). Tune the model until error between predicted vs actual values of target variable is as low as possible & is within acceptable limits. 2.Scoring phase Use the model for making predictions (or score) in real time or productionize the model See schematic in the next slide, details about each of the steps in subsequent slides 21
  • 19. Data mining – overview example - predicting market price of house using simple linear learning algorithm 22 Sampled Training dataset Known 1. Area of house 2. Number of rooms 3. Age of house 4. Location 5. Gym [y/n] 6. ..... Etc, etc Learning algorithm predictive hypothesis h(x) Prediction market price of house Called target or Called features outcome or predictors h(x) is a linear equation of the type: hθ(x) = θ0+ θ1x1 + θ2x2 +....... Θnxn Past data of housing market having features & predictors Learning phase scoring phase
  • 20. 23 • used Overview - what is data mining & machine learning – why, where •Types of data mining •Data mining Steps - overview •Data mining Steps in detail •Data mining software, references •About q-Maxim & Jagadish C A
  • 21. 24 Data mining Steps in detail Business objectives Data from many sources selection Target data Pre-processing, clean ,exploring Pre-processed data transformation Transformed data Data mining Train model Interpret / evaluate Knowledge model in daily use; evaluate performance Export in PMML & deploy Data mining project 1 2 3 4 5 6 identify and define business opportunity
  • 22. Data mining Steps in detail Predictive data mining steps Each of the steps shown in the schematic diagram in the previous slide is explained in some detail in the following slides Steps are numbered (such as this: ) as per the marking in the schematic diagram in the previous slide 25 6
  • 23. Data mining Steps in detail selection 1.identify and define business opportunity 2.Select data mining project (s) 3.Identify data sources, could be 1.at many databases 2.External –Social media (Facebook, Twitter, news items, blogs) 3.Internal – ERP, CRM, Data warehouse ,relational technologies, XML-databases, MS-Office files, etc Dataset might consist of thousands (even millions) of records and hundreds & sometimes thousands of features. For example, suppose we are doing a data mining project on census records of US citizens, dataset will have > 300 million records as population of US is about 300 million 26 1 2
  • 24. Data mining Steps in detail Selection •Extract data of interest Many techniques may have to be used to extract useful information such as - •SQL •Roll-up •Drill-down •Slice and dice •Pivot 27 1 2
  • 25. Data mining Steps in detail Pre-processing –scrub, explore •Scrub data •Clean data – errors, inconsistent units, etc . E.g.: area of flat might in m2 in some records and in ft2 in other records •Fill missing data e.g. some fields might be empty •Hide identity if necessary e.g. Patient medical records •Remove duplicate fields 28 3
  • 26. Data mining Steps in detail Pre-processing –scrub, explore •Explore data by visualisation •Visualise the data to get a quick overview. Use some of these graph types: –Scatter plots, Box plots, bar charts, Histograms, Scatter plots, histograms, density plots –Advanced graphs: Heat maps, Cluster dendrograms see next slide for pictures of graph types 29 3
  • 27. Data mining Steps in detail pre-processing –visualisation graph types 30 3 histograms Box plots bar plots Density plots scatter plots Heat maps Clustering dendrogram
  • 28. Data mining Steps in detail transformation Sometimes it is necessary to convert features or target variables to a different format. One or more of these may be used: –Feature scaling •Make sure features are on a similar scale – Convert every feature to a scale between -1 to +1 This makes some of the data mining programs to run faster. –Mean normalization •Replace each feature value by value- mean of the dataset so that features have zero mean. 31 4
  • 29. Data mining Steps in detail transformation (cont.) –Combine several features to a single feature (e.g. Convert dimensions of the house to area) –Date conversion for doing date arithmetic –Generally, if target variable data is skewed, apply one these functions •Log, square root, squared, polynomial ... 32 4
  • 30. Data mining Steps in detail train model Reaching this stage constitutes typically as much as 60% of the data mining effort This step has several sub-steps & is explained in some detail Schematic picture of this step is in next slide. Additional explanation in subsequent slides 33 5
  • 31. Data mining Steps in detail train model 34 5 Sampled data Split data into 1.Training 2. Validation 3. Test datasets typically in the ratio : 70:15:15 Training dataset Sample pre- dataset processed, transformed Build predictive model on training , validation datasets using one or more learning algorithms$$ Predictive model Measure the performance of prediction of model on validation dataset using error rate. Tune model as necessary Tuned Predictive model $$typical Learning algorithms : 1.Linear regression 2.Polynomial regression 3.Logistic regression 4.Neural network 5.Support vector machine 6.Random forest
  • 32. Data mining Steps in detail train model -sample & split dataset –Cleaned & transformed data is sampled as original data set may be very large –Sampled data is split into three subsets typically in 70:15:15 ratio into: –Training –Validation –Test datasets 35 5
  • 33. Data mining Steps in detail train model -sample & split dataset (cont.) –Only Training and validation dataset is used to build model. Model is built on training dataset & predictive performance is repeatedly tested on validation dataset. –Goodness of the Model so build is evaluated on Test dataset 36 5
  • 34. Data mining Steps in detail train model -build model –Depending on application, one or more of the learning algorithms is used to build predictive models. –Each learning algorithm is based on different principles –Most common algorithms are: 1.Linear regression 2.Polynomial regression 3.Logistic regression 4.Neural networks 5.Support vector machine (SVM) 6.Random forest 37 5
  • 35. Data mining Steps in detail train model -build model -Each learning algorithm has different parameters for improving its performance called tuning parameters - Most of the data mining programs have libraries for doing this Brief explanation about learning algorithms follows in next few slides 38 5
  • 36. Data mining Steps in detail train model -build model Learning algorithms : 1.Linear regression Simplest of the lot assumes linear relationship between features and target . Hypothesis of model with n features would look like this: hθ(x) = θ0+ θ1x1 + θ2x2 +....... Θnxn 2.Polynomial regression Assumes polynomial relationship between features and target . Typical hypothesis of a polynomial model would look like this: hθ(x) = θ0+ θ1x2 + θ2x3 + θ2x4 39 5
  • 37. Data mining Steps in detail train model -build model Learning algorithms : 3. Target is classification type e.g. E-mail spam or not spam, tumour malignant or benign. Typical hypothesis of for a model with 4 features would look like this hθ(x) = g(θ0+ θ1x1 + θ2x2 +θ3x3 +Θ4x4) 40 5
  • 38. Data mining Steps in detail train model -build model Learning algorithms (advanced) : 4. Neural networks Can handle categorical & regression target types. Is a machine learning type algorithm. Can handle non-linear & complicated type of hypothesis. Resemble functioning of neurons in human brain. Though not easy to understand working, can produce very good predictions. 5. Support vector machine (SVM) Can handle categorical & regression target types. Is a machine learning type algorithm. Can handle non-linear & complicated type of hypothesis. 41 5
  • 39. Data mining Steps in detail train model -build model Learning algorithms (advanced) : 6. Random forest (decision tree) Can handle categorical & regression target types. These are ensemble learning method that operate by constructing a multitude of decision trees (see next slide for example) . Is a recursive partitioning method of machine learning type algorithm. Can handle non-linear & complicated type of hypothesis. Can also list relative importance of features. 42 5
  • 40. Data mining Steps in detail train model -build model-decision tree example 43 5 A tree showing survival of passengers on the Titanic ("sibsp" is the number of spouses or siblings aboard). The figures under the leaves show the probability of survival and the percentage of observations in the leaf. Source: WIKIPEDIA Decision tree of possibility of a person surviving Titanic sinking
  • 41. Data mining Steps in detail Interpret / evaluate performance –Predictive ability of the model so built is evaluated applying on unseen data i.e. test dataset (also called scoring) –predictive ability is measured by error measures –Error measures are different for regression and classification problems 44 6
  • 42. Data mining Steps in detail Interpret / evaluate performance (cont.) –Common Error measures for regression •Adjusted R2, AIC,BIC • Root-mean-square error (RMSE),mean squared error (MSE) of an estimator is one of many ways to quantify the difference between values implied by an estimator and the true values of the quantity being estimated. –Common Error measures for classification •Precision, recall, F1 score, accuracy •Lift, Area under ROC (receiver operating characteristic curve) 45 6
  • 43. Data mining Steps in detail Interpret / evaluate performance –These error measures are used as a basis for –Confirming performance of the model –Comparing performance of different algorithms –Sometimes model is able to fit very well on the training & validation sets but unable to generalise on new samples. Could be a underfit (called high bias) or overfit (called high variance). 46 6
  • 44. Data mining Steps in detail Interpret / evaluate performance •Not always the performance of the model is to the desired level. One or more of the following measures could be tried to improve the performance: –Increase training samples –Increase number of features –Decrease number of features –Add polynomial features (e.g. hθ(x) = θ0+ θ1x2 + θ2x3 + θ2x4) –improving the model by tuning learning algorithm. Each algorithm has tuning parameters e.g. For SVM learning algorithm it is cost, gamma, epsilon ) 47 6
  • 45. Data mining Steps in detail deploying model –models are deployed for routine use & data can be scored in real-time –Before deploying model is often exported to open standard -PMML format –PMML (Predictive Model Markup Language) provides a standard way to represent data mining models. It allows for the interchange of models among different tools and environments –Companies like Zemenentis provide PMML based scoring engines for many platforms 48 6
  • 46. 49 • used Overview - what is data mining & machine learning – why, where •Types of data mining •Data mining Steps - overview •Data mining Steps in detail •Caution notice, Data mining software, references •About q-Maxim & Jagadish C A
  • 47. Data mining caution notice –One must carefully distinguish between correlation and causation –Fact that Data mining studies indicate high level of performance of the model does not necessarily imply causation. –It is possible to get good correlation by fitting data around just noise not signal –Healthy scepticism is desirable. Before concluding about causation facts have to be verified. 50 6
  • 48. Data mining software software 51 Several data mining packages exist which make data mining task relatively painless. Some of the prominent Open source ones are: 1.R 2.Rattle – R with graphical interface 3.Octave
  • 49. Data mining software software 52 Some of the prominent Commercial ones are: 1.Revolution analytics – enhanced R 2.Minitab (some data mining aspects) 3.Ms-office data mining add-in 4.IBM-SPSS, SAS, Statistica 5.Microsoft Office -data mining extensions 6....... & Many more
  • 50. Data mining references 1.Data mining - Wikipedia, the free encyclopedia 2.Big data: The next frontier for innovation, competition, and productivity –McKinsey global publication 3.Machine learning- Wikipedia 4.A. Guazzelli, M. Zeller, W. Chen, and G. Williams. PMML: An Open Standard for Sharing Models. The R Journal, Volume 1/1, May 2009. 5.Data analysis and machine learning online courses in Coursera 6.R: A programming language and software environment for statistical computing, data mining, and graphics. Numerous other R resources on the web 7.Rattle: A Data Mining GUI for R - WILLIAMS - The R Journal 8.Support vector machine (SVM) –Wikipedia 9.Neural network software - Wikipedia, the free encyclopedia 10.Random forest - Wikipedia, the free encyclopedia 11.Publications / websites of commercial data mining software companies listed in previous slide 12.Jagadish’s notes based on his past experience 53
  • 51. 54 • used Overview - what is data mining & machine learning – why, where •Types of data mining •Data mining Steps - overview •Data mining Steps in detail •Data mining software, references •About q-Maxim, Jagadish C A
  • 52. CONTACT US FOR DETAILS (DETAILS NEXT SLIDE) QUESTIONS? DOUBTS? WHAT NEXT? WOULD YOU LIKE TO DISCUSS FURTHER TO EXPLORE DATA MINING/MACHINE LEARNING ? 55
  • 53. Contact: : q-Maxim , Jagadish C.A. (Rao) Founder, President jagadish.chandra@qmaxim.com +91 9538328704 +91 80 2693 1804 LinkedIn: http://in.linkedin.com/in/jagdishca/ blog: qmaxim.wordpress.com Note : Contents of this presentation, concepts, data, style are proprietary in nature and & is subject to intellectual property restrictions Q-Maxim is niche consultancy focussed on advanced problem solving, Quality, optimization and Japanese quality methodologies. About Us: