SlideShare a Scribd company logo
1 of 12
Data preprocessing
concepts
 Feature Engneering
 Feature Scaling
 Train test split
 Encoding techniques
 Imbalanced data
Table of Contents
Feature Engineering
 some times we are having more no.of columns it leads for the curse of dimensionality or
high dimensionality at that point model will not perform well
 In this case we just remove the un usefull columns they are
droping columns (columns with more than 70% of Missing values)
 columns with less variance (the spread of the data is very high )
 columns with same co-relation with other columns
 Finally we will go for PCA(principle component analysis) for unsupervised
machine learning algorithms
 LDA (Latent dimension analysis ) for supervised machine learning models
Feature Scaling
MinMaxScaler
StandardScaler
x(i)norm = (xi- xmin )/(xmax- xmin )
x(i)std = ( x(i)- μx)/(σx )
*Min-max scaling scales the data to a limited range of values.
in standardization, we center the feature columns at mean 0 with standard deviation 1
so that the feature columns takes the form of a normal distribution,
Imbalanced Data
The problem of imbalanced classes arises when one set of classes
dominate over another set of classes
Example of imbalanced classes
Disease detection
Fraud detection
Spam filtering
Earthquake prediction
Different ways to fix the imbalanced data :
 Undersampling methods
 Oversampling
methods
 Synthetic Minority Oversampling Technique
(SMOTE)
Creating synthetic samples
Encoding techniques
Usually Data in the form of three types :-
1. Nominal
2. Ordinal
3. Binary
The Data Variable that represents names, no.s, places with no
order
The Data Variable that represents the data with some order
The data that is in the form of binary like 0 or 1, yes or no, True or False etc…..
1.Binary: Either/or
Examples:
1. Yes, No
2. True, False
1.Ordinal: Specific ordered Groups.
Examples:
1. low, medium, high
2. cold, hot, lava Hot
1.Nominal: Unordered
Groups.
Examples
1. cat, dog, tiger
2. pizza, burger, coke
Encoding tech
 Converting all catogeries in a variable into numerical
Encoding Types
1 . One hot encoding
2. Label Encoding
4. Mean Encoding
5. Top 10 Mean Frequency Encoding
3. Frequency Encoding
Splitting Data
Train test split is nothing but splitting the data into train data
1. In this splitting train data is for training the data
2. Test data is for test the model performance
1. x_train
2. x_test
3. y_train
4. y_test

More Related Content

Similar to preprocessing techniques.pptx

Machine learning and decision trees
Machine learning and decision treesMachine learning and decision trees
Machine learning and decision treesPadma Metta
 
Using Tree algorithms on machine learning
Using Tree algorithms on machine learningUsing Tree algorithms on machine learning
Using Tree algorithms on machine learningRajasekhar364622
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 
Lecture 9 slides: Machine learning for Protein Structure ...
Lecture 9 slides: Machine learning for Protein Structure ...Lecture 9 slides: Machine learning for Protein Structure ...
Lecture 9 slides: Machine learning for Protein Structure ...butest
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfAmmarAhmedSiddiqui2
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningAI Summary
 
How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? HackerEarth
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
 
An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boostingbutest
 
Credit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly DetectionCredit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly DetectionLalit Jain
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CARTXueping Peng
 
Sample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfSample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfAaryanArora10
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind MapAshish Patel
 

Similar to preprocessing techniques.pptx (20)

Classification
ClassificationClassification
Classification
 
Classification
ClassificationClassification
Classification
 
ML-Unit-4.pdf
ML-Unit-4.pdfML-Unit-4.pdf
ML-Unit-4.pdf
 
Machine learning and decision trees
Machine learning and decision treesMachine learning and decision trees
Machine learning and decision trees
 
Using Tree algorithms on machine learning
Using Tree algorithms on machine learningUsing Tree algorithms on machine learning
Using Tree algorithms on machine learning
 
LR2. Summary Day 2
LR2. Summary Day 2LR2. Summary Day 2
LR2. Summary Day 2
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Lecture 9 slides: Machine learning for Protein Structure ...
Lecture 9 slides: Machine learning for Protein Structure ...Lecture 9 slides: Machine learning for Protein Structure ...
Lecture 9 slides: Machine learning for Protein Structure ...
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ?
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boosting
 
Credit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly DetectionCredit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly Detection
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Sample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfSample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdf
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
 

Recently uploaded

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 

Recently uploaded (20)

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 

preprocessing techniques.pptx

  • 2.  Feature Engneering  Feature Scaling  Train test split  Encoding techniques  Imbalanced data Table of Contents
  • 3. Feature Engineering  some times we are having more no.of columns it leads for the curse of dimensionality or high dimensionality at that point model will not perform well  In this case we just remove the un usefull columns they are droping columns (columns with more than 70% of Missing values)  columns with less variance (the spread of the data is very high )  columns with same co-relation with other columns  Finally we will go for PCA(principle component analysis) for unsupervised machine learning algorithms  LDA (Latent dimension analysis ) for supervised machine learning models
  • 4. Feature Scaling MinMaxScaler StandardScaler x(i)norm = (xi- xmin )/(xmax- xmin ) x(i)std = ( x(i)- μx)/(σx ) *Min-max scaling scales the data to a limited range of values. in standardization, we center the feature columns at mean 0 with standard deviation 1 so that the feature columns takes the form of a normal distribution,
  • 5. Imbalanced Data The problem of imbalanced classes arises when one set of classes dominate over another set of classes Example of imbalanced classes Disease detection Fraud detection Spam filtering Earthquake prediction
  • 6. Different ways to fix the imbalanced data :  Undersampling methods  Oversampling methods  Synthetic Minority Oversampling Technique (SMOTE)
  • 8. Encoding techniques Usually Data in the form of three types :- 1. Nominal 2. Ordinal 3. Binary The Data Variable that represents names, no.s, places with no order The Data Variable that represents the data with some order The data that is in the form of binary like 0 or 1, yes or no, True or False etc…..
  • 9.
  • 10. 1.Binary: Either/or Examples: 1. Yes, No 2. True, False 1.Ordinal: Specific ordered Groups. Examples: 1. low, medium, high 2. cold, hot, lava Hot 1.Nominal: Unordered Groups. Examples 1. cat, dog, tiger 2. pizza, burger, coke
  • 11. Encoding tech  Converting all catogeries in a variable into numerical Encoding Types 1 . One hot encoding 2. Label Encoding 4. Mean Encoding 5. Top 10 Mean Frequency Encoding 3. Frequency Encoding
  • 12. Splitting Data Train test split is nothing but splitting the data into train data 1. In this splitting train data is for training the data 2. Test data is for test the model performance 1. x_train 2. x_test 3. y_train 4. y_test