SlideShare une entreprise Scribd logo
1  sur  12
Version 1.0
Machine Learning - Feature
Selection
Feature selection describes the process of picking particular,
relevant data features out of a wider data set, to be used to
perform model training.
Obioma Anomnachi
Engineer @ Anant
Data Preparation
● Data preparation deals with
transformations applied to data that
prepare it for use with machine
learning algorithms
○ Previously, we’ve covered a number
of methods within the field:
https://blog.anant.us/spark-and-
cassandra-for-machine-learning-
data-pre-processing/
○ Vectorization and Encoding help
organize raw data into a form that
ML models can work with
○ Standardization can help to better
express the variance within data and
prepare it for models that expect
data within certain ranges
Data Preparation (2)
● Imputation is one of a number of methods
for dealing with missing fields for particular
rows within your data
● Feature selection actually falls within the
same category as PCA, a previously covered
topic. Both methods are types of
dimensionality reduction.
○ Dimensionality reduction focuses on
removing irrelevant data from the data set to
reduce computational costs, improve model
performance, and work towards “legibility” -
or the ability of the model to be understood
by humans.
Feature Selection - Overview
● Feature selection, as a subcategory of dimensionality reduction, is concerned with picking the
most relevant features out of a dataset. It is a process for removing irrelevant or misleading
columns from a dataset before any models are trained.
○ Just like ML models in general, feature selection methods can be supervised or unsupervised, depending on
whether the data that they interact with is labeled or not.
■ Unsupervised feature selection processes do not have a label against which they can compare the
relevance of the data, so the most it can accomplish is to remove redundant data from the data set.
■ Supervised processes can compare how highly certain fields are correlated with the label we want the
model to predict in the end, so data can be defined as irrelevant if it has no bearing on that outcome.
■ Essentially, supervised methods are about the relationship between your data and the labels while
unsupervised methods are about the relationships between your data and the rest of your data.
Feature Selection - Unsupervised
● Unsupervised methods can work within singular features to remove ones that even in isolation fail
to add information to the wider data set
○ Variance Thresholds are used to remove any fields with variance below a certain value.
○ In the most extreme case, fields that contain the same value for every row in the dataset can safely be
dropped. Variance thresholds allow less extreme settings but generally accomplish the same type of thing.
● They can also work across the entire set of feature to remove redundant ones.
○ A correlation matrix can be built between fields in the data set. Fields that show extremely high correlation
with each other can be chosen between.
○ For an extreme example consider a data set that contains two fields measuring the exact same thing with
different units. At most, one of those should make it into the training set. In this test they would show 100%
correlations with each other signalling that we only need one.
Feature Selection - Supervised
● Supervised filter selection methods compare
predictor fields to the label field, picking out
the most relevant fields to prediction
outcomes.
○ Supervised methods get divided further into
three groups.
○ Filter Methods use information theory to
select and drop the least relevant fields based
on their relation to the label field.
○ Wrapper methods progressively remove
fields from the data set and train and test
models, using the testing results to determine
the best fields to remove.
○ Intrinsic methods combine the training and
testing steps of the wrapper methods with
rule based methods for selecting out subsets
of fields to test
Feature Selection - Supervised - Filter Methods
● Filter methods use statistical analysis to
perform feature selections. Which algorithms
need to be used depend on the type of the
fields of the label fields and the predictor field
being analyzed.
○ Numerical fields cover any fields with integer
or decimal types.
○ Categorical fields include boolean types,
ordinal categories, and nominal categories.
● Each of these combinations have various
associated statistical tests. Some of these are
familiar like Pearson’s Correlation
Coefficients, a measure of correlation and
ANOVA, a measure of statistical significance
used in scientific research.
Feature Selection - Supervised - Wrapper Methods
● Wrapper methods train models on subsets of fields and evaluate the
performance of those models to determine the best subset of features to
select.
○ The most obvious method included in this subset is Exhaustive Feature Selection
method, where each combination of features is used to train a model. Each
model’s performance is compared and the best performing subset is selected as
the set of features for the actual learning task. This returns the best performing
subset of features over all of the possible combinations.
○ Other techniques include:
■ Forward Feature Selection - Start with the best single feature and add
features until criteria are met.
■ Backward Feature Elimination - Start with all of the features and remove
them until criteria are satisfied.
■ Recursive Feature Elimination - Recursively remove features or groups
of features that are determined to be least important
Feature Selection - Supervised - Intrinsic Methods
● Intrinsic methods are similar to wrapper methods of feature selection in that they involve training
a model.
○ While wrapper methods do preliminary training of example models in order to extract statistical information
intrinsic methods take place during the actual model training process.
○ L1 Regularization - or LASSO directly changes the cost function to help avoid overfitting. In the process of
this, it adds an extra coefficient for each field in the training set. These coefficients can go down to zero,
effectively removing fields from the data set.
Demo
Resources
● https://sebastianraschka.com/faq/docs/feature_sele_categories.html
● http://www.feat.engineering/classes-of-feature-selection-methodologies.html
● https://machinelearningmastery.com/feature-selection-with-real-and-categorical-
data/#:~:text=Feature%20selection%20is%20the%20process,the%20performance
%20of%20the%20model.
● https://scikit-learn.org/stable/modules/feature_selection.html
● https://en.wikipedia.org/wiki/Feature_selection
● https://www.simplilearn.com/tutorials/machine-learning-tutorial/feature-
selection-in-machine-learning
● https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-
machine-learning/
Strategy: Scalable Fast Data
Architecture: Cassandra, Spark, Kafka
Engineering: Node, Python, JVM,CLR
Operations: Cloud, Container
Rescue: Downtime!! I need help.
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037

Contenu connexe

Similaire à Data Engineer’s Lunch #67: Machine Learning - Feature Selection

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...IJMER
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptxnarmeen11
 
A Survey on Classification of Feature Selection Strategies
A Survey on Classification of Feature Selection StrategiesA Survey on Classification of Feature Selection Strategies
A Survey on Classification of Feature Selection Strategiesijtsrd
 
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...IEEEGLOBALSOFTTECHNOLOGIES
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
Booster in High Dimensional Data Classification
Booster in High Dimensional Data ClassificationBooster in High Dimensional Data Classification
Booster in High Dimensional Data Classificationrahulmonikasharma
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applicationsBenjaminlapid1
 
student performance ppt1.pptx
student performance ppt1.pptxstudent performance ppt1.pptx
student performance ppt1.pptxdattuprince1
 
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentationNaveen Kumar
 
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...IEEEGLOBALSOFTTECHNOLOGIES
 
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...IEEEGLOBALSOFTTECHNOLOGIES
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...IEEEFINALYEARPROJECTS
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Jayanti Pande
 
It's Machine Learning Basics -- For You!
It's Machine Learning Basics -- For You!It's Machine Learning Basics -- For You!
It's Machine Learning Basics -- For You!To Sum It Up
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3Luis Borbon
 

Similaire à Data Engineer’s Lunch #67: Machine Learning - Feature Selection (20)

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
ml-09x01.pdf
ml-09x01.pdfml-09x01.pdf
ml-09x01.pdf
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
 
A Survey on Classification of Feature Selection Strategies
A Survey on Classification of Feature Selection StrategiesA Survey on Classification of Feature Selection Strategies
A Survey on Classification of Feature Selection Strategies
 
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Booster in High Dimensional Data Classification
Booster in High Dimensional Data ClassificationBooster in High Dimensional Data Classification
Booster in High Dimensional Data Classification
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applications
 
student performance ppt1.pptx
student performance ppt1.pptxstudent performance ppt1.pptx
student performance ppt1.pptx
 
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentation
 
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
 
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...
 
Module-4_Part-II.pptx
Module-4_Part-II.pptxModule-4_Part-II.pptx
Module-4_Part-II.pptx
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.
 
ML-Unit-4.pdf
ML-Unit-4.pdfML-Unit-4.pdf
ML-Unit-4.pdf
 
It's Machine Learning Basics -- For You!
It's Machine Learning Basics -- For You!It's Machine Learning Basics -- For You!
It's Machine Learning Basics -- For You!
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3
 

Plus de Anant Corporation

QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137Anant Corporation
 
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdfKono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdfAnant Corporation
 
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache PinotData Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache PinotAnant Corporation
 
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...Anant Corporation
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAnant Corporation
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapAnant Corporation
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowAnant Corporation
 
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward TalksCassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward TalksAnant Corporation
 
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionData Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionAnant Corporation
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Anant Corporation
 
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & FutureCassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & FutureAnant Corporation
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Anant Corporation
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
 
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsApache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsAnant Corporation
 
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraApache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraAnant Corporation
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessAnant Corporation
 

Plus de Anant Corporation (20)

QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
 
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdfKono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
 
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache PinotData Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
 
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
 
YugabyteDB Developer Tools
YugabyteDB Developer ToolsYugabyteDB Developer Tools
YugabyteDB Developer Tools
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with Airflow
 
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward TalksCassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward Talks
 
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionData Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
 
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & FutureCassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
CL 121
CL 121CL 121
CL 121
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsApache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
 
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraApache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
 

Dernier

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 

Dernier (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 

Data Engineer’s Lunch #67: Machine Learning - Feature Selection

  • 1. Version 1.0 Machine Learning - Feature Selection Feature selection describes the process of picking particular, relevant data features out of a wider data set, to be used to perform model training. Obioma Anomnachi Engineer @ Anant
  • 2. Data Preparation ● Data preparation deals with transformations applied to data that prepare it for use with machine learning algorithms ○ Previously, we’ve covered a number of methods within the field: https://blog.anant.us/spark-and- cassandra-for-machine-learning- data-pre-processing/ ○ Vectorization and Encoding help organize raw data into a form that ML models can work with ○ Standardization can help to better express the variance within data and prepare it for models that expect data within certain ranges
  • 3. Data Preparation (2) ● Imputation is one of a number of methods for dealing with missing fields for particular rows within your data ● Feature selection actually falls within the same category as PCA, a previously covered topic. Both methods are types of dimensionality reduction. ○ Dimensionality reduction focuses on removing irrelevant data from the data set to reduce computational costs, improve model performance, and work towards “legibility” - or the ability of the model to be understood by humans.
  • 4. Feature Selection - Overview ● Feature selection, as a subcategory of dimensionality reduction, is concerned with picking the most relevant features out of a dataset. It is a process for removing irrelevant or misleading columns from a dataset before any models are trained. ○ Just like ML models in general, feature selection methods can be supervised or unsupervised, depending on whether the data that they interact with is labeled or not. ■ Unsupervised feature selection processes do not have a label against which they can compare the relevance of the data, so the most it can accomplish is to remove redundant data from the data set. ■ Supervised processes can compare how highly certain fields are correlated with the label we want the model to predict in the end, so data can be defined as irrelevant if it has no bearing on that outcome. ■ Essentially, supervised methods are about the relationship between your data and the labels while unsupervised methods are about the relationships between your data and the rest of your data.
  • 5. Feature Selection - Unsupervised ● Unsupervised methods can work within singular features to remove ones that even in isolation fail to add information to the wider data set ○ Variance Thresholds are used to remove any fields with variance below a certain value. ○ In the most extreme case, fields that contain the same value for every row in the dataset can safely be dropped. Variance thresholds allow less extreme settings but generally accomplish the same type of thing. ● They can also work across the entire set of feature to remove redundant ones. ○ A correlation matrix can be built between fields in the data set. Fields that show extremely high correlation with each other can be chosen between. ○ For an extreme example consider a data set that contains two fields measuring the exact same thing with different units. At most, one of those should make it into the training set. In this test they would show 100% correlations with each other signalling that we only need one.
  • 6. Feature Selection - Supervised ● Supervised filter selection methods compare predictor fields to the label field, picking out the most relevant fields to prediction outcomes. ○ Supervised methods get divided further into three groups. ○ Filter Methods use information theory to select and drop the least relevant fields based on their relation to the label field. ○ Wrapper methods progressively remove fields from the data set and train and test models, using the testing results to determine the best fields to remove. ○ Intrinsic methods combine the training and testing steps of the wrapper methods with rule based methods for selecting out subsets of fields to test
  • 7. Feature Selection - Supervised - Filter Methods ● Filter methods use statistical analysis to perform feature selections. Which algorithms need to be used depend on the type of the fields of the label fields and the predictor field being analyzed. ○ Numerical fields cover any fields with integer or decimal types. ○ Categorical fields include boolean types, ordinal categories, and nominal categories. ● Each of these combinations have various associated statistical tests. Some of these are familiar like Pearson’s Correlation Coefficients, a measure of correlation and ANOVA, a measure of statistical significance used in scientific research.
  • 8. Feature Selection - Supervised - Wrapper Methods ● Wrapper methods train models on subsets of fields and evaluate the performance of those models to determine the best subset of features to select. ○ The most obvious method included in this subset is Exhaustive Feature Selection method, where each combination of features is used to train a model. Each model’s performance is compared and the best performing subset is selected as the set of features for the actual learning task. This returns the best performing subset of features over all of the possible combinations. ○ Other techniques include: ■ Forward Feature Selection - Start with the best single feature and add features until criteria are met. ■ Backward Feature Elimination - Start with all of the features and remove them until criteria are satisfied. ■ Recursive Feature Elimination - Recursively remove features or groups of features that are determined to be least important
  • 9. Feature Selection - Supervised - Intrinsic Methods ● Intrinsic methods are similar to wrapper methods of feature selection in that they involve training a model. ○ While wrapper methods do preliminary training of example models in order to extract statistical information intrinsic methods take place during the actual model training process. ○ L1 Regularization - or LASSO directly changes the cost function to help avoid overfitting. In the process of this, it adds an extra coefficient for each field in the training set. These coefficients can go down to zero, effectively removing fields from the data set.
  • 10. Demo
  • 11. Resources ● https://sebastianraschka.com/faq/docs/feature_sele_categories.html ● http://www.feat.engineering/classes-of-feature-selection-methodologies.html ● https://machinelearningmastery.com/feature-selection-with-real-and-categorical- data/#:~:text=Feature%20selection%20is%20the%20process,the%20performance %20of%20the%20model. ● https://scikit-learn.org/stable/modules/feature_selection.html ● https://en.wikipedia.org/wiki/Feature_selection ● https://www.simplilearn.com/tutorials/machine-learning-tutorial/feature- selection-in-machine-learning ● https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in- machine-learning/
  • 12. Strategy: Scalable Fast Data Architecture: Cassandra, Spark, Kafka Engineering: Node, Python, JVM,CLR Operations: Cloud, Container Rescue: Downtime!! I need help. www.anant.us | solutions@anant.us | (855) 262-6826 3 Washington Circle, NW | Suite 301 | Washington, DC 20037