SlideShare une entreprise Scribd logo
1  sur  68
Télécharger pour lire hors ligne
Chapter 18,19.
Building a model can be a never-ending process
IMPROVE THE MODEL
ADDING
INTERACTION S
Taking away
variables
Doing
transformation
How do we judge the quality of the model?
The answer :
In relation to other models.
an analysis
of residuals
drop-in
deviance
the results
of an
ANOVA test
Wald test
the AIC or
BIC score
cross-
validation
error
bootstrapping.
18.1. Residuals
The difference between the actual response and
the fitted values.
where the errors, akin to residuals, are
normally distributed.
The basic idea is that if the model is appropriately
fitted to the data, the residuals should be normally
distributed as well.
each coefficient is plotted as a point with a thick
line representing the one standard
error confidence interval and a thin line
representing the two standard error confidence
interval.
There is a vertical line indicating 0. In general, a
good rule of thumb is that if the two standard error
confidence interval does not contain 0, it is
statistically significant.
Remember
ggplot2 with linear regression
has a handy trick for dealing with lm models. We can use the
model as the data source and ggplot2 “fortifies” it, creating
new columns, for easy plotting
The basic structure for ggplot2 starts with the ggplot function,which at its most basic
should take the data as its first argument. It can take more arguments, or fewer, but
we will stick with that for now. After initializing the object, we add layers using the +
symbol. To start, we will just discuss geometric layers such as points, lines and
histograms. They are included using functions like geom point, geom line and geom
histogram. These functions take multiple arguments, the most important being which
variable in the data gets mapped to which axis or other aesthetic using aes.
Furthermore, each layer can have different aesthetic mappings and even different
data.
ggplot2
Q-Q plot
If the model is a good fit, the standardized residuals should all
fall along a straight line when plotted against the theoretical
quantiles of the normal distribution. Both the base
graphics and ggplot2 versions are shown in next slide .
histogram of the residuals. This time we will not be
showing the base graphics alternative because a
histogram is standard plot that we have shown
repeatedly.
The histogram is not normally distributed, meaning
model is not an entirely correct.
histogram
All of this measuring of model fit only really makes sense
when comparing multiple models, because all of these
measures are relative.
where :
ni is the number of observations in group i,
i is the mean of group i, is the overall mean,
Yij is observation j in group i,
N is the total number of observations
K is the number of groups.
ANOVA for a multisample test, we do believe it serves a useful purpose
in testing the relative merits of different models. Simply passing
multiple model objects to anova will return a table of results including
the residual sum of squares (RSS), which is a measure of error, the lower
the better.
Akaike Information Criterion (AIC). As with RSS, the model with
thelowest AIC—even negative values—is considered optimal.
The BIC (Bayesian Information Criterion) is a similar measure where,
once again, lower is better.
AIC & BIC
The formula for AIC & BIC is :
Cross-Validation
The results from cv.glm include delta, which has two numbers,
 the raw cross-validation error : based on the cost function (in this case the mean squared error, which
is a measure of correctness for an estimator and is defined in this Equation )
for all the folds and the adjusted cross-validation error.
 This second number compensates for not using leave-one-out cross-validation, which is like k-fold
cross-validation except that each fold is the all but one data point with one point held out. This is
very accurate but highly computationally intensive.
we got a nice number for the error, it helps us only if we can compare it to other models
Bootstrapping
 The idea is that we start with n rows of data. Some statistic (whether a mean,
regression or some arbitrary function) is applied to the data.
 Then the data are sampled, creating a new dataset.
 This new set still has n rows except that there are repeats and other rows are
entirely missing.
 The statistic is applied to this new dataset.
 The process is repeated R times (typically around 1,200), which generates an
entire distribution for the statistic.
 This distribution can then be used to find the mean and confidence interval
(typically 95%) for the statistic.
 The boot package is a very robust set of tools for making the bootstrap easy to
compute
 to compute the batting average is to divide total hits by total at bats. This
means we cannot simply run mean(h/ab) and sd(h/ab) to get the mean and
standard deviation. Rather, the batting average is calculated as
sum(h)/sum(ab) and its standar deviation is not easily calculated. This
problem is a great candidate for using the bootstrap.
 We calculate the overall batting average with the original data. Then we
sample n rows with replacement and calculate the batting average again. We
do this repeatedly until a distribution isformed. Rather that doing this
manually, though, we use boot.
 The first argument to boot is the data. The second argument is the function
that is to be computed on the data. This function must take at least two
arguments.
 The beautiful thing about the bootstrap is its near universal applicability. It
can be used in just about any situation where an analytical solution is
impractical or impossible.
Bootstrapping
Visualizing the distribution is as simple as plotting a histogram of the replicate results
18.5. Stepwise Variable Selection
 A common, though becoming increasingly discouraged, way to select
variables for a model is stepwise selection. This is the process of iteratively
adding and removing variables from a model and testing the model at each
step, usually using AIC.
Return to the book to see all results.
 Determining the quality of a model is an important step in the model-building
process. This can take the form of traditional tests of fit such as ANOVA or
more modern techniques like cross-validation.
 The bootstrap is another means of determining model uncertainty, especially
for models where confidence intervals are impractical to calculate. These can
all be shaped by helping select which variables are included in a model and
which are excluded.
18.6. Conclusion
Chapter 19. Regularization and Shrinkage
 19.1. Elastic Net
 a dynamic blending of lasso and ridge regression.
 The lasso uses an L1 penalty to perform variable selection and dimension
reduction, while the ridge uses an L2 penalty to shrink the coefficients for
more stable predictions.
 The formula for the Elastic Net is:
 where λ is a complexity parameter controlling the amount of shrinkage (0 is
no penalty and ∞ is complete penalty)
 α regulates how much of the solution is ridge versus lasso with α = 0 being
complete ridge and α = 1 being complete lasso.
 Γ, not seen here, is a vector of penalty factors—one value per variable—that
multiplies λ for fine tuning of the penalty applied to each variable;
Lasso vs ridge
Glmnet
 which fits generalized linear models with the Elastic Net.
 it is designed for speed and larger, sparser data.
 Where functions like lm and glm take a formula to specify the model, glmnet
requires a matrix of predictors (including an intercept) and a response
matrix
we will look at the American Community Survey(ACS) data for New York State. We
will throw every possible predictor into the model and see which are selected.
 λ controls the amount of shrinkage.
 By default glmnet fits the regularization path on 100 different values of λ.
 glmnet package has a function, cv.glmnet, that computes the cross-validation
automatically. By default α = 1, meaning only the lasso is calculated.
 Selecting the best α requires an additional layer of cross-validation.
Visualizing where variables enter the model along the λ path can be illuminating
Finding the optimal value of α requires an additional layer of cross-validation,
and unfortunately glmnet does not do that automatically. This will require us
to run cv.glmnet at various levels of α, which will take a fairly large chunk of
time if performed sequentially, making this a good time to use parallelization.
The most straightforward way to run code in parallel is to the use the
parallel, doParallel and foreach packages
 First, we build some helper objects to speed along the process.
 When a two-layered cross validation is run, an observation should fall in
the same fold each time, so we build a vector specifying fold membership.
 We also specify the sequence of α values that foreach will loop over.
 It is generally considered better to lean toward the lasso rather than the
ridge, so we consider only α values greater than 0.5.
Before running a parallel job, a cluster (even on a single machine) must be started and
registered with makeCluster and registerDoParallel. After the job is done the cluster
should be stopped with stopCluster.
Setting .errorhandling to ''remove'' means that if an error occurs, that iteration will be
skipped. Setting .inorder to FALSE means that the order of combining the results does
not matter and they can be combined whenever returned, which yields significant
speed improvements. Because we are using the default combination function, list,
which takes multiple arguments at once, we can speed up the process by setting
.multicombine to TRUE.
We specify in .packages that glmnet should be loaded on each of the workers, again
leading to performance improvements. The operator %dopar% tells foreach to work in
parallel.
Parallel computing can be dependent on the environment, so we explicitly load some
variables into the foreach environment using .export, namely, acsX, acsY, alphas and
theFolds
19.2. Bayesian Shrinkage
 useful when a model is built on data that does not have a large enough number of
rows for some combinations of the variables.For this example, we blatantly steal an
example
Chapter 18,19
Chapter 18,19
Chapter 18,19
Chapter 18,19
Chapter 18,19
Chapter 18,19
Chapter 18,19

Contenu connexe

Tendances

Logarithmic transformations
Logarithmic transformationsLogarithmic transformations
Logarithmic transformationsamylute
 
Use of Linear Regression in Machine Learning for Ranking
Use of Linear Regression in Machine Learning for RankingUse of Linear Regression in Machine Learning for Ranking
Use of Linear Regression in Machine Learning for Rankingijsrd.com
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with YellowbrickRebecca Bilbro
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Kazi Toufiq Wadud
 
PRML Chapter 11
PRML Chapter 11PRML Chapter 11
PRML Chapter 11Sunwoo Kim
 
PRML Chapter 9
PRML Chapter 9PRML Chapter 9
PRML Chapter 9Sunwoo Kim
 
Binary classification metrics_cheatsheet
Binary classification metrics_cheatsheetBinary classification metrics_cheatsheet
Binary classification metrics_cheatsheetJakub Czakon
 
Implement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratchImplement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratchEshanAgarwal4
 
HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with Regoodwintx
 
PRML Chapter 12
PRML Chapter 12PRML Chapter 12
PRML Chapter 12Sunwoo Kim
 
PRML Chapter 5
PRML Chapter 5PRML Chapter 5
PRML Chapter 5Sunwoo Kim
 
Principal Component Analysis and Clustering
Principal Component Analysis and ClusteringPrincipal Component Analysis and Clustering
Principal Component Analysis and ClusteringUsha Vijay
 
Prediction of house price using multiple regression
Prediction of house price using multiple regressionPrediction of house price using multiple regression
Prediction of house price using multiple regressionvinovk
 
Forecasting warranty returns with Wiebull Fit
Forecasting warranty returns with Wiebull FitForecasting warranty returns with Wiebull Fit
Forecasting warranty returns with Wiebull FitTonda MacLeod
 

Tendances (20)

working with python
working with pythonworking with python
working with python
 
Telecom customer churn prediction
Telecom customer churn predictionTelecom customer churn prediction
Telecom customer churn prediction
 
Logarithmic transformations
Logarithmic transformationsLogarithmic transformations
Logarithmic transformations
 
Use of Linear Regression in Machine Learning for Ranking
Use of Linear Regression in Machine Learning for RankingUse of Linear Regression in Machine Learning for Ranking
Use of Linear Regression in Machine Learning for Ranking
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
 
Factors affecting customer satisfaction
Factors affecting customer satisfactionFactors affecting customer satisfaction
Factors affecting customer satisfaction
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?
 
PRML Chapter 11
PRML Chapter 11PRML Chapter 11
PRML Chapter 11
 
PRML Chapter 9
PRML Chapter 9PRML Chapter 9
PRML Chapter 9
 
2. diagnostics, collinearity, transformation, and missing data
2. diagnostics, collinearity, transformation, and missing data 2. diagnostics, collinearity, transformation, and missing data
2. diagnostics, collinearity, transformation, and missing data
 
Binary classification metrics_cheatsheet
Binary classification metrics_cheatsheetBinary classification metrics_cheatsheet
Binary classification metrics_cheatsheet
 
Credit risk - loan default model
Credit risk - loan default modelCredit risk - loan default model
Credit risk - loan default model
 
Implement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratchImplement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratch
 
HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with R
 
PRML Chapter 12
PRML Chapter 12PRML Chapter 12
PRML Chapter 12
 
PRML Chapter 5
PRML Chapter 5PRML Chapter 5
PRML Chapter 5
 
Principal Component Analysis and Clustering
Principal Component Analysis and ClusteringPrincipal Component Analysis and Clustering
Principal Component Analysis and Clustering
 
Prediction of house price using multiple regression
Prediction of house price using multiple regressionPrediction of house price using multiple regression
Prediction of house price using multiple regression
 
Forecasting warranty returns with Wiebull Fit
Forecasting warranty returns with Wiebull FitForecasting warranty returns with Wiebull Fit
Forecasting warranty returns with Wiebull Fit
 

En vedette

Final presentation
Final presentationFinal presentation
Final presentationheba_ahmad
 
heba alsayed ahmad_Recomm_#2
heba alsayed ahmad_Recomm_#2heba alsayed ahmad_Recomm_#2
heba alsayed ahmad_Recomm_#2heba_ahmad
 
Be a Data Scientist in 8 steps!
Be a Data Scientist in 8 steps! Be a Data Scientist in 8 steps!
Be a Data Scientist in 8 steps! PromptCloud
 
Introduction to Shiny for building web apps in R
Introduction to Shiny for building web apps in RIntroduction to Shiny for building web apps in R
Introduction to Shiny for building web apps in RPaul Richards
 
heba alsayed ahmad_Recomm_#
heba alsayed ahmad_Recomm_#heba alsayed ahmad_Recomm_#
heba alsayed ahmad_Recomm_#heba_ahmad
 
recommendation dr jose
recommendation dr joserecommendation dr jose
recommendation dr joseheba_ahmad
 
bassel alkhatib recommendation
bassel alkhatib recommendation bassel alkhatib recommendation
bassel alkhatib recommendation heba_ahmad
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data ScienceEdureka!
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientistryanorban
 

En vedette (15)

Data mining
Data miningData mining
Data mining
 
Ggplot2 ch2
Ggplot2 ch2Ggplot2 ch2
Ggplot2 ch2
 
Final presentation
Final presentationFinal presentation
Final presentation
 
The portable mba in marketing
The portable mba in marketingThe portable mba in marketing
The portable mba in marketing
 
heba alsayed ahmad_Recomm_#2
heba alsayed ahmad_Recomm_#2heba alsayed ahmad_Recomm_#2
heba alsayed ahmad_Recomm_#2
 
Shiny in R
Shiny in RShiny in R
Shiny in R
 
Be a Data Scientist in 8 steps!
Be a Data Scientist in 8 steps! Be a Data Scientist in 8 steps!
Be a Data Scientist in 8 steps!
 
Introduction to Shiny for building web apps in R
Introduction to Shiny for building web apps in RIntroduction to Shiny for building web apps in R
Introduction to Shiny for building web apps in R
 
Mba Basics
Mba BasicsMba Basics
Mba Basics
 
heba alsayed ahmad_Recomm_#
heba alsayed ahmad_Recomm_#heba alsayed ahmad_Recomm_#
heba alsayed ahmad_Recomm_#
 
recommendation dr jose
recommendation dr joserecommendation dr jose
recommendation dr jose
 
bassel alkhatib recommendation
bassel alkhatib recommendation bassel alkhatib recommendation
bassel alkhatib recommendation
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 

Similaire à Chapter 18,19

Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfBeyaNasr1
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMSAli T. Lotia
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdfgadissaassefa
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning conceptsJoe li
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)Abhimanyu Dwivedi
 
CFM Challenge - Course Project
CFM Challenge - Course ProjectCFM Challenge - Course Project
CFM Challenge - Course ProjectKhalilBergaoui
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson ChallengeRaouf KESKES
 
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONGENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONijaia
 
Regression kriging
Regression krigingRegression kriging
Regression krigingFAO
 
IBM SPSS Statistics Algorithms.pdf
IBM SPSS Statistics Algorithms.pdfIBM SPSS Statistics Algorithms.pdf
IBM SPSS Statistics Algorithms.pdfNorafizah Samawi
 
Regularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptxRegularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptxMohamed Essam
 
Essay on-data-analysis
Essay on-data-analysisEssay on-data-analysis
Essay on-data-analysisRaman Kannan
 
Sample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfSample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfAaryanArora10
 
How to understand and implement regression analysis
How to understand and implement regression analysisHow to understand and implement regression analysis
How to understand and implement regression analysisClaireWhittaker5
 
Numerical analysis using Scilab: Numerical stability and conditioning
Numerical analysis using Scilab: Numerical stability and conditioningNumerical analysis using Scilab: Numerical stability and conditioning
Numerical analysis using Scilab: Numerical stability and conditioningScilab
 
Deep learning MindMap
Deep learning MindMapDeep learning MindMap
Deep learning MindMapAshish Patel
 
Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom IndustrySatyam Barsaiyan
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & UnderfittingSOUMIT KAR
 
Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62Max Kleiner
 

Similaire à Chapter 18,19 (20)

Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning concepts
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
CFM Challenge - Course Project
CFM Challenge - Course ProjectCFM Challenge - Course Project
CFM Challenge - Course Project
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson Challenge
 
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONGENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
 
Regression kriging
Regression krigingRegression kriging
Regression kriging
 
IBM SPSS Statistics Algorithms.pdf
IBM SPSS Statistics Algorithms.pdfIBM SPSS Statistics Algorithms.pdf
IBM SPSS Statistics Algorithms.pdf
 
Regularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptxRegularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptx
 
Essay on-data-analysis
Essay on-data-analysisEssay on-data-analysis
Essay on-data-analysis
 
Sample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfSample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdf
 
How to understand and implement regression analysis
How to understand and implement regression analysisHow to understand and implement regression analysis
How to understand and implement regression analysis
 
Numerical analysis using Scilab: Numerical stability and conditioning
Numerical analysis using Scilab: Numerical stability and conditioningNumerical analysis using Scilab: Numerical stability and conditioning
Numerical analysis using Scilab: Numerical stability and conditioning
 
Deep learning MindMap
Deep learning MindMapDeep learning MindMap
Deep learning MindMap
 
Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom Industry
 
Regresión
RegresiónRegresión
Regresión
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62
 

Dernier

Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxRosabel UA
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
The Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsThe Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsRommel Regala
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxElton John Embodo
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 

Dernier (20)

YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
The Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsThe Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World Politics
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docx
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 

Chapter 18,19

  • 2. Building a model can be a never-ending process IMPROVE THE MODEL ADDING INTERACTION S Taking away variables Doing transformation
  • 3. How do we judge the quality of the model? The answer : In relation to other models. an analysis of residuals drop-in deviance the results of an ANOVA test Wald test the AIC or BIC score cross- validation error bootstrapping.
  • 4. 18.1. Residuals The difference between the actual response and the fitted values. where the errors, akin to residuals, are normally distributed. The basic idea is that if the model is appropriately fitted to the data, the residuals should be normally distributed as well.
  • 5.
  • 6. each coefficient is plotted as a point with a thick line representing the one standard error confidence interval and a thin line representing the two standard error confidence interval. There is a vertical line indicating 0. In general, a good rule of thumb is that if the two standard error confidence interval does not contain 0, it is statistically significant. Remember
  • 7.
  • 8.
  • 9. ggplot2 with linear regression has a handy trick for dealing with lm models. We can use the model as the data source and ggplot2 “fortifies” it, creating new columns, for easy plotting The basic structure for ggplot2 starts with the ggplot function,which at its most basic should take the data as its first argument. It can take more arguments, or fewer, but we will stick with that for now. After initializing the object, we add layers using the + symbol. To start, we will just discuss geometric layers such as points, lines and histograms. They are included using functions like geom point, geom line and geom histogram. These functions take multiple arguments, the most important being which variable in the data gets mapped to which axis or other aesthetic using aes. Furthermore, each layer can have different aesthetic mappings and even different data. ggplot2
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15. Q-Q plot If the model is a good fit, the standardized residuals should all fall along a straight line when plotted against the theoretical quantiles of the normal distribution. Both the base graphics and ggplot2 versions are shown in next slide .
  • 16.
  • 17. histogram of the residuals. This time we will not be showing the base graphics alternative because a histogram is standard plot that we have shown repeatedly. The histogram is not normally distributed, meaning model is not an entirely correct. histogram
  • 18.
  • 19. All of this measuring of model fit only really makes sense when comparing multiple models, because all of these measures are relative.
  • 20.
  • 21.
  • 22. where : ni is the number of observations in group i, i is the mean of group i, is the overall mean, Yij is observation j in group i, N is the total number of observations K is the number of groups.
  • 23. ANOVA for a multisample test, we do believe it serves a useful purpose in testing the relative merits of different models. Simply passing multiple model objects to anova will return a table of results including the residual sum of squares (RSS), which is a measure of error, the lower the better.
  • 24.
  • 25. Akaike Information Criterion (AIC). As with RSS, the model with thelowest AIC—even negative values—is considered optimal. The BIC (Bayesian Information Criterion) is a similar measure where, once again, lower is better. AIC & BIC
  • 26. The formula for AIC & BIC is :
  • 27.
  • 28.
  • 29.
  • 30.
  • 32. The results from cv.glm include delta, which has two numbers,  the raw cross-validation error : based on the cost function (in this case the mean squared error, which is a measure of correctness for an estimator and is defined in this Equation ) for all the folds and the adjusted cross-validation error.  This second number compensates for not using leave-one-out cross-validation, which is like k-fold cross-validation except that each fold is the all but one data point with one point held out. This is very accurate but highly computationally intensive.
  • 33. we got a nice number for the error, it helps us only if we can compare it to other models
  • 34.
  • 35.
  • 36.
  • 37. Bootstrapping  The idea is that we start with n rows of data. Some statistic (whether a mean, regression or some arbitrary function) is applied to the data.  Then the data are sampled, creating a new dataset.  This new set still has n rows except that there are repeats and other rows are entirely missing.  The statistic is applied to this new dataset.  The process is repeated R times (typically around 1,200), which generates an entire distribution for the statistic.  This distribution can then be used to find the mean and confidence interval (typically 95%) for the statistic.  The boot package is a very robust set of tools for making the bootstrap easy to compute
  • 38.
  • 39.  to compute the batting average is to divide total hits by total at bats. This means we cannot simply run mean(h/ab) and sd(h/ab) to get the mean and standard deviation. Rather, the batting average is calculated as sum(h)/sum(ab) and its standar deviation is not easily calculated. This problem is a great candidate for using the bootstrap.  We calculate the overall batting average with the original data. Then we sample n rows with replacement and calculate the batting average again. We do this repeatedly until a distribution isformed. Rather that doing this manually, though, we use boot.  The first argument to boot is the data. The second argument is the function that is to be computed on the data. This function must take at least two arguments.  The beautiful thing about the bootstrap is its near universal applicability. It can be used in just about any situation where an analytical solution is impractical or impossible. Bootstrapping
  • 40.
  • 41. Visualizing the distribution is as simple as plotting a histogram of the replicate results
  • 42. 18.5. Stepwise Variable Selection  A common, though becoming increasingly discouraged, way to select variables for a model is stepwise selection. This is the process of iteratively adding and removing variables from a model and testing the model at each step, usually using AIC. Return to the book to see all results.
  • 43.  Determining the quality of a model is an important step in the model-building process. This can take the form of traditional tests of fit such as ANOVA or more modern techniques like cross-validation.  The bootstrap is another means of determining model uncertainty, especially for models where confidence intervals are impractical to calculate. These can all be shaped by helping select which variables are included in a model and which are excluded. 18.6. Conclusion
  • 44. Chapter 19. Regularization and Shrinkage  19.1. Elastic Net  a dynamic blending of lasso and ridge regression.  The lasso uses an L1 penalty to perform variable selection and dimension reduction, while the ridge uses an L2 penalty to shrink the coefficients for more stable predictions.
  • 45.  The formula for the Elastic Net is:  where λ is a complexity parameter controlling the amount of shrinkage (0 is no penalty and ∞ is complete penalty)  α regulates how much of the solution is ridge versus lasso with α = 0 being complete ridge and α = 1 being complete lasso.  Γ, not seen here, is a vector of penalty factors—one value per variable—that multiplies λ for fine tuning of the penalty applied to each variable;
  • 47. Glmnet  which fits generalized linear models with the Elastic Net.  it is designed for speed and larger, sparser data.  Where functions like lm and glm take a formula to specify the model, glmnet requires a matrix of predictors (including an intercept) and a response matrix
  • 48. we will look at the American Community Survey(ACS) data for New York State. We will throw every possible predictor into the model and see which are selected.
  • 49.  λ controls the amount of shrinkage.  By default glmnet fits the regularization path on 100 different values of λ.  glmnet package has a function, cv.glmnet, that computes the cross-validation automatically. By default α = 1, meaning only the lasso is calculated.  Selecting the best α requires an additional layer of cross-validation.
  • 50.
  • 51. Visualizing where variables enter the model along the λ path can be illuminating
  • 52. Finding the optimal value of α requires an additional layer of cross-validation, and unfortunately glmnet does not do that automatically. This will require us to run cv.glmnet at various levels of α, which will take a fairly large chunk of time if performed sequentially, making this a good time to use parallelization. The most straightforward way to run code in parallel is to the use the parallel, doParallel and foreach packages  First, we build some helper objects to speed along the process.  When a two-layered cross validation is run, an observation should fall in the same fold each time, so we build a vector specifying fold membership.  We also specify the sequence of α values that foreach will loop over.  It is generally considered better to lean toward the lasso rather than the ridge, so we consider only α values greater than 0.5.
  • 53. Before running a parallel job, a cluster (even on a single machine) must be started and registered with makeCluster and registerDoParallel. After the job is done the cluster should be stopped with stopCluster. Setting .errorhandling to ''remove'' means that if an error occurs, that iteration will be skipped. Setting .inorder to FALSE means that the order of combining the results does not matter and they can be combined whenever returned, which yields significant speed improvements. Because we are using the default combination function, list, which takes multiple arguments at once, we can speed up the process by setting .multicombine to TRUE. We specify in .packages that glmnet should be loaded on each of the workers, again leading to performance improvements. The operator %dopar% tells foreach to work in parallel. Parallel computing can be dependent on the environment, so we explicitly load some variables into the foreach environment using .export, namely, acsX, acsY, alphas and theFolds
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61. 19.2. Bayesian Shrinkage  useful when a model is built on data that does not have a large enough number of rows for some combinations of the variables.For this example, we blatantly steal an example