SlideShare une entreprise Scribd logo
1  sur  15
Geometric and
Topological Extensions
of Regression Models
Colleen M. Farrelly
Background
Introduction
 Real data is messy.
 Large volumes
 Small volumes
 More predictors than individuals
 Missing data
 Correlated predictors
 The messiness of data can create computational
issues for algorithms based on linear algebra
solvers.
 Least squares algorithm
 Principle components algorithm
 Introducing solvers based on topology and
geometry can mitigate some of these issues and
produce robust algorithms.
Generalized Linear Models
 Flexible extensions of multiple
regression (Gaussian distribution)
common in data science today:
 Yes/no outcomes (binomial distribution)
 Count outcomes (Poisson distribution)
 Survival models (Weibull distribution)
 Transforms regression equation to fit
the outcome distribution
 Sort of like silly putty stretching the
outcome variable in the data space
 Suffers same drawbacks as multiple
regression:
 P>n
 Correlations between predictors
 Local optima
 Impose penalties on the generalized linear
model frameworks:
 Sparsity (set most estimates to 0 to reduce
model size and complexity)
 Robustness (generalizability of the results
under noise)
 Reduce the number of predictors
 Shrink some predictor estimates to 0
 Examine sets of similar predictors
 Similar to a cowboy at the origin roping
coefficients that get too close
 Includes LASSO, LARS, elastic net, and
ridge regression, among others
Penalized Regression Models
Homotopy-Based LASSO (lasso2)
 Homotopy arrow example
◦ Red and blue arrows
 Anchor start and finish points
 Wiggle middle parts of the line until
arrows overlap
◦ Yellow arrow
 Hole presents issues
 Can’t wiggle into blue or red arrow
without breaking the yellow arrow
 Homotopy LASSO/LARS wiggles an
easy regression path into an
optimal regression path
◦ Avoids local optima
 Peaks
 Valleys
 Saddles
 R package lasso2 implements for a
variety of outcome types
 Homotopy as path equivalence
◦ Intrinsic property of topological
spaces
 Instead of fitting model to data, fit model to tangent space (what isn’t
the data)
 Deals with collinearity, as parallel vectors share the same tangent space
 LARS/LASSO extensions
 Partition model into sets of predictors based on tangent space
 Fit sets that correspond well to an outcome
 Rao scoring for selection.
 Effect estimates (angles)
 Model selection criteria
 Information criteria
 Deviance scoring
 New extensions of R package dglars
 Most exponential family distributions
 Binomial
 Poisson
 Gaussian
 Gamma
Differential Geometry and Regression (dglars)
Applications in R
Example Dataset (Open-Source)
 Link to code and data:
 https://www.researchgate.net/project/Miami-Data-Science-Meetup
 https://archive.ics.uci.edu/ml/datasets/Student+Performance (original downloaded data)
 Code:
#load data
mydata<-read.csv("MathScores.csv")
#retrieve only first term scores
mydata<-mydata[,-c(32:33)]
#split to train and test set
s<-sample(1:395,0.7*395)
train<-mydata[s,]
test<-mydata[-s,]
lasso2 Package
 R package implementing homotopy-based LASSO model
 Example pieces of code for logistic regression:
library(lasso2)
#run the model, can use multiple bounds and compare fit
etastart<-NULL
las<-gl1ce(G1~., train, family=gaussian(link=identity), bound=5, standardize=F)
#predict scores of test group
lpred<-predict(las, test, link="response")
sum((lpred-test$G1)^2)/119
#compare to MSE of mean model
sum((mean(test$G1)-test$G1)^2)/119
#obtain coefficients
coef(las)
#obtain deviance estimate (model fit—can be used to derive AIC/BIC)
deviance(las)
 Try it out on your dataset!
dglars Package
 R package implementing differential-geometry-based LARS algorithm
 Example pieces of code for logistic regression:
library(dglars)
dg<-dglars(G1~., family="gaussian", data=train)
#can also use cross-validation (cvdglars() function)
dg2<-cvdglars(G1~., family="gaussian", data=train)
#summary of the model
summary(dg)
#extract coefficients from matrix of coefficients at each step
coef(dg)
#obtain model fit statistics, can also use logLik(dg)
AIC(dg)
AIC(dg2)
#plot path of LARS algorithm or model fit for cross-validated model
plot(dg)
plot(dg2)
 Try it out on your dataset!
Compare with multiple linear regression
#compare DGLARS with multiple linear regression
gl<-lm(G1~., data=train)
AIC(gl) #1418
AIC(dg) #1402
AIC(dg2) #1403
#obtain coefficients to compare with both penalized models
summary(gl)
#Compare prediction accuracy
pred<-predict(gl, test, link="response")
sum((pred-test$G1)^2)/119
sum((lpred-test$G1)^2)/119
sum((mean(test$G1)-test$G1)^2)/119
Conclusions and References
Summary
 Geometry and topology can be leveraged to improve generalized linear
regression and penalized regression model performance, particularly when
data suffers from general “messiness.”
 Multiple R packages exist to implement these algorithms, and algorithms are
built to accommodate many common exponential family distributions of
outcomes.
 Packages provide interpretable models similar to generalized linear
regression, model fit statistics, and prediction capabilities.
 Many more extensions of regression are possible, and there is work being done
to modify other algorithms based on topology and differential geometry.
Open-Source References
 Augugliaro, L., & Mineo, A. (2013, September). Estimation of sparse
generalized linear models: the dglars package. In 9th Scientific Meeting of
the Classification and Data Analysis Group (pp. 20-23). Tommaso Minerva,
Isabella Morlini, Francesco Palumbo.
 Farrelly, C. M. (2017). Topology and Geometry in Machine Learning for Logistic
Regression.
 Lokhorst, J., Venables, B., Turlach, B., & Turlach, M. B. (2013). Package
‘lasso2’.
 Osborne, M. R., Presnell, B., & Turlach, B. A. (2000). A new approach to
variable selection in least squares problems. IMA journal of numerical
analysis, 20(3), 389-403.
 R package tutorials:
 https://cran.r-project.org/web/packages/dglars/dglars.pdf
 https://cran.r-project.org/web/packages/lasso2/lasso2.pdf

Contenu connexe

Tendances

Logistic regression: topological and geometric considerations
Logistic regression: topological and geometric considerationsLogistic regression: topological and geometric considerations
Logistic regression: topological and geometric considerationsColleen Farrelly
 
Machine Learning by Analogy II
Machine Learning by Analogy IIMachine Learning by Analogy II
Machine Learning by Analogy IIColleen Farrelly
 
Quantum persistent k cores for community detection
Quantum persistent k cores for community detectionQuantum persistent k cores for community detection
Quantum persistent k cores for community detectionColleen Farrelly
 
Machine Learning by Analogy
Machine Learning by AnalogyMachine Learning by Analogy
Machine Learning by AnalogyColleen Farrelly
 
Deep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problemsDeep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problemsColleen Farrelly
 
Empirical Network Classification
Empirical Network ClassificationEmpirical Network Classification
Empirical Network ClassificationColleen Farrelly
 
Multiscale Mapper Networks
Multiscale Mapper NetworksMultiscale Mapper Networks
Multiscale Mapper NetworksColleen Farrelly
 
2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science TalkColleen Farrelly
 
Cluster analysis for market segmentation
Cluster analysis for market segmentationCluster analysis for market segmentation
Cluster analysis for market segmentationVishal Tandel
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysissaba khan
 
Marketing analytics - clustering Types
Marketing analytics - clustering TypesMarketing analytics - clustering Types
Marketing analytics - clustering TypesSuryakumar Thangarasu
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction StratergiesAnjaliSoorej
 
Slides distancecovariance
Slides distancecovarianceSlides distancecovariance
Slides distancecovarianceShrey Nishchal
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisJaclyn Kokx
 
Cannonical Correlation
Cannonical CorrelationCannonical Correlation
Cannonical Correlationdomsr
 

Tendances (20)

Logistic regression: topological and geometric considerations
Logistic regression: topological and geometric considerationsLogistic regression: topological and geometric considerations
Logistic regression: topological and geometric considerations
 
Morse-Smale Regression
Morse-Smale RegressionMorse-Smale Regression
Morse-Smale Regression
 
Machine Learning by Analogy II
Machine Learning by Analogy IIMachine Learning by Analogy II
Machine Learning by Analogy II
 
Quantum persistent k cores for community detection
Quantum persistent k cores for community detectionQuantum persistent k cores for community detection
Quantum persistent k cores for community detection
 
Machine Learning by Analogy
Machine Learning by AnalogyMachine Learning by Analogy
Machine Learning by Analogy
 
Topology for data science
Topology for data scienceTopology for data science
Topology for data science
 
Deep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problemsDeep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problems
 
Empirical Network Classification
Empirical Network ClassificationEmpirical Network Classification
Empirical Network Classification
 
Multiscale Mapper Networks
Multiscale Mapper NetworksMultiscale Mapper Networks
Multiscale Mapper Networks
 
2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk
 
Cluster analysis for market segmentation
Cluster analysis for market segmentationCluster analysis for market segmentation
Cluster analysis for market segmentation
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Marketing analytics - clustering Types
Marketing analytics - clustering TypesMarketing analytics - clustering Types
Marketing analytics - clustering Types
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction Stratergies
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Slides distancecovariance
Slides distancecovarianceSlides distancecovariance
Slides distancecovariance
 
Canonical correlation
Canonical correlationCanonical correlation
Canonical correlation
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 
Summary2 (1)
Summary2 (1)Summary2 (1)
Summary2 (1)
 
Cannonical Correlation
Cannonical CorrelationCannonical Correlation
Cannonical Correlation
 

Similaire à Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models

Regression kriging
Regression krigingRegression kriging
Regression krigingFAO
 
Demography 7263 fall 2015 spatially autoregressive models 2
Demography 7263 fall 2015 spatially autoregressive models 2Demography 7263 fall 2015 spatially autoregressive models 2
Demography 7263 fall 2015 spatially autoregressive models 2Corey Sparks
 
Relaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked dataRelaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked dataAlessandro Adamou
 
Statistical Clustering
Statistical ClusteringStatistical Clustering
Statistical Clusteringtim_hare
 
Big Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetBig Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetVishva Abeyrathne
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using RUmmiya Mohammedi
 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analyticstempledf
 
Accounting for uncertainty in species delineation during the analysis of envi...
Accounting for uncertainty in species delineation during the analysis of envi...Accounting for uncertainty in species delineation during the analysis of envi...
Accounting for uncertainty in species delineation during the analysis of envi...methodsecolevol
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMSAli T. Lotia
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
result analysis for deep leakage from gradients
result analysis for deep leakage from gradientsresult analysis for deep leakage from gradients
result analysis for deep leakage from gradients國騰 丁
 
Jgrass-NewAge: Kriging component
Jgrass-NewAge: Kriging componentJgrass-NewAge: Kriging component
Jgrass-NewAge: Kriging componentNiccolò Tubini
 
A course work on R programming for basics to advance statistics and GIS.pdf
A course work on R programming for basics to advance statistics and GIS.pdfA course work on R programming for basics to advance statistics and GIS.pdf
A course work on R programming for basics to advance statistics and GIS.pdfSEEMAB AKHTAR
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageMajid Abdollahi
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
Surrogate modeling for industrial design
Surrogate modeling for industrial designSurrogate modeling for industrial design
Surrogate modeling for industrial designShinwoo Jang
 
Variable selection for classification and regression using R
Variable selection for classification and regression using RVariable selection for classification and regression using R
Variable selection for classification and regression using RGregg Barrett
 
Building Predictive Models R_caret language
Building Predictive Models R_caret languageBuilding Predictive Models R_caret language
Building Predictive Models R_caret languagejaved khan
 

Similaire à Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models (20)

Regression kriging
Regression krigingRegression kriging
Regression kriging
 
Demography 7263 fall 2015 spatially autoregressive models 2
Demography 7263 fall 2015 spatially autoregressive models 2Demography 7263 fall 2015 spatially autoregressive models 2
Demography 7263 fall 2015 spatially autoregressive models 2
 
Colombo14a
Colombo14aColombo14a
Colombo14a
 
User biglm
User biglmUser biglm
User biglm
 
Relaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked dataRelaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked data
 
Statistical Clustering
Statistical ClusteringStatistical Clustering
Statistical Clustering
 
Big Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetBig Data Processing using a AWS Dataset
Big Data Processing using a AWS Dataset
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using R
 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analytics
 
Accounting for uncertainty in species delineation during the analysis of envi...
Accounting for uncertainty in species delineation during the analysis of envi...Accounting for uncertainty in species delineation during the analysis of envi...
Accounting for uncertainty in species delineation during the analysis of envi...
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
result analysis for deep leakage from gradients
result analysis for deep leakage from gradientsresult analysis for deep leakage from gradients
result analysis for deep leakage from gradients
 
Jgrass-NewAge: Kriging component
Jgrass-NewAge: Kriging componentJgrass-NewAge: Kriging component
Jgrass-NewAge: Kriging component
 
A course work on R programming for basics to advance statistics and GIS.pdf
A course work on R programming for basics to advance statistics and GIS.pdfA course work on R programming for basics to advance statistics and GIS.pdf
A course work on R programming for basics to advance statistics and GIS.pdf
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R Language
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Surrogate modeling for industrial design
Surrogate modeling for industrial designSurrogate modeling for industrial design
Surrogate modeling for industrial design
 
Variable selection for classification and regression using R
Variable selection for classification and regression using RVariable selection for classification and regression using R
Variable selection for classification and regression using R
 
Building Predictive Models R_caret language
Building Predictive Models R_caret languageBuilding Predictive Models R_caret language
Building Predictive Models R_caret language
 

Plus de Colleen Farrelly

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023Colleen Farrelly
 
Modeling Climate Change.pptx
Modeling Climate Change.pptxModeling Climate Change.pptx
Modeling Climate Change.pptxColleen Farrelly
 
Natural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptxNatural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptxColleen Farrelly
 
The Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptxThe Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptxColleen Farrelly
 
Generative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxGenerative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxColleen Farrelly
 
Emerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptxEmerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptxColleen Farrelly
 
Applications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptxApplications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptxColleen Farrelly
 
Geometry for Social Good.pptx
Geometry for Social Good.pptxGeometry for Social Good.pptx
Geometry for Social Good.pptxColleen Farrelly
 
Topology for Time Series.pptx
Topology for Time Series.pptxTopology for Time Series.pptx
Topology for Time Series.pptxColleen Farrelly
 
Time Series Applications AMLD.pptx
Time Series Applications AMLD.pptxTime Series Applications AMLD.pptx
Time Series Applications AMLD.pptxColleen Farrelly
 
An introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptxAn introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptxColleen Farrelly
 
An introduction to time series data with R.pptx
An introduction to time series data with R.pptxAn introduction to time series data with R.pptx
An introduction to time series data with R.pptxColleen Farrelly
 
NLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved AreasNLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved AreasColleen Farrelly
 
Geometry, Data, and One Path Into Data Science.pptx
Geometry, Data, and One Path Into Data Science.pptxGeometry, Data, and One Path Into Data Science.pptx
Geometry, Data, and One Path Into Data Science.pptxColleen Farrelly
 
Topological Data Analysis.pptx
Topological Data Analysis.pptxTopological Data Analysis.pptx
Topological Data Analysis.pptxColleen Farrelly
 
Transforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptxTransforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptxColleen Farrelly
 
Natural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptxNatural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptxColleen Farrelly
 
SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing Colleen Farrelly
 
WIDS 2021--An Introduction to Network Science
WIDS 2021--An Introduction to Network ScienceWIDS 2021--An Introduction to Network Science
WIDS 2021--An Introduction to Network ScienceColleen Farrelly
 

Plus de Colleen Farrelly (20)

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023
 
Modeling Climate Change.pptx
Modeling Climate Change.pptxModeling Climate Change.pptx
Modeling Climate Change.pptx
 
Natural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptxNatural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptx
 
The Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptxThe Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptx
 
Generative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxGenerative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptx
 
Emerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptxEmerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptx
 
Applications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptxApplications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptx
 
Geometry for Social Good.pptx
Geometry for Social Good.pptxGeometry for Social Good.pptx
Geometry for Social Good.pptx
 
Topology for Time Series.pptx
Topology for Time Series.pptxTopology for Time Series.pptx
Topology for Time Series.pptx
 
Time Series Applications AMLD.pptx
Time Series Applications AMLD.pptxTime Series Applications AMLD.pptx
Time Series Applications AMLD.pptx
 
An introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptxAn introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptx
 
An introduction to time series data with R.pptx
An introduction to time series data with R.pptxAn introduction to time series data with R.pptx
An introduction to time series data with R.pptx
 
NLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved AreasNLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved Areas
 
Geometry, Data, and One Path Into Data Science.pptx
Geometry, Data, and One Path Into Data Science.pptxGeometry, Data, and One Path Into Data Science.pptx
Geometry, Data, and One Path Into Data Science.pptx
 
Topological Data Analysis.pptx
Topological Data Analysis.pptxTopological Data Analysis.pptx
Topological Data Analysis.pptx
 
Transforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptxTransforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptx
 
Natural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptxNatural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptx
 
SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing
 
WIDS 2021--An Introduction to Network Science
WIDS 2021--An Introduction to Network ScienceWIDS 2021--An Introduction to Network Science
WIDS 2021--An Introduction to Network Science
 

Dernier

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 

Dernier (20)

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 

Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models

  • 1. Geometric and Topological Extensions of Regression Models Colleen M. Farrelly
  • 3. Introduction  Real data is messy.  Large volumes  Small volumes  More predictors than individuals  Missing data  Correlated predictors  The messiness of data can create computational issues for algorithms based on linear algebra solvers.  Least squares algorithm  Principle components algorithm  Introducing solvers based on topology and geometry can mitigate some of these issues and produce robust algorithms.
  • 4. Generalized Linear Models  Flexible extensions of multiple regression (Gaussian distribution) common in data science today:  Yes/no outcomes (binomial distribution)  Count outcomes (Poisson distribution)  Survival models (Weibull distribution)  Transforms regression equation to fit the outcome distribution  Sort of like silly putty stretching the outcome variable in the data space  Suffers same drawbacks as multiple regression:  P>n  Correlations between predictors  Local optima
  • 5.  Impose penalties on the generalized linear model frameworks:  Sparsity (set most estimates to 0 to reduce model size and complexity)  Robustness (generalizability of the results under noise)  Reduce the number of predictors  Shrink some predictor estimates to 0  Examine sets of similar predictors  Similar to a cowboy at the origin roping coefficients that get too close  Includes LASSO, LARS, elastic net, and ridge regression, among others Penalized Regression Models
  • 6. Homotopy-Based LASSO (lasso2)  Homotopy arrow example ◦ Red and blue arrows  Anchor start and finish points  Wiggle middle parts of the line until arrows overlap ◦ Yellow arrow  Hole presents issues  Can’t wiggle into blue or red arrow without breaking the yellow arrow  Homotopy LASSO/LARS wiggles an easy regression path into an optimal regression path ◦ Avoids local optima  Peaks  Valleys  Saddles  R package lasso2 implements for a variety of outcome types  Homotopy as path equivalence ◦ Intrinsic property of topological spaces
  • 7.  Instead of fitting model to data, fit model to tangent space (what isn’t the data)  Deals with collinearity, as parallel vectors share the same tangent space  LARS/LASSO extensions  Partition model into sets of predictors based on tangent space  Fit sets that correspond well to an outcome  Rao scoring for selection.  Effect estimates (angles)  Model selection criteria  Information criteria  Deviance scoring  New extensions of R package dglars  Most exponential family distributions  Binomial  Poisson  Gaussian  Gamma Differential Geometry and Regression (dglars)
  • 9. Example Dataset (Open-Source)  Link to code and data:  https://www.researchgate.net/project/Miami-Data-Science-Meetup  https://archive.ics.uci.edu/ml/datasets/Student+Performance (original downloaded data)  Code: #load data mydata<-read.csv("MathScores.csv") #retrieve only first term scores mydata<-mydata[,-c(32:33)] #split to train and test set s<-sample(1:395,0.7*395) train<-mydata[s,] test<-mydata[-s,]
  • 10. lasso2 Package  R package implementing homotopy-based LASSO model  Example pieces of code for logistic regression: library(lasso2) #run the model, can use multiple bounds and compare fit etastart<-NULL las<-gl1ce(G1~., train, family=gaussian(link=identity), bound=5, standardize=F) #predict scores of test group lpred<-predict(las, test, link="response") sum((lpred-test$G1)^2)/119 #compare to MSE of mean model sum((mean(test$G1)-test$G1)^2)/119 #obtain coefficients coef(las) #obtain deviance estimate (model fit—can be used to derive AIC/BIC) deviance(las)  Try it out on your dataset!
  • 11. dglars Package  R package implementing differential-geometry-based LARS algorithm  Example pieces of code for logistic regression: library(dglars) dg<-dglars(G1~., family="gaussian", data=train) #can also use cross-validation (cvdglars() function) dg2<-cvdglars(G1~., family="gaussian", data=train) #summary of the model summary(dg) #extract coefficients from matrix of coefficients at each step coef(dg) #obtain model fit statistics, can also use logLik(dg) AIC(dg) AIC(dg2) #plot path of LARS algorithm or model fit for cross-validated model plot(dg) plot(dg2)  Try it out on your dataset!
  • 12. Compare with multiple linear regression #compare DGLARS with multiple linear regression gl<-lm(G1~., data=train) AIC(gl) #1418 AIC(dg) #1402 AIC(dg2) #1403 #obtain coefficients to compare with both penalized models summary(gl) #Compare prediction accuracy pred<-predict(gl, test, link="response") sum((pred-test$G1)^2)/119 sum((lpred-test$G1)^2)/119 sum((mean(test$G1)-test$G1)^2)/119
  • 14. Summary  Geometry and topology can be leveraged to improve generalized linear regression and penalized regression model performance, particularly when data suffers from general “messiness.”  Multiple R packages exist to implement these algorithms, and algorithms are built to accommodate many common exponential family distributions of outcomes.  Packages provide interpretable models similar to generalized linear regression, model fit statistics, and prediction capabilities.  Many more extensions of regression are possible, and there is work being done to modify other algorithms based on topology and differential geometry.
  • 15. Open-Source References  Augugliaro, L., & Mineo, A. (2013, September). Estimation of sparse generalized linear models: the dglars package. In 9th Scientific Meeting of the Classification and Data Analysis Group (pp. 20-23). Tommaso Minerva, Isabella Morlini, Francesco Palumbo.  Farrelly, C. M. (2017). Topology and Geometry in Machine Learning for Logistic Regression.  Lokhorst, J., Venables, B., Turlach, B., & Turlach, M. B. (2013). Package ‘lasso2’.  Osborne, M. R., Presnell, B., & Turlach, B. A. (2000). A new approach to variable selection in least squares problems. IMA journal of numerical analysis, 20(3), 389-403.  R package tutorials:  https://cran.r-project.org/web/packages/dglars/dglars.pdf  https://cran.r-project.org/web/packages/lasso2/lasso2.pdf

Notes de l'éditeur

  1. Same assumptions as multiple regression, minus outcome’s normal distribution (link function extends to non-normal distributions). McCullagh, P. (1984). Generalized linear models. European Journal of Operational Research, 16(3), 285-292.
  2. Relaxes predictor independence requirement and adds penalty term. Adds a penalty to reduce generalized linear model’s model size. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320.
  3. Exists for several types of models, including survival, binomial, and Poisson regression models Augugliaro, L., Mineo, A. M., & Wit, E. C. (2013). Differential geometric least angle regression: a differential geometric approach to sparse generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3), 471-498. Augugliaro, L., & Mineo, A. M. (2015). Using the dglars Package to Estimate a Sparse Generalized Linear Model. In Advances in Statistical Models for Data Analysis (pp. 1-8). Springer International Publishing.