Day 24 - 10/16/2024

Model selection

Recall Figure 2 in Hooten et al. (2015).

\(AIC = 2p - 2\log(\hat{L})\)
\(BIC = \log (n) \cdot p - 2\log(\hat{L})\)
\(R_{adj} = R^2 - (1 - R^2) \frac{p-1}{n-p}\)
Cross validation (k-fold, leave-one-out)
\(RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})}\)
FYI. Consider that \(MSE(\hat\theta) = Bias(\hat{\theta}, \theta) + Var(\hat{\theta})\) whenever you are evaluating models.

Model-based model selection

These methods are sometimes a useful alternative for scenarios where the predictors present collinearity.
regularization, shrinkage, penalization.

Refresh: Ordinary Least Squares Regression

\[\hat{\beta}_{OLS}=(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\]

Ridge Regression

Solve \(\underset{\beta}{\text{argmin}} = (\mathbf{y}-\mathbf{X}\beta)^\top(\mathbf{y}-\mathbf{X}\beta)+\lambda(\beta^\top \beta - c)\).
\(\hat{\beta}_{\text{Ridge}}=(\mathbf{X}^\top\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}\).
\(\hat{\beta}_{\text{Ridge}}\) for \(\lambda = 0\).
As the ridge parameter \(\lambda\) increases, the estimates can become close to zero.
Useful for cases with collinearity and unstable \(\hat{\beta}_{OLS}\).
Pick the smallest value of \(\lambda\) that produces stable estimates of \(\beta\).
There are also automatic selections of \(\lambda\).

library(glmnet)

This data example was used in Zou and Hastie (2005) and is available for easy download here.

dd <- read.csv("data/prostate.csv")
names(dd)

##  [1] "lcavol"  "lweight" "age"     "lbph"    "svi"     "lcp"     "gleason"
##  [8] "pgg45"   "lpsa"    "train"

y <- dd$lpsa # log prostate-specific antigen
X <- model.matrix(lpsa ~ ., dd[,-10])[, -1]

m_ridge <- glmnet(X, y, family = c("gaussian"), alpha = 0, standardize  = FALSE)
par(mfrow = c(1, 2))
plot(m_ridge)
plot(m_ridge, xvar = "lambda", label = TRUE)

Lasso Regression

Lasso = Least absolute shrinkage and selection operator
Solve \(\underset{\beta_0, \beta }{\text{argmin}} = \left\{ \sum_{i=1}^{N}(y_i - \beta_0 - x_i^T \beta)^2 \right\} \text{subject to} \sum_{j=1}^{p}|\beta_j| \leq t\).
\(\hat{\beta}\) does not have a closed form solution.
As penalization increases, the \(\beta\) estimates can become zero.
Not great for cases with collinearity. Elastic net is a better alternative.
Problems with high-dimensional data (e.g., genetics data). Elastic net is a better alternative.

fit_lasso <- glmnet(X, y, family = c("gaussian"), alpha = 1, standardize  = FALSE)
par(mfrow = c(1, 2))
plot(fit_lasso)
plot(fit_lasso, xvar = "lambda", label = TRUE)

Elastic Net Regression

Combines the L1 (Lasso) and L2 (Ridge) regularization penalties.
\(\hat{\beta} \equiv \underset{\beta}{\text{argmin}}(||y-\mathbf{X}\beta||^2 +\lambda_2||\beta||^2 +\lambda_1||\beta||_1)\), where \(||\beta||_1 = \sum_{j=1}^{p}|\beta_j|\).
Ridge and Lasso are special cases of Elastic Net.

fit_elasticnet <- glmnet(X, y, family = c("gaussian"), alpha = .5, standardize  = FALSE)
par(mfrow = c(1, 2))
plot(fit_elasticnet)
plot(fit_elasticnet, xvar = "lambda", label = TRUE)

Some thoghts about regularization

Interpretation.
Objective of the study.

Other shrinkage examples

Principal Components

Dimension reduction.
All variables are reduced into a few “principal components” that are linear combinations of said variables.
Helpful for cases that present collinearity.

pca <- prcomp(dd)
pca

## Standard deviations (1, .., p=10):
##  [1] 28.3128508  7.1567817  1.5647572  1.3808222  0.8129917  0.5710866
##  [7]  0.4881765  0.4136752  0.3298806  0.2713449
## 
## Rotation (n x k) = (10 x 10):
##                 PC1          PC2         PC3          PC4          PC5
## lcavol  0.018209408  0.016873128 -0.60226447 -0.096028751 -0.189062877
## lweight 0.001733282  0.020076745 -0.09288546  0.109978129 -0.076095426
## age     0.077590211  0.993864995  0.01728889 -0.072237987  0.007021371
## lbph    0.004375509  0.071633091 -0.06293385  0.970367573  0.157836437
## svi     0.006704749 -0.001112326 -0.13664589 -0.051679105  0.049232261
## lcp     0.031208032 -0.012011062 -0.54156574 -0.127311476  0.775311057
## gleason 0.019215746  0.004958514 -0.04232931 -0.007612331 -0.007549060
## pgg45   0.995959332 -0.077956766  0.03836989  0.005570729 -0.012100079
## lpsa    0.017321304  0.008215582 -0.55417421  0.109018476 -0.572842885
## train   0.001694013  0.009922213  0.04529459 -0.031603567 -0.040244907
##                  PC6          PC7          PC8          PC9         PC10
## lcavol   0.671290542  0.358944315  0.094572994  0.011688492 -0.056055777
## lweight -0.154894134  0.021776013  0.179714119  0.946642705 -0.142072328
## age     -0.014059758 -0.011933186  0.009168881 -0.013345716  0.006583614
## lbph     0.094407842  0.064132049 -0.037880421 -0.092203422 -0.037432071
## svi     -0.197564040  0.007911475  0.005382256 -0.179720175 -0.951191208
## lcp     -0.226551635 -0.070686058 -0.060086870  0.025118571  0.166461277
## gleason  0.384150965 -0.659546669 -0.610006091  0.172544720 -0.115093780
## pgg45   -0.001450991  0.011953094  0.013798218 -0.000957456  0.001330172
## lpsa    -0.490045822 -0.233226076 -0.085297103 -0.146235623  0.171142213
## train   -0.195422946  0.609779686 -0.757645677  0.104707457  0.014718193

# factoextra::fviz_pca_biplot(pca, label = "var", repel = TRUE)

Helpful visualization of dimension reduction

For next class

We’ll focus on a project example (including post-hoc mean comparisons).