Day 4 - 08/26/2024

From last class

I emailed those who needed partner for their semester project.
Quiz: Go to kahoot.it

Linear models

A model is a linear model when the parameters enter linearly.

1. \(\mu_i = \beta_0 + x_i \beta_1 + x^2_i \beta_2\)
1. \(\mu_i = \beta_0 + x_{1i} \beta_1 + x_{2i}^{\beta_2}\)
1. \(\mu_i = \exp(x_{1i}\beta_1 + x_{2i}\beta_2)\)
1. The famous linear-plus-plateau model: \[\mu_i = \left \{ \begin{aligned} & \beta_0 + x_i \beta_1 && \text{if } x_i \leq x_{crit}\\ & \beta_0 + x_{crit} \beta_1 && \text{if } x_i > x_{crit} \end{aligned} \right.\]

Estimation

Continuous versus discrete independent variables.
- How do we interpret them?

Recall the clover example:

3 clover species planted in pots.
24 plants per species.
After 2 months, we harvest and weigh their biomass.
We want to estimate how much biomass each genotype produces, eventually know how much and which one produces more.

library(tidyverse)
dd_lotus <- read.csv("data/lotus_part1.csv") %>% 
  transmute(species = factor(species), agb_g) 
summary(dd_lotus)

 species     agb_g      
 A:24    Min.   :1.265  
 C:24    1st Qu.:1.969  
 D:24    Median :2.355  
         Mean   :2.492  
         3rd Qu.:2.936  
         Max.   :4.354

m1 <- lm(agb_g ~ 0 + species, data = dd_lotus)

By programming m1 <- lm(agb_g ~ 0 + species, data = dd_lotus), we are fitting

\[\mathbf{y} \sim \text{N}(\boldsymbol{\mu}, \sigma^2\mathbf{I}),\] \[\boldsymbol{\mu}=\mathbf{X}\boldsymbol{\beta}.\]

What is another way of writing the model shown here?

Point estimates in \(\boldsymbol{\beta}\)

Let’s calculate \(\hat{\boldsymbol{\beta}}_{LSE} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)

X <- model.matrix(m1)
y <- dd_lotus$agb_g
(beta_hat <- solve(t(X)%*%X) %*% t(X)%*%y)

             [,1]
speciesA 2.936576
speciesC 2.396695
speciesD 2.143253

coefficients(m1)

speciesA speciesC speciesD 
2.936576 2.396695 2.143253

Residuals

head(residuals <- y - X%*%beta_hat)

         [,1]
1 -0.02957583
2 -0.61733583
3 -0.21065583
4  0.47442417
5  1.29108417
6  0.25212417

quantile(residuals)

        0%        25%        50%        75%       100% 
-1.0413558 -0.4618581  0.0325275  0.4484598  1.4173242

head(residuals(m1))

          1           2           3           4           5           6 
-0.02957583 -0.61733583 -0.21065583  0.47442417  1.29108417  0.25212417

quantile(residuals(m1))

        0%        25%        50%        75%       100% 
-1.0413558 -0.4618581  0.0325275  0.4484598  1.4173242

hist(residuals)

hist(residuals(m1))

This will help us for model diagnostics (coming soon!)

Variance \(\sigma^2\)

Perhaps one of the most important parameters we will estimate.
Prediction accuracy.
Prediction intervals, estimation intervals.
Degrees of freedom. From \(n-1\) to \(n-p\).

Let’s calculate \(\hat{\sigma}^2 = \frac{\boldsymbol{\hat{\varepsilon}^T\hat{\varepsilon}}}{n-p} = \frac{RSS}{n-p}\)

RSS <- sum(residuals^2)
n <- nrow(dd_lotus)
p <- length(beta_hat)
(sigma2_hat <- RSS/(n-p))

[1] 0.3704

(sigma_hat <- sqrt(sigma2_hat))

[1] 0.6086049

summary(m1)$sigma

[1] 0.6086049

Confidence intervals

Confidence intervals are a measure of uncertainty. They are related to other measures of uncertainty, like \(\sigma^2\). We know that \[CI = \hat{\beta} \pm t_{\alpha/2, dfe} \ s.e.,\] where \(t_{\alpha/2, dfe}\) is the t-score that corresponds to the confidence level (i.e., \(\alpha\)) with the degrees of freedom of the error \(\text{rdf}\), and \(s.e.\) is the standard error.

First, let’s calculate the standard error:

unscaled_covariance <- solve(t(X)%*%X)

(se <- sqrt(diag(unscaled_covariance))*sigma_hat)

speciesA speciesC speciesD 
0.124231 0.124231 0.124231

alpha <- 0.05
dfe <- n-p
(t_score <- qt(p=alpha/2, df = dfe, lower.tail=F))

[1] 1.994945

se*t_score

speciesA speciesC speciesD 
0.247834 0.247834 0.247834

data.frame(lower = beta_hat - se*t_score, 
           upper = beta_hat + se*t_score)

            lower    upper
speciesA 2.688742 3.184410
speciesC 2.148861 2.644529
speciesD 1.895419 2.391086

confint(m1)

            2.5 %   97.5 %
speciesA 2.688742 3.184410
speciesC 2.148861 2.644529
speciesD 1.895419 2.391086

In `summary`:

summary(m1)


Call:
lm(formula = agb_g ~ 0 + species, data = dd_lotus)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.04136 -0.46186  0.03253  0.44846  1.41732 

Coefficients:
         Estimate Std. Error t value Pr(>|t|)    
speciesA   2.9366     0.1242   23.64   <2e-16 ***
speciesC   2.3967     0.1242   19.29   <2e-16 ***
speciesD   2.1433     0.1242   17.25   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6086 on 69 degrees of freedom
Multiple R-squared:  0.9468,    Adjusted R-squared:  0.9445 
F-statistic: 409.5 on 3 and 69 DF,  p-value: < 2.2e-16

For next class

No reading!
Finish the assignment.
Schedule a meeting for the semester project.