Day 5 - 08/28/2024

From last class

  • Please submit Assignment 1 today!

  • A few things to review

    • What is an unbiased estimator?
      When we use \(\hat{\beta}\) as an estimator of \(\beta\), there are no systematic mechanisms in the computation of \(\hat{\beta}\) that will lead to over/underestimation of \(\beta\). That is why we expect to obtain \(\beta\) (i.e., the “true” \(\beta\)) back.

Estimation

Point estimates in \(\boldsymbol{\beta}\)

library(tidyverse)
dd_lotus <- read.csv("data/lotus_part1.csv") %>% 
  transmute(species = factor(species), agb_g) 
summary(dd_lotus)
 species     agb_g      
 A:24    Min.   :1.265  
 C:24    1st Qu.:1.969  
 D:24    Median :2.355  
         Mean   :2.492  
         3rd Qu.:2.936  
         Max.   :4.354  
m1 <- lm(agb_g ~ 0 + species, data = dd_lotus)

Let’s calculate \(\hat{\boldsymbol{\beta}}_{LSE} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)

X <- model.matrix(m1)
y <- dd_lotus$agb_g
(beta_hat <- solve(t(X)%*%X) %*% t(X)%*%y)
             [,1]
speciesA 2.936576
speciesC 2.396695
speciesD 2.143253
coefficients(m1)
speciesA speciesC speciesD 
2.936576 2.396695 2.143253 

Residuals

residuals <- y - X%*%beta_hat

Variance \(\sigma^2\)

  • Perhaps one of the most important parameters we will estimate.
  • Prediction accuracy.
  • Prediction intervals, estimation intervals.
  • Degrees of freedom. From \(n-1\) to \(n-p\).

Let’s calculate \(\hat{\sigma}^2 = \frac{\boldsymbol{\hat{\varepsilon}^T\hat{\varepsilon}}}{n-p} = \frac{RSS}{n-p}\)

RSS <- sum(residuals^2)
n <- nrow(dd_lotus)
p <- length(beta_hat)
(sigma2_hat <- RSS/(n-p)) 
[1] 0.3704
(sigma_hat <- sqrt(sigma2_hat))
[1] 0.6086049
summary(m1)$sigma
[1] 0.6086049

Confidence intervals

Confidence intervals are a measure of uncertainty. They are related to other measures of uncertainty, like \(\sigma^2\). We know that \[CI = \hat{\beta} \pm t_{dfe, 1 - \alpha/2} \ s.e.,\] where \(t_{\alpha/2, dfe}\) is the t-score that corresponds to the confidence level (i.e., \(\alpha\)) with the degrees of freedom of the error \(\text{rdf}\), and \(s.e.\) is the standard error.

  • Why t distribution?

  • Let’s calculate the standard error:

unscaled_covariance <- solve(t(X)%*%X)
(se <- sqrt(diag(unscaled_covariance))*sigma_hat)
speciesA speciesC speciesD 
0.124231 0.124231 0.124231 
alpha <- 0.05
dfe <- n-p
(t_score <- qt(p=1-alpha/2, df = dfe, lower.tail=T))
[1] 1.994945
se*t_score
speciesA speciesC speciesD 
0.247834 0.247834 0.247834 
data.frame(lower = beta_hat - se*t_score, 
           upper = beta_hat + se*t_score)
            lower    upper
speciesA 2.688742 3.184410
speciesC 2.148861 2.644529
speciesD 1.895419 2.391086
confint(m1)
            2.5 %   97.5 %
speciesA 2.688742 3.184410
speciesC 2.148861 2.644529
speciesD 1.895419 2.391086

Interpretation

  • “We are 95% confident that …”

In summary:

summary(m1)

Call:
lm(formula = agb_g ~ 0 + species, data = dd_lotus)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.04136 -0.46186  0.03253  0.44846  1.41732 

Coefficients:
         Estimate Std. Error t value Pr(>|t|)    
speciesA   2.9366     0.1242   23.64   <2e-16 ***
speciesC   2.3967     0.1242   19.29   <2e-16 ***
speciesD   2.1433     0.1242   17.25   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6086 on 69 degrees of freedom
Multiple R-squared:  0.9468,    Adjusted R-squared:  0.9445 
F-statistic: 409.5 on 3 and 69 DF,  p-value: < 2.2e-16

Let’s do the same, from scratch, for continuous data

Live R code here.

How much money would you bet on the estimate? Let’s do a simulation.

R code here.

For next class

  • No reading!
  • Assignment 1.
  • Take a look at the schedule.