Day 10 - 09/11/2024

Statistical Inference

library(tidyverse)
url <- 
"https://raw.githubusercontent.com/jlacasa/stat705_fall2024/main/classes/data/lotus_part1.csv"
data <- read.csv(url)
 
data %>% 
  ggplot(aes(species, agb_g))+
  geom_boxplot()+
  labs(x = "Species", 
       y = "Aboveground biomass (g)")+
  theme_classic()+
  theme(aspect.ratio = 1)

Let’s take a simple statistical model:

The aboveground biomass from the \(i\)th observation, \(y_i\), can be described as \[y_i \sim N(\mu_i , \sigma^2),\]

\[\mu_i = \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i},\] \[x_{1 i} \left \{ \begin{aligned} & 1 && \text{if species A} \\ & 0 && \text{else} \end{aligned} \right.,\] \[x_{2 i} \left \{ \begin{aligned} & 1 && \text{if species C} \\ & 0 && \text{else} \end{aligned} \right.,\] \[x_{3 i} \left \{ \begin{aligned} & 1 && \text{if species D} \\ & 0 && \text{else} \end{aligned} \right.,\]

for \(i =1, 2, ..., n\), where \(n\) is the total number of observations, \(\mu_i\) is the mean of the \(i\)th observation, \(x_{1i}\), \(x_{2i}\), and \(x_{3i}\) are indicator variables that identify the species as A, C, or D, and thus \(\beta_1\), \(\beta_2\), and \(\beta_3\) are the means of genotypes A, C, and D, respectively, and \(\sigma^2\) is the variance of the data.

m <- lm(agb_g ~ 0 + species , data = data)
beta_hat <- coef(m)

m


Call:
lm(formula = agb_g ~ 0 + species, data = data)

Coefficients:
speciesA  speciesC  speciesD  
   2.937     2.397     2.143

95% Confidence intervals

confint(m)

            2.5 %   97.5 %
speciesA 2.688742 3.184410
speciesC 2.148861 2.644529
speciesD 1.895419 2.391086

T-test

X <- model.matrix(m)
sigma_hat <- summary(m)$sigma
unscaled_covariance <- solve(t(X)%*%X)
(se <- sqrt(diag(unscaled_covariance))*sigma_hat)

speciesA speciesC speciesD 
0.124231 0.124231 0.124231

dfe <- (nrow(data)-3)

The t statistic \[T = \frac{\hat\theta}{s.e.(\hat\theta)}\]

is used to compute the p value in hypothesis tests. The p value is the probability of observing the t statistic under the conditions of the null hypothesis. In this formula, \(\hat\theta\) is an estimator of an operation of the parameters, e.g., \(\theta = \beta_1 - 0\) or \(\theta = \beta_1-\beta_2\).

The t distribution is similar to a normal distribution, but with heavier tails.

x <- seq(-5, 5, by = .1)
dt <- dt(x, dfe)
tc <- qt(.975, dfe)

data.frame(x, dt) %>% 
  ggplot(aes(x, dt))+
  geom_ribbon(aes(ymin = 0, ymax = dt), 
              data = . %>% filter(x < -tc))+
  geom_ribbon(aes(ymin = 0, ymax = dt), 
              data = . %>% filter(x > tc))+
  labs(y = "[t]", 
       x= "t score")+
  geom_line()+
  theme(aspect.ratio = 1)+
  coord_cartesian(expand = F)+
  theme_classic()

In the summary output, we have \(\theta = \beta_j-0\), i.e., we are testing if a given \(\beta_j\) is different to zero.

t_A <- (beta_hat[1] - 0)/se[1]
t_A

speciesA 
23.63803

t_C <- (beta_hat[2] - 0)/se[2]
t_C

speciesC 
19.29225

t_D <- (beta_hat[3] - 0)/se[3]
t_D

speciesD 
17.25216

dt(t_A, dfe)

    speciesA 
1.087321e-34

dt(t_C, dfe)

    speciesC 
2.495304e-29

dt(t_D, dfe)

    speciesD 
1.624738e-26

summary(m)


Call:
lm(formula = agb_g ~ 0 + species, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.04136 -0.46186  0.03253  0.44846  1.41732 

Coefficients:
         Estimate Std. Error t value Pr(>|t|)    
speciesA   2.9366     0.1242   23.64   <2e-16 ***
speciesC   2.3967     0.1242   19.29   <2e-16 ***
speciesD   2.1433     0.1242   17.25   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6086 on 69 degrees of freedom
Multiple R-squared:  0.9468,    Adjusted R-squared:  0.9445 
F-statistic: 409.5 on 3 and 69 DF,  p-value: < 2.2e-16

For next class

Homework 2 due.
Project proposals due.