Day 11 - 09/13/2024

Announcements

Assignment 2 due today
Project proposal due today

T-tests

library(tidyverse)
url <- 
"https://raw.githubusercontent.com/jlacasa/stat705_fall2024/main/classes/data/lotus_part1.csv"
data <- read.csv(url)
data %>% 
  ggplot(aes(species, agb_g))+
  geom_boxplot()+
  geom_jitter(alpha = .5)+
  labs(x = "Species", 
       y = "Aboveground biomass (g)")+
  theme_classic()+
  theme(aspect.ratio = 1)

The aboveground biomass from the \(i\)th observation, \(y_i\), can be described as \[y_i \sim N(\mu_i , \sigma^2),\]

\[\mu_i = \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i},\] \[x_{1 i} \left \{ \begin{aligned} & 1 && \text{if species A} \\ & 0 && \text{else} \end{aligned} \right.,\] \[x_{2 i} \left \{ \begin{aligned} & 1 && \text{if species C} \\ & 0 && \text{else} \end{aligned} \right.,\] \[x_{3 i} \left \{ \begin{aligned} & 1 && \text{if species D} \\ & 0 && \text{else} \end{aligned} \right.,\]

for \(i =1, 2, ..., n\), where \(n\) is the total number of observations, \(\mu_i\) is the mean of the \(i\)th observation, \(x_{1i}\), \(x_{2i}\), and \(x_{3i}\) are indicator variables that identify the species as A, C, or D, and thus \(\beta_1\), \(\beta_2\), and \(\beta_3\) are the means of genotypes A, C, and D, respectively, and \(\sigma^2\) is the variance of the data.

m <- lm(agb_g ~ 0 + species , data = data)

Define a bunch of objects needed for computations below:

beta_hat <- coef(m)
dfe <- m$df.residual
se <- summary(m)[["coefficients"]][,2]

t_A <- (beta_hat[1] - 0)/se[1]
t_C <- (beta_hat[2] - 0)/se[2]
t_D <- (beta_hat[3] - 0)/se[3]
dt(t_A, dfe)

    speciesA 
1.087321e-34

dt(t_C, dfe)

    speciesC 
2.495304e-29

dt(t_D, dfe)

    speciesD 
1.624738e-26

More on t tests

Recall that the t statistic \(T = \frac{\hat\theta}{s.e.(\hat\theta)}\) is used to compute the p value in hypothesis tests. We tested \(\beta_1 = 0\) before, but what about \(\beta_1 = \beta_2\)? In other words, we want to test if the difference between \(\beta_1\) and \(\beta_2\) is different to zero.
In this case, \(s.e.(\hat\beta_1 - \hat\beta_2) = \sqrt{s.e.(\hat\beta_1)^2+ s.e.(\hat\beta_2)^2 -2\text{cov}(\hat\beta_1, \hat\beta_2)}\).

c_1 <- (beta_hat[1]-beta_hat[2])/sqrt(se[1]^2 + se[2]^2)
dt(c_1, dfe)

   speciesA 
0.004463487

c_2 <- (beta_hat[1]-beta_hat[3])/sqrt(se[1]^2 + se[3]^2)
dt(c_2, dfe)

    speciesA 
4.613109e-05

c_3 <- (beta_hat[2]-beta_hat[3])/sqrt(se[2]^2 + se[3]^2)
dt(c_3, dfe)

 speciesC 
0.1405026

An interpretation of 95% CI using t-tests

Recall the 95% CI of \(\hat\beta_1\): \(\hat{\beta} \pm t_{dfe, 1-\alpha/2}s.e. (\hat{\beta})\).

confint(m)

            2.5 %   97.5 %
speciesA 2.688742 3.184410
speciesC 2.148861 2.644529
speciesD 1.895419 2.391086

dt((beta_hat[1] - 2.67)/( se[1]/sqrt(1)), dfe)

  speciesA 
0.04143843

dt((beta_hat[1] - 2.68)/( se[1]/sqrt(1)), dfe)

  speciesA 
0.04870437

dt((beta_hat[1] - 2.7)/( se[1]/sqrt(1)), dfe)

  speciesA 
0.06618161

comp <- seq(2, 3, by = .01)
p.save <- numeric(length(comp))

for (i in 1:length(comp)) {
  p.save[i] <- dt((beta_hat[1] - comp[i])/( se[1]/sqrt(1)), dfe)
}

data.frame(comp, p.save) %>% 
  ggplot(aes(comp, p.save))+
  geom_line()+
  labs(y = "p value", 
       x = "comparing the mean with ...")+
  geom_hline(yintercept = .05, linetype =2, color = 'red')+
  geom_vline(xintercept = confint(m)[1], linetype= 2)+
  theme_bw()

summary(m)


Call:
lm(formula = agb_g ~ 0 + species, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.04136 -0.46186  0.03253  0.44846  1.41732 

Coefficients:
         Estimate Std. Error t value Pr(>|t|)    
speciesA   2.9366     0.1242   23.64   <2e-16 ***
speciesC   2.3967     0.1242   19.29   <2e-16 ***
speciesD   2.1433     0.1242   17.25   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6086 on 69 degrees of freedom
Multiple R-squared:  0.9468,    Adjusted R-squared:  0.9445 
F-statistic: 409.5 on 3 and 69 DF,  p-value: < 2.2e-16

summary(lm(lm(agb_g ~ species , data = data)))


Call:
lm(formula = lm(agb_g ~ species, data = data))

Residuals:
     Min       1Q   Median       3Q      Max 
-1.04136 -0.46186  0.03253  0.44846  1.41732 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.9366     0.1242  23.638  < 2e-16 ***
speciesC     -0.5399     0.1757  -3.073  0.00303 ** 
speciesD     -0.7933     0.1757  -4.515 2.54e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6086 on 69 degrees of freedom
Multiple R-squared:  0.2357,    Adjusted R-squared:  0.2135 
F-statistic: 10.64 on 2 and 69 DF,  p-value: 9.399e-05

Summary

Now we know where (almost) everything in the summary output comes from!

summary(m)

But why do we care if something is significant?

The value in having categorized (“significant”/“non-significant” results)
What would you say for an effect with p = 0.049 and one with 0.051? Are both significant?
What is the difference between ‘significant’ and ‘non-significant’ results? See this paper.
A few things to have in mind when evaluating effects:
- is this difference meaningful?
- what is the noise in these estimations?
Why are we learning about p-values then?
ASA statement on p-values.

For next class

Kahoot on Monday