Day 8 - 09/06/2024

From last class

Be suspicious of point estimates that are reported alone.

Uncertainty

Intro - data/model prep

library(tidyverse)
url <- "https://raw.githubusercontent.com/jlacasa/stat705_fall2024/main/classes/data/corn_example2.csv"
data <- read.csv(url)
data %>% 
  ggplot(aes(plant_density, yield_Mgha))+
  geom_point()+
  labs(x = expression(Plant~Density~(plants~m^{-2})), 
       y = expression(Grain~Yield~(Mg~ha^{-1})))+
  theme_classic()+
  coord_cartesian(xlim = c(0, 15), 
                  ylim = c(0, 16), 
                  expand = F)+
  theme(aspect.ratio = 1)

What would be a good statistical model for these data?

The yield from the \(i\)th observation, \(y_i\), can be described as \[y_i \sim N(\mu_i , \sigma^2),\]

\[\mu_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 ,\] for \(i =1, 2, ..., n\), where \(n\) is the total number of observations, \(\mu_i\) is the mean of the \(i\)th observation, \(\beta_0\) is the intercept (i.e., the yield with zero plants), \(\beta_1\) and \(\beta_2\) determine the shape of the quadratic curve, \(x_i\) is the plant density of the \(i\)th observation, and \(\sigma^2\) is the variance of the data.

m <- lm(yield_Mgha ~ plant_density + I(plant_density^2), data = data)
summary(m)


Call:
lm(formula = yield_Mgha ~ plant_density + I(plant_density^2), 
    data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9426 -0.6571 -0.1711  0.9246  1.6917 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         4.30879    1.38213   3.117 0.005016 ** 
plant_density       1.74562    0.33964   5.140 3.76e-05 ***
I(plant_density^2) -0.08300    0.01862  -4.458 0.000197 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9736 on 22 degrees of freedom
Multiple R-squared:  0.6493,    Adjusted R-squared:  0.6174 
F-statistic: 20.37 on 2 and 22 DF,  p-value: 9.866e-06

Estimation uncertainty

data_new <- data.frame(plant_density = 10)

Let’s look at \[E(y_i) = E(\beta_0 + \beta_1 x_i + \beta_2 x_i^2).\]

predict(m, data_new, type = "response")

       1 
13.46458

Let’s look at the 95% CI for \[E(y_i) = E(\beta_0 + \beta_1 x_i + \beta_2 x_i^2).\]

predict(m, data_new, interval = "confidence")

       fit      lwr      upr
1 13.46458 12.85359 14.07558

Prediction uncertainty

Let’s look at the 95% CI for \[y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i,\] where \(\varepsilon_i = y_i- \mu_i\).

predict(m, data_new, interval = "prediction")

       fit      lwr      upr
1 13.46458 11.35501 15.57416

Uncertainty of derived quantities

What is a derived quantity?
Invariance property of Maximum likelihood estimators.
What options are there out there to calculate the uncertainty of a derived quantity?
Check out this paper.

The ‘optimum planting rate’

The first derivative of \(\beta_0 + \beta_1 x_i + \beta_2 x_i^2\) is \(\beta_1 + 2x_i \beta_2\). If we set that to zero,

\[\beta_1 + 2x_i \beta_2 = 0, \]

and

\[OPR = x_i = - \frac{\beta_1}{2 \beta_2}.\]

Using the invariance property of MLE, we can obtain the MLE for this derived quantity.

coef(m)

       (Intercept)      plant_density I(plant_density^2) 
        4.30878773         1.74562330        -0.08300439

(opd_hat <- -coef(m)[2]/(2*coef(m)[3]))

plant_density 
     10.51525

What would be a reasonable confidence interval for this derived quantity?

Can we apply the invariance propert and use the 95% CI from the \(\beta\)s to estimate the 95CI for the derived quantity?

confint(m)

                        2.5 %      97.5 %
(Intercept)         1.4424214  7.17515407
plant_density       1.0412544  2.44999217
I(plant_density^2) -0.1216181 -0.04439072

covariance <- vcov(m)
covariance

                   (Intercept) plant_density I(plant_density^2)
(Intercept)         1.91028888  -0.454746081       0.0237469842
plant_density      -0.45474608   0.115354876      -0.0062400834
I(plant_density^2)  0.02374698  -0.006240083       0.0003466713

Remember the simulation? [R code]

The delta method

opd_se_hat <- msm::deltamethod(g = ~ -x2/(2*x3), mean = coef(m), cov = covariance)
opd_hat - 1.96*opd_se_hat ## 1.96 is typically used for approximate confidence intervals

plant_density 
     9.584308

opd_hat + 1.96*opd_se_hat

plant_density 
     11.44619

Summary

Why are uncertainty estimates important?
What are uncertainty estimates telling us?
Which variables have uncertainty? Which ones don’t?
Where does uncertainty come from?

For next week

Assignment 2 is up.
Semester project proposals are due next Friday.
An example proposal.