Applied data example I (what are outliers anyways?)

The clover example

  • Last day (harvest)

  • Crown biomass

library(tidyverse)
url <-  "https://raw.githubusercontent.com/jlacasa/stat705_fall2024/main/classes/data/dd_finalproj.csv"
dd <- read.csv(url) %>% filter(doy == 237) %>% 
    filter(species %in% c("A", "D", "E")) 

Outliers definitions

With Q1 and Q3 denoting (essentially) the lower and upper quartiles in the sample, observations greater than \(Q3 + k(Q3 − Q1)\) or less than \(Q1 − k(Q3 − Q1)\) are flagged as outliers. These values are sometimes outliers and sometimes not. With the typical value of 1.5 for \(k\), a normal sample of size 100 has more than 50 percent chance of containing one or more of these ‘outliers’!

From International Encyclopedia of the Social & Behavioral Sciences

Default boxplot

dd %>% 
  ggplot(aes(paste(species, trt), crown_g))+
  theme_classic()+
  labs(x = "Species", 
       y = expression(Crown~biomass~(grams~plant^{-1})))+
  geom_boxplot(alpha = .6)

boxplot(crown_g ~species:trt, data = dd,
        xlab = "Species",
        ylab = expression(Crown~biomass~(grams~plant^{-1})))

Interquartile range

dd %>% 
  group_by(trt, species) %>% 
  transmute(crown_g, 
            outlier = crown_g > (quantile(crown_g, probs = .75)+ 1.5*IQR(crown_g)) |
              crown_g < (quantile(crown_g, probs = .25) - 1.5*IQR(crown_g)) ) %>% 
  filter(outlier == TRUE)
## # A tibble: 5 × 4
## # Groups:   trt, species [2]
##   trt   species crown_g outlier
##   <chr> <chr>     <dbl> <lgl>  
## 1 flood D         0.409 TRUE   
## 2 flood D         1.09  TRUE   
## 3 flood D         0.409 TRUE   
## 4 flood E         1.22  TRUE   
## 5 flood E         1.23  TRUE

Data manipulation goes wrong - a famous example

Let’s work on an example

R script

Announcements