Palmerpenquins tidymodels

By Edward F. Hillenaar

January 1, 0001

Get started with tidymodels and #TidyTuesday Palmer penguins

This is a blogpost of Julia Silge about rstats tidymodels: Julia Silge’s youtube video.

Palmerpenguins dataset

The Palmerpenguins dataset can be found here: Palmerpenguins dataset.

We can build a classification model to distinguish male and female penguins.

Explore data

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
penguins %>% 
  filter(!is.na(sex)) %>% 
  ggplot(aes(flipper_length_mm, bill_length_mm, color = sex, size = body_mass_g)) +
  geom_point(alpha = 0.7) +
  facet_wrap(~species)

penguins_df <- penguins %>% 
  filter(!is.na(sex)) %>% 
  select(-year, -island)

Build a model

set.seed(123)

penguin_split <- initial_split(penguins_df, strata = sex)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)

library(rsample)
# Set seed for reproducibility
set.seed(123)

# Ungroup the data if it's grouped
penguin_train <- ungroup(penguin_train)

# Number of bootstrap replicates
num_replicates <- 250

# Create an empty list to store bootstrap samples
penguin_boot <- vector("list", num_replicates)

# Perform bootstrap resampling
for (i in 1:num_replicates) {
  # Generate random indices with replacement
  indices <- sample(nrow(penguin_train), replace = TRUE)
  # Extract bootstrap sample using the random indices
  penguin_boot[[i]] <- penguin_train[indices, ]
}

# View the first bootstrap sample
penguin_boot[[1]]
## # A tibble: 249 × 6
##    species   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex   
##    <fct>              <dbl>         <dbl>             <int>       <int> <fct> 
##  1 Adelie              38.3          19.2               189        3950 male  
##  2 Gentoo              48.6          16                 230        5800 male  
##  3 Adelie              41.5          18.5               201        4000 male  
##  4 Adelie              37            16.9               185        3000 female
##  5 Gentoo              49.6          16                 225        5700 male  
##  6 Adelie              37.7          19.8               198        3500 male  
##  7 Adelie              32.1          15.5               188        3050 female
##  8 Chinstrap           42.5          17.3               187        3350 female
##  9 Adelie              38.8          17.6               191        3275 female
## 10 Chinstrap           51.3          18.2               197        3750 male  
## # ℹ 239 more rows

Setting up a (regression) model to train the data

glm_spec <- logistic_reg() %>% 
  set_engine("glm")

rf_spec <- rand_forest() %>% 
  set_mode("classification") %>% 
  set_engine("ranger")
penguin_wf <- workflow() %>% 
  add_formula(sex ~ .)

penguin_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: None
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## sex ~ .

Evaluate modeling

collect_metrics(rf_rs)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.919    25 0.0164  Preprocessor1_Model1
## 2 roc_auc  binary     0.977    25 0.00807 Preprocessor1_Model1
collect_metrics(glm_rs)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.919    25 0.0174  Preprocessor1_Model1
## 2 roc_auc  binary     0.981    25 0.00770 Preprocessor1_Model1
glm_rs %>% 
  conf_mat_resampled()
## # A tibble: 4 × 3
##   Prediction Truth   Freq
##   <fct>      <fct>  <dbl>
## 1 female     female  4.52
## 2 female     male    0.4 
## 3 male       female  0.4 
## 4 male       male    4.64
glm_rs %>% 
  collect_predictions() %>% 
  group_by(id) %>% 
  roc_curve(sex, .pred_female) %>% 
  ggplot(aes(1 - specificity, sensitivity, color = id)) +
  geom_abline(lty = 2, color = "gray80", size = 1.5) +
  geom_path(show.legend = FALSE, alpha = 0.6, size = 1.2) +
  coord_equal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

penguin_final <- penguin_wf %>% 
  add_model(glm_spec) %>% 
  last_fit(penguin_split)

penguin_final
## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
##   splits           id               .metrics .notes   .predictions .workflow 
##   <list>           <chr>            <list>   <list>   <list>       <list>    
## 1 <split [249/84]> train/test split <tibble> <tibble> <tibble>     <workflow>
collect_metrics(penguin_final)
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.857 Preprocessor1_Model1
## 2 roc_auc  binary         0.938 Preprocessor1_Model1
collect_predictions(penguin_final) %>% conf_mat(sex, .pred_class)
##           Truth
## Prediction female male
##     female     37    7
##     male        5   35
penguin_final$.workflow[[1]] %>% 
  tidy(exponentiate = TRUE) %>% arrange(estimate)
## # A tibble: 7 × 5
##   term              estimate std.error statistic     p.value
##   <chr>                <dbl>     <dbl>     <dbl>       <dbl>
## 1 (Intercept)       5.75e-46  19.6        -5.31  0.000000110
## 2 speciesGentoo     1.14e- 5   3.75       -3.03  0.00243    
## 3 speciesChinstrap  1.37e- 4   2.34       -3.79  0.000148   
## 4 body_mass_g       1.01e+ 0   0.00176     4.59  0.00000442 
## 5 flipper_length_mm 1.06e+ 0   0.0611      0.926 0.355      
## 6 bill_length_mm    1.91e+ 0   0.180       3.60  0.000321   
## 7 bill_depth_mm     8.36e+ 0   0.478       4.45  0.00000868
penguins %>% 
  filter(!is.na(sex)) %>% 
  ggplot(aes(bill_depth_mm, bill_length_mm, color = sex, size = body_mass_g)) +
  geom_point(alpha = 0.7) +
  facet_wrap(~species)

References

Gorman, K. B., Williams, T. D., & Fraser, W. R. (2014). Ecological sexual dimorphism and environmental variability within a community of antarctic penguins (genus pygoscelis). PloS One, 9(3), e90081.

LTER, P. S. A., & Gorman, K. (2016). Structural size measurements and isotopic signatures of foraging among adult male and female chinstrap penguins (pygoscelis antarctica) nesting along the palmer archipelago near palmer station, 2007-2009.

Posted on:
January 1, 0001
Length:
5 minute read, 882 words
See Also: