18: Iteration II

Author
Affiliation

Prof Amanda Luby

Carleton College
Stat 220 - Spring 2025

Load the Data

unscaled_cancer <- read_csv("https://raw.githubusercontent.com/UBC-DSCI/introduction-to-datascience/refs/heads/main/data/wdbc_unscaled.csv")


bakeoff = read_csv("https://stat220-s25.github.io/data/bakeoff-episodes.csv") |>
  filter(series == 14)

# us_castaway_results
load(url("https://stat220-s25.github.io/data/combining-data-examples.Rda"))

# penguins
library(palmerpenguins)

Try it: map

  1. Edit the code chunk below so it returns a numeric vector (remove eval: false)

  2. Edit the code chunk so it only maps to the numeric columns (3-12) of unscaled cancer

map(unscaled_cancer, mean)
  1. map the summary function to all columns in the {palmerpenguins} penguins data
# your code here

Your turn: penguins and map

Using the penguins data, use map to calculate the range of a numeric variable and the table of a factor variable. (It may be helpful to first write a custom function for this output)

Your result should be a list (it will have length 8).

# your code here

More practice: survivor data

us_castaway_results 
# A tibble: 870 × 7
   castaway_id castaway season_name      season place jury  finalist
   <chr>       <chr>    <chr>             <dbl> <dbl> <lgl> <lgl>   
 1 US0001      Sonja    Survivor: Borneo      1    16 FALSE FALSE   
 2 US0002      B.B.     Survivor: Borneo      1    15 FALSE FALSE   
 3 US0003      Stacey   Survivor: Borneo      1    14 FALSE FALSE   
 4 US0004      Ramona   Survivor: Borneo      1    13 FALSE FALSE   
 5 US0005      Dirk     Survivor: Borneo      1    12 FALSE FALSE   
 6 US0006      Joel     Survivor: Borneo      1    11 FALSE FALSE   
 7 US0007      Gretchen Survivor: Borneo      1    10 FALSE FALSE   
 8 US0008      Greg     Survivor: Borneo      1     9 TRUE  FALSE   
 9 US0009      Jenna    Survivor: Borneo      1     8 TRUE  FALSE   
10 US0010      Gervase  Survivor: Borneo      1     7 TRUE  FALSE   
# ℹ 860 more rows
  1. Write a function called finalists that takes the input of a survivor season (as a numeric) and outputs a string of the finalists’ names for that season. The finalists’ names should be separated with a comma.

  2. Use map_chr to return a character vector of finalists for seasons 31-40.

Your turn: sample_finalists

Some survivor seasons only had 16 players, while others had 22. This could result in some seasons being slightly under/overrepresented in our sample. Let’s account for this.

Edit this experiment to instead randomly sample one player from each season (this results in 47 players) and then sample 20 players from the 47 random ones.

(Hint: look at the by argument in slice_sample)

# edit this function
sample_finalists = function(n){
    us_castaway_results %>%
    slice_sample(n = n) %>%
    filter(finalist) %>%
    count() %>%
    pull(n)
}
# your iteration code here

Permutation test

unscaled_cancer %>%
  group_by(Class) %>%
  summarize(
    mean = mean(Radius)
  )
# A tibble: 2 × 2
  Class  mean
  <chr> <dbl>
1 B      12.1
2 M      17.5
  1. Define a function to run the experiment
    • Create a new column of unscaled_cancer called class_shuffled, which is a permutation of the original Class variable (Hint: see sample function)
    • Group by class_shuffled and compute the group means
    • Find the difference in the means
  2. Repeat the experiment 1000 times, making sure to save the difference in means
  3. Make a histogram of the simulated differences. How (un)likely is the difference we observed?