18: Iteration II
Load the Data
unscaled_cancer <- read_csv("https://raw.githubusercontent.com/UBC-DSCI/introduction-to-datascience/refs/heads/main/data/wdbc_unscaled.csv")
bakeoff = read_csv("https://stat220-s25.github.io/data/bakeoff-episodes.csv") |>
filter(series == 14)
# us_castaway_results
load(url("https://stat220-s25.github.io/data/combining-data-examples.Rda"))
# penguins
library(palmerpenguins)Try it: map
Edit the code chunk below so it returns a numeric vector (remove
eval: false)Edit the code chunk so it only
maps to the numeric columns (3-12) ofunscaled cancer
map(unscaled_cancer, mean)-
mapthesummaryfunction to all columns in the {palmerpenguins}penguinsdata
# your code hereYour turn: penguins and map
Using the penguins data, use map to calculate the range of a numeric variable and the table of a factor variable. (It may be helpful to first write a custom function for this output)
Your result should be a list (it will have length 8).
# your code hereMore practice: survivor data
us_castaway_results # A tibble: 870 × 7
castaway_id castaway season_name season place jury finalist
<chr> <chr> <chr> <dbl> <dbl> <lgl> <lgl>
1 US0001 Sonja Survivor: Borneo 1 16 FALSE FALSE
2 US0002 B.B. Survivor: Borneo 1 15 FALSE FALSE
3 US0003 Stacey Survivor: Borneo 1 14 FALSE FALSE
4 US0004 Ramona Survivor: Borneo 1 13 FALSE FALSE
5 US0005 Dirk Survivor: Borneo 1 12 FALSE FALSE
6 US0006 Joel Survivor: Borneo 1 11 FALSE FALSE
7 US0007 Gretchen Survivor: Borneo 1 10 FALSE FALSE
8 US0008 Greg Survivor: Borneo 1 9 TRUE FALSE
9 US0009 Jenna Survivor: Borneo 1 8 TRUE FALSE
10 US0010 Gervase Survivor: Borneo 1 7 TRUE FALSE
# ℹ 860 more rows
Write a function called
finaliststhat takes the input of a survivor season (as anumeric) and outputs a string of the finalists’ names for that season. The finalists’ names should be separated with a comma.Use
map_chrto return a character vector of finalists for seasons 31-40.
Your turn: sample_finalists
Some survivor seasons only had 16 players, while others had 22. This could result in some seasons being slightly under/overrepresented in our sample. Let’s account for this.
Edit this experiment to instead randomly sample one player from each season (this results in 47 players) and then sample 20 players from the 47 random ones.
(Hint: look at the by argument in slice_sample)
# your iteration code herePermutation test
# A tibble: 2 × 2
Class mean
<chr> <dbl>
1 B 12.1
2 M 17.5
- Define a function to run the experiment
- Create a new column of
unscaled_cancercalledclass_shuffled, which is a permutation of the originalClassvariable (Hint: seesamplefunction) - Group by
class_shuffledand compute the group means - Find the difference in the means
- Create a new column of
- Repeat the experiment 1000 times, making sure to save the difference in means
- Make a histogram of the simulated differences. How (un)likely is the difference we observed?
