18: Iteration II
Load the Data
unscaled_cancer <- read_csv("https://raw.githubusercontent.com/UBC-DSCI/introduction-to-datascience/refs/heads/main/data/wdbc_unscaled.csv")
bakeoff = read_csv("https://stat220-s25.github.io/data/bakeoff-episodes.csv") |>
filter(series == 14)
# us_castaway_results
load(url("https://stat220-s25.github.io/data/combining-data-examples.Rda"))
# penguins
library(palmerpenguins)
Try it: map
Edit the code chunk below so it returns a numeric vector (remove
eval: false
)Edit the code chunk so it only
map
s to the numeric columns (3-12) ofunscaled cancer
map(unscaled_cancer, mean)
-
map
thesummary
function to all columns in the {palmerpenguins}penguins
data
# your code here
Your turn: penguins
and map
Using the penguins
data, use map
to calculate the range
of a numeric variable and the table
of a factor variable. (It may be helpful to first write a custom function for this output)
Your result should be a list (it will have length 8).
# your code here
More practice: survivor data
us_castaway_results
# A tibble: 870 × 7
castaway_id castaway season_name season place jury finalist
<chr> <chr> <chr> <dbl> <dbl> <lgl> <lgl>
1 US0001 Sonja Survivor: Borneo 1 16 FALSE FALSE
2 US0002 B.B. Survivor: Borneo 1 15 FALSE FALSE
3 US0003 Stacey Survivor: Borneo 1 14 FALSE FALSE
4 US0004 Ramona Survivor: Borneo 1 13 FALSE FALSE
5 US0005 Dirk Survivor: Borneo 1 12 FALSE FALSE
6 US0006 Joel Survivor: Borneo 1 11 FALSE FALSE
7 US0007 Gretchen Survivor: Borneo 1 10 FALSE FALSE
8 US0008 Greg Survivor: Borneo 1 9 TRUE FALSE
9 US0009 Jenna Survivor: Borneo 1 8 TRUE FALSE
10 US0010 Gervase Survivor: Borneo 1 7 TRUE FALSE
# ℹ 860 more rows
Write a function called
finalists
that takes the input of a survivor season (as anumeric
) and outputs a string of the finalists’ names for that season. The finalists’ names should be separated with a comma.Use
map_chr
to return a character vector of finalists for seasons 31-40.
Your turn: sample_finalists
Some survivor seasons only had 16 players, while others had 22. This could result in some seasons being slightly under/overrepresented in our sample. Let’s account for this.
Edit this experiment to instead randomly sample one player from each season (this results in 47 players) and then sample 20 players from the 47 random ones.
(Hint: look at the by
argument in slice_sample
)
# your iteration code here
Permutation test
# A tibble: 2 × 2
Class mean
<chr> <dbl>
1 B 12.1
2 M 17.5
- Define a function to run the experiment
- Create a new column of
unscaled_cancer
calledclass_shuffled
, which is a permutation of the originalClass
variable (Hint: seesample
function) - Group by
class_shuffled
and compute the group means - Find the difference in the means
- Create a new column of
- Repeat the experiment 1000 times, making sure to save the difference in means
- Make a histogram of the simulated differences. How (un)likely is the difference we observed?