Day 17
Carleton College
Stat 220 - Spring 2025
Here’s a short function that standardizes an input vector
Now we can easily standardize all of our variables, right?
Rows: 569
Columns: 12
$ ID <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786…
$ Class <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M…
$ Radius <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450, 18.2…
$ Texture <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.98, 20.…
$ Perimeter <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, 119.60…
$ Area <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, 1040.0…
$ Smoothness <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0.12780…
$ Compactness <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0.17000…
$ Concavity <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0.15780…
$ Concave_Points <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0.08089…
$ Symmetry <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087, 0.17…
$ Fractal_Dimension <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0.07613…
scaled_cancer <- unscaled_cancer %>%
mutate(
Radius = standardize(Radius),
Texture = standardize(Texture),
Perimeter = standardize(Perimeter),
Area = standardize(Area),
Smoothness = standardize(Smoothness),
Compactness = standardize(Compactness),
Concavity = standardize(Concavity),
Concave_points = standardize(Concave_Points),
Symmetry = standardize(Symmetry),
Fractal_dimension = standardize(Fractal_Dimension)
)
You should consider
writing a functioniterating
whenever you’ve copied and pasted a
block of code more than twice
—
Hadley Wickham
-Amanda Luby
Programmatically repeat the code
We have two options for doing this:
using a for
loop, or similar (imperative programming)
map
ping with functional programming
for
loops are the simplest and most common type of loop in R
Given a vector iterate through the elements and evaluate the code block for each
Goal: Standardize all of the numeric columns via for loops.
scaled_cancer <- unscaled_cancer %>%
mutate(
Radius = NA,
Texture = NA,
Perimeter = NA,
Area = NA,
Smoothness = NA,
Compactness = NA,
Concavity = NA,
Concave_Points = NA,
Symmetry = NA,
Fractal_dimension = NA
)
scaled_cancer
# A tibble: 569 × 13
ID Class Radius Texture Perimeter Area Smoothness Compactness Concavity
<dbl> <chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 8.42e5 M NA NA NA NA NA NA NA
2 8.43e5 M NA NA NA NA NA NA NA
3 8.43e7 M NA NA NA NA NA NA NA
4 8.43e7 M NA NA NA NA NA NA NA
5 8.44e7 M NA NA NA NA NA NA NA
6 8.44e5 M NA NA NA NA NA NA NA
7 8.44e5 M NA NA NA NA NA NA NA
8 8.45e7 M NA NA NA NA NA NA NA
9 8.45e5 M NA NA NA NA NA NA NA
10 8.45e7 M NA NA NA NA NA NA NA
# ℹ 559 more rows
# ℹ 4 more variables: Concave_Points <lgl>, Symmetry <lgl>,
# Fractal_Dimension <dbl>, Fractal_dimension <lgl>
Columns 3 to 12 are numeric, our index is 3:12
# A tibble: 569 × 13
ID Class Radius Texture Perimeter Area Smoothness Compactness Concavity
<dbl> <chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 8.42e5 M NA NA NA NA NA NA NA
2 8.43e5 M NA NA NA NA NA NA NA
3 8.43e7 M NA NA NA NA NA NA NA
4 8.43e7 M NA NA NA NA NA NA NA
5 8.44e7 M NA NA NA NA NA NA NA
6 8.44e5 M NA NA NA NA NA NA NA
7 8.44e5 M NA NA NA NA NA NA NA
8 8.45e7 M NA NA NA NA NA NA NA
9 8.45e5 M NA NA NA NA NA NA NA
10 8.45e7 M NA NA NA NA NA NA NA
# ℹ 559 more rows
# ℹ 4 more variables: Concave_Points <lgl>, Symmetry <lgl>,
# Fractal_Dimension <dbl>, Fractal_dimension <lgl>
i <- 3
# A tibble: 569 × 13
ID Class Radius Texture Perimeter Area Smoothness Compactness Concavity
<dbl> <chr> <dbl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 842302 M 1.10 NA NA NA NA NA NA
2 842517 M 1.83 NA NA NA NA NA NA
3 84300903 M 1.58 NA NA NA NA NA NA
4 84348301 M -0.768 NA NA NA NA NA NA
5 84358402 M 1.75 NA NA NA NA NA NA
# ℹ 564 more rows
# ℹ 4 more variables: Concave_Points <lgl>, Symmetry <lgl>,
# Fractal_Dimension <dbl>, Fractal_dimension <lgl>
i <- 4
# A tibble: 569 × 13
ID Class Radius Texture Perimeter Area Smoothness Compactness Concavity
<dbl> <chr> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 842302 M 1.10 -2.07 NA NA NA NA NA
2 842517 M 1.83 -0.353 NA NA NA NA NA
3 84300903 M 1.58 0.456 NA NA NA NA NA
4 84348301 M -0.768 0.254 NA NA NA NA NA
5 84358402 M 1.75 -1.15 NA NA NA NA NA
# ℹ 564 more rows
# ℹ 4 more variables: Concave_Points <lgl>, Symmetry <lgl>,
# Fractal_Dimension <dbl>, Fractal_dimension <lgl>
i <- 5
# A tibble: 569 × 13
ID Class Radius Texture Perimeter Area Smoothness Compactness Concavity
<dbl> <chr> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl>
1 842302 M 1.10 -2.07 1.27 NA NA NA NA
2 842517 M 1.83 -0.353 1.68 NA NA NA NA
3 84300903 M 1.58 0.456 1.57 NA NA NA NA
4 84348301 M -0.768 0.254 -0.592 NA NA NA NA
5 84358402 M 1.75 -1.15 1.78 NA NA NA NA
# ℹ 564 more rows
# ℹ 4 more variables: Concave_Points <lgl>, Symmetry <lgl>,
# Fractal_Dimension <dbl>, Fractal_dimension <lgl>
i <- 6
# A tibble: 569 × 13
ID Class Radius Texture Perimeter Area Smoothness Compactness Concavity
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl>
1 8.42e5 M 1.10 -2.07 1.27 0.984 NA NA NA
2 8.43e5 M 1.83 -0.353 1.68 1.91 NA NA NA
3 8.43e7 M 1.58 0.456 1.57 1.56 NA NA NA
4 8.43e7 M -0.768 0.254 -0.592 -0.764 NA NA NA
5 8.44e7 M 1.75 -1.15 1.78 1.82 NA NA NA
# ℹ 564 more rows
# ℹ 4 more variables: Concave_Points <lgl>, Symmetry <lgl>,
# Fractal_Dimension <dbl>, Fractal_dimension <lgl>
# Preallocate storage
scaled_cancer <- unscaled_cancer %>%
mutate(
Radius = NA,
Texture = NA,
Perimeter = NA,
Area = NA,
Smoothness = NA,
Compactness = NA,
Concavity = NA,
Concave_Points = NA,
Symmetry = NA,
Fractal_dimension = NA
)
# Iterate over numeric columns and save
for (i in 3:12){
scaled_cancer[, i] <- standardize(unscaled_cancer[[i]])
}
scaled_cancer
# A tibble: 569 × 13
ID Class Radius Texture Perimeter Area Smoothness Compactness Concavity
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 8.42e5 M 1.10 -2.07 1.27 0.984 1.57 3.28 2.65
2 8.43e5 M 1.83 -0.353 1.68 1.91 -0.826 -0.487 -0.0238
3 8.43e7 M 1.58 0.456 1.57 1.56 0.941 1.05 1.36
4 8.43e7 M -0.768 0.254 -0.592 -0.764 3.28 3.40 1.91
5 8.44e7 M 1.75 -1.15 1.78 1.82 0.280 0.539 1.37
6 8.44e5 M -0.476 -0.835 -0.387 -0.505 2.24 1.24 0.866
7 8.44e5 M 1.17 0.161 1.14 1.09 -0.123 0.0882 0.300
8 8.45e7 M -0.118 0.358 -0.0728 -0.219 1.60 1.14 0.0610
9 8.45e5 M -0.320 0.588 -0.184 -0.384 2.20 1.68 1.22
10 8.45e7 M -0.473 1.10 -0.329 -0.509 1.58 2.56 1.74
# ℹ 559 more rows
# ℹ 4 more variables: Concave_Points <dbl>, Symmetry <dbl>,
# Fractal_Dimension <dbl>, Fractal_dimension <lgl>
Load the palmerpenguins
package.
Write a for
loop that calculates the mean of the numeric variables in the penguins
data set and stores the means in a named vector.
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
03:00
Here are a few useful ways to preallocate storage for a vector of length n
:
Useful ways to create index vector to iterate over:
1:n
- manual creation if you already have n
stored
seq_along(df)
- construct an index “along” the columns of your data frame/tibble
e.g. seq_along(unscaled_cancer)
⚠️ Use this instead of 1:nrow(df)
or 1:length(x)
x
- pass in a vector, there’s no reason it needs to be an “index”
e.g. colnames(unscaled_cancer)
Revisit the {palmerpenguins} penguins
data.
Write a for
loop that calculates the summary()
of a numeric variable and the table()
of a factor variable.
Store the results in a list (it will have length 8).
03:00
across()
for
loopsacross()
across()
on unscaled cancer data# A tibble: 1 × 10
Radius Texture Perimeter Area Smoothness Compactness Concavity Concave_Points
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 14.1 19.3 92.0 655. 0.0964 0.104 0.0888 0.0489
# ℹ 2 more variables: Symmetry <dbl>, Fractal_Dimension <dbl>
# A tibble: 1 × 20
Radius_mean Radius_sd Texture_mean Texture_sd Perimeter_mean Perimeter_sd
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 14.1 3.52 19.3 4.30 92.0 24.3
# ℹ 14 more variables: Area_mean <dbl>, Area_sd <dbl>, Smoothness_mean <dbl>,
# Smoothness_sd <dbl>, Compactness_mean <dbl>, Compactness_sd <dbl>,
# Concavity_mean <dbl>, Concavity_sd <dbl>, Concave_Points_mean <dbl>,
# Concave_Points_sd <dbl>, Symmetry_mean <dbl>, Symmetry_sd <dbl>,
# Fractal_Dimension_mean <dbl>, Fractal_Dimension_sd <dbl>
# A tibble: 569 × 12
ID Class Radius Texture Perimeter Area Smoothness Compactness Concavity
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 8.42e5 M 1.10 -2.07 1.27 0.984 1.57 3.28 2.65
2 8.43e5 M 1.83 -0.353 1.68 1.91 -0.826 -0.487 -0.0238
3 8.43e7 M 1.58 0.456 1.57 1.56 0.941 1.05 1.36
4 8.43e7 M -0.768 0.254 -0.592 -0.764 3.28 3.40 1.91
5 8.44e7 M 1.75 -1.15 1.78 1.82 0.280 0.539 1.37
6 8.44e5 M -0.476 -0.835 -0.387 -0.505 2.24 1.24 0.866
7 8.44e5 M 1.17 0.161 1.14 1.09 -0.123 0.0882 0.300
8 8.45e7 M -0.118 0.358 -0.0728 -0.219 1.60 1.14 0.0610
9 8.45e5 M -0.320 0.588 -0.184 -0.384 2.20 1.68 1.22
10 8.45e7 M -0.473 1.10 -0.329 -0.509 1.58 2.56 1.74
# ℹ 559 more rows
# ℹ 3 more variables: Concave_Points <dbl>, Symmetry <dbl>,
# Fractal_Dimension <dbl>
across()
where(is.numeric)
selects all numeric columns.where(is.character)
selects all string columns.where(is.Date)
selects all date columns.where(is.logical)
selects all logical columns.across()
on unscaled cancer data# A tibble: 569 × 12
ID Class Radius Texture Perimeter Area Smoothness Compactness Concavity
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.236 M 1.10 -2.07 1.27 0.984 1.57 3.28 2.65
2 -0.236 M 1.83 -0.353 1.68 1.91 -0.826 -0.487 -0.0238
3 0.431 M 1.58 0.456 1.57 1.56 0.941 1.05 1.36
4 0.432 M -0.768 0.254 -0.592 -0.764 3.28 3.40 1.91
5 0.432 M 1.75 -1.15 1.78 1.82 0.280 0.539 1.37
6 -0.236 M -0.476 -0.835 -0.387 -0.505 2.24 1.24 0.866
7 -0.236 M 1.17 0.161 1.14 1.09 -0.123 0.0882 0.300
8 0.433 M -0.118 0.358 -0.0728 -0.219 1.60 1.14 0.0610
9 -0.236 M -0.320 0.588 -0.184 -0.384 2.20 1.68 1.22
10 0.433 M -0.473 1.10 -0.329 -0.509 1.58 2.56 1.74
# ℹ 559 more rows
# ℹ 3 more variables: Concave_Points <dbl>, Symmetry <dbl>,
# Fractal_Dimension <dbl>
across()
with a new functionLet’s say we want the range of each quantitative variable (max(x) - min(x)
). We could name a new function, or we could do it directly in across()
# A tibble: 1 × 11
ID Radius Texture Perimeter Area Smoothness Compactness Concavity
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 911311832 21.1 29.6 145. 2358. 0.111 0.326 0.427
# ℹ 3 more variables: Concave_Points <dbl>, Symmetry <dbl>,
# Fractal_Dimension <dbl>
Or we can use an anonymous function
# A tibble: 1 × 11
ID Radius Texture Perimeter Area Smoothness Compactness Concavity
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 911311832 21.1 29.6 145. 2358. 0.111 0.326 0.427
# ℹ 3 more variables: Concave_Points <dbl>, Symmetry <dbl>,
# Fractal_Dimension <dbl>
Use summarize
and across
to find the range of any quantitative variables, and the number of levels of any factor variables in the penguins
dataset.
The good news: we can use across
to do lots of for-loop-type tasks in our {dplyr} pipelines.
The bad news: across()
only works with {dplyr} functions like mutate
or summarize
The good news: there’s a more general-purpose solution in the {tidyverse} (which we’ll see next time!)