Iteration

Day 17

Prof Amanda Luby

Carleton College
Stat 220 - Spring 2025

Recap from last class: Standardizing a vector

Here’s a short function that standardizes an input vector

standardize <- function(x, na.rm = TRUE) {
  (x - mean(x, na.rm = na.rm)) / sd(x, na.rm = na.rm)
}

Now we can easily standardize all of our variables, right?

Cancer data

  • Set of digitized breast cancer image features
  • Row = an image of a tumor sample
  • Variables include the diagnosis (benign or malignant) and measurements
Rows: 569
Columns: 12
$ ID                <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786…
$ Class             <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M…
$ Radius            <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450, 18.2…
$ Texture           <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.98, 20.…
$ Perimeter         <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, 119.60…
$ Area              <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, 1040.0…
$ Smoothness        <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0.12780…
$ Compactness       <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0.17000…
$ Concavity         <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0.15780…
$ Concave_Points    <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0.08089…
$ Symmetry          <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087, 0.17…
$ Fractal_Dimension <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0.07613…

Standardizing cancer data

scaled_cancer <- unscaled_cancer %>%
  mutate(
    Radius = standardize(Radius),
    Texture = standardize(Texture),
    Perimeter = standardize(Perimeter),
    Area = standardize(Area),
    Smoothness = standardize(Smoothness),
    Compactness = standardize(Compactness),
    Concavity = standardize(Concavity),
    Concave_points = standardize(Concave_Points),
    Symmetry = standardize(Symmetry),
    Fractal_dimension = standardize(Fractal_Dimension)
  )

You should consider writing a function iterating
whenever you’ve copied and pasted a
block of code more than twice

Hadley Wickham
-Amanda Luby

Iteration

  • Programmatically repeat the code

  • We have two options for doing this:

    • using a for loop, or similar (imperative programming)

    • mapping with functional programming

for loops

  • for loops are the simplest and most common type of loop in R

  • Given a vector iterate through the elements and evaluate the code block for each

Goal: Standardize all of the numeric columns via for loops.

for loops

(1) Set up a object to store results

scaled_cancer <- unscaled_cancer %>%
  mutate(
    Radius = NA,
    Texture = NA,
    Perimeter = NA,
    Area = NA,
    Smoothness = NA,
    Compactness = NA,
    Concavity = NA,
    Concave_Points = NA,
    Symmetry = NA,
    Fractal_dimension = NA
  )

scaled_cancer
# A tibble: 569 × 13
        ID Class Radius Texture Perimeter Area  Smoothness Compactness Concavity
     <dbl> <chr> <lgl>  <lgl>   <lgl>     <lgl> <lgl>      <lgl>       <lgl>    
 1  8.42e5 M     NA     NA      NA        NA    NA         NA          NA       
 2  8.43e5 M     NA     NA      NA        NA    NA         NA          NA       
 3  8.43e7 M     NA     NA      NA        NA    NA         NA          NA       
 4  8.43e7 M     NA     NA      NA        NA    NA         NA          NA       
 5  8.44e7 M     NA     NA      NA        NA    NA         NA          NA       
 6  8.44e5 M     NA     NA      NA        NA    NA         NA          NA       
 7  8.44e5 M     NA     NA      NA        NA    NA         NA          NA       
 8  8.45e7 M     NA     NA      NA        NA    NA         NA          NA       
 9  8.45e5 M     NA     NA      NA        NA    NA         NA          NA       
10  8.45e7 M     NA     NA      NA        NA    NA         NA          NA       
# ℹ 559 more rows
# ℹ 4 more variables: Concave_Points <lgl>, Symmetry <lgl>,
#   Fractal_Dimension <dbl>, Fractal_dimension <lgl>

for loops

(2) Determine an index over which to iterate

Columns 3 to 12 are numeric, our index is 3:12

# A tibble: 569 × 13
        ID Class Radius Texture Perimeter Area  Smoothness Compactness Concavity
     <dbl> <chr> <lgl>  <lgl>   <lgl>     <lgl> <lgl>      <lgl>       <lgl>    
 1  8.42e5 M     NA     NA      NA        NA    NA         NA          NA       
 2  8.43e5 M     NA     NA      NA        NA    NA         NA          NA       
 3  8.43e7 M     NA     NA      NA        NA    NA         NA          NA       
 4  8.43e7 M     NA     NA      NA        NA    NA         NA          NA       
 5  8.44e7 M     NA     NA      NA        NA    NA         NA          NA       
 6  8.44e5 M     NA     NA      NA        NA    NA         NA          NA       
 7  8.44e5 M     NA     NA      NA        NA    NA         NA          NA       
 8  8.45e7 M     NA     NA      NA        NA    NA         NA          NA       
 9  8.45e5 M     NA     NA      NA        NA    NA         NA          NA       
10  8.45e7 M     NA     NA      NA        NA    NA         NA          NA       
# ℹ 559 more rows
# ℹ 4 more variables: Concave_Points <lgl>, Symmetry <lgl>,
#   Fractal_Dimension <dbl>, Fractal_dimension <lgl>

for loops

(3) Iterate through the index, standardize and save results
i <- 3

for (i in 3:12){
  scaled_cancer[, i] <- standardize(unscaled_cancer[[i]])
}
# A tibble: 569 × 13
        ID Class Radius Texture Perimeter Area  Smoothness Compactness Concavity
     <dbl> <chr>  <dbl> <lgl>   <lgl>     <lgl> <lgl>      <lgl>       <lgl>    
1   842302 M      1.10  NA      NA        NA    NA         NA          NA       
2   842517 M      1.83  NA      NA        NA    NA         NA          NA       
3 84300903 M      1.58  NA      NA        NA    NA         NA          NA       
4 84348301 M     -0.768 NA      NA        NA    NA         NA          NA       
5 84358402 M      1.75  NA      NA        NA    NA         NA          NA       
# ℹ 564 more rows
# ℹ 4 more variables: Concave_Points <lgl>, Symmetry <lgl>,
#   Fractal_Dimension <dbl>, Fractal_dimension <lgl>

for loops

(3) Iterate through the index, standardize and save results
i <- 4

for (i in 3:12){
  scaled_cancer[, i] <- standardize(unscaled_cancer[[i]])
}
# A tibble: 569 × 13
        ID Class Radius Texture Perimeter Area  Smoothness Compactness Concavity
     <dbl> <chr>  <dbl>   <dbl> <lgl>     <lgl> <lgl>      <lgl>       <lgl>    
1   842302 M      1.10   -2.07  NA        NA    NA         NA          NA       
2   842517 M      1.83   -0.353 NA        NA    NA         NA          NA       
3 84300903 M      1.58    0.456 NA        NA    NA         NA          NA       
4 84348301 M     -0.768   0.254 NA        NA    NA         NA          NA       
5 84358402 M      1.75   -1.15  NA        NA    NA         NA          NA       
# ℹ 564 more rows
# ℹ 4 more variables: Concave_Points <lgl>, Symmetry <lgl>,
#   Fractal_Dimension <dbl>, Fractal_dimension <lgl>

for loops

(3) Iterate through the index, standardize and save results
i <- 5

for (i in 3:12){
  scaled_cancer[, i] <- standardize(unscaled_cancer[[i]])
}
# A tibble: 569 × 13
        ID Class Radius Texture Perimeter Area  Smoothness Compactness Concavity
     <dbl> <chr>  <dbl>   <dbl>     <dbl> <lgl> <lgl>      <lgl>       <lgl>    
1   842302 M      1.10   -2.07      1.27  NA    NA         NA          NA       
2   842517 M      1.83   -0.353     1.68  NA    NA         NA          NA       
3 84300903 M      1.58    0.456     1.57  NA    NA         NA          NA       
4 84348301 M     -0.768   0.254    -0.592 NA    NA         NA          NA       
5 84358402 M      1.75   -1.15      1.78  NA    NA         NA          NA       
# ℹ 564 more rows
# ℹ 4 more variables: Concave_Points <lgl>, Symmetry <lgl>,
#   Fractal_Dimension <dbl>, Fractal_dimension <lgl>

for loops

(3) Iterate through the index, standardize and save results
i <- 6

for (i in 3:12){
  scaled_cancer[, i] <- standardize(unscaled_cancer[[i]])
}
# A tibble: 569 × 13
       ID Class Radius Texture Perimeter   Area Smoothness Compactness Concavity
    <dbl> <chr>  <dbl>   <dbl>     <dbl>  <dbl> <lgl>      <lgl>       <lgl>    
1  8.42e5 M      1.10   -2.07      1.27   0.984 NA         NA          NA       
2  8.43e5 M      1.83   -0.353     1.68   1.91  NA         NA          NA       
3  8.43e7 M      1.58    0.456     1.57   1.56  NA         NA          NA       
4  8.43e7 M     -0.768   0.254    -0.592 -0.764 NA         NA          NA       
5  8.44e7 M      1.75   -1.15      1.78   1.82  NA         NA          NA       
# ℹ 564 more rows
# ℹ 4 more variables: Concave_Points <lgl>, Symmetry <lgl>,
#   Fractal_Dimension <dbl>, Fractal_dimension <lgl>

Putting it all together:

# Preallocate storage
scaled_cancer <- unscaled_cancer %>%
  mutate(
    Radius = NA,
    Texture = NA,
    Perimeter = NA,
    Area = NA,
    Smoothness = NA,
    Compactness = NA,
    Concavity = NA,
    Concave_Points = NA,
    Symmetry = NA,
    Fractal_dimension = NA
  )

# Iterate over numeric columns and save
for (i in 3:12){
  scaled_cancer[, i] <- standardize(unscaled_cancer[[i]])
}

scaled_cancer
# A tibble: 569 × 13
       ID Class Radius Texture Perimeter   Area Smoothness Compactness Concavity
    <dbl> <chr>  <dbl>   <dbl>     <dbl>  <dbl>      <dbl>       <dbl>     <dbl>
 1 8.42e5 M      1.10   -2.07     1.27    0.984      1.57       3.28      2.65  
 2 8.43e5 M      1.83   -0.353    1.68    1.91      -0.826     -0.487    -0.0238
 3 8.43e7 M      1.58    0.456    1.57    1.56       0.941      1.05      1.36  
 4 8.43e7 M     -0.768   0.254   -0.592  -0.764      3.28       3.40      1.91  
 5 8.44e7 M      1.75   -1.15     1.78    1.82       0.280      0.539     1.37  
 6 8.44e5 M     -0.476  -0.835   -0.387  -0.505      2.24       1.24      0.866 
 7 8.44e5 M      1.17    0.161    1.14    1.09      -0.123      0.0882    0.300 
 8 8.45e7 M     -0.118   0.358   -0.0728 -0.219      1.60       1.14      0.0610
 9 8.45e5 M     -0.320   0.588   -0.184  -0.384      2.20       1.68      1.22  
10 8.45e7 M     -0.473   1.10    -0.329  -0.509      1.58       2.56      1.74  
# ℹ 559 more rows
# ℹ 4 more variables: Concave_Points <dbl>, Symmetry <dbl>,
#   Fractal_Dimension <dbl>, Fractal_dimension <lgl>

Your turn

Load the palmerpenguins package.

Write a for loop that calculates the mean of the numeric variables in the penguins data set and stores the means in a named vector.

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
03:00

Preallocate storage to increase speed

add_to_vector <- function(n) {
  output <- NULL 
  for (i in 1:n) {
    output <- c(output, i)
  }
  output
}

add_to_vector(10000)
0.12 sec elapsed
add_to_vector2 <- function(n) {
  output <- vector("integer", n)  
  for (i in 1:n) {
   output[i] <- i
  }
  output
}

add_to_vector2(10000)
0.049 sec elapsed

The difference is noticable for larger n

add_to_vector <- function(n) {
  output <- NULL 
  for (i in 1:n) {
    output <- c(output, i)
  }
  output
}

add_to_vector(100000)
6.38 sec elapsed
add_to_vector2 <- function(n) {
  output <- vector("integer", n)  
  for (i in 1:n) {
   output[i] <- i
  }
  output
}

add_to_vector2(100000)
0.438 sec elapsed

Preallocating output

Here are a few useful ways to preallocate storage for a vector of length n:

output <- double(n)    # numeric vector
output <- integer(n)   # integer vector
output <- character(n) # character vector
output <- rep(NA, n)   # vector of NAs

You can make tibbles of NAs by combining vectors via tibble()

output_tbl <- tibble(
  a = double(n),
  b = integer(n),
  c = character(n),
  d = rep(NA, n)
)

Create index vector

Useful ways to create index vector to iterate over:

  • 1:n - manual creation if you already have n stored

  • seq_along(df) - construct an index “along” the columns of your data frame/tibble

    e.g. seq_along(unscaled_cancer)
    ⚠️ Use this instead of 1:nrow(df)or 1:length(x)

  • x - pass in a vector, there’s no reason it needs to be an “index”

    e.g. colnames(unscaled_cancer)

Your turn:

Revisit the {palmerpenguins} penguins data.

Write a for loop that calculates the summary() of a numeric variable and the table() of a factor variable.

Store the results in a list (it will have length 8).

03:00

across()

R has lots of alternatives to for loops

  • The basic data type in R is a vector
  • In R, an “integer” is the same as a vector of integers of length 1
  • In more general-purpose programming languages, single items are stored differently than arrays of those items
  • This means that R is highly optimized for vectorized operations

across()

  • .cols: columns to apply function to
  • .fns function to apply
across(.cols, 
       .fns)

across() on unscaled cancer data

unscaled_cancer |>
  summarize(across(Radius:Fractal_Dimension, mean))
# A tibble: 1 × 10
  Radius Texture Perimeter  Area Smoothness Compactness Concavity Concave_Points
   <dbl>   <dbl>     <dbl> <dbl>      <dbl>       <dbl>     <dbl>          <dbl>
1   14.1    19.3      92.0  655.     0.0964       0.104    0.0888         0.0489
# ℹ 2 more variables: Symmetry <dbl>, Fractal_Dimension <dbl>
unscaled_cancer |>
  summarize(across(Radius:Fractal_Dimension, list(mean = mean, 
                                                  sd = sd)))
# A tibble: 1 × 20
  Radius_mean Radius_sd Texture_mean Texture_sd Perimeter_mean Perimeter_sd
        <dbl>     <dbl>        <dbl>      <dbl>          <dbl>        <dbl>
1        14.1      3.52         19.3       4.30           92.0         24.3
# ℹ 14 more variables: Area_mean <dbl>, Area_sd <dbl>, Smoothness_mean <dbl>,
#   Smoothness_sd <dbl>, Compactness_mean <dbl>, Compactness_sd <dbl>,
#   Concavity_mean <dbl>, Concavity_sd <dbl>, Concave_Points_mean <dbl>,
#   Concave_Points_sd <dbl>, Symmetry_mean <dbl>, Symmetry_sd <dbl>,
#   Fractal_Dimension_mean <dbl>, Fractal_Dimension_sd <dbl>
unscaled_cancer |>
  mutate(across(Radius:Fractal_Dimension, standardize))
# A tibble: 569 × 12
       ID Class Radius Texture Perimeter   Area Smoothness Compactness Concavity
    <dbl> <chr>  <dbl>   <dbl>     <dbl>  <dbl>      <dbl>       <dbl>     <dbl>
 1 8.42e5 M      1.10   -2.07     1.27    0.984      1.57       3.28      2.65  
 2 8.43e5 M      1.83   -0.353    1.68    1.91      -0.826     -0.487    -0.0238
 3 8.43e7 M      1.58    0.456    1.57    1.56       0.941      1.05      1.36  
 4 8.43e7 M     -0.768   0.254   -0.592  -0.764      3.28       3.40      1.91  
 5 8.44e7 M      1.75   -1.15     1.78    1.82       0.280      0.539     1.37  
 6 8.44e5 M     -0.476  -0.835   -0.387  -0.505      2.24       1.24      0.866 
 7 8.44e5 M      1.17    0.161    1.14    1.09      -0.123      0.0882    0.300 
 8 8.45e7 M     -0.118   0.358   -0.0728 -0.219      1.60       1.14      0.0610
 9 8.45e5 M     -0.320   0.588   -0.184  -0.384      2.20       1.68      1.22  
10 8.45e7 M     -0.473   1.10    -0.329  -0.509      1.58       2.56      1.74  
# ℹ 559 more rows
# ℹ 3 more variables: Concave_Points <dbl>, Symmetry <dbl>,
#   Fractal_Dimension <dbl>

Helper functions for across()

  • where(is.numeric) selects all numeric columns.
  • where(is.character) selects all string columns.
  • where(is.Date) selects all date columns.
  • where(is.logical) selects all logical columns.

across() on unscaled cancer data

unscaled_cancer %>%
  mutate(across(where(is.numeric), standardize))
# A tibble: 569 × 12
       ID Class Radius Texture Perimeter   Area Smoothness Compactness Concavity
    <dbl> <chr>  <dbl>   <dbl>     <dbl>  <dbl>      <dbl>       <dbl>     <dbl>
 1 -0.236 M      1.10   -2.07     1.27    0.984      1.57       3.28      2.65  
 2 -0.236 M      1.83   -0.353    1.68    1.91      -0.826     -0.487    -0.0238
 3  0.431 M      1.58    0.456    1.57    1.56       0.941      1.05      1.36  
 4  0.432 M     -0.768   0.254   -0.592  -0.764      3.28       3.40      1.91  
 5  0.432 M      1.75   -1.15     1.78    1.82       0.280      0.539     1.37  
 6 -0.236 M     -0.476  -0.835   -0.387  -0.505      2.24       1.24      0.866 
 7 -0.236 M      1.17    0.161    1.14    1.09      -0.123      0.0882    0.300 
 8  0.433 M     -0.118   0.358   -0.0728 -0.219      1.60       1.14      0.0610
 9 -0.236 M     -0.320   0.588   -0.184  -0.384      2.20       1.68      1.22  
10  0.433 M     -0.473   1.10    -0.329  -0.509      1.58       2.56      1.74  
# ℹ 559 more rows
# ℹ 3 more variables: Concave_Points <dbl>, Symmetry <dbl>,
#   Fractal_Dimension <dbl>

across() with a new function

Let’s say we want the range of each quantitative variable (max(x) - min(x)). We could name a new function, or we could do it directly in across()

unscaled_cancer %>%
  summarize(across(where(is.numeric), function(x) max(x) - min(x)))
# A tibble: 1 × 11
         ID Radius Texture Perimeter  Area Smoothness Compactness Concavity
      <dbl>  <dbl>   <dbl>     <dbl> <dbl>      <dbl>       <dbl>     <dbl>
1 911311832   21.1    29.6      145. 2358.      0.111       0.326     0.427
# ℹ 3 more variables: Concave_Points <dbl>, Symmetry <dbl>,
#   Fractal_Dimension <dbl>

Or we can use an anonymous function

unscaled_cancer %>%
  summarize(across(where(is.numeric), \(x) max(x) - min(x)))
# A tibble: 1 × 11
         ID Radius Texture Perimeter  Area Smoothness Compactness Concavity
      <dbl>  <dbl>   <dbl>     <dbl> <dbl>      <dbl>       <dbl>     <dbl>
1 911311832   21.1    29.6      145. 2358.      0.111       0.326     0.427
# ℹ 3 more variables: Concave_Points <dbl>, Symmetry <dbl>,
#   Fractal_Dimension <dbl>

Your turn

Use summarize and across to find the range of any quantitative variables, and the number of levels of any factor variables in the penguins dataset.

countdown(4)
04:00

The good news: we can use across to do lots of for-loop-type tasks in our {dplyr} pipelines.

The bad news: across() only works with {dplyr} functions like mutate or summarize

The good news: there’s a more general-purpose solution in the {tidyverse} (which we’ll see next time!)