Functions

Day 16

Prof Amanda Luby

Carleton College
Stat 220 - Spring 2025

Today

  1. Writing Functions

  2. Conditional Execution

Warm up

What does the mutate code below do?

df %>% slice_head(n = 3)
# A tibble: 3 × 4
      a       b     c       d
  <dbl>   <dbl> <dbl>   <dbl>
1  1.08 -1.46   0.243  0.239 
2  1.02  0.0941 0.431 -0.0970
3 -1.83  0.310  1.38  -0.841 
df |> mutate(
  a = (a - mean(a, na.rm = TRUE)) / 
    sd(a, na.rm = TRUE),
  b = (b - mean(b, na.rm = TRUE)) / 
    sd(b, na.rm = TRUE),
  c = (c - mean(a, na.rm = TRUE)) / 
    sd(c, na.rm = TRUE),
  d = (d - mean(d, na.rm = TRUE)) / 
    sd(d, na.rm = TRUE)
)
01:00

Rescaling variables

Standardizing/rescaling/normalizing variables is often an essential first step in predictive modeling

Example:

  • Set of digitized breast cancer image features

  • Each row in the data set represents an image of a tumor sample, including the diagnosis (benign or malignant) and several other measurements (nucleus texture, perimeter, area, etc.)

  • Diagnosis for each image was conducted by physicians.

# A tibble: 569 × 12
        ID Class Radius Texture Perimeter  Area Smoothness Compactness Concavity
     <dbl> <chr>  <dbl>   <dbl>     <dbl> <dbl>      <dbl>       <dbl>     <dbl>
 1  8.42e5 M       18.0    10.4     123.  1001      0.118       0.278     0.300 
 2  8.43e5 M       20.6    17.8     133.  1326      0.0847      0.0786    0.0869
 3  8.43e7 M       19.7    21.2     130   1203      0.110       0.160     0.197 
 4  8.43e7 M       11.4    20.4      77.6  386.     0.142       0.284     0.241 
 5  8.44e7 M       20.3    14.3     135.  1297      0.100       0.133     0.198 
 6  8.44e5 M       12.4    15.7      82.6  477.     0.128       0.17      0.158 
 7  8.44e5 M       18.2    20.0     120.  1040      0.0946      0.109     0.113 
 8  8.45e7 M       13.7    20.8      90.2  578.     0.119       0.164     0.0937
 9  8.45e5 M       13      21.8      87.5  520.     0.127       0.193     0.186 
10  8.45e7 M       12.5    24.0      84.0  476.     0.119       0.240     0.227 
11  8.46e5 M       16.0    23.2     103.   798.     0.0821      0.0667    0.0330
12  8.46e7 M       15.8    17.9     104.   781      0.0971      0.129     0.0995
13  8.46e5 M       19.2    24.8     132.  1123      0.0974      0.246     0.206 
14  8.46e5 M       15.8    24.0     104.   783.     0.0840      0.100     0.0994
15  8.47e7 M       13.7    22.6      93.6  578.     0.113       0.229     0.213 
16  8.48e7 M       14.5    27.5      96.7  659.     0.114       0.160     0.164 
# ℹ 553 more rows
# ℹ 3 more variables: Concave_Points <dbl>, Symmetry <dbl>,
#   Fractal_Dimension <dbl>

Standardizing variables

Goal: standardize the 10 quantitative variables in the data set

Approach: subtract sample mean, divide by sample standard deviation

scaled_cancer <- unscaled_cancer %>%
  mutate(
    Radius = (Radius - mean(Radius)) / sd(Radius),
    Texture = (Texture - mean(Texture)) / sd(Texture),
    Perimeter = (Perimeter - mean(Perimeter)) / sd(Perimeter),
    Area = (Area - mean(Area)) / sd(Area),
    Smoothness = (Smoothness - mean(Smoothness)) / sd(Smoothness),
    Compactness = (Compactness - mean(Compactness)) / sd(Compactness),
    Concavity = (Concavity - mean(Concavity)) / sd(Concavity),
    Concave_Points = (Concave_Points - mean(Concave_Points)) / sd(Concave_Points),
    Symmetry = (Symmetry - mean(Symmetry)) / sd(Symmetry),
    Fractal_Dimension = (Fractal_Dimension - mean(Fractal_Dimension)) / sd(Fractal_Dimension) 
  )

Don’t repeat yourself (DRY)

  • We’ve copied, pasted, and edited the same code chunk many times

  • That’s a lot of work when the only change in the code is the variable name!

scaled_cancer <- unscaled_cancer %>%
  mutate(
    Radius = (Radius - mean(Radius)) / sd(Radius), 
    Texture = (Texture - mean(Texture)) / sd(Texture), 
    Perimeter = (Perimeter - mean(Perimeter)) / sd(Perimeter), 
    Area = (Area - mean(Area)) / sd(Area), 
    Smoothness = (Smoothness - mean(Smoothness)) / sd(Smoothness), 
    Compactness = (Compactness - mean(Compactness)) / sd(Compactness),
    Concavity = (Concavity - mean(Concavity)) / sd(Concavity), 
    Concave_Points = (Concave_Points - mean(Concave_Points)) / sd(Concave_Points), 
    Symmetry = (Symmetry - mean(Symmetry)) / sd(Symmetry), 
    Fractal_Dimension = (Fractal_Dimension - mean(Fractal_Dimension)) / sd(Fractal_Dimension) 
  )

You should consider writing a function
whenever you’ve copied and pasted a
block of code more than twice

—Hadley Wickham

Why functions?

Automate common tasks in a more powerful and more general way than copy-and-pasting

  • You can give a function an evocative name that makes your code easier to understand.

  • As requirements change, you only need to update code in one place, instead of many.

  • You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).

Down the line — Improve your reach as a data scientist by writing functions (and packages!) that others use

Anatomy of a function

  • A short but informative name, preferably a verb
calc_this <- function() { 

}

Anatomy of a function

  • A short but informative name, preferably a verb
  • Arguments of the function inside function()
calc_this <- function(..arguments..) { 

}

Anatomy of a function

  • A short but informative name, preferably a verb
  • Arguments of the function inside function()
  • Place the code you have developed in body of the function, a { block that immediately follows function(...).
calc_this <- function(..arguments..) { 
  # do stuff with arguments 
  # last result will be returned 
}

Building standardize

  • A short but informative name, preferably a verb
standardize <- function() { 

}

Building standardize

  • A short but informative name, preferably a verb
  • Arguments of the function inside function()
standardize <- function(x) { 

}

Building standardize

  • A short but informative name, preferably a verb
  • Arguments of the function inside function()
  • Place the code you have developed in body of the function, a { block that immediately follows function(...).
standardize <- function(x) { 
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

Does it work?

Standardized Compactness

std_compact <- standardize(unscaled_cancer$Compactness)
head(std_compact)
[1]  3.2806281 -0.4866435  1.0519999  3.3999174  0.5388663  1.2432416

Unstandardized Compactness

head(unscaled_cancer$Compactness)
[1] 0.27760 0.07864 0.15990 0.28390 0.13280 0.17000

Your turn

Turn the following code snippets into functions. Think about what each function does before you begin, and be sure to give each function an informative name.

  1. mean(is.na(x))

  2. x / sum(x, na.rm = TRUE)

  3. sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)

You can test your functions on variables from the palmerpenguins::penguins dataset (e.g. my_func(penguins$flipper_length_mm)).

06:00

Return values

The value returned by the function is usually the last statement it evaluates

standardize <- function(x) { 
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

You can force a function to return early using return().

This is often paired with conditional execution (more later).

standardize <- function(x) { 
 return((x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE))
}

Adding arguments

We can include na.rm as an argument to give the user control over it

standardize <- function(x, na.rm) {
  (x - mean(x, na.rm = na.rm)) / sd(x, na.rm = na.rm)
}
std_compact <- standardize(unscaled_cancer$Compactness, na.rm = TRUE)
head(std_compact)
[1]  3.2806281 -0.4866435  1.0519999  3.3999174  0.5388663  1.2432416

Setting defaults

Unless a default value is set for an argument, R will require a value to be specified in the function call

std_compact <- standardize(unscaled_cancer$Compactness)
 Error in mean(x, na.rm = na.rm) : 
  argument "na.rm" is missing, with no default 

Setting defaults

To set a default, set the value in the function definition

standardize <- function(x, na.rm = TRUE) { 
  (x - mean(x, na.rm = na.rm)) / sd(x, na.rm = na.rm)
}
std_compact <- standardize(unscaled_cancer$Compactness)

Your turn

  • Write a function called column_mean that takes a data set and column name (as a string) as inputs and returns the column mean as output. (Hint: access the column using [[)

  • You should also include a na.rm argument and set the default to TRUE so that NAs are removed from the calculation by default.

  • Test your function on the mtcars data set.

column_mean(mtcars, "cyl")
[1] 6.1875
05:00

Plotting functions

p1 = unscaled_cancer %>%
  ggplot(aes(x = Radius, fill = Class)) + 
  geom_histogram(col = "white", bins = 20, alpha = .7) 

p2 = unscaled_cancer %>%
  ggplot(aes(x = Texture, fill = Class)) + 
  geom_histogram(col = "white", bins = 20, alpha = .7) 

p3 = unscaled_cancer %>%
  ggplot(aes(x = Perimeter, fill = Class)) + 
  geom_histogram(col = "white", bins = 20, alpha = .7) 

p4 = unscaled_cancer %>%
  ggplot(aes(x = Area, fill = Class)) + 
  geom_histogram(col = "white", bins = 20, alpha = .7) 

(p1 + p2)/(p3 + p4)

Custom histogram() function

histogram <- function(df, var, bins = 20) {
  df %>%
    ggplot(aes(x = var, fill = Class)) + 
    geom_histogram(col = "white", bins = 20, alpha = .7) 
}

histogram(unscaled_cancer, Radius)
Error in `geom_histogram()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error:
! object 'Radius' not found

🤗 embracing 🤗

This issue arises in {dplyr} and {ggplot} functions because they use a special kind of evaluation, which allows us to refer to our variable names directly.

To write our own functions that use these packages, we use 🤗 embracing 🤗 (wrap the variable in braces). This tells R to use the value stored inside the argument, not the argument as the literal variable name.

One way to remember what’s happening is to think of { } as looking down a tunnel — { var } will make a dplyr function look inside of var rather than looking for a variable called var.

histogram() function

histogram <- function(df, var, bins = 20) {
  df %>%
    ggplot(aes(x = {{var}}, fill = Class)) + 
    geom_histogram(col = "white", bins = 20, alpha = .7) 
}

histogram(unscaled_cancer, Radius)

Your turn

Write a plotting function that makes a scatterplot of any two quantitative variables, coloring the points by a 3rd categorical variable.

Test your function with the following examples:

scatterplot(unscaled_cancer, 
            Radius, 
            Texture, 
            Class)

scatterplot(penguins, 
            bill_length_mm, 
            bill_depth_mm, 
            species)

02:00

Functions with conditional execution

  • Sometimes, we want our function to do different things depending on the arguments we feed it.
  • Example: a histobar function that makes a histogram if the variable is quantitative and a barchart if the variable is categorical.
histobar(unscaled_cancer, Radius)

histobar(unscaled_cancer, Class)

Conditional execution

The if() statement allows us to control which statements are executed.

if(condition) {
  # commands when TRUE
}
if(condition) {
  # commands when TRUE
} else {
  # commands when FALSE
}

Conditional execution

A basic example

x <- 5
if (x > 5) {
  x <- x + 1
}
x
[1] 5
x <- 6
if (x > 5) {
  x <- x + 1
}
x
[1] 7

Conditional execution

Another basic example

x <- 5
if (x > 5) {
  x <- x + 1
} else {
  x <- x - 1
}

x
[1] 4

histobar

histobar <- function(df, var){
  if(is.numeric(class({{var}}))){
    ggplot(df, aes(x = {{var}})) + 
      geom_histogram(col = "white")
  }
  else {
    ggplot(df, aes(y = {{var}})) + 
      geom_bar()
  }
}
histobar(unscaled_cancer, Radius)
Error: object 'Radius' not found

histobar

{var} is a column in a data frame/tibble, but is.numeric only works on vectors

histobar <- function(df, var){
  if(is.numeric(class({{var}}))){
    ggplot(df, aes(x = {{var}})) + 
      geom_histogram(col = "white")
  }
  else {
    ggplot(df, aes(y = {{var}})) + 
      geom_bar()
  }
}

histobar

histobar <- function(df, var){
  if(is.numeric(df %>% pull({{var}}))){
    ggplot(df, aes(x = {{var}})) + 
      geom_histogram(col = "white")
  }
  else {
    ggplot(df, aes(y = {{var}})) + 
      geom_bar()
  }
}

Testing histobar

histobar(unscaled_cancer, Radius)

Testing histobar

histobar(unscaled_cancer, Class)

Testing histobar

histobar(palmerpenguins::penguins, species)

Testing histobar

histobar(palmerpenguins::penguins, flipper_length_mm)

Your turn

Edit your scatterplot function to include an argument called draw_line. If draw_line is TRUE, your function should add a line of best fit to your scatterplot. Test your function with the following examples

scatterplot(unscaled_cancer, 
            Radius, 
            Texture, 
            Class, 
            draw_line = FALSE)

scatterplot(palmerpenguins::penguins, 
            bill_length_mm, 
            bill_depth_mm, 
            species, 
            draw_line = TRUE)

02:00

Coding style

It’s ok to drop the curly braces if you have a very short if statement that can fit on one line (no more than 80 characters!)

if (y < 20) "Too low" 
if (y < 20) "Too low" else "Too high"

Coding style

  • Part of writing reproducible and shareable code is - following good style guidelines.

  • Mostly, this means choosing good object names and using white space in a consistent and clear way.

  • Tidyverse style guide has guidelines for writing funtions and if statements