16-functions

Author
Affiliation

Prof Amanda Luby

Carleton College
Stat 220 - Spring 2025

library(tidyverse)
library(palmerpenguins) # load penguins data

Load the Data

unscaled_cancer <- read_csv("https://raw.githubusercontent.com/UBC-DSCI/introduction-to-datascience/refs/heads/main/data/wdbc_unscaled.csv")

unscaled_cancer
# A tibble: 569 × 12
        ID Class Radius Texture Perimeter  Area Smoothness Compactness Concavity
     <dbl> <chr>  <dbl>   <dbl>     <dbl> <dbl>      <dbl>       <dbl>     <dbl>
 1  8.42e5 M       18.0    10.4     123.  1001      0.118       0.278     0.300 
 2  8.43e5 M       20.6    17.8     133.  1326      0.0847      0.0786    0.0869
 3  8.43e7 M       19.7    21.2     130   1203      0.110       0.160     0.197 
 4  8.43e7 M       11.4    20.4      77.6  386.     0.142       0.284     0.241 
 5  8.44e7 M       20.3    14.3     135.  1297      0.100       0.133     0.198 
 6  8.44e5 M       12.4    15.7      82.6  477.     0.128       0.17      0.158 
 7  8.44e5 M       18.2    20.0     120.  1040      0.0946      0.109     0.113 
 8  8.45e7 M       13.7    20.8      90.2  578.     0.119       0.164     0.0937
 9  8.45e5 M       13      21.8      87.5  520.     0.127       0.193     0.186 
10  8.45e7 M       12.5    24.0      84.0  476.     0.119       0.240     0.227 
# ℹ 559 more rows
# ℹ 3 more variables: Concave_Points <dbl>, Symmetry <dbl>,
#   Fractal_Dimension <dbl>

Your turn: 3 functions

Turn the following code snippets into functions. Think about what each function does before you begin, and be sure to give each function an informative name.

  1. mean(is.na(x))

  2. x / sum(x, na.rm = TRUE)

  3. sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)

Your turn: column_mean

  • Write a function called column_mean that takes a data set and column name (as a string) as inputs and returns the column mean as output. (Hint: access the column using [[)

  • You should also include a na.rm argument and set the default to TRUE so that NAs are removed from the calculation by default.

  • Test your function on the mtcars data set.

> column_mean(mtcars, "cyl")
[1] 6.1875

Your turn: scatterplot

Write a plotting function that makes a scatterplot of any two quantitative variables, coloring the points by a 3rd categorical variable.

Test your function with the following examples:

scatterplot(unscaled_cancer, Radius, Texture, Class)

scatterplot(penguins, bill_length_mm, bill_depth_mm, species)

Your turn: scatterplot 2

Edit your scatterplot function to include an argument called draw_line. If draw_line is TRUE, your function should add a line of best fit to your scatterplot. Test your function with the following examples

scatterplot(unscaled_cancer, Radius, Texture, Class, draw_line = FALSE)

scatterplot(penguins, bill_length_mm, bill_depth_mm, species, draw_line = TRUE)

More practice (if time)

1

Consider the following function and subsequent call of that function. What causes this error? How can you fix it?

summarize_species <- function(pattern = "Human") {
  starwars |>
    filter(species == pattern) |>
    summarize(
      num_people = n(),
      avg_height = mean(height, na.rm = TRUE)
    )
}

summarize_species(Wookiee)

2

You can tell R to print messages using print("your message here"). Edit the function above to first check if pattern is one of the species in the dataset. If it’s not, your function should print an informative message and then return().

3

Given a vector of birthdates, write a function to compute the age in years. See if your function works on the easy vector first, then try the hard vector.

hard <- c("15 Feb 1992", "10 March 1985", "03/28/2024", "09/30/2005")
easy <- mdy(c("2/15/1992", "3/10/1985", "3/28/2024", "9/30/2005"))