Data Wrangling:
Tidy Data
+Code Style

Day 09

Prof Amanda Luby

Carleton College
Stat 220 - Spring 2025

Last time: Bakeoff ratings

  • Ratings data for each episodes in series 1-8
# A tibble: 8 × 11
  series    e1    e2    e3    e4    e5    e6    e7    e8    e9   e10
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1      1  2.24  3     3     2.6   3.03  2.75 NA    NA    NA    NA   
2      2  3.1   3.53  3.82  3.6   3.83  4.25  4.42  5.06 NA    NA   
3      3  3.85  4.6   4.53  4.71  4.61  4.82  5.1   5.35  5.7   6.74
4      4  6.6   6.65  7.17  6.82  6.95  7.32  7.76  7.41  7.41  9.45
5      5  8.51  8.79  9.28 10.2   9.95 10.1  10.3   9.02 10.7  13.5 
6      6 11.6  11.6  12.0  12.4  12.4  12    12.4  11.1  12.6  15.0 
7      7 13.6  13.4  13.0  13.3  13.1  13.1  13.4  13.3  13.4  15.9 
8      8  9.46  9.23  8.68  8.55  8.61  8.61  9.01  8.95  9.03 10.0 

Last Time: tidying the data

bakeoff_ratings %>%
  pivot_longer(cols = e1:e10, names_to = "episode", values_to = "rating") %>%
  mutate(
    episode = parse_number(episode) 
  )
# A tibble: 80 × 3
   series episode rating
    <dbl>   <dbl>  <dbl>
 1      1       1   2.24
 2      1       2   3   
 3      1       3   3   
 4      1       4   2.6 
 5      1       5   3.03
 6      1       6   2.75
 7      1       7  NA   
 8      1       8  NA   
 9      1       9  NA   
10      1      10  NA   
# ℹ 70 more rows

Try it: messyratings2

Tidy this data set by

  1. Selecting the series and e*_7day columns
  2. Pivoting the data to add a column for episode and a column for rating (we’ll clean up the episode column later)
messy_ratings2 <- read_csv("https://stat220-s25.github.io/data/messy_ratings2.csv")
messy_ratings2
# A tibble: 8 × 21
  series e1_7day e1_28day e2_7day e2_28day e3_7day e3_28day e4_7day e4_28day
   <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>
1      1    2.24    NA       3       NA       3       NA       2.6     NA   
2      2    3.1     NA       3.53    NA       3.82    NA       3.6     NA   
3      3    3.85    NA       4.6     NA       4.53    NA       4.71    NA   
4      4    6.6     NA       6.65    NA       7.17    NA       6.82    NA   
5      5    8.51    NA       8.79    NA       9.28    NA      10.2     NA   
6      6   11.6     11.7    11.6     11.8    12.0     NA      12.4     12.7 
7      7   13.6     13.9    13.4     13.7    13.0     13.4    13.3     13.9 
8      8    9.46     9.72    9.23     9.53    8.68     9.06    8.55     8.87
# ℹ 12 more variables: e5_7day <dbl>, e5_28day <dbl>, e6_7day <dbl>,
#   e6_28day <dbl>, e7_7day <dbl>, e7_28day <dbl>, e8_7day <dbl>,
#   e8_28day <dbl>, e9_7day <dbl>, e9_28day <dbl>, e10_7day <dbl>,
#   e10_28day <dbl>
08:00

Cleaning episode

ratings2 <- messy_ratings2 %>%
  select(series, contains("7day")) %>%
  pivot_longer(contains("7day"), 
               names_to = "episode", 
               values_to = "rating")
ratings2
# A tibble: 80 × 3
   series episode  rating
    <dbl> <chr>     <dbl>
 1      1 e1_7day    2.24
 2      1 e2_7day    3   
 3      1 e3_7day    3   
 4      1 e4_7day    2.6 
 5      1 e5_7day    3.03
 6      1 e6_7day    2.75
 7      1 e7_7day   NA   
 8      1 e8_7day   NA   
 9      1 e9_7day   NA   
10      1 e10_7day  NA   
# ℹ 70 more rows

separate()

  • data (as usual)
separate(
  data, 
  col, 
  into = c("col1", "col2"),
  sep 
  )

separate()

  • data (as usual)
  • col: column to separate
separate(
  data, 
  col, 
  into = c("col1", "col2"),
  sep 
  )

separate()

  • data (as usual)
  • col: column to separate
  • into: names of new columns to create
separate(
  data, 
  col, 
  into = c("col1", "col2"),
  sep 
  )

separate()

  • data (as usual)
  • col: column to separate
  • into: names of new columns to create
  • sep: separator between columns
separate(
  data, 
  col, 
  into = c("col1", "col2"),
  sep 
  )

Cleaning episode

ratings2 %>%
  separate( 
    col = episode, 
    into = c("episode", "period") 
  )
# A tibble: 80 × 4
   series episode period rating
    <dbl> <chr>   <chr>   <dbl>
 1      1 e1      7day     2.24
 2      1 e2      7day     3   
 3      1 e3      7day     3   
 4      1 e4      7day     2.6 
 5      1 e5      7day     3.03
 6      1 e6      7day     2.75
 7      1 e7      7day    NA   
 8      1 e8      7day    NA   
 9      1 e9      7day    NA   
10      1 e10     7day    NA   
# ℹ 70 more rows

Wrap it up

  • Clean the episode and period column using parse_number
  • Make a line plot with episode on the x-axis, rating on the y-axis, colored by series. (You may also need to map the group aesthetic to series)

Code Style

Which is easier to read?

group_by(colleges,region) %>% mutate(z_cost=(cost - mean(cost, na.rm=TRUE)) / sd(cost,na.rm = TRUE)) %>% ungroup() 
# A tibble: 187 × 14
   unitid school  type  city  state region admission_rate   act undergrads  cost
    <dbl> <chr>   <chr> <chr> <chr> <chr>           <dbl> <dbl>      <dbl> <dbl>
 1 228343 Southw… priv… Geor… TX    South…          0.490    26       1507 55886
 2 177719 Barnes… priv… Sain… MO    Plains         NA        NA        569    NA
 3 367884 Hodges… priv… Fort… FL    South…          0.612    NA        832 27425
 4 149781 Wheato… priv… Whea… IL    Great…          0.848    29       2358 49214
 5 135364 Luther… priv… Lith… GA    South…          0.5      NA        235    NA
 6 212601 Gannon… priv… Erie  PA    Mid E…          0.755    23       2866 44896
 7 133979 Florid… priv… Miam… FL    South…          0.400    NA       1049 27460
 8 117140 Univer… priv… La V… CA    Far W…          0.548    22       4516 58014
 9 152567 Trine … priv… Ango… IN    Great…          0.816    25       2120 46440
10 237057 Whitma… priv… Wall… WA    Far W…          0.559    31       1545 68082
# ℹ 177 more rows
# ℹ 4 more variables: grad_rate <dbl>, fy_retention <dbl>, fedloan <dbl>,
#   z_cost <dbl>
colleges %>%
  group_by(region) %>%
  mutate(z_cost = (cost - mean(cost, na.rm = TRUE)) / sd(cost, na.rm = TRUE)) %>%
  ungroup() 
# A tibble: 187 × 14
   unitid school  type  city  state region admission_rate   act undergrads  cost
    <dbl> <chr>   <chr> <chr> <chr> <chr>           <dbl> <dbl>      <dbl> <dbl>
 1 228343 Southw… priv… Geor… TX    South…          0.490    26       1507 55886
 2 177719 Barnes… priv… Sain… MO    Plains         NA        NA        569    NA
 3 367884 Hodges… priv… Fort… FL    South…          0.612    NA        832 27425
 4 149781 Wheato… priv… Whea… IL    Great…          0.848    29       2358 49214
 5 135364 Luther… priv… Lith… GA    South…          0.5      NA        235    NA
 6 212601 Gannon… priv… Erie  PA    Mid E…          0.755    23       2866 44896
 7 133979 Florid… priv… Miam… FL    South…          0.400    NA       1049 27460
 8 117140 Univer… priv… La V… CA    Far W…          0.548    22       4516 58014
 9 152567 Trine … priv… Ango… IN    Great…          0.816    25       2120 46440
10 237057 Whitma… priv… Wall… WA    Far W…          0.559    31       1545 68082
# ℹ 177 more rows
# ℹ 4 more variables: grad_rate <dbl>, fy_retention <dbl>, fedloan <dbl>,
#   z_cost <dbl>

Which is easier to read?

palmerpenguins::penguins |>
  filter(species == "Adelie") |>
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point() + 
  scale_color_viridis_d(option = "magma",end = .75) + 
  theme_bw(base_family = "Times") + 
  theme(legend.position = "none",
        panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        axis.title.x = element_text(color = "darkred"))
palmerpenguins::penguins |> filter(species=="Adelie") |> ggplot(aes(x =bill_length_mm, y= bill_depth_mm)) + geom_point() + scale_color_viridis_d(option = "magma",end = .75) + 
  theme_bw(base_family = "Times") + 
  theme(legend.position = "none", panel.grid.minor=element_blank(), panel.grid.major.x = element_blank(),
        axis.title.x = element_text(color="darkred"))

https://style.tidyverse.org

Example: Pipes and whitespace

|> should always have a space before it, and should usually be followed by a new line. After the first step, each line should be indented by two spaces.

Good:

iris |>
  summarize(across(where(is.numeric), mean), .by = Species) |>
  pivot_longer(!Species, names_to = "measure", values_to = "value") |>
  arrange(value)

Bad:

iris|> summarize(across(where(is.numeric), mean), .by = Species) |>
pivot_longer(!Species, names_to = "measure", values_to = "value")|>
arrange(value)

Example: long lines

If the arguments to a function don’t all fit on one line, put each argument on its own line and indent:

Good:

iris |>
  summarise(
    Sepal.Length = mean(Sepal.Length, na.rm = TRUE),
    Sepal.Width = mean(Sepal.Width, na.rm = TRUE),
    .by = Species
  )

Bad:

iris |>
  summarise(Sepal.Length = mean(Sepal.Length, na.rm = TRUE), Sepal.Width = mean(Sepal.Width, na.rm = TRUE), .by = Species)

ggplot2 whitespace and indenting

+ should always have a space before it, and should be followed by a new line. After the first step, each line should be indented by two spaces.

If you are creating a ggplot off of a dplyr pipeline, there should only be one level of indentation.

Good:

iris |>
  filter(Species == "setosa") |>
  ggplot(aes(x = Sepal.Width, y = Sepal.Length)) +
  geom_point()

Bad:

iris |>
  filter(Species == "setosa") |>
  ggplot(aes(x = Sepal.Width, y = Sepal.Length)) +
    geom_point()

Bad:

iris |>
  filter(Species == "setosa") |>
  ggplot(aes(x = Sepal.Width, y = Sepal.Length)) + geom_point()

ggplot2 long lines

If the arguments to a ggplot2 layer don’t all fit on one line, put each argument on its own line and indent:

Good:

iris |>
  ggplot(aes(x = Sepal.Width, y = Sepal.Length, color = Species)) +
  geom_point() +
  labs(
    x = "Sepal width, in cm",
    y = "Sepal length, in cm",
    title = "Sepal length vs. width of irises"
  )

Bad:

iris |>
  ggplot(aes(x = Sepal.Width, y = Sepal.Length, color = Species)) +
  geom_point() +
  labs(x = "Sepal width, in cm", y = "Sepal length, in cm", title = "Sepal length vs. width of irises")

Code style summary

  • All code style guides are opinionated and subjective
  • Using consistent style makes it easier for collaborators (including future you!) to read and understand your code
  • Try to follow the tidyverse style guide in this class

A shortcut

In RStudio,

  1. Highlight the code that you want to reformat
  2. Go to “code –> reformat code”
  3. Marvel in wonder

Try it

Reformat your code from the flights activity so far according to the tidyverse style guide