Data Wrangling:
Tidy Data
+Code Style

Day 09

Prof Amanda Luby

Carleton College
Stat 220 - Spring 2025

Last time: Bakeoff ratings

Ratings data for each episodes in series 1-8

# A tibble: 8 × 11
  series    e1    e2    e3    e4    e5    e6    e7    e8    e9   e10
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1      1  2.24  3     3     2.6   3.03  2.75 NA    NA    NA    NA   
2      2  3.1   3.53  3.82  3.6   3.83  4.25  4.42  5.06 NA    NA   
3      3  3.85  4.6   4.53  4.71  4.61  4.82  5.1   5.35  5.7   6.74
4      4  6.6   6.65  7.17  6.82  6.95  7.32  7.76  7.41  7.41  9.45
5      5  8.51  8.79  9.28 10.2   9.95 10.1  10.3   9.02 10.7  13.5 
6      6 11.6  11.6  12.0  12.4  12.4  12    12.4  11.1  12.6  15.0 
7      7 13.6  13.4  13.0  13.3  13.1  13.1  13.4  13.3  13.4  15.9 
8      8  9.46  9.23  8.68  8.55  8.61  8.61  9.01  8.95  9.03 10.0

Last Time: `tidy`ing the data

bakeoff_ratings %>%
  pivot_longer(cols = e1:e10, names_to = "episode", values_to = "rating") %>%
  mutate(
    episode = parse_number(episode) 
  )

# A tibble: 80 × 3
   series episode rating
    <dbl>   <dbl>  <dbl>
 1      1       1   2.24
 2      1       2   3   
 3      1       3   3   
 4      1       4   2.6 
 5      1       5   3.03
 6      1       6   2.75
 7      1       7  NA   
 8      1       8  NA   
 9      1       9  NA   
10      1      10  NA   
# ℹ 70 more rows

Try it: `messyratings2`

Tidy this data set by

Selecting the series and e*_7day columns
Pivoting the data to add a column for episode and a column for rating (we’ll clean up the episode column later)

messy_ratings2 <- read_csv("https://stat220-s25.github.io/data/messy_ratings2.csv")
messy_ratings2

# A tibble: 8 × 21
  series e1_7day e1_28day e2_7day e2_28day e3_7day e3_28day e4_7day e4_28day
   <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>
1      1    2.24    NA       3       NA       3       NA       2.6     NA   
2      2    3.1     NA       3.53    NA       3.82    NA       3.6     NA   
3      3    3.85    NA       4.6     NA       4.53    NA       4.71    NA   
4      4    6.6     NA       6.65    NA       7.17    NA       6.82    NA   
5      5    8.51    NA       8.79    NA       9.28    NA      10.2     NA   
6      6   11.6     11.7    11.6     11.8    12.0     NA      12.4     12.7 
7      7   13.6     13.9    13.4     13.7    13.0     13.4    13.3     13.9 
8      8    9.46     9.72    9.23     9.53    8.68     9.06    8.55     8.87
# ℹ 12 more variables: e5_7day <dbl>, e5_28day <dbl>, e6_7day <dbl>,
#   e6_28day <dbl>, e7_7day <dbl>, e7_28day <dbl>, e8_7day <dbl>,
#   e8_28day <dbl>, e9_7day <dbl>, e9_28day <dbl>, e10_7day <dbl>,
#   e10_28day <dbl>

08:00

Cleaning `episode`

ratings2 <- messy_ratings2 %>%
  select(series, contains("7day")) %>%
  pivot_longer(contains("7day"), 
               names_to = "episode", 
               values_to = "rating")
ratings2

# A tibble: 80 × 3
   series episode  rating
    <dbl> <chr>     <dbl>
 1      1 e1_7day    2.24
 2      1 e2_7day    3   
 3      1 e3_7day    3   
 4      1 e4_7day    2.6 
 5      1 e5_7day    3.03
 6      1 e6_7day    2.75
 7      1 e7_7day   NA   
 8      1 e8_7day   NA   
 9      1 e9_7day   NA   
10      1 e10_7day  NA   
# ℹ 70 more rows

`separate()`

data (as usual)

separate(
  data, 
  col, 
  into = c("col1", "col2"),
  sep 
  )

`separate()`

data (as usual)
col: column to separate

separate(
  data, 
  col, 
  into = c("col1", "col2"),
  sep 
  )

`separate()`

data (as usual)
col: column to separate
into: names of new columns to create

separate(
  data, 
  col, 
  into = c("col1", "col2"),
  sep 
  )

`separate()`

data (as usual)
col: column to separate
into: names of new columns to create
sep: separator between columns

separate(
  data, 
  col, 
  into = c("col1", "col2"),
  sep 
  )

Cleaning `episode`

ratings2 %>%
  separate( 
    col = episode, 
    into = c("episode", "period") 
  )

# A tibble: 80 × 4
   series episode period rating
    <dbl> <chr>   <chr>   <dbl>
 1      1 e1      7day     2.24
 2      1 e2      7day     3   
 3      1 e3      7day     3   
 4      1 e4      7day     2.6 
 5      1 e5      7day     3.03
 6      1 e6      7day     2.75
 7      1 e7      7day    NA   
 8      1 e8      7day    NA   
 9      1 e9      7day    NA   
10      1 e10     7day    NA   
# ℹ 70 more rows

Wrap it up

Clean the episode and period column using parse_number
Make a line plot with episode on the x-axis, rating on the y-axis, colored by series. (You may also need to map the group aesthetic to series)

Code Style

group_by(colleges,region) %>% mutate(z_cost=(cost - mean(cost, na.rm=TRUE)) / sd(cost,na.rm = TRUE)) %>% ungroup()

# A tibble: 187 × 14
   unitid school  type  city  state region admission_rate   act undergrads  cost
    <dbl> <chr>   <chr> <chr> <chr> <chr>           <dbl> <dbl>      <dbl> <dbl>
 1 228343 Southw… priv… Geor… TX    South…          0.490    26       1507 55886
 2 177719 Barnes… priv… Sain… MO    Plains         NA        NA        569    NA
 3 367884 Hodges… priv… Fort… FL    South…          0.612    NA        832 27425
 4 149781 Wheato… priv… Whea… IL    Great…          0.848    29       2358 49214
 5 135364 Luther… priv… Lith… GA    South…          0.5      NA        235    NA
 6 212601 Gannon… priv… Erie  PA    Mid E…          0.755    23       2866 44896
 7 133979 Florid… priv… Miam… FL    South…          0.400    NA       1049 27460
 8 117140 Univer… priv… La V… CA    Far W…          0.548    22       4516 58014
 9 152567 Trine … priv… Ango… IN    Great…          0.816    25       2120 46440
10 237057 Whitma… priv… Wall… WA    Far W…          0.559    31       1545 68082
# ℹ 177 more rows
# ℹ 4 more variables: grad_rate <dbl>, fy_retention <dbl>, fedloan <dbl>,
#   z_cost <dbl>

colleges %>%
  group_by(region) %>%
  mutate(z_cost = (cost - mean(cost, na.rm = TRUE)) / sd(cost, na.rm = TRUE)) %>%
  ungroup()

# A tibble: 187 × 14
   unitid school  type  city  state region admission_rate   act undergrads  cost
    <dbl> <chr>   <chr> <chr> <chr> <chr>           <dbl> <dbl>      <dbl> <dbl>
 1 228343 Southw… priv… Geor… TX    South…          0.490    26       1507 55886
 2 177719 Barnes… priv… Sain… MO    Plains         NA        NA        569    NA
 3 367884 Hodges… priv… Fort… FL    South…          0.612    NA        832 27425
 4 149781 Wheato… priv… Whea… IL    Great…          0.848    29       2358 49214
 5 135364 Luther… priv… Lith… GA    South…          0.5      NA        235    NA
 6 212601 Gannon… priv… Erie  PA    Mid E…          0.755    23       2866 44896
 7 133979 Florid… priv… Miam… FL    South…          0.400    NA       1049 27460
 8 117140 Univer… priv… La V… CA    Far W…          0.548    22       4516 58014
 9 152567 Trine … priv… Ango… IN    Great…          0.816    25       2120 46440
10 237057 Whitma… priv… Wall… WA    Far W…          0.559    31       1545 68082
# ℹ 177 more rows
# ℹ 4 more variables: grad_rate <dbl>, fy_retention <dbl>, fedloan <dbl>,
#   z_cost <dbl>

Which is easier to read?

Option A
Option B

palmerpenguins::penguins |>
  filter(species == "Adelie") |>
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point() + 
  scale_color_viridis_d(option = "magma",end = .75) + 
  theme_bw(base_family = "Times") + 
  theme(legend.position = "none",
        panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        axis.title.x = element_text(color = "darkred"))

palmerpenguins::penguins |> filter(species=="Adelie") |> ggplot(aes(x =bill_length_mm, y= bill_depth_mm)) + geom_point() + scale_color_viridis_d(option = "magma",end = .75) + 
  theme_bw(base_family = "Times") + 
  theme(legend.position = "none", panel.grid.minor=element_blank(), panel.grid.major.x = element_blank(),
        axis.title.x = element_text(color="darkred"))

https://style.tidyverse.org

Example: Pipes and whitespace

|> should always have a space before it, and should usually be followed by a new line. After the first step, each line should be indented by two spaces.

Good:

iris |>
  summarize(across(where(is.numeric), mean), .by = Species) |>
  pivot_longer(!Species, names_to = "measure", values_to = "value") |>
  arrange(value)

Bad:

iris|> summarize(across(where(is.numeric), mean), .by = Species) |>
pivot_longer(!Species, names_to = "measure", values_to = "value")|>
arrange(value)

Example: long lines

If the arguments to a function don’t all fit on one line, put each argument on its own line and indent:

Good:

iris |>
  summarise(
    Sepal.Length = mean(Sepal.Length, na.rm = TRUE),
    Sepal.Width = mean(Sepal.Width, na.rm = TRUE),
    .by = Species
  )

Bad:

iris |>
  summarise(Sepal.Length = mean(Sepal.Length, na.rm = TRUE), Sepal.Width = mean(Sepal.Width, na.rm = TRUE), .by = Species)

ggplot2 whitespace and indenting

+ should always have a space before it, and should be followed by a new line. After the first step, each line should be indented by two spaces.

If you are creating a ggplot off of a dplyr pipeline, there should only be one level of indentation.

Good:

iris |>
  filter(Species == "setosa") |>
  ggplot(aes(x = Sepal.Width, y = Sepal.Length)) +
  geom_point()

Bad:

iris |>
  filter(Species == "setosa") |>
  ggplot(aes(x = Sepal.Width, y = Sepal.Length)) +
    geom_point()

Bad:

iris |>
  filter(Species == "setosa") |>
  ggplot(aes(x = Sepal.Width, y = Sepal.Length)) + geom_point()

ggplot2 long lines

If the arguments to a ggplot2 layer don’t all fit on one line, put each argument on its own line and indent:

Good:

iris |>
  ggplot(aes(x = Sepal.Width, y = Sepal.Length, color = Species)) +
  geom_point() +
  labs(
    x = "Sepal width, in cm",
    y = "Sepal length, in cm",
    title = "Sepal length vs. width of irises"
  )

Bad:

iris |>
  ggplot(aes(x = Sepal.Width, y = Sepal.Length, color = Species)) +
  geom_point() +
  labs(x = "Sepal width, in cm", y = "Sepal length, in cm", title = "Sepal length vs. width of irises")

Code style summary

All code style guides are opinionated and subjective
Using consistent style makes it easier for collaborators (including future you!) to read and understand your code
Try to follow the tidyverse style guide in this class

A shortcut

In RStudio,

Highlight the code that you want to reformat
Go to “code –> reformat code”
Marvel in wonder

Try it

Reformat your code from the flights activity so far according to the tidyverse style guide

Data Wrangling: Tidy Data +Code Style

Last time: Bakeoff ratings

Last Time: tidying the data

Try it: messyratings2

Cleaning episode

separate()

separate()

separate()

separate()

Cleaning episode

Wrap it up

Code Style

Which is easier to read?

Which is easier to read?

https://style.tidyverse.org

Example: Pipes and whitespace

Example: long lines

ggplot2 whitespace and indenting

ggplot2 long lines

Code style summary

A shortcut

Try it

Data Wrangling:
Tidy Data
+Code Style

Last Time: `tidy`ing the data

Try it: `messyratings2`

Cleaning `episode`

`separate()`

`separate()`

`separate()`

`separate()`

Cleaning `episode`