tidyr 2: Reshaping Data

Author
Affiliation

Prof Amanda Luby

Carleton College
Stat 220 - Spring 2025

Click the “code” button above to copy and paste the source code, or pull a fresh version of the “activities” repo from GitHub.

Last time

Here is a data frame with the ratings for each episode of The Great British Bakeoff for series 1-8.

bakeoff_ratings <- read_csv("https://stat220-s25.github.io/data/bakeoff_messy_ratings.csv")
bakeoff_ratings
# A tibble: 8 × 11
  series    e1    e2    e3    e4    e5    e6    e7    e8    e9   e10
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1      1  2.24  3     3     2.6   3.03  2.75 NA    NA    NA    NA   
2      2  3.1   3.53  3.82  3.6   3.83  4.25  4.42  5.06 NA    NA   
3      3  3.85  4.6   4.53  4.71  4.61  4.82  5.1   5.35  5.7   6.74
4      4  6.6   6.65  7.17  6.82  6.95  7.32  7.76  7.41  7.41  9.45
5      5  8.51  8.79  9.28 10.2   9.95 10.1  10.3   9.02 10.7  13.5 
6      6 11.6  11.6  12.0  12.4  12.4  12    12.4  11.1  12.6  15.0 
7      7 13.6  13.4  13.0  13.3  13.1  13.1  13.4  13.3  13.4  15.9 
8      8  9.46  9.23  8.68  8.55  8.61  8.61  9.01  8.95  9.03 10.0 

and here is the code we used last time to tidy the data:

bakeoff_ratings %>%
  pivot_longer(cols = e1:e10, names_to = "episode", values_to = "rating") %>%
  mutate(
    episode = parse_number(episode) 
  )
# A tibble: 80 × 3
   series episode rating
    <dbl>   <dbl>  <dbl>
 1      1       1   2.24
 2      1       2   3   
 3      1       3   3   
 4      1       4   2.6 
 5      1       5   3.03
 6      1       6   2.75
 7      1       7  NA   
 8      1       8  NA   
 9      1       9  NA   
10      1      10  NA   
# ℹ 70 more rows

messyratings2

messy_ratings2 <- read_csv("https://stat220-s25.github.io/data/messy_ratings2.csv")
messy_ratings2
# A tibble: 8 × 21
  series e1_7day e1_28day e2_7day e2_28day e3_7day e3_28day e4_7day e4_28day
   <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>
1      1    2.24    NA       3       NA       3       NA       2.6     NA   
2      2    3.1     NA       3.53    NA       3.82    NA       3.6     NA   
3      3    3.85    NA       4.6     NA       4.53    NA       4.71    NA   
4      4    6.6     NA       6.65    NA       7.17    NA       6.82    NA   
5      5    8.51    NA       8.79    NA       9.28    NA      10.2     NA   
6      6   11.6     11.7    11.6     11.8    12.0     NA      12.4     12.7 
7      7   13.6     13.9    13.4     13.7    13.0     13.4    13.3     13.9 
8      8    9.46     9.72    9.23     9.53    8.68     9.06    8.55     8.87
# ℹ 12 more variables: e5_7day <dbl>, e5_28day <dbl>, e6_7day <dbl>,
#   e6_28day <dbl>, e7_7day <dbl>, e7_28day <dbl>, e8_7day <dbl>,
#   e8_28day <dbl>, e9_7day <dbl>, e9_28day <dbl>, e10_7day <dbl>,
#   e10_28day <dbl>

Task 1

Tidy this data set by

  1. Selecting the series and e*_7day columns
  2. Pivoting the data to add a column for episode and a column for rating (we’ll clean up the episode column later)
# include your code here

Task 2

  • Clean the episode and period column
  • Make a line plot with episode on the x-axis, rating on the y-axis, colored by series. (You will also need to map the group aesthetic to series)

More Practice

relig_income

The relig_income dataset in the {tidyr} package stores counts based on a survey which (among other things) asked people about their religion and annual income:

relig_income
# A tibble: 18 × 11
   religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
   <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
 1 Agnostic      27        34        60        81        76       137        122
 2 Atheist       12        27        37        52        35        70         73
 3 Buddhist      27        21        30        34        33        58         62
 4 Catholic     418       617       732       670       638      1116        949
 5 Don’t k…      15        14        15        11        10        35         21
 6 Evangel…     575       869      1064       982       881      1486        949
 7 Hindu          1         9         7         9        11        34         47
 8 Histori…     228       244       236       238       197       223        131
 9 Jehovah…      20        27        24        24        21        30         15
10 Jewish        19        19        25        25        30        95         69
11 Mainlin…     289       495       619       655       651      1107        939
12 Mormon        29        40        48        51        56       112         85
13 Muslim         6         7         9        10         9        23         16
14 Orthodox      13        17        23        32        32        47         38
15 Other C…       9         7        11        13        13        14         18
16 Other F…      20        33        40        46        49        63         46
17 Other W…       5         2         3         4         2         7          3
18 Unaffil…     217       299       374       365       341       528        407
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#   `Don't know/refused` <dbl>

Use pivot_longer() to tidy this dataset.

anscombe

Anscombe’s quartet is a built-in dataset in R.

anscombe
   x1 x2 x3 x4    y1   y2    y3    y4
1  10 10 10  8  8.04 9.14  7.46  6.58
2   8  8  8  8  6.95 8.14  6.77  5.76
3  13 13 13  8  7.58 8.74 12.74  7.71
4   9  9  9  8  8.81 8.77  7.11  8.84
5  11 11 11  8  8.33 9.26  7.81  8.47
6  14 14 14  8  9.96 8.10  8.84  7.04
7   6  6  6  8  7.24 6.13  6.08  5.25
8   4  4  4 19  4.26 3.10  5.39 12.50
9  12 12 12  8 10.84 9.13  8.15  5.56
10  7  7  7  8  4.82 7.26  6.42  7.91
11  5  5  5  8  5.68 4.74  5.73  6.89

This dataset contains four pairs of variables (x1 and y1, x2 and y2, etc) that underlie Anscombe’s quartet, a collection of four datasets that have the same summary statistics (mean, sd, correlation etc), but have quite different data. We want to produce a dataset with columns set, x and y:

# A tibble: 44 × 3
   set       x     y
   <chr> <dbl> <dbl>
 1 1        10  8.04
 2 2        10  9.14
 3 3        10  7.46

There are (at least) two ways to do this. The first is a little more intuitive, but not as efficient:

  1. First, we’ll create a new “index” column, so we don’t lose track of which x values map to which y values.
anscombe <- anscombe %>%
  mutate(
    index = 1:nrow(anscombe)
  )
  1. Next, use pivot_longer() on all of the columns but index. Use the default for names_to and values_to. The first few rows of your result should look like this:
# A tibble: 88 × 3
   index name  value
   <int> <chr> <dbl>
 1     1 x1    10   
 2     1 x2    10   
 3     1 x3    10  
  1. Next, separate name into variable and set. This is a little tricky, since there’s no separator character (the values of name are x1 and x2 instead of x_1 or x_2). Instead, set sep = 1, which tells R to split the column after the first character. The first few rows of your result should look like this:
# A tibble: 88 × 4
   index variable set   value
   <int> <chr>    <chr> <dbl>
 1     1 x        1     10   
 2     1 x        2     10   
 3     1 x        3     10   
  1. Finally, use pivot_wider() with names_from variable and values_from value. Call your tidy dataset anscombe_tidy.

If all went well, you should be able to run the following two chunks to generate the summary statistics and scatterplots for Anscombe’s quartet.

anscombe_tidy %>%
  group_by(set) %>%
  summarize(
    mean_x = mean(x),
    mean_y = mean(y),
    sd_x = sd(x),
    sd_y = sd(y),
    cor = cor(x,y)
  )
anscombe_tidy %>%
  ggplot(aes(x = x, y = y)) + 
  geom_point() + 
  facet_wrap(~set)

The second way to do this is directly within pivot_longer():

anscombe %>%
  pivot_longer(
    -index,
    names_to = c(".value", "set"),
    names_pattern = "(.)(.)"
  )
# A tibble: 44 × 4
   index set       x     y
   <int> <chr> <dbl> <dbl>
 1     1 1        10  8.04
 2     1 2        10  9.14
 3     1 3        10  7.46
 4     1 4         8  6.58
 5     2 1         8  6.95
 6     2 2         8  8.14
 7     2 3         8  6.77
 8     2 4         8  5.76
 9     3 1        13  7.58
10     3 2        13  8.74
# ℹ 34 more rows

pull up the help page for pivot_longer and try to explain the new arguments.