bind_ and _join: Combining Data

Author
Affiliation

Prof Amanda Luby

Carleton College
Stat 220 - Spring 2025

Click the “code” button above to copy and paste the source code, or pull a fresh version of the “activities” repo from GitHub.

Warm Up

load(url("https://stat220-s25.github.io/data/combining-data-examples.Rda"))

Example 1: Star Wars Characters

Dataset 1:

  • Each row in the dataset corresponds to a star wars character
  • The columns include their name, height, mass, homeworld, species, and the first_film they appeared in
starwars_characters
# A tibble: 87 × 6
   name               height  mass homeworld species first_film
   <chr>               <int> <dbl> <chr>     <chr>   <chr>     
 1 Luke Skywalker        172    77 Tatooine  Human   A New Hope
 2 C-3PO                 167    75 Tatooine  Droid   A New Hope
 3 R2-D2                  96    32 Naboo     Droid   A New Hope
 4 Darth Vader           202   136 Tatooine  Human   A New Hope
 5 Leia Organa           150    49 Alderaan  Human   A New Hope
 6 Owen Lars             178   120 Tatooine  Human   A New Hope
 7 Beru Whitesun Lars    165    75 Tatooine  Human   A New Hope
 8 R5-D4                  97    32 Tatooine  Droid   A New Hope
 9 Biggs Darklighter     183    84 Tatooine  Human   A New Hope
10 Obi-Wan Kenobi        182    77 Stewjon   Human   A New Hope
# ℹ 77 more rows

Dataset 2:

The dataset is formatted the same, but includes two characters that were first introduced in *The Last Jedi, a Star Wars movie that came out after the first datset was released.

starwars_lastjedi
# A tibble: 2 × 6
  name         height mass  homeworld species first_film   
  <chr>        <lgl>  <lgl> <chr>     <chr>   <chr>        
1 Rose Tico    NA     NA    Otomok    Human   The Last Jedi
2 Amilyn Holdo NA     NA    <NA>      Human   The Last Jedi

How should these datasets be combined?

Example 2: Stats Sections

Dataset 1:

This dataset includes the number of sections of stats courses offered in Fall 2024 and Winter 2025 terms.

stats_sections_fw
# A tibble: 8 × 3
  class    fall winter
  <chr>   <dbl>  <dbl>
1 stat120     3      3
2 stat220     1      1
3 stat230     1      1
4 stat250     0      1
5 stat270     1      0
6 stat285     1      1
7 stat320     0      0
8 stat330     0      1

Dataset 2:

This dataset includes the number of sections of stats courses offered in the Spring 2025 term.

stats_sections_s
# A tibble: 8 × 1
  spring
   <dbl>
1      4
2      1
3      1
4      1
5      0
6      1
7      1
8      0

How should these datasets be combined?

Example 3: Survivor castaways

Dataset 1:

  • This dataset includes information about each castaway’s performance in a given season of Survivor (US seasons)
  • It includes a castaway ID, first name, season name and number, and the results for that castaway in that season (their overall place, whether they made the jury, and whether they were a finalist)
us_castaway_results
# A tibble: 870 × 7
   castaway_id castaway season_name      season place jury  finalist
   <chr>       <chr>    <chr>             <dbl> <dbl> <lgl> <lgl>   
 1 US0001      Sonja    Survivor: Borneo      1    16 FALSE FALSE   
 2 US0002      B.B.     Survivor: Borneo      1    15 FALSE FALSE   
 3 US0003      Stacey   Survivor: Borneo      1    14 FALSE FALSE   
 4 US0004      Ramona   Survivor: Borneo      1    13 FALSE FALSE   
 5 US0005      Dirk     Survivor: Borneo      1    12 FALSE FALSE   
 6 US0006      Joel     Survivor: Borneo      1    11 FALSE FALSE   
 7 US0007      Gretchen Survivor: Borneo      1    10 FALSE FALSE   
 8 US0008      Greg     Survivor: Borneo      1     9 TRUE  FALSE   
 9 US0009      Jenna    Survivor: Borneo      1     8 TRUE  FALSE   
10 US0010      Gervase  Survivor: Borneo      1     7 TRUE  FALSE   
# ℹ 860 more rows

Dataset 2:

  • This dataset includes information about each castaway who has appeared on Survivor (including non-US seasons)
  • It includes a castaway ID, their full name, date of birth, gender and occupation as presented in the show, and their Myers-Briggs personality type.
cast_details
# A tibble: 1,118 × 6
   castaway_id full_name        date_of_birth gender occupation personality_type
   <chr>       <chr>            <date>        <chr>  <chr>      <chr>           
 1 US0014      Rudy Boesch      1928-01-20    Male   Retired N… ISTJ            
 2 US0002      B.B. Andersen    1936-01-18    Male   Real Esta… ESTJ            
 3 US0001      Sonja Christoph… 1937-01-28    Female Musician   ENFP            
 4 US0075      Jake Billingsley 1941-08-21    Male   Land Brok… ISFJ            
 5 US0151      Jim Lynch        1942-01-07    Male   Retired F… ISTJ            
 6 US0474      Joseph Del Campo 1943-07-04    Male   Former FB… ISTJ            
 7 US0304      Jimmy Johnson    1943-07-16    Male   Former NF… ESFJ            
 8 US0047      Kim Johnson      1944-09-18    Female Retired T… ISFJ            
 9 US0128      Scout Cloud Lee  1944-11-08    Female Rancher    INFJ            
10 US0061      Paschal English  1945-03-05    Male   Judge      ISFJ            
# ℹ 1,108 more rows

We are interested in exploring the relationship between overall place on a Survivor season and personality type. How should these datasets be combined?

Stop here

Bakeoff Ratings (again)

We’ll start with our messy_ratings2 dataset from last week:

messy_ratings2 <- read_csv("https://stat220-s25.github.io/data/messy_ratings2.csv")
messy_ratings2
# A tibble: 8 × 21
  series e1_7day e1_28day e2_7day e2_28day e3_7day e3_28day e4_7day e4_28day
   <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>
1      1    2.24    NA       3       NA       3       NA       2.6     NA   
2      2    3.1     NA       3.53    NA       3.82    NA       3.6     NA   
3      3    3.85    NA       4.6     NA       4.53    NA       4.71    NA   
4      4    6.6     NA       6.65    NA       7.17    NA       6.82    NA   
5      5    8.51    NA       8.79    NA       9.28    NA      10.2     NA   
6      6   11.6     11.7    11.6     11.8    12.0     NA      12.4     12.7 
7      7   13.6     13.9    13.4     13.7    13.0     13.4    13.3     13.9 
8      8    9.46     9.72    9.23     9.53    8.68     9.06    8.55     8.87
# ℹ 12 more variables: e5_7day <dbl>, e5_28day <dbl>, e6_7day <dbl>,
#   e6_28day <dbl>, e7_7day <dbl>, e7_28day <dbl>, e8_7day <dbl>,
#   e8_28day <dbl>, e9_7day <dbl>, e9_28day <dbl>, e10_7day <dbl>,
#   e10_28day <dbl>

On Friday, we tidied the data1 using the following code:

  • select columns of interest
  • pivot_longer to extract the variable encoded in the column names
  • separate to separate episode number from ratings period
  • parse_number to format columns as numeric
tidy_ratings2 <- messy_ratings2 %>%
  select(series, contains("7day")) %>%
  pivot_longer(-series, names_to = "episode", values_to = "viewers") %>%
  separate(episode, into = c("episode", "period")) %>%
  mutate(
    episode = parse_number(episode),
    period = parse_number(period)
  )

tidy_ratings2
# A tibble: 80 × 4
   series episode period viewers
    <dbl>   <dbl>  <dbl>   <dbl>
 1      1       1      7    2.24
 2      1       2      7    3   
 3      1       3      7    3   
 4      1       4      7    2.6 
 5      1       5      7    3.03
 6      1       6      7    2.75
 7      1       7      7   NA   
 8      1       8      7   NA   
 9      1       9      7   NA   
10      1      10      7   NA   
# ℹ 70 more rows

Task: line plot

Finish the code to make a line plot of viewers against episode, colored and grouped by series. (remember to remove #| eval: false)

tidy_ratings2 %>%
  ggplot(aes(x = _____, y = _____, col = _____, group = _____)) +
  geom_line() + 
  scale_color_viridis_c(option = "plasma", end = .8)

Task: Add series 9-14 to the dataset

The code chunk below loads the new dataset

messy_ratings2b <- read_csv("https://stat220-s25.github.io/data/messy_ratings2_9_14.csv")
messy_ratings2b
# A tibble: 6 × 21
  series e1_7day e1_28day e2_7day e2_28day e3_7day e3_28day e4_7day e4_28day
   <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>
1      9    9.55     9.92    9.31     9.76    8.91     9.35    8.88     9.41
2     10    9.62    10.0     9.38     9.8     8.94     9.42    8.96     9.49
3     11   11.2     11.8    10.8     11.4    10.7     11.2    10.6     11.2 
4     12    9.57    10.2     8.98     9.64    8.69     9.39    8.75     9.13
5     13    8.3      1       7.6      1       7.35     1       7.76     1   
6     14    7.84     1       7.54     1       7.28     1       7.12     1   
# ℹ 12 more variables: e5_7day <dbl>, e5_28day <dbl>, e6_7day <dbl>,
#   e6_28day <dbl>, e7_7day <dbl>, e7_28day <dbl>, e8_7day <dbl>,
#   e8_28day <dbl>, e9_7day <dbl>, e9_28day <dbl>, e10_7day <dbl>,
#   e10_28day <dbl>

Combine with bind_rows()

Combine the two messy datasets by using bind_rows(). Be sure to look at your resulting dataset and make sure it has the right format

# your code here

Combine with join()

We’ll practice joining with the tidy versions of the data. First, create tidy_ratings2b using the new dataset. Follow the same steps as what was used to create tidy_ratings2

# your code here

Once you’re satisfied, that the two tidy datasets look similar enough to combine, remove #| eval: false from the following code chunk and run it. Did it work? Why or why not?

tidy_ratings = tidy_ratings2 %>%
  left_join(tidy_ratings2b)

Use the correction _join function here. Save your dataset to the tidy_ratings object.

# your code here

Combine with bind_cols

To join the two datasets using bind_cols, you’ll first have to un-tidy your tidy datasets, so that each episode number is in a row and each series number is in a column. After using bind_cols() the first few rows of your final dataset should look like this:

# A tibble: 10 × 16
   episode period    s1    s2    s3    s4    s5    s6    s7    s8    s9   s10   s11   s12   s13   s14
     <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1       1      7  2.24  3.1   3.85  6.6   8.51  11.6  13.6  9.46  9.55  9.62  11.2  9.57  8.3   7.84
 2       2      7  3     3.53  4.6   6.65  8.79  11.6  13.4  9.23  9.31  9.38  10.8  8.98  7.6   7.54
 3       3      7  3     3.82  4.53  7.17  9.28  12.0  13.0  8.68  8.91  8.94  10.7  8.69  7.35  7.28

Stop here and let Amanda know you’ve finished

join the episodes data

The episodes data has information about each episode of GBBO. In particular, it includes:

  • baker: each baker that competed on that episode
  • signature: the name of the dessert they made for the signature challenge
  • technical: the place the baker earned in the technical challenge
  • showstopper: the name of the dessert they made for the showstopper challenge
  • result: whether the baker was “Safe” or “Eliminated”
episodes = read_csv("https://stat220-s25.github.io/data/bakeoff-episodes.csv")
episodes
# A tibble: 1,007 × 7
   series episode baker     signature               technical showstopper result
    <dbl>   <dbl> <chr>     <chr>                       <dbl> <chr>       <chr> 
 1      1       1 Annetha   Light Jamaican Black C…         2 Red, White… Safe  
 2      1       1 David     Chocolate Orange Cake           3 Black Fore… Safe  
 3      1       1 Edd       Caramel Cinnamon and B…         1 <NA>        Safe  
 4      1       1 Jasminder Fresh Mango and Passio…        NA <NA>        Safe  
 5      1       1 Jonathan  Carrot Cake with Lime …         9 Three Tier… Safe  
 6      1       1 Lea       Cranberry and Pistachi…        10 Raspberrie… Elimi…
 7      1       1 Louise    Carrot and Orange Cake         NA Never Fail… Safe  
 8      1       1 Mark      Sticky Marmalade Tea L…        NA Heart-shap… Elimi…
 9      1       1 Miranda   Triple Layered Brownie…         8 Three Tier… Safe  
10      1       1 Ruth      Lemon Drizzle Cakewith…        NA Classic Ch… Safe  
# ℹ 997 more rows

Join the tidy dataset (all 14 seasons) with the episodes data.

Your task is to print the “signature” desserts that were made on the 10 episodes that had the highest 7-day viewership.

Stop here and let Amanda know you’ve finished

Footnotes

  1. Talk it over: does everybody remember what makes a dataset “tidy”?↩︎