bind_
and _join
: Combining Data
Click the “code” button above to copy and paste the source code, or pull a fresh version of the “activities” repo from GitHub.
Warm Up
Example 1: Star Wars Characters
Dataset 1:
- Each row in the dataset corresponds to a star wars character
- The columns include their
name
,height
,mass
,homeworld
,species
, and thefirst_film
they appeared in
starwars_characters
# A tibble: 87 × 6
name height mass homeworld species first_film
<chr> <int> <dbl> <chr> <chr> <chr>
1 Luke Skywalker 172 77 Tatooine Human A New Hope
2 C-3PO 167 75 Tatooine Droid A New Hope
3 R2-D2 96 32 Naboo Droid A New Hope
4 Darth Vader 202 136 Tatooine Human A New Hope
5 Leia Organa 150 49 Alderaan Human A New Hope
6 Owen Lars 178 120 Tatooine Human A New Hope
7 Beru Whitesun Lars 165 75 Tatooine Human A New Hope
8 R5-D4 97 32 Tatooine Droid A New Hope
9 Biggs Darklighter 183 84 Tatooine Human A New Hope
10 Obi-Wan Kenobi 182 77 Stewjon Human A New Hope
# ℹ 77 more rows
Dataset 2:
The dataset is formatted the same, but includes two characters that were first introduced in *The Last Jedi, a Star Wars movie that came out after the first datset was released.
starwars_lastjedi
# A tibble: 2 × 6
name height mass homeworld species first_film
<chr> <lgl> <lgl> <chr> <chr> <chr>
1 Rose Tico NA NA Otomok Human The Last Jedi
2 Amilyn Holdo NA NA <NA> Human The Last Jedi
How should these datasets be combined?
Example 2: Stats Sections
Dataset 1:
This dataset includes the number of sections of stats courses offered in Fall 2024 and Winter 2025 terms.
stats_sections_fw
# A tibble: 8 × 3
class fall winter
<chr> <dbl> <dbl>
1 stat120 3 3
2 stat220 1 1
3 stat230 1 1
4 stat250 0 1
5 stat270 1 0
6 stat285 1 1
7 stat320 0 0
8 stat330 0 1
Dataset 2:
This dataset includes the number of sections of stats courses offered in the Spring 2025 term.
stats_sections_s
# A tibble: 8 × 1
spring
<dbl>
1 4
2 1
3 1
4 1
5 0
6 1
7 1
8 0
How should these datasets be combined?
Example 3: Survivor castaways
Dataset 1:
- This dataset includes information about each castaway’s performance in a given season of Survivor (US seasons)
- It includes a castaway ID, first name, season name and number, and the results for that castaway in that season (their overall
place
, whether they made thejury
, and whether they were afinalist
)
us_castaway_results
# A tibble: 870 × 7
castaway_id castaway season_name season place jury finalist
<chr> <chr> <chr> <dbl> <dbl> <lgl> <lgl>
1 US0001 Sonja Survivor: Borneo 1 16 FALSE FALSE
2 US0002 B.B. Survivor: Borneo 1 15 FALSE FALSE
3 US0003 Stacey Survivor: Borneo 1 14 FALSE FALSE
4 US0004 Ramona Survivor: Borneo 1 13 FALSE FALSE
5 US0005 Dirk Survivor: Borneo 1 12 FALSE FALSE
6 US0006 Joel Survivor: Borneo 1 11 FALSE FALSE
7 US0007 Gretchen Survivor: Borneo 1 10 FALSE FALSE
8 US0008 Greg Survivor: Borneo 1 9 TRUE FALSE
9 US0009 Jenna Survivor: Borneo 1 8 TRUE FALSE
10 US0010 Gervase Survivor: Borneo 1 7 TRUE FALSE
# ℹ 860 more rows
Dataset 2:
- This dataset includes information about each castaway who has appeared on Survivor (including non-US seasons)
- It includes a castaway ID, their full name, date of birth, gender and occupation as presented in the show, and their Myers-Briggs personality type.
cast_details
# A tibble: 1,118 × 6
castaway_id full_name date_of_birth gender occupation personality_type
<chr> <chr> <date> <chr> <chr> <chr>
1 US0014 Rudy Boesch 1928-01-20 Male Retired N… ISTJ
2 US0002 B.B. Andersen 1936-01-18 Male Real Esta… ESTJ
3 US0001 Sonja Christoph… 1937-01-28 Female Musician ENFP
4 US0075 Jake Billingsley 1941-08-21 Male Land Brok… ISFJ
5 US0151 Jim Lynch 1942-01-07 Male Retired F… ISTJ
6 US0474 Joseph Del Campo 1943-07-04 Male Former FB… ISTJ
7 US0304 Jimmy Johnson 1943-07-16 Male Former NF… ESFJ
8 US0047 Kim Johnson 1944-09-18 Female Retired T… ISFJ
9 US0128 Scout Cloud Lee 1944-11-08 Female Rancher INFJ
10 US0061 Paschal English 1945-03-05 Male Judge ISFJ
# ℹ 1,108 more rows
We are interested in exploring the relationship between overall place on a Survivor season and personality type. How should these datasets be combined?
Stop here
Bakeoff Ratings (again)
We’ll start with our messy_ratings2
dataset from last week:
messy_ratings2 <- read_csv("https://stat220-s25.github.io/data/messy_ratings2.csv")
messy_ratings2
# A tibble: 8 × 21
series e1_7day e1_28day e2_7day e2_28day e3_7day e3_28day e4_7day e4_28day
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2.24 NA 3 NA 3 NA 2.6 NA
2 2 3.1 NA 3.53 NA 3.82 NA 3.6 NA
3 3 3.85 NA 4.6 NA 4.53 NA 4.71 NA
4 4 6.6 NA 6.65 NA 7.17 NA 6.82 NA
5 5 8.51 NA 8.79 NA 9.28 NA 10.2 NA
6 6 11.6 11.7 11.6 11.8 12.0 NA 12.4 12.7
7 7 13.6 13.9 13.4 13.7 13.0 13.4 13.3 13.9
8 8 9.46 9.72 9.23 9.53 8.68 9.06 8.55 8.87
# ℹ 12 more variables: e5_7day <dbl>, e5_28day <dbl>, e6_7day <dbl>,
# e6_28day <dbl>, e7_7day <dbl>, e7_28day <dbl>, e8_7day <dbl>,
# e8_28day <dbl>, e9_7day <dbl>, e9_28day <dbl>, e10_7day <dbl>,
# e10_28day <dbl>
On Friday, we tidied the data1 using the following code:
-
select
columns of interest -
pivot_longer
to extract the variable encoded in the column names -
separate
to separate episode number from ratings period -
parse_number
to format columns as numeric
tidy_ratings2 <- messy_ratings2 %>%
select(series, contains("7day")) %>%
pivot_longer(-series, names_to = "episode", values_to = "viewers") %>%
separate(episode, into = c("episode", "period")) %>%
mutate(
episode = parse_number(episode),
period = parse_number(period)
)
tidy_ratings2
# A tibble: 80 × 4
series episode period viewers
<dbl> <dbl> <dbl> <dbl>
1 1 1 7 2.24
2 1 2 7 3
3 1 3 7 3
4 1 4 7 2.6
5 1 5 7 3.03
6 1 6 7 2.75
7 1 7 7 NA
8 1 8 7 NA
9 1 9 7 NA
10 1 10 7 NA
# ℹ 70 more rows
Task: line plot
Finish the code to make a line plot of viewers
against episode
, colored and grouped by series
. (remember to remove #| eval: false
)
%>%
tidy_ratings2 ggplot(aes(x = _____, y = _____, col = _____, group = _____)) +
geom_line() +
scale_color_viridis_c(option = "plasma", end = .8)
Task: Add series 9-14 to the dataset
The code chunk below loads the new dataset
messy_ratings2b <- read_csv("https://stat220-s25.github.io/data/messy_ratings2_9_14.csv")
messy_ratings2b
# A tibble: 6 × 21
series e1_7day e1_28day e2_7day e2_28day e3_7day e3_28day e4_7day e4_28day
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 9 9.55 9.92 9.31 9.76 8.91 9.35 8.88 9.41
2 10 9.62 10.0 9.38 9.8 8.94 9.42 8.96 9.49
3 11 11.2 11.8 10.8 11.4 10.7 11.2 10.6 11.2
4 12 9.57 10.2 8.98 9.64 8.69 9.39 8.75 9.13
5 13 8.3 1 7.6 1 7.35 1 7.76 1
6 14 7.84 1 7.54 1 7.28 1 7.12 1
# ℹ 12 more variables: e5_7day <dbl>, e5_28day <dbl>, e6_7day <dbl>,
# e6_28day <dbl>, e7_7day <dbl>, e7_28day <dbl>, e8_7day <dbl>,
# e8_28day <dbl>, e9_7day <dbl>, e9_28day <dbl>, e10_7day <dbl>,
# e10_28day <dbl>
Combine with bind_rows()
Combine the two messy
datasets by using bind_rows()
. Be sure to look at your resulting dataset and make sure it has the right format
# your code here
Combine with join()
We’ll practice join
ing with the tidy versions of the data. First, create tidy_ratings2b
using the new dataset. Follow the same steps as what was used to create tidy_ratings2
# your code here
Once you’re satisfied, that the two tidy datasets look similar enough to combine, remove #| eval: false
from the following code chunk and run it. Did it work? Why or why not?
tidy_ratings = tidy_ratings2 %>%
left_join(tidy_ratings2b)
Use the correction _join
function here. Save your dataset to the tidy_ratings
object.
# your code here
Combine with bind_cols
To join the two datasets using bind_cols
, you’ll first have to un-tidy your tidy datasets, so that each episode number is in a row and each series number is in a column. After using bind_cols()
the first few rows of your final dataset should look like this:
# A tibble: 10 × 16
episode period s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 7 2.24 3.1 3.85 6.6 8.51 11.6 13.6 9.46 9.55 9.62 11.2 9.57 8.3 7.84
2 2 7 3 3.53 4.6 6.65 8.79 11.6 13.4 9.23 9.31 9.38 10.8 8.98 7.6 7.54
3 3 7 3 3.82 4.53 7.17 9.28 12.0 13.0 8.68 8.91 8.94 10.7 8.69 7.35 7.28
Stop here and let Amanda know you’ve finished
join
the episodes data
The episodes
data has information about each episode of GBBO. In particular, it includes:
-
baker
: each baker that competed on that episode -
signature
: the name of the dessert they made for the signature challenge -
technical
: the place the baker earned in the technical challenge -
showstopper
: the name of the dessert they made for the showstopper challenge -
result
: whether the baker was “Safe” or “Eliminated”
episodes = read_csv("https://stat220-s25.github.io/data/bakeoff-episodes.csv")
episodes
# A tibble: 1,007 × 7
series episode baker signature technical showstopper result
<dbl> <dbl> <chr> <chr> <dbl> <chr> <chr>
1 1 1 Annetha Light Jamaican Black C… 2 Red, White… Safe
2 1 1 David Chocolate Orange Cake 3 Black Fore… Safe
3 1 1 Edd Caramel Cinnamon and B… 1 <NA> Safe
4 1 1 Jasminder Fresh Mango and Passio… NA <NA> Safe
5 1 1 Jonathan Carrot Cake with Lime … 9 Three Tier… Safe
6 1 1 Lea Cranberry and Pistachi… 10 Raspberrie… Elimi…
7 1 1 Louise Carrot and Orange Cake NA Never Fail… Safe
8 1 1 Mark Sticky Marmalade Tea L… NA Heart-shap… Elimi…
9 1 1 Miranda Triple Layered Brownie… 8 Three Tier… Safe
10 1 1 Ruth Lemon Drizzle Cakewith… NA Classic Ch… Safe
# ℹ 997 more rows
Join the tidy dataset (all 14 seasons) with the episodes
data.
Your task is to print the “signature” desserts that were made on the 10 episodes that had the highest 7-day viewership.
Stop here and let Amanda know you’ve finished
Footnotes
Talk it over: does everybody remember what makes a dataset “tidy”?↩︎