dplyr
2: Pipes and Groups
Wrangling the nycflights23
data
summarize()
Using the flights
data from the {nycflights23} package, use summarize()
to compute statistics about the data:
The lowest and highest
distance
traveledThe lowest and highest
air_time
The median
dep_delay
flights
# A tibble: 435,352 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2023 1 1 1 2038 203 328 3
2 2023 1 1 18 2300 78 228 135
3 2023 1 1 31 2344 47 500 426
4 2023 1 1 33 2140 173 238 2352
5 2023 1 1 36 2048 228 223 2252
6 2023 1 1 503 500 3 808 815
7 2023 1 1 520 510 10 948 949
8 2023 1 1 524 530 -6 645 710
9 2023 1 1 537 520 17 926 818
10 2023 1 1 547 545 2 845 852
# ℹ 435,342 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
summarize()
II
Using the flights
data from the {nycflights23} package, extract the rows for Delta flights. (Hint: the carrier code is “DL”)
Then use summarize()
and a summary function to find:
The number of flights in this subset
The median
dep_delay
. How does it compare to the overall median?
flights
# A tibble: 435,352 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2023 1 1 1 2038 203 328 3
2 2023 1 1 18 2300 78 228 135
3 2023 1 1 31 2344 47 500 426
4 2023 1 1 33 2140 173 238 2352
5 2023 1 1 36 2048 228 223 2252
6 2023 1 1 503 500 3 808 815
7 2023 1 1 520 510 10 948 949
8 2023 1 1 524 530 -6 645 710
9 2023 1 1 537 520 17 926 818
10 2023 1 1 547 545 2 845 852
# ℹ 435,342 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
group_by()
Use group_by()
, summarize()
, and slice_max()
to display the five dest
airports with the most flights from NYC airports in 2023. Your display should include the median flight delay for these airports.
flights
# A tibble: 435,352 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2023 1 1 1 2038 203 328 3
2 2023 1 1 18 2300 78 228 135
3 2023 1 1 31 2344 47 500 426
4 2023 1 1 33 2140 173 238 2352
5 2023 1 1 36 2048 228 223 2252
6 2023 1 1 503 500 3 808 815
7 2023 1 1 520 510 10 948 949
8 2023 1 1 524 530 -6 645 710
9 2023 1 1 537 520 17 926 818
10 2023 1 1 547 545 2 845 852
# ℹ 435,342 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Intro to Tidy Data
bakeoff_ratings <- read_csv("https://stat220-s25.github.io/data/bakeoff_messy_ratings.csv")
bakeoff_ratings
# A tibble: 8 × 11
series e1 e2 e3 e4 e5 e6 e7 e8 e9 e10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2.24 3 3 2.6 3.03 2.75 NA NA NA NA
2 2 3.1 3.53 3.82 3.6 3.83 4.25 4.42 5.06 NA NA
3 3 3.85 4.6 4.53 4.71 4.61 4.82 5.1 5.35 5.7 6.74
4 4 6.6 6.65 7.17 6.82 6.95 7.32 7.76 7.41 7.41 9.45
5 5 8.51 8.79 9.28 10.2 9.95 10.1 10.3 9.02 10.7 13.5
6 6 11.6 11.6 12.0 12.4 12.4 12 12.4 11.1 12.6 15.0
7 7 13.6 13.4 13.0 13.3 13.1 13.1 13.4 13.3 13.4 15.9
8 8 9.46 9.23 8.68 8.55 8.61 8.61 9.01 8.95 9.03 10.0
Discuss
Is this dataset in tidy format? Why or why not?
If not, what would a tidy data set look like? Sketch out the first few rows of this data set in tidy format
tidy
ing bakeoff ratings
Follow along with the slides to tidy the bakeoff_ratings
dataset
messyratings2
messy_ratings2 <- read_csv("https://stat220-s25.github.io/data/messy_ratings2.csv")
messy_ratings2
# A tibble: 8 × 21
series e1_7day e1_28day e2_7day e2_28day e3_7day e3_28day e4_7day e4_28day
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2.24 NA 3 NA 3 NA 2.6 NA
2 2 3.1 NA 3.53 NA 3.82 NA 3.6 NA
3 3 3.85 NA 4.6 NA 4.53 NA 4.71 NA
4 4 6.6 NA 6.65 NA 7.17 NA 6.82 NA
5 5 8.51 NA 8.79 NA 9.28 NA 10.2 NA
6 6 11.6 11.7 11.6 11.8 12.0 NA 12.4 12.7
7 7 13.6 13.9 13.4 13.7 13.0 13.4 13.3 13.9
8 8 9.46 9.72 9.23 9.53 8.68 9.06 8.55 8.87
# ℹ 12 more variables: e5_7day <dbl>, e5_28day <dbl>, e6_7day <dbl>,
# e6_28day <dbl>, e7_7day <dbl>, e7_28day <dbl>, e8_7day <dbl>,
# e8_28day <dbl>, e9_7day <dbl>, e9_28day <dbl>, e10_7day <dbl>,
# e10_28day <dbl>
Task 1
Tidy this data set by
- Selecting the
series
ande*_7day
columns - Pivoting the data to add a column for
episode
and a column forrating
(we’ll clean up the episode column later)
# include your code here
Task 2
- Clean the
episode
andperiod
column - Make a line plot with episode on the x-axis, rating on the y-axis, colored by series. (You will also need to map the
group
aesthetic to series)
When you’re done, remember to submit your .rmd to the google form!