dplyr 2: Pipes and Groups
Wrangling the nycflights23 data
summarize()
Using the flights data from the {nycflights23} package, use summarize() to compute statistics about the data:
The lowest and highest
distancetraveledThe lowest and highest
air_timeThe median
dep_delay
flights# A tibble: 435,352 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2023 1 1 1 2038 203 328 3
2 2023 1 1 18 2300 78 228 135
3 2023 1 1 31 2344 47 500 426
4 2023 1 1 33 2140 173 238 2352
5 2023 1 1 36 2048 228 223 2252
6 2023 1 1 503 500 3 808 815
7 2023 1 1 520 510 10 948 949
8 2023 1 1 524 530 -6 645 710
9 2023 1 1 537 520 17 926 818
10 2023 1 1 547 545 2 845 852
# ℹ 435,342 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
summarize() II
Using the flights data from the {nycflights23} package, extract the rows for Delta flights. (Hint: the carrier code is “DL”)
Then use summarize() and a summary function to find:
The number of flights in this subset
The median
dep_delay. How does it compare to the overall median?
flights# A tibble: 435,352 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2023 1 1 1 2038 203 328 3
2 2023 1 1 18 2300 78 228 135
3 2023 1 1 31 2344 47 500 426
4 2023 1 1 33 2140 173 238 2352
5 2023 1 1 36 2048 228 223 2252
6 2023 1 1 503 500 3 808 815
7 2023 1 1 520 510 10 948 949
8 2023 1 1 524 530 -6 645 710
9 2023 1 1 537 520 17 926 818
10 2023 1 1 547 545 2 845 852
# ℹ 435,342 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
group_by()
Use group_by(), summarize(), and slice_max() to display the five dest airports with the most flights from NYC airports in 2023. Your display should include the median flight delay for these airports.
flights# A tibble: 435,352 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2023 1 1 1 2038 203 328 3
2 2023 1 1 18 2300 78 228 135
3 2023 1 1 31 2344 47 500 426
4 2023 1 1 33 2140 173 238 2352
5 2023 1 1 36 2048 228 223 2252
6 2023 1 1 503 500 3 808 815
7 2023 1 1 520 510 10 948 949
8 2023 1 1 524 530 -6 645 710
9 2023 1 1 537 520 17 926 818
10 2023 1 1 547 545 2 845 852
# ℹ 435,342 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Intro to Tidy Data
bakeoff_ratings <- read_csv("https://stat220-s25.github.io/data/bakeoff_messy_ratings.csv")
bakeoff_ratings# A tibble: 8 × 11
series e1 e2 e3 e4 e5 e6 e7 e8 e9 e10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2.24 3 3 2.6 3.03 2.75 NA NA NA NA
2 2 3.1 3.53 3.82 3.6 3.83 4.25 4.42 5.06 NA NA
3 3 3.85 4.6 4.53 4.71 4.61 4.82 5.1 5.35 5.7 6.74
4 4 6.6 6.65 7.17 6.82 6.95 7.32 7.76 7.41 7.41 9.45
5 5 8.51 8.79 9.28 10.2 9.95 10.1 10.3 9.02 10.7 13.5
6 6 11.6 11.6 12.0 12.4 12.4 12 12.4 11.1 12.6 15.0
7 7 13.6 13.4 13.0 13.3 13.1 13.1 13.4 13.3 13.4 15.9
8 8 9.46 9.23 8.68 8.55 8.61 8.61 9.01 8.95 9.03 10.0
Discuss
Is this dataset in tidy format? Why or why not?
If not, what would a tidy data set look like? Sketch out the first few rows of this data set in tidy format
tidying bakeoff ratings
Follow along with the slides to tidy the bakeoff_ratings dataset
messyratings2
messy_ratings2 <- read_csv("https://stat220-s25.github.io/data/messy_ratings2.csv")
messy_ratings2# A tibble: 8 × 21
series e1_7day e1_28day e2_7day e2_28day e3_7day e3_28day e4_7day e4_28day
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2.24 NA 3 NA 3 NA 2.6 NA
2 2 3.1 NA 3.53 NA 3.82 NA 3.6 NA
3 3 3.85 NA 4.6 NA 4.53 NA 4.71 NA
4 4 6.6 NA 6.65 NA 7.17 NA 6.82 NA
5 5 8.51 NA 8.79 NA 9.28 NA 10.2 NA
6 6 11.6 11.7 11.6 11.8 12.0 NA 12.4 12.7
7 7 13.6 13.9 13.4 13.7 13.0 13.4 13.3 13.9
8 8 9.46 9.72 9.23 9.53 8.68 9.06 8.55 8.87
# ℹ 12 more variables: e5_7day <dbl>, e5_28day <dbl>, e6_7day <dbl>,
# e6_28day <dbl>, e7_7day <dbl>, e7_28day <dbl>, e8_7day <dbl>,
# e8_28day <dbl>, e9_7day <dbl>, e9_28day <dbl>, e10_7day <dbl>,
# e10_28day <dbl>
Task 1
Tidy this data set by
- Selecting the
seriesande*_7daycolumns - Pivoting the data to add a column for
episodeand a column forrating(we’ll clean up the episode column later)
# include your code hereTask 2
- Clean the
episodeandperiodcolumn - Make a line plot with episode on the x-axis, rating on the y-axis, colored by series. (You will also need to map the
groupaesthetic to series)
When you’re done, remember to submit your .rmd to the google form!
