`dplyr` 2: Pipes and Groups

Author

Affiliation

Stat220 – W25

Carleton College
Stat 220 - Spring 2025

library(tidyverse)
library(ggthemes)
library(nycflights23)

Wrangling the `nycflights23` data

`summarize()`

Using the flights data from the {nycflights23} package, use summarize() to compute statistics about the data:

The lowest and highest distance traveled
The lowest and highest air_time
The median dep_delay

flights

# A tibble: 435,352 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2023     1     1        1           2038       203      328              3
 2  2023     1     1       18           2300        78      228            135
 3  2023     1     1       31           2344        47      500            426
 4  2023     1     1       33           2140       173      238           2352
 5  2023     1     1       36           2048       228      223           2252
 6  2023     1     1      503            500         3      808            815
 7  2023     1     1      520            510        10      948            949
 8  2023     1     1      524            530        -6      645            710
 9  2023     1     1      537            520        17      926            818
10  2023     1     1      547            545         2      845            852
# ℹ 435,342 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

`summarize()` II

Using the flights data from the {nycflights23} package, extract the rows for Delta flights. (Hint: the carrier code is “DL”)

Then use summarize() and a summary function to find:

The number of flights in this subset
The median dep_delay. How does it compare to the overall median?

flights

# A tibble: 435,352 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2023     1     1        1           2038       203      328              3
 2  2023     1     1       18           2300        78      228            135
 3  2023     1     1       31           2344        47      500            426
 4  2023     1     1       33           2140       173      238           2352
 5  2023     1     1       36           2048       228      223           2252
 6  2023     1     1      503            500         3      808            815
 7  2023     1     1      520            510        10      948            949
 8  2023     1     1      524            530        -6      645            710
 9  2023     1     1      537            520        17      926            818
10  2023     1     1      547            545         2      845            852
# ℹ 435,342 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

`group_by()`

Use group_by(), summarize(), and slice_max() to display the five dest airports with the most flights from NYC airports in 2023. Your display should include the median flight delay for these airports.

flights

# A tibble: 435,352 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2023     1     1        1           2038       203      328              3
 2  2023     1     1       18           2300        78      228            135
 3  2023     1     1       31           2344        47      500            426
 4  2023     1     1       33           2140       173      238           2352
 5  2023     1     1       36           2048       228      223           2252
 6  2023     1     1      503            500         3      808            815
 7  2023     1     1      520            510        10      948            949
 8  2023     1     1      524            530        -6      645            710
 9  2023     1     1      537            520        17      926            818
10  2023     1     1      547            545         2      845            852
# ℹ 435,342 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Intro to Tidy Data

bakeoff_ratings <- read_csv("https://stat220-s25.github.io/data/bakeoff_messy_ratings.csv")
bakeoff_ratings

# A tibble: 8 × 11
  series    e1    e2    e3    e4    e5    e6    e7    e8    e9   e10
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1      1  2.24  3     3     2.6   3.03  2.75 NA    NA    NA    NA   
2      2  3.1   3.53  3.82  3.6   3.83  4.25  4.42  5.06 NA    NA   
3      3  3.85  4.6   4.53  4.71  4.61  4.82  5.1   5.35  5.7   6.74
4      4  6.6   6.65  7.17  6.82  6.95  7.32  7.76  7.41  7.41  9.45
5      5  8.51  8.79  9.28 10.2   9.95 10.1  10.3   9.02 10.7  13.5 
6      6 11.6  11.6  12.0  12.4  12.4  12    12.4  11.1  12.6  15.0 
7      7 13.6  13.4  13.0  13.3  13.1  13.1  13.4  13.3  13.4  15.9 
8      8  9.46  9.23  8.68  8.55  8.61  8.61  9.01  8.95  9.03 10.0

Discuss

Is this dataset in tidy format? Why or why not?

If not, what would a tidy data set look like? Sketch out the first few rows of this data set in tidy format

`tidy`ing bakeoff ratings

Follow along with the slides to tidy the bakeoff_ratings dataset

`messyratings2`

messy_ratings2 <- read_csv("https://stat220-s25.github.io/data/messy_ratings2.csv")
messy_ratings2

# A tibble: 8 × 21
  series e1_7day e1_28day e2_7day e2_28day e3_7day e3_28day e4_7day e4_28day
   <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>
1      1    2.24    NA       3       NA       3       NA       2.6     NA   
2      2    3.1     NA       3.53    NA       3.82    NA       3.6     NA   
3      3    3.85    NA       4.6     NA       4.53    NA       4.71    NA   
4      4    6.6     NA       6.65    NA       7.17    NA       6.82    NA   
5      5    8.51    NA       8.79    NA       9.28    NA      10.2     NA   
6      6   11.6     11.7    11.6     11.8    12.0     NA      12.4     12.7 
7      7   13.6     13.9    13.4     13.7    13.0     13.4    13.3     13.9 
8      8    9.46     9.72    9.23     9.53    8.68     9.06    8.55     8.87
# ℹ 12 more variables: e5_7day <dbl>, e5_28day <dbl>, e6_7day <dbl>,
#   e6_28day <dbl>, e7_7day <dbl>, e7_28day <dbl>, e8_7day <dbl>,
#   e8_28day <dbl>, e9_7day <dbl>, e9_28day <dbl>, e10_7day <dbl>,
#   e10_28day <dbl>

Task 1

Tidy this data set by

Selecting the series and e*_7day columns
Pivoting the data to add a column for episode and a column for rating (we’ll clean up the episode column later)

# include your code here

Task 2

Clean the episode and period column
Make a line plot with episode on the x-axis, rating on the y-axis, colored by series. (You will also need to map the group aesthetic to series)

Note

When you’re done, remember to submit your .rmd to the google form!

Wrangling the nycflights23 data

summarize()

summarize() II

group_by()

Intro to Tidy Data

Discuss

tidying bakeoff ratings

messyratings2

Task 1

Task 2

Wrangling the `nycflights23` data

`summarize()`

`summarize()` II

`group_by()`

`tidy`ing bakeoff ratings

`messyratings2`