Working with
Factors

Day 12

Prof Amanda Luby

Carleton College
Stat 220 - Spring 2025

Survivor castaways data

# A tibble: 1,351 × 29
   version version_season season_name      season full_name castaway_id castaway
   <fct>   <chr>          <chr>             <dbl> <chr>     <chr>       <chr>   
 1 US      US01           Survivor: Borneo      1 Sonja Ch… US0001      Sonja   
 2 US      US01           Survivor: Borneo      1 B.B. And… US0002      B.B.    
 3 US      US01           Survivor: Borneo      1 Stacey S… US0003      Stacey  
 4 US      US01           Survivor: Borneo      1 Ramona G… US0004      Ramona  
 5 US      US01           Survivor: Borneo      1 Dirk Been US0005      Dirk    
 6 US      US01           Survivor: Borneo      1 Joel Klug US0006      Joel    
 7 US      US01           Survivor: Borneo      1 Gretchen… US0007      Gretchen
 8 US      US01           Survivor: Borneo      1 Greg Buis US0008      Greg    
 9 US      US01           Survivor: Borneo      1 Jenna Le… US0009      Jenna   
10 US      US01           Survivor: Borneo      1 Gervase … US0010      Gervase 
# ℹ 1,341 more rows
# ℹ 22 more variables: age <dbl>, city <chr>, state <chr>, episode <dbl>,
#   day <dbl>, order <dbl>, result <chr>, place <dbl>, jury_status <chr>,
#   original_tribe <chr>, jury <lgl>, finalist <lgl>, winner <lgl>,
#   acknowledge <lgl>, ack_look <lgl>, ack_speak <lgl>, ack_gesture <lgl>,
#   ack_smile <lgl>, ack_quote <chr>, ack_score <dbl>, jury1 <dbl>, jury2 <fct>

Both of these are categorical variables

unique(castaways$version)
[1] US AU SA UK NZ
Levels: AU NZ SA UK US
unique(castaways$season_name) %>% head()
[1] "Survivor: Borneo"                 "Survivor: The Australian Outback"
[3] "Survivor: Africa"                 "Survivor: Marquesas"             
[5] "Survivor: Thailand"               "Survivor: The Amazon"            

But are stored as different types

class(castaways$version)
[1] "factor"
class(castaways$season_name) 
[1] "character"

Same graph, colored by the same categorical variable

Factors

R’s representation of categorical data. Consists of:

  1. A set of values

  2. An ordered set of valid levels

eyes <- factor(x = c("blue", "green", "green"), 
               levels = c("blue", "brown", "green"))
eyes
[1] blue  green green
Levels: blue brown green

Factors

Stored as an integer vector with a levels attribute

unclass(eyes)
[1] 1 3 3
attr(,"levels")
[1] "blue"  "brown" "green"

  • Simple functions for working with factors.

  • Part of the tidyverse

# loaded with tidyverse
library(forcats)

gss_cat

A sample of data from the General Social Survey, a long-running US survey conducted by NORC at the University of Chicago.

# A tibble: 21,483 × 9
    year marital         age race  rincome        partyid    relig denom tvhours
   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
 1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
 2  2000 Divorced         48 White $8000 to 9999  Not str r… Prot… Bapt…      NA
 3  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
 4  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
 5  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
 6  2000 Married          25 White $20000 - 24999 Strong de… Prot… Sout…      NA
 7  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
 8  2000 Divorced         44 White $7000 to 7999  Ind,near … Prot… Luth…      NA
 9  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
10  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
# ℹ 21,473 more rows

Warm up

Use gss_cat to answer the following questions.

  1. Which religions watch the least TV?

  2. Do married people watch more or less TV than single people?

04:00

Which religions watch the least TV?

gss_cat
# A tibble: 21,483 × 9
    year marital         age race  rincome        partyid    relig denom tvhours
   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
 1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
 2  2000 Divorced         48 White $8000 to 9999  Not str r… Prot… Bapt…      NA
 3  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
 4  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
 5  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
 6  2000 Married          25 White $20000 - 24999 Strong de… Prot… Sout…      NA
 7  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
 8  2000 Divorced         44 White $7000 to 7999  Ind,near … Prot… Luth…      NA
 9  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
10  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
# ℹ 21,473 more rows

Which religions watch the least TV?

gss_cat %>%
  drop_na(tvhours)
# A tibble: 11,337 × 9
    year marital         age race  rincome        partyid    relig denom tvhours
   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
 1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
 2  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
 3  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
 4  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
 5  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
 6  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
 7  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
 8  2000 Married          53 White $25000 or more Not str d… Prot… Other       2
 9  2000 Married          52 White $25000 or more Strong de… Prot… Sout…       1
10  2000 Divorced         52 White $25000 or more Ind,near … None  Not …       1
# ℹ 11,327 more rows

Which religions watch the least TV?

gss_cat %>%
  drop_na(tvhours) %>%
  group_by(relig)
# A tibble: 11,337 × 9
# Groups:   relig [15]
    year marital         age race  rincome        partyid    relig denom tvhours
   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
 1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
 2  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
 3  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
 4  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
 5  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
 6  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
 7  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
 8  2000 Married          53 White $25000 or more Not str d… Prot… Other       2
 9  2000 Married          52 White $25000 or more Strong de… Prot… Sout…       1
10  2000 Divorced         52 White $25000 or more Ind,near … None  Not …       1
# ℹ 11,327 more rows

Which religions watch the least TV?

gss_cat %>%
  drop_na(tvhours) %>%
  group_by(relig) %>%
  summarize(tvhours = mean(tvhours))
# A tibble: 15 × 2
   relig                   tvhours
   <fct>                     <dbl>
 1 No answer                  2.72
 2 Don't know                 4.62
 3 Inter-nondenominational    2.87
 4 Native american            3.46
 5 Christian                  2.79
 6 Orthodox-christian         2.42
 7 Moslem/islam               2.44
 8 Other eastern              1.67
 9 Hinduism                   1.89
10 Buddhism                   2.38
11 Other                      2.73
12 None                       2.71
13 Jewish                     2.52
14 Catholic                   2.96
15 Protestant                 3.15

Which religions watch the least TV?

gss_cat %>%
  drop_na(tvhours) %>%
  group_by(relig) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(tvhours, relig))

Which religions watch the least TV?

gss_cat %>%
  drop_na(tvhours) %>%
  group_by(relig) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(tvhours, relig)) +
    geom_point()

Which do you prefer?

Why is the y-axis in this order?

levels()

Use levels() to access a factor’s levels

gss_cat %>% pull(relig) %>% levels()
 [1] "No answer"               "Don't know"             
 [3] "Inter-nondenominational" "Native american"        
 [5] "Christian"               "Orthodox-christian"     
 [7] "Moslem/islam"            "Other eastern"          
 [9] "Hinduism"                "Buddhism"               
[11] "Other"                   "None"                   
[13] "Jewish"                  "Catholic"               
[15] "Protestant"              "Not applicable"         

Most useful factor skills:

1. Reorder the levels

2. Recode the levels

3. Collapse levels

fct_reorder

  • .f factor vector
  • .x variable to reorder by (in conjunction with .fun)
  • .fun function to reorder by
  • .desc put levels in descending order?
fct_reorder(
  .f, 
  .x, 
  .fun = median, 
  ...,
  .desc = FALSE 
  )

Reorder relig by tvhours

gss_cat %>%
  drop_na(tvhours) %>%
  group_by(relig) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(
    x = tvhours,
    y = relig
  )) +
    geom_point()

Reorder relig by tvhours

gss_cat %>%
  drop_na(tvhours) %>%
  group_by(relig) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(
    x = tvhours,
    y = fct_reorder(relig, tvhours)                 #ROTATE
  )) +
    geom_point()

Try it

Use rincome_summary to construct a dotplot of rincome against age.

Reorder rincome by age

rincome_summary <- gss_cat %>%
  group_by(rincome) %>%
  summarize(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )
02:30

Which do you prefer?

fct_reorder2

Reorders the levels of a factor by the Y values associated with the largest X values.

  • .f factor vector
  • .x X variable
  • .y Y variable
  • .fun function to reorder by
  • .desc put levels in descending order?
fct_reorder2(
  .f, 
  .x, 
  .y, 
  .fun = median, 
  ...,
  .desc = FALSE
  )

Reorder marital

gss_cat %>%
  drop_na(age) %>%
  count(age, marital) %>%
  group_by(age) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(
    aes(age,
        prop,
        color = marital )) +

  geom_line() +
  scale_color_colorblind("")

Reorder marital

gss_cat %>%
  drop_na(age) %>%
  count(age, marital) %>%
  group_by(age) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(
    aes(age,
        prop,
        color = fct_reorder2(marital, .x = age, .y = prop))) + #ROTATE

  geom_line() +
  scale_color_colorblind("")

Other reordering functions

gss_cat %>%
  ggplot() +
  geom_bar(aes(x = marital))

Other reordering functions

gss_cat %>%
  ggplot() +
  geom_bar(aes(x = fct_infreq(marital)))

Other reordering functions

gss_cat %>%
  ggplot() +
  geom_bar(aes(x = fct_rev(fct_infreq(marital))))

Which political leaning watches more TV?

How could we improve the partyid labels?

fct_recode

Changes values of levels

  • .f factor vector
  • ... new level = old level pairs (as a named character vector)
fct_recode(.f, ...)

Recoding partyid

gss_cat
# A tibble: 21,483 × 9
    year marital         age race  rincome        partyid    relig denom tvhours
   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
 1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
 2  2000 Divorced         48 White $8000 to 9999  Not str r… Prot… Bapt…      NA
 3  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
 4  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
 5  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
 6  2000 Married          25 White $20000 - 24999 Strong de… Prot… Sout…      NA
 7  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
 8  2000 Divorced         44 White $7000 to 7999  Ind,near … Prot… Luth…      NA
 9  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
10  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
# ℹ 21,473 more rows

Recoding partyid

gss_cat %>%
  drop_na(tvhours)
# A tibble: 11,337 × 9
    year marital         age race  rincome        partyid    relig denom tvhours
   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
 1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
 2  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
 3  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
 4  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
 5  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
 6  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
 7  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
 8  2000 Married          53 White $25000 or more Not str d… Prot… Other       2
 9  2000 Married          52 White $25000 or more Strong de… Prot… Sout…       1
10  2000 Divorced         52 White $25000 or more Ind,near … None  Not …       1
# ℹ 11,327 more rows

Recoding partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours)
# A tibble: 11,337 × 2
   partyid            tvhours
   <fct>                <int>
 1 Ind,near rep            12
 2 Independent              2
 3 Ind,near rep             4
 4 Not str democrat         1
 5 Not str republican       3
 6 Not str democrat         0
 7 Strong republican        3
 8 Not str democrat         2
 9 Strong democrat          1
10 Ind,near dem             1
# ℹ 11,327 more rows

Recoding partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
    mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat"))
# A tibble: 11,337 × 2
   partyid               tvhours
   <fct>                   <int>
 1 Independent, near rep      12
 2 Independent                 2
 3 Independent, near rep       4
 4 Democrat, weak              1
 5 Republican, weak            3
 6 Democrat, weak              0
 7 Republican, strong          3
 8 Democrat, weak              2
 9 Democrat, strong            1
10 Independent, near dem       1
# ℹ 11,327 more rows

Recoding partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
    mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat")) %>%
  group_by(partyid)
# A tibble: 11,337 × 2
# Groups:   partyid [10]
   partyid               tvhours
   <fct>                   <int>
 1 Independent, near rep      12
 2 Independent                 2
 3 Independent, near rep       4
 4 Democrat, weak              1
 5 Republican, weak            3
 6 Democrat, weak              0
 7 Republican, strong          3
 8 Democrat, weak              2
 9 Democrat, strong            1
10 Independent, near dem       1
# ℹ 11,327 more rows

Recoding partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
    mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat")) %>%
  group_by(partyid) %>%
  summarize(tvhours = mean(tvhours))
# A tibble: 10 × 2
   partyid               tvhours
   <fct>                   <dbl>
 1 No answer                3.22
 2 Don't know               2   
 3 Other party              2.79
 4 Republican, strong       2.72
 5 Republican, weak         2.63
 6 Independent, near rep    2.77
 7 Independent              3.08
 8 Independent, near dem    2.80
 9 Democrat, weak           3.04
10 Democrat, strong         3.52

Recoding partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
    mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat")) %>%
  group_by(partyid) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(tvhours, fct_reorder(partyid, tvhours)))

Recoding partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
    mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat")) %>%
  group_by(partyid) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(tvhours, fct_reorder(partyid, tvhours))) +
  geom_point()

Recoding partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
    mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat")) %>%
  group_by(partyid) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(tvhours, fct_reorder(partyid, tvhours))) +
  geom_point() +
  labs(y = "partyid")

How can we combine these factor levels?

fct_collapse()

Changes multiple levels into single levels

  • .f factor vector
  • ... named arguments set to a character vector (levels in the vector will be collapsed to the name of the argument)
fct_collapse(.f, ...)

Collapsing partyid

gss_cat
# A tibble: 21,483 × 9
    year marital         age race  rincome        partyid    relig denom tvhours
   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
 1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
 2  2000 Divorced         48 White $8000 to 9999  Not str r… Prot… Bapt…      NA
 3  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
 4  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
 5  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
 6  2000 Married          25 White $20000 - 24999 Strong de… Prot… Sout…      NA
 7  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
 8  2000 Divorced         44 White $7000 to 7999  Ind,near … Prot… Luth…      NA
 9  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
10  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
# ℹ 21,473 more rows

Collapsing partyid

gss_cat %>%
  drop_na(tvhours)
# A tibble: 11,337 × 9
    year marital         age race  rincome        partyid    relig denom tvhours
   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
 1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
 2  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
 3  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
 4  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
 5  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
 6  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
 7  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
 8  2000 Married          53 White $25000 or more Not str d… Prot… Other       2
 9  2000 Married          52 White $25000 or more Strong de… Prot… Sout…       1
10  2000 Divorced         52 White $25000 or more Ind,near … None  Not …       1
# ℹ 11,327 more rows

Collapsing partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours)
# A tibble: 11,337 × 2
   partyid            tvhours
   <fct>                <int>
 1 Ind,near rep            12
 2 Independent              2
 3 Ind,near rep             4
 4 Not str democrat         1
 5 Not str republican       3
 6 Not str democrat         0
 7 Strong republican        3
 8 Not str democrat         2
 9 Strong democrat          1
10 Ind,near dem             1
# ℹ 11,327 more rows

Collapsing partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
  mutate(
    partyid =
      fct_collapse(
        partyid,
        conservative = c("Strong republican",
                         "Not str republican",
                         "Ind,near rep"),
        liberal = c("Strong democrat",
                    "Not str democrat",
                    "Ind,near dem"))
  )
# A tibble: 11,337 × 2
   partyid      tvhours
   <fct>          <int>
 1 conservative      12
 2 Independent        2
 3 conservative       4
 4 liberal            1
 5 conservative       3
 6 liberal            0
 7 conservative       3
 8 liberal            2
 9 liberal            1
10 liberal            1
# ℹ 11,327 more rows

Collapsing partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
  mutate(
    partyid =
      fct_collapse(
        partyid,
        conservative = c("Strong republican",
                         "Not str republican",
                         "Ind,near rep"),
        liberal = c("Strong democrat",
                    "Not str democrat",
                    "Ind,near dem"))
  ) %>%
  group_by(partyid)
# A tibble: 11,337 × 2
# Groups:   partyid [6]
   partyid      tvhours
   <fct>          <int>
 1 conservative      12
 2 Independent        2
 3 conservative       4
 4 liberal            1
 5 conservative       3
 6 liberal            0
 7 conservative       3
 8 liberal            2
 9 liberal            1
10 liberal            1
# ℹ 11,327 more rows

Collapsing partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
  mutate(
    partyid =
      fct_collapse(
        partyid,
        conservative = c("Strong republican",
                         "Not str republican",
                         "Ind,near rep"),
        liberal = c("Strong democrat",
                    "Not str democrat",
                    "Ind,near dem"))
  ) %>%
  group_by(partyid) %>%
  summarize(tvhours = mean(tvhours))
# A tibble: 6 × 2
  partyid      tvhours
  <fct>          <dbl>
1 No answer       3.22
2 Don't know      2   
3 Other party     2.79
4 conservative    2.69
5 Independent     3.08
6 liberal         3.15

Collapsing partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
  mutate(
    partyid =
      fct_collapse(
        partyid,
        conservative = c("Strong republican",
                         "Not str republican",
                         "Ind,near rep"),
        liberal = c("Strong democrat",
                    "Not str democrat",
                    "Ind,near dem"))
  ) %>%
  group_by(partyid) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(tvhours, fct_reorder(partyid, tvhours)))

Collapsing partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
  mutate(
    partyid =
      fct_collapse(
        partyid,
        conservative = c("Strong republican",
                         "Not str republican",
                         "Ind,near rep"),
        liberal = c("Strong democrat",
                    "Not str democrat",
                    "Ind,near dem"))
  ) %>%
  group_by(partyid) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(tvhours, fct_reorder(partyid, tvhours))) +
  geom_point()

Collapsing partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
  mutate(
    partyid =
      fct_collapse(
        partyid,
        conservative = c("Strong republican",
                         "Not str republican",
                         "Ind,near rep"),
        liberal = c("Strong democrat",
                    "Not str democrat",
                    "Ind,near dem"))
  ) %>%
  group_by(partyid) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(tvhours, fct_reorder(partyid, tvhours))) +
  geom_point() +
  labs(y = "partyid")

Your turn

Collapse the marital variable to have levels Married, Not married, and No answer

Include "Never married", "Divorced", and “Widowed" in Not married

02:00

There are relatively few points in each of these groups

There are relatively few points in each of these groups

fct_lump()

Collapses levels with fewest values into a single level.

By default collapses as many levels as possible such that the new level is still the smallest.

  • f factor vector
  • n preserve the most common n levels
  • other_level = "Other"
fct_lump(
  f, 
  n, 
  other_level = "Other", 
  ...
)

Lumping parytid

gss_cat %>%
  mutate(partyid = partyid) %>%
  ggplot(aes(partyid)) +
  geom_bar() +
  labs(x = "partyid")

Lumping parytid

gss_cat %>%
  mutate(partyid = fct_lump(partyid, n = 2)) %>%
  ggplot(aes(partyid)) +
  geom_bar() +
  labs(x = "partyid")

Lumping parytid

gss_cat %>%
  mutate(partyid = fct_lump(partyid, n = 3)) %>%
  ggplot(aes(partyid)) +
  geom_bar() +
  labs(x = "partyid")

Your turn: hotel bookings

The remainder of the activity file has you fixing up some plots using hotel bookings data from Tidy Tuesday