Working with
Factors

Day 12

Prof Amanda Luby

Carleton College
Stat 220 - Spring 2025

Survivor `castaways` data

# A tibble: 1,351 × 29
   version version_season season_name      season full_name castaway_id castaway
   <fct>   <chr>          <chr>             <dbl> <chr>     <chr>       <chr>   
 1 US      US01           Survivor: Borneo      1 Sonja Ch… US0001      Sonja   
 2 US      US01           Survivor: Borneo      1 B.B. And… US0002      B.B.    
 3 US      US01           Survivor: Borneo      1 Stacey S… US0003      Stacey  
 4 US      US01           Survivor: Borneo      1 Ramona G… US0004      Ramona  
 5 US      US01           Survivor: Borneo      1 Dirk Been US0005      Dirk    
 6 US      US01           Survivor: Borneo      1 Joel Klug US0006      Joel    
 7 US      US01           Survivor: Borneo      1 Gretchen… US0007      Gretchen
 8 US      US01           Survivor: Borneo      1 Greg Buis US0008      Greg    
 9 US      US01           Survivor: Borneo      1 Jenna Le… US0009      Jenna   
10 US      US01           Survivor: Borneo      1 Gervase … US0010      Gervase 
# ℹ 1,341 more rows
# ℹ 22 more variables: age <dbl>, city <chr>, state <chr>, episode <dbl>,
#   day <dbl>, order <dbl>, result <chr>, place <dbl>, jury_status <chr>,
#   original_tribe <chr>, jury <lgl>, finalist <lgl>, winner <lgl>,
#   acknowledge <lgl>, ack_look <lgl>, ack_speak <lgl>, ack_gesture <lgl>,
#   ack_smile <lgl>, ack_quote <chr>, ack_score <dbl>, jury1 <dbl>, jury2 <fct>

Both of these are categorical variables

unique(castaways$version)

[1] US AU SA UK NZ
Levels: AU NZ SA UK US

unique(castaways$season_name) %>% head()

[1] "Survivor: Borneo"                 "Survivor: The Australian Outback"
[3] "Survivor: Africa"                 "Survivor: Marquesas"             
[5] "Survivor: Thailand"               "Survivor: The Amazon"

But are stored as different types

class(castaways$version)

[1] "factor"

class(castaways$season_name)

[1] "character"

Same graph, colored by the same categorical variable

Factors

R’s representation of categorical data. Consists of:

A set of values
An ordered set of valid levels

eyes <- factor(x = c("blue", "green", "green"), 
               levels = c("blue", "brown", "green"))
eyes

[1] blue  green green
Levels: blue brown green

Factors

Stored as an integer vector with a levels attribute

unclass(eyes)

[1] 1 3 3
attr(,"levels")
[1] "blue"  "brown" "green"

Simple functions for working with factors.
Part of the tidyverse

# loaded with tidyverse
library(forcats)

`gss_cat`

A sample of data from the General Social Survey, a long-running US survey conducted by NORC at the University of Chicago.

# A tibble: 21,483 × 9
    year marital         age race  rincome        partyid    relig denom tvhours
   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
 1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
 2  2000 Divorced         48 White $8000 to 9999  Not str r… Prot… Bapt…      NA
 3  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
 4  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
 5  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
 6  2000 Married          25 White $20000 - 24999 Strong de… Prot… Sout…      NA
 7  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
 8  2000 Divorced         44 White $7000 to 7999  Ind,near … Prot… Luth…      NA
 9  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
10  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
# ℹ 21,473 more rows

Warm up

Use gss_cat to answer the following questions.

Which religions watch the least TV?
Do married people watch more or less TV than single people?

04:00

Which religions watch the least TV?

gss_cat

# A tibble: 21,483 × 9
    year marital         age race  rincome        partyid    relig denom tvhours
   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
 1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
 2  2000 Divorced         48 White $8000 to 9999  Not str r… Prot… Bapt…      NA
 3  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
 4  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
 5  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
 6  2000 Married          25 White $20000 - 24999 Strong de… Prot… Sout…      NA
 7  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
 8  2000 Divorced         44 White $7000 to 7999  Ind,near … Prot… Luth…      NA
 9  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
10  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
# ℹ 21,473 more rows

Which religions watch the least TV?

gss_cat %>%
  drop_na(tvhours)

# A tibble: 11,337 × 9
    year marital         age race  rincome        partyid    relig denom tvhours
   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
 1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
 2  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
 3  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
 4  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
 5  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
 6  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
 7  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
 8  2000 Married          53 White $25000 or more Not str d… Prot… Other       2
 9  2000 Married          52 White $25000 or more Strong de… Prot… Sout…       1
10  2000 Divorced         52 White $25000 or more Ind,near … None  Not …       1
# ℹ 11,327 more rows

Which religions watch the least TV?

gss_cat %>%
  drop_na(tvhours) %>%
  group_by(relig)

# A tibble: 11,337 × 9
# Groups:   relig [15]
    year marital         age race  rincome        partyid    relig denom tvhours
   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
 1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
 2  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
 3  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
 4  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
 5  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
 6  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
 7  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
 8  2000 Married          53 White $25000 or more Not str d… Prot… Other       2
 9  2000 Married          52 White $25000 or more Strong de… Prot… Sout…       1
10  2000 Divorced         52 White $25000 or more Ind,near … None  Not …       1
# ℹ 11,327 more rows

Which religions watch the least TV?

gss_cat %>%
  drop_na(tvhours) %>%
  group_by(relig) %>%
  summarize(tvhours = mean(tvhours))

# A tibble: 15 × 2
   relig                   tvhours
   <fct>                     <dbl>
 1 No answer                  2.72
 2 Don't know                 4.62
 3 Inter-nondenominational    2.87
 4 Native american            3.46
 5 Christian                  2.79
 6 Orthodox-christian         2.42
 7 Moslem/islam               2.44
 8 Other eastern              1.67
 9 Hinduism                   1.89
10 Buddhism                   2.38
11 Other                      2.73
12 None                       2.71
13 Jewish                     2.52
14 Catholic                   2.96
15 Protestant                 3.15

Which religions watch the least TV?

gss_cat %>%
  drop_na(tvhours) %>%
  group_by(relig) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(tvhours, relig))

Which religions watch the least TV?

gss_cat %>%
  drop_na(tvhours) %>%
  group_by(relig) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(tvhours, relig)) +
    geom_point()

Which do you prefer?

Why is the y-axis in this order?

`levels()`

Use levels() to access a factor’s levels

gss_cat %>% pull(relig) %>% levels()

 [1] "No answer"               "Don't know"             
 [3] "Inter-nondenominational" "Native american"        
 [5] "Christian"               "Orthodox-christian"     
 [7] "Moslem/islam"            "Other eastern"          
 [9] "Hinduism"                "Buddhism"               
[11] "Other"                   "None"                   
[13] "Jewish"                  "Catholic"               
[15] "Protestant"              "Not applicable"

Most useful factor skills:

1. Reorder the levels

2. Recode the levels

3. Collapse levels

`fct_reorder`

.f factor vector
.x variable to reorder by (in conjunction with .fun)
.fun function to reorder by
.desc put levels in descending order?

fct_reorder(
  .f, 
  .x, 
  .fun = median, 
  ...,
  .desc = FALSE 
  )

Reorder relig by tvhours

gss_cat %>%
  drop_na(tvhours) %>%
  group_by(relig) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(
    x = tvhours,
    y = relig
  )) +
    geom_point()

Reorder relig by tvhours

gss_cat %>%
  drop_na(tvhours) %>%
  group_by(relig) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(
    x = tvhours,
    y = fct_reorder(relig, tvhours)                 #ROTATE
  )) +
    geom_point()

Try it

Use rincome_summary to construct a dotplot of rincome against age.

Reorder rincome by age

rincome_summary <- gss_cat %>%
  group_by(rincome) %>%
  summarize(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

02:30

Which do you prefer?

`fct_reorder2`

Reorders the levels of a factor by the Y values associated with the largest X values.

.f factor vector
.x X variable
.y Y variable
.fun function to reorder by
.desc put levels in descending order?

fct_reorder2(
  .f, 
  .x, 
  .y, 
  .fun = median, 
  ...,
  .desc = FALSE
  )

Reorder marital

gss_cat %>%
  drop_na(age) %>%
  count(age, marital) %>%
  group_by(age) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(
    aes(age,
        prop,
        color = marital )) +

  geom_line() +
  scale_color_colorblind("")

Reorder marital

gss_cat %>%
  drop_na(age) %>%
  count(age, marital) %>%
  group_by(age) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(
    aes(age,
        prop,
        color = fct_reorder2(marital, .x = age, .y = prop))) + #ROTATE

  geom_line() +
  scale_color_colorblind("")

Other reordering functions

gss_cat %>%
  ggplot() +
  geom_bar(aes(x = marital))

Other reordering functions

gss_cat %>%
  ggplot() +
  geom_bar(aes(x = fct_infreq(marital)))

Other reordering functions

gss_cat %>%
  ggplot() +
  geom_bar(aes(x = fct_rev(fct_infreq(marital))))

Which political leaning watches more TV?

How could we improve the partyid labels?

`fct_recode`

Changes values of levels

.f factor vector
... new level = old level pairs (as a named character vector)

fct_recode(.f, ...)

Recoding partyid

gss_cat

# A tibble: 21,483 × 9
    year marital         age race  rincome        partyid    relig denom tvhours
   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
 1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
 2  2000 Divorced         48 White $8000 to 9999  Not str r… Prot… Bapt…      NA
 3  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
 4  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
 5  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
 6  2000 Married          25 White $20000 - 24999 Strong de… Prot… Sout…      NA
 7  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
 8  2000 Divorced         44 White $7000 to 7999  Ind,near … Prot… Luth…      NA
 9  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
10  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
# ℹ 21,473 more rows

Recoding partyid

gss_cat %>%
  drop_na(tvhours)

# A tibble: 11,337 × 9
    year marital         age race  rincome        partyid    relig denom tvhours
   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
 1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
 2  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
 3  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
 4  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
 5  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
 6  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
 7  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
 8  2000 Married          53 White $25000 or more Not str d… Prot… Other       2
 9  2000 Married          52 White $25000 or more Strong de… Prot… Sout…       1
10  2000 Divorced         52 White $25000 or more Ind,near … None  Not …       1
# ℹ 11,327 more rows

Recoding partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours)

# A tibble: 11,337 × 2
   partyid            tvhours
   <fct>                <int>
 1 Ind,near rep            12
 2 Independent              2
 3 Ind,near rep             4
 4 Not str democrat         1
 5 Not str republican       3
 6 Not str democrat         0
 7 Strong republican        3
 8 Not str democrat         2
 9 Strong democrat          1
10 Ind,near dem             1
# ℹ 11,327 more rows

Recoding partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
    mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat"))

# A tibble: 11,337 × 2
   partyid               tvhours
   <fct>                   <int>
 1 Independent, near rep      12
 2 Independent                 2
 3 Independent, near rep       4
 4 Democrat, weak              1
 5 Republican, weak            3
 6 Democrat, weak              0
 7 Republican, strong          3
 8 Democrat, weak              2
 9 Democrat, strong            1
10 Independent, near dem       1
# ℹ 11,327 more rows

Recoding partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
    mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat")) %>%
  group_by(partyid)

# A tibble: 11,337 × 2
# Groups:   partyid [10]
   partyid               tvhours
   <fct>                   <int>
 1 Independent, near rep      12
 2 Independent                 2
 3 Independent, near rep       4
 4 Democrat, weak              1
 5 Republican, weak            3
 6 Democrat, weak              0
 7 Republican, strong          3
 8 Democrat, weak              2
 9 Democrat, strong            1
10 Independent, near dem       1
# ℹ 11,327 more rows

Recoding partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
    mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat")) %>%
  group_by(partyid) %>%
  summarize(tvhours = mean(tvhours))

# A tibble: 10 × 2
   partyid               tvhours
   <fct>                   <dbl>
 1 No answer                3.22
 2 Don't know               2   
 3 Other party              2.79
 4 Republican, strong       2.72
 5 Republican, weak         2.63
 6 Independent, near rep    2.77
 7 Independent              3.08
 8 Independent, near dem    2.80
 9 Democrat, weak           3.04
10 Democrat, strong         3.52

Recoding partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
    mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat")) %>%
  group_by(partyid) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(tvhours, fct_reorder(partyid, tvhours)))

Recoding partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
    mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat")) %>%
  group_by(partyid) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(tvhours, fct_reorder(partyid, tvhours))) +
  geom_point()

Recoding partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
    mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat")) %>%
  group_by(partyid) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(tvhours, fct_reorder(partyid, tvhours))) +
  geom_point() +
  labs(y = "partyid")

How can we combine these factor levels?

`fct_collapse()`

Changes multiple levels into single levels

.f factor vector
... named arguments set to a character vector (levels in the vector will be collapsed to the name of the argument)

fct_collapse(.f, ...)

Collapsing partyid

gss_cat

# A tibble: 21,483 × 9
    year marital         age race  rincome        partyid    relig denom tvhours
   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
 1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
 2  2000 Divorced         48 White $8000 to 9999  Not str r… Prot… Bapt…      NA
 3  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
 4  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
 5  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
 6  2000 Married          25 White $20000 - 24999 Strong de… Prot… Sout…      NA
 7  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
 8  2000 Divorced         44 White $7000 to 7999  Ind,near … Prot… Luth…      NA
 9  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
10  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
# ℹ 21,473 more rows

Collapsing partyid

gss_cat %>%
  drop_na(tvhours)

# A tibble: 11,337 × 9
    year marital         age race  rincome        partyid    relig denom tvhours
   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
 1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
 2  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
 3  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
 4  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
 5  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
 6  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
 7  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
 8  2000 Married          53 White $25000 or more Not str d… Prot… Other       2
 9  2000 Married          52 White $25000 or more Strong de… Prot… Sout…       1
10  2000 Divorced         52 White $25000 or more Ind,near … None  Not …       1
# ℹ 11,327 more rows

Collapsing partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours)

# A tibble: 11,337 × 2
   partyid            tvhours
   <fct>                <int>
 1 Ind,near rep            12
 2 Independent              2
 3 Ind,near rep             4
 4 Not str democrat         1
 5 Not str republican       3
 6 Not str democrat         0
 7 Strong republican        3
 8 Not str democrat         2
 9 Strong democrat          1
10 Ind,near dem             1
# ℹ 11,327 more rows

Collapsing partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
  mutate(
    partyid =
      fct_collapse(
        partyid,
        conservative = c("Strong republican",
                         "Not str republican",
                         "Ind,near rep"),
        liberal = c("Strong democrat",
                    "Not str democrat",
                    "Ind,near dem"))
  )

# A tibble: 11,337 × 2
   partyid      tvhours
   <fct>          <int>
 1 conservative      12
 2 Independent        2
 3 conservative       4
 4 liberal            1
 5 conservative       3
 6 liberal            0
 7 conservative       3
 8 liberal            2
 9 liberal            1
10 liberal            1
# ℹ 11,327 more rows

Collapsing partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
  mutate(
    partyid =
      fct_collapse(
        partyid,
        conservative = c("Strong republican",
                         "Not str republican",
                         "Ind,near rep"),
        liberal = c("Strong democrat",
                    "Not str democrat",
                    "Ind,near dem"))
  ) %>%
  group_by(partyid)

# A tibble: 11,337 × 2
# Groups:   partyid [6]
   partyid      tvhours
   <fct>          <int>
 1 conservative      12
 2 Independent        2
 3 conservative       4
 4 liberal            1
 5 conservative       3
 6 liberal            0
 7 conservative       3
 8 liberal            2
 9 liberal            1
10 liberal            1
# ℹ 11,327 more rows

Collapsing partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
  mutate(
    partyid =
      fct_collapse(
        partyid,
        conservative = c("Strong republican",
                         "Not str republican",
                         "Ind,near rep"),
        liberal = c("Strong democrat",
                    "Not str democrat",
                    "Ind,near dem"))
  ) %>%
  group_by(partyid) %>%
  summarize(tvhours = mean(tvhours))

# A tibble: 6 × 2
  partyid      tvhours
  <fct>          <dbl>
1 No answer       3.22
2 Don't know      2   
3 Other party     2.79
4 conservative    2.69
5 Independent     3.08
6 liberal         3.15

Collapsing partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
  mutate(
    partyid =
      fct_collapse(
        partyid,
        conservative = c("Strong republican",
                         "Not str republican",
                         "Ind,near rep"),
        liberal = c("Strong democrat",
                    "Not str democrat",
                    "Ind,near dem"))
  ) %>%
  group_by(partyid) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(tvhours, fct_reorder(partyid, tvhours)))

Collapsing partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
  mutate(
    partyid =
      fct_collapse(
        partyid,
        conservative = c("Strong republican",
                         "Not str republican",
                         "Ind,near rep"),
        liberal = c("Strong democrat",
                    "Not str democrat",
                    "Ind,near dem"))
  ) %>%
  group_by(partyid) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(tvhours, fct_reorder(partyid, tvhours))) +
  geom_point()

Collapsing partyid

gss_cat %>%
  drop_na(tvhours) %>%
  select(partyid, tvhours) %>%
  mutate(
    partyid =
      fct_collapse(
        partyid,
        conservative = c("Strong republican",
                         "Not str republican",
                         "Ind,near rep"),
        liberal = c("Strong democrat",
                    "Not str democrat",
                    "Ind,near dem"))
  ) %>%
  group_by(partyid) %>%
  summarize(tvhours = mean(tvhours)) %>%
  ggplot(aes(tvhours, fct_reorder(partyid, tvhours))) +
  geom_point() +
  labs(y = "partyid")

Your turn

Collapse the marital variable to have levels Married, Not married, and No answer

Include "Never married", "Divorced", and “Widowed" in Not married

02:00

There are relatively few points in each of these groups

There are relatively few points in each of these groups

`fct_lump()`

Collapses levels with fewest values into a single level.

By default collapses as many levels as possible such that the new level is still the smallest.

f factor vector
n preserve the most common n levels
other_level = "Other"

fct_lump(
  f, 
  n, 
  other_level = "Other", 
  ...
)

Lumping parytid

gss_cat %>%
  mutate(partyid = partyid) %>%
  ggplot(aes(partyid)) +
  geom_bar() +
  labs(x = "partyid")

Lumping parytid

gss_cat %>%
  mutate(partyid = fct_lump(partyid, n = 2)) %>%
  ggplot(aes(partyid)) +
  geom_bar() +
  labs(x = "partyid")

Lumping parytid

gss_cat %>%
  mutate(partyid = fct_lump(partyid, n = 3)) %>%
  ggplot(aes(partyid)) +
  geom_bar() +
  labs(x = "partyid")

Your turn: hotel bookings

The remainder of the activity file has you fixing up some plots using hotel bookings data from Tidy Tuesday

Working with Factors

Survivor castaways data

Both of these are categorical variables

But are stored as different types

Same graph, colored by the same categorical variable

Factors

Factors

gss_cat

Warm up

Which religions watch the least TV?

Which religions watch the least TV?

Which religions watch the least TV?

Which religions watch the least TV?

Which religions watch the least TV?

Which religions watch the least TV?

Which do you prefer?

Why is the y-axis in this order?

levels()

Most useful factor skills:

1. Reorder the levels

2. Recode the levels

3. Collapse levels

fct_reorder

Reorder relig by tvhours

Reorder relig by tvhours

Try it

Which do you prefer?

fct_reorder2

Reorder marital

Reorder marital

Other reordering functions

Other reordering functions

Other reordering functions

Which political leaning watches more TV?

fct_recode

Recoding partyid

Recoding partyid

Recoding partyid

Recoding partyid

Recoding partyid

Recoding partyid

Recoding partyid

Recoding partyid

Recoding partyid

How can we combine these factor levels?

fct_collapse()

Collapsing partyid

Collapsing partyid

Collapsing partyid

Collapsing partyid

Collapsing partyid

Collapsing partyid

Collapsing partyid

Collapsing partyid

Collapsing partyid

Your turn

There are relatively few points in each of these groups

There are relatively few points in each of these groups

fct_lump()

Lumping parytid

Lumping parytid

Lumping parytid

Your turn: hotel bookings

Working with
Factors

Survivor `castaways` data

`gss_cat`

`levels()`

`fct_reorder`

`fct_reorder2`

`fct_recode`

`fct_collapse()`

`fct_lump()`