14-text-case-study

Author

Affiliation

Prof Amanda Luby

Carleton College
Stat 220 - Spring 2025

library(tidyverse)
library(tidytext) # functions for doing text analysis

Load the Data

en_coursera_reviews <- read_csv("https://stat220-s25.github.io/data/en_coursera_sample.csv")
glimpse(en_coursera_reviews)

Rows: 26,882
Columns: 5
$ CourseId  <chr> "nurture-market-strategies", "nand2tetris2", "schedule-proje…
$ Review    <chr> "It would be better if the instructors cared to respond to q…
$ Label     <dbl> 1, 5, 5, 5, 5, 5, 5, 3, 5, 1, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, …
$ cld2      <chr> "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", …
$ review_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…

Tokenize

library(tidytext)
en_coursera_reviews |>
  unnest_tokens(output = word, input = Review) |>
  select(CourseId, word)

# A tibble: 666,493 × 2
   CourseId                  word       
   <chr>                     <chr>      
 1 nurture-market-strategies it         
 2 nurture-market-strategies would      
 3 nurture-market-strategies be         
 4 nurture-market-strategies better     
 5 nurture-market-strategies if         
 6 nurture-market-strategies the        
 7 nurture-market-strategies instructors
 8 nurture-market-strategies cared      
 9 nurture-market-strategies to         
10 nurture-market-strategies respond    
# ℹ 666,483 more rows

Count the tokens

en_coursera_reviews |>
  unnest_tokens(output = word, input = Review) |>
  anti_join(get_stopwords(source = "stopwords-iso"), by = "word") |>
  filter(str_detect(CourseId, "data-science")) |> 
  group_by(CourseId) %>%
  count(word) %>%
  slice_max(n, n = 10) %>%
  ggplot(aes(x = n, y = fct_reorder(word, n))) + 
  geom_col() + 
  facet_wrap(~CourseId, scales = "free_y", nrow = 2) + 
  labs(
    y = "Word"
  )

Create “clean” reviews

coursera_reviews <- en_coursera_reviews |>
  unnest_tokens(word, Review) |>
  anti_join(get_stopwords(source = "stopwords-iso"), by = "word") |>
  group_by(CourseId, review_id) |>
  summarize(review_clean = paste(word, collapse = " "))

Your turn

Explore the en_coursera_reviews dataset with the folks around you. Can you replicate my analyses, and try a few of your own? Here are some ideas:

Explore another subject (besides data-science)
Label contains the numeric rating for the course. How does the review text differ between highly rated and low-rated courses?
stopwords::stopwords_getsources() gives the available stopword dictionaries. How do these differ, and how do they change your results?