Intro to ggplot2

Author
Affiliation

Prof Amanda Luby

Carleton College
Stat 220 - Spring 2025

Note: the setup chunk includes the line eval = FALSE. Make sure to delete this when you are ready to knit your file.

The data we’re using today contains information about all seasons of Survivor and comes from the survivoR R package. In the show, a group of people (called castaways) are placed in an isolated location, where they must provide food, fire, and shelter for themselves. The castaways compete in challenges testing the contestants’ physical abilities like running and swimming or their mental abilities like puzzles and endurance challenges for rewards and immunity from elimination. The castaways are progressively eliminated from the game as they are voted out by their fellow contestants until only two or three remain. At that point, the players who were eliminated (the “jury”) vote for the winner. The winner is given the title of “Sole Survivor” and is awarded the grand prize of $1,000,000

season_summary = readr::read_csv("https://math.carleton.edu/aluby/stat220/survivor_season_summary.csv")

Speed Groupwork

Round 1: View the data and summary statistics

1. To get started, load the tidyverse and and take a glimpse at the dataset. How many rows and columns are there? What does each row represent?

Loading the tidyverse loads 8 packages, one of which is ggplot2. You can certainly load each package individually, but it’s often easier to just load the whole thing.

2. To access a column in the dataset, we can use $. For example, the chunk below selects the viewers_reunion column.

season_summary$viewers_reunion

Run summary(season_summary$viewers_reunion) and describe what you find.

3. Another handy function is unique(). Run unique() on the timeslot variable to see all the different times that Survivor has aired over the years.

Round 2: Scatterplots

First, let’s create a scatterplot of viewers_finale (the number of viewers for the last episode of the season) vs. viewers_premiere (the number of viewers for the first episode of the season).

A note on wording: when we say viewers_finale vs. viewers_premiere, this should be interpreted as “variable on the y-axis” vs. “variable on the x-axis”.

4. Fill in the data and aesthetic mapping in the below code chunk. What is returned? What’s missing?

# Fill in the blanks
ggplot(data = ___, mapping = aes(x = ___, y = ___))

5. Add the appropriate geometric object to create the scatterplot. This is called adding a layer to a plot. Remember to always put the + at the end of a line, never at the start.

# Copy your code from the previous chunk and add a geom

What do you notice? Write a sentence or two describing your findings

6. You must remember to put the aesthetic mappings in the aes() function! What happens if you forget?

# Add a layer and see what happens
ggplot(data = ___, x = ___, y = ___)

7. The aesthetic mappings can be specified in the geom layer if you prefer, instead of the main ggplot() call. Give it a try:

# Rebuild the scatterplot with your aesthetic mapping in the geom layer
ggplot(data = ___)

Round 3: Additional Aesthetics

x and y are not the only aesthetic mappings possible. In this section you’ll explore the color, size, shape, and alpha (i.e. transparency) aesthetics.

8. Create a scatterplot of viewers_finale vs. viewers_premiere. Add the color aesthetic to map country2 to the point color.

ggplot(data = ___) +
  geom_point(aes(x = ___, y = ___, color = ___))

9. Create a scatterplot of viewers_finale vs. viewers_premiere. Use shape to represent country2. Is this plot easier or harder to interpret than the previous plot?

10. Create a scatterplot of viewers_finale vs. viewers_premiere. Use both shape and color to represent the country2. Is this plot easier or harder to interpret than the previous two plots?

11. Create a scatterplot of viewers_finale vs. viewers_premiere. Use color to represent the season. What did you learn from the plot?

12. Create a scatterplot of viewers_finale vs. viewers_premiere. Use size to represent the season.

13. Look back at your scatterplots from the last few questions. Explain the differences when you map aesthetics to discrete and continuous variables.

14. Create a scatterplot of viewers_finale vs. viewers_premiere. Use alpha to represent the season.

Round 4: Visualizing Distributions

15. Build a histogram of viewers_mean using geom_histogram(). Don’t hesitate to look at the ggplot2 cheat sheet for help!

# Fill in the blanks
ggplot(___) +
  geom_histogram(aes(x = ___))

What have you learned about the distribution of average viewership?

16. By default, ggplot2 uses 30 bins. To change the number of bins, to say 15, add the argument bins = 15 to geom_histogram(). Note: this is not an aesthetic mapping.

# Fill in the blanks
ggplot(___) +
  geom_histogram(aes(x = ___), bins = ___)

17. Instead of a histogram, let’s create a kernel density plot. To do this, substitute geom_density() into your code for question 15.

18. Now, let’s make side-by-side boxplots of viewers_mean for each country2.

# Fill in the blanks
ggplot(___) +
  geom_boxplot(aes(x = ___, y = ___))

19. A violin plot is a kernel density on its side, made symmetric. Change your code from question 18 to use geom_violin(). Which plot do you prefer, boxplots or violin plots? Why?

# Put your violin plot code here

Round 5: Bar and column charts + Labeling

How many seasons were filmed in each country? Let’s find out!

20. Make a bar chart of the number of seasons filmed in each country2 using geom_bar()

# Fill in the blanks
ggplot(___) +
  geom_bar(aes(x = ___))

21. country2 has a category called “other”. Make a bar chart of country (which includes all individual countries) instead.

# Fill in the blanks
ggplot(___) +
  geom_bar(aes(x = ___))

22. When you have lots of categories, it’s sometimes hard to read the labels on the x-axis. One trick is to flip the axes. Change geom_bar() to use the y aesthetic from your code in question 21 instead.

23 . Sometimes, the variable we want to show up in the bar chart shows up explicitly in our data. Put n_tribes on the x-axis, and season_name on the y-axis. Use geom_col (column) as the geom. How is this different than a bar plot?

# Fill in the blanks
ggplot(___) +
  geom_col(aes(x = ___, y = ___))

24. In ggplot2 you can add/change the title, subtitle, caption, and x- and y-axis labels by adding a labs() layer. Below is an example illustrating it’s use. Choose one graph from today and add all labels.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  labs(
    title = "Put your informative title here",
    subtitle = "and your subtitle here",
    x = "New x label",
    y = "New y label",
    caption = "Put a caption here"
  )

When you’re done: make sure to submit your .rmd at the in-class activity submission google form.