Intro to ggplot2
Note: the setup
chunk includes the line eval = FALSE
. Make sure to delete this when you are ready to knit your file.
The data we’re using today contains information about all seasons of Survivor and comes from the survivoR R package. In the show, a group of people (called castaways) are placed in an isolated location, where they must provide food, fire, and shelter for themselves. The castaways compete in challenges testing the contestants’ physical abilities like running and swimming or their mental abilities like puzzles and endurance challenges for rewards and immunity from elimination. The castaways are progressively eliminated from the game as they are voted out by their fellow contestants until only two or three remain. At that point, the players who were eliminated (the “jury”) vote for the winner. The winner is given the title of “Sole Survivor” and is awarded the grand prize of $1,000,000
season_summary = readr::read_csv("https://math.carleton.edu/aluby/stat220/survivor_season_summary.csv")
-
season
: the number of each season -
country
: the country each season was filmed in -
winner
: the eventual winner of the season (spoilers!!!) -
viewers_premiere
: the number of viewers of the first episode of the season -
viewers_finale
: the number of viewers for the last episode of the season -
imdb_mean
: the average IMDB rating of the season
Speed Groupwork
Round 1: View the data and summary statistics
1. To get started, load the tidyverse and and take a glimpse
at the dataset. How many rows and columns are there? What does each row represent?
Loading the tidyverse loads 8 packages, one of which is ggplot2. You can certainly load each package individually, but it’s often easier to just load the whole thing.
2. To access a column in the dataset, we can use $
. For example, the chunk below selects the viewers_reunion
column.
season_summary$viewers_reunion
Run summary(season_summary$viewers_reunion)
and describe what you find.
3. Another handy function is unique()
. Run unique()
on the timeslot
variable to see all the different times that Survivor has aired over the years.
Round 2: Scatterplots
First, let’s create a scatterplot of viewers_finale
(the number of viewers for the last episode of the season) vs. viewers_premiere
(the number of viewers for the first episode of the season).
A note on wording: when we say viewers_finale
vs. viewers_premiere
, this should be interpreted as “variable on the y-axis” vs. “variable on the x-axis”.
4. Fill in the data and aesthetic mapping in the below code chunk. What is returned? What’s missing?
# Fill in the blanks
ggplot(data = ___, mapping = aes(x = ___, y = ___))
5. Add the appropriate geom
etric object to create the scatterplot. This is called adding a layer to a plot. Remember to always put the +
at the end of a line, never at the start.
# Copy your code from the previous chunk and add a geom
What do you notice? Write a sentence or two describing your findings
6. You must remember to put the aesthetic mappings in the aes()
function! What happens if you forget?
# Add a layer and see what happens
ggplot(data = ___, x = ___, y = ___)
7. The aesthetic mappings can be specified in the geom
layer if you prefer, instead of the main ggplot()
call. Give it a try:
# Rebuild the scatterplot with your aesthetic mapping in the geom layer
ggplot(data = ___)
Round 3: Additional Aesthetics
x
and y
are not the only aesthetic mappings possible. In this section you’ll explore the color
, size
, shape
, and alpha
(i.e. transparency) aesthetics.
8. Create a scatterplot of viewers_finale
vs. viewers_premiere
. Add the color
aesthetic to map country2
to the point color.
ggplot(data = ___) +
geom_point(aes(x = ___, y = ___, color = ___))
9. Create a scatterplot of viewers_finale
vs. viewers_premiere
. Use shape
to represent country2
. Is this plot easier or harder to interpret than the previous plot?
10. Create a scatterplot of viewers_finale
vs. viewers_premiere
. Use both shape
and color
to represent the country2
. Is this plot easier or harder to interpret than the previous two plots?
11. Create a scatterplot of viewers_finale
vs. viewers_premiere
. Use color
to represent the season
. What did you learn from the plot?
12. Create a scatterplot of viewers_finale
vs. viewers_premiere
. Use size
to represent the season
.
13. Look back at your scatterplots from the last few questions. Explain the differences when you map aesthetics to discrete and continuous variables.
14. Create a scatterplot of viewers_finale
vs. viewers_premiere
. Use alpha
to represent the season
.
Round 4: Visualizing Distributions
15. Build a histogram of viewers_mean
using geom_histogram()
. Don’t hesitate to look at the ggplot2 cheat sheet for help!
# Fill in the blanks
ggplot(___) +
geom_histogram(aes(x = ___))
What have you learned about the distribution of average viewership?
16. By default, ggplot2
uses 30 bins. To change the number of bins, to say 15, add the argument bins = 15
to geom_histogram()
. Note: this is not an aesthetic mapping.
# Fill in the blanks
ggplot(___) +
geom_histogram(aes(x = ___), bins = ___)
17. Instead of a histogram, let’s create a kernel density plot. To do this, substitute geom_density() into your code for question 15.
18. Now, let’s make side-by-side boxplots of viewers_mean
for each country2
.
# Fill in the blanks
ggplot(___) +
geom_boxplot(aes(x = ___, y = ___))
19. A violin plot is a kernel density on its side, made symmetric. Change your code from question 18 to use geom_violin()
. Which plot do you prefer, boxplots or violin plots? Why?
# Put your violin plot code here
Round 5: Bar and column charts + Labeling
How many seasons were filmed in each country? Let’s find out!
20. Make a bar chart of the number of seasons filmed in each country2
using geom_bar()
# Fill in the blanks
ggplot(___) +
geom_bar(aes(x = ___))
21. country2
has a category called “other”. Make a bar chart of country
(which includes all individual countries) instead.
# Fill in the blanks
ggplot(___) +
geom_bar(aes(x = ___))
22. When you have lots of categories, it’s sometimes hard to read the labels on the x-axis. One trick is to flip the axes. Change geom_bar()
to use the y
aesthetic from your code in question 21 instead.
23 . Sometimes, the variable we want to show up in the bar chart shows up explicitly in our data. Put n_tribes
on the x-axis, and season_name
on the y-axis. Use geom_col
(column) as the geom. How is this different than a bar plot?
# Fill in the blanks
ggplot(___) +
geom_col(aes(x = ___, y = ___))
24. In ggplot2 you can add/change the title, subtitle, caption, and x- and y-axis labels by adding a labs() layer. Below is an example illustrating it’s use. Choose one graph from today and add all labels.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
labs(
title = "Put your informative title here",
subtitle = "and your subtitle here",
x = "New x label",
y = "New y label",
caption = "Put a caption here"
)
When you’re done: make sure to submit your .rmd at the in-class activity submission google form.