21-scraping
paths_allowed
library(robotstxt)
paths_allowed("http://www.zillow.com")
[1] TRUE
Box Office Mojo
page <- read_html("https://www.boxofficemojo.com/year/2024/")
Carleton Class Schedule
https://www.carleton.edu/catalog/current/search/?subject=STAT&term=25WI
View the page source to try to find the html elements where this data is located (e.g. ‘h1’, ‘p’, ‘table’)
- Course number
- Course title
- Course description
- Course meetings
- Faculty
- Course meetings
listings = read_html("https://www.carleton.edu/catalog/current/search/?subject=STAT&term=25WI")
SelectorGadget
Use the SelectorGadget to explore http://www.imdb.com/chart/top
What should the columns of our target dataset be? Do they correspond to any specific css selectors?
Scraping IMDb Movie Page
imdb <- read_html("http://www.imdb.com/chart/top")
<- imdb %>%
_______ html_elements(".with-margin .ipc-title__text") %>%
html_text()
<- imdb %>%
_______ html_elements(".cli-title-metadata-item:nth-child(1)") %>%
html_text()
<- imdb %>%
_______ html_elements(".cli-title-metadata-item:nth-child(2)") %>%
html_text()
<- imdb %>%
_______ html_elements(".cli-title-metadata-item:nth-child(3)") %>%
html_text()
<- tibble(
imdb_top_250
)
imdb_top_250
Your Turn: IMDb TV Shows Page
In an R script:
Scrape the names, scores, and years of most popular TV shows on IMDB: www.imdb.com/chart/tvmeter
Create a data frame called
tvshows
with the variables:rank
,title
,stars
,year
,episodes
,n_ratings
Wrangle your resulting data so that all variable types are imported correctly
Use
write_csv
to save your file. If time, read it into the21-scraping.rmd
and make a graph