Web Scraping

Day 21

Prof Amanda Luby

Carleton College
Stat 220 - Spring 2025

Ways to access data from the web:

  1. Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.

  2. Screen scraping:
    extract data from source code of website, with html parser (easy) or regular expression matching (less easy).

Check the terms of use/service first!

  • Can you query this webpage?

  • Are there restrictions on the use of the data?

  • How many requests can you make per minute?

  • …and more…

Checking for permission to scrape

Use robotstxt::paths_allowed() to see if you can scrape the web page.

You can scrape Zillow

library(robotstxt)
paths_allowed("http://www.zillow.com")
[1] TRUE

But not Facebook

paths_allowed("http://www.facebook.com")
[1] FALSE

What websites have data about you? Think of 1-2 and see if scraping is allowed on those sites.

Hypertext Markup Language

  • Lots of data on the web is still available as HTML

  • It is structured (hierarchical / tree based), but it’s often not available in a form useful for analysis (flat / tidy).

<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p align="center">Hello world!</p>
  </body>
</html>

HTML tags

HTML uses tags to describe different aspects of document content

Tag Example
heading <h1>My Title</h1>
paragraph <p>A paragraph of content...</p>
table <table> ... </table>
anchor (with attribute) <a href="http://www.mysite.net">click here for link</a>

{rvest}

  • Pronounced like “harvest”

  • Processing and manipulation of HTML data

  • Installed with the {tidyverse} but not loaded automatically

library(rvest)

Core rvest functions

Function Description
read_html Read HTML data from a url or character string
html_element Select a specified element from HTML document
html_elements Select specified elements from HTML document
html_table Parse an HTML table into a data frame
html_text Extract tag pairs’ content
html_name Extract tags’ names
html_attrs Extract all of each tag’s attributes
html_attr Extract tags’ attribute value by name

Example: box office mojo

https://www.boxofficemojo.com/year/2024/

  • Take a look at the web page and the html source code

    Chrome or Firefox: right click -> View page source

  • Look for the "table" div ID or tag

Read HTML into R

page <- read_html("https://www.boxofficemojo.com/year/2024/")
page
{html_document}
<html class="a-no-js" data-19ax5a9jf="dingo">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body id="body" class="mojo-page-id-yld a-m-us a-aui_72554-c a-aui_a11y_6 ...
str(page)
List of 2
 $ node:<externalptr> 
 $ doc :<externalptr> 
 - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

HTML elements

There are over 100 HTML elements:

  • Every HTML page must be in an <html> element, and it must have two children: <head> and <body>
  • Block tags like <h1>, <p>, <ol> form the structure of the page
  • Inline tags like <b>, <i>, and <a> format text inside block tags

We’ll often work with tables. HTML tables are composed of four main elements <table>, <tr> (table row), <th> (table heading), and <td> (table data).

Extract tables

Use html_element() or html_elements() to extract pieces out of HTML documents

tables <- page %>% html_elements("table")
str(tables)
List of 1
 $ :List of 2
  ..$ node:<externalptr> 
  ..$ doc :<externalptr> 
  ..- attr(*, "class")= chr "xml_node"
 - attr(*, "class")= chr "xml_nodeset"

Parse a table into a data frame/tibble

top2024 <- html_table(tables[[1]])
glimpse(top2024)
Rows: 200
Columns: 11
$ Rank           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ Release        <chr> "Inside Out 2", "Deadpool & Wolverine", "Wicked", "Moan…
$ Genre          <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", …
$ Budget         <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", …
$ `Running Time` <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", …
$ Gross          <chr> "$652,980,194", "$636,745,858", "$432,943,285", "$404,0…
$ Theaters       <chr> "4,440", "4,330", "3,888", "4,200", "4,449", "4,575", "…
$ `Total Gross`  <chr> "$652,980,194", "$636,745,858", "$473,231,120", "$460,4…
$ `Release Date` <chr> "Jun 14", "Jul 26", "Nov 22", "Nov 27", "Jul 3", "Sep 6…
$ Distributor    <chr> "Walt Disney Studios Motion Pictures", "Walt Disney Stu…
$ Estimated      <chr> "false", "false", "false", "false", "false", "false", "…

Scrape then wrangle

top2024 <- top2024 %>%
  mutate(
    Gross = parse_number(Gross),
    Theaters = parse_number(Theaters),
    `Total Gross` = parse_number(`Total Gross`)
  ) %>%
  separate(`Release Date`, into = c("Month", "Day"))

glimpse(top2024)
Rows: 200
Columns: 12
$ Rank           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ Release        <chr> "Inside Out 2", "Deadpool & Wolverine", "Wicked", "Moan…
$ Genre          <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", …
$ Budget         <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", …
$ `Running Time` <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", …
$ Gross          <dbl> 652980194, 636745858, 432943285, 404017489, 361004205, …
$ Theaters       <dbl> 4440, 4330, 3888, 4200, 4449, 4575, 4074, 4170, 3948, 4…
$ `Total Gross`  <dbl> 652980194, 636745858, 473231120, 460405297, 361004205, …
$ Month          <chr> "Jun", "Jul", "Nov", "Nov", "Jul", "Sep", "Mar", "Jul",…
$ Day            <chr> "14", "26", "22", "27", "3", "6", "1", "19", "29", "8",…
$ Distributor    <chr> "Walt Disney Studios Motion Pictures", "Walt Disney Stu…
$ Estimated      <chr> "false", "false", "false", "false", "false", "false", "…

Scraped data will almost always need wrangling/cleaning

  • Are numeric columns numeric?
  • Are date columns dates?
  • Are factor and string columns treated correctly?

Data aren’t always stored as tables

https://www.carleton.edu/catalog/current/search/?subject=STAT&term=25WI

Where is the data stored?

View the page source to try to find the html elements where this data is located (e.g. ‘h1’, ‘p’, ‘table’)

  • Course number
  • Course title
  • Course description
  • Course meetings
  • Faculty
  • Course meetings
03:00

listings = read_html("https://www.carleton.edu/catalog/current/search/?subject=STAT&term=25WI")
listings |>
  html_elements("h3")
{xml_nodeset (22)}
 [1] <h3 class="courseTitleBar">\n            <span class="courseNumber" data ...
 [2] <h3 class="courseTitleBar">\n            <span class="courseNumber" data ...
 [3] <h3 class="courseTitleBar">\n            <span class="courseNumber" data ...
 [4] <h3 class="courseTitleBar">\n            <span class="courseNumber" data ...
 [5] <h3 class="courseTitleBar">\n            <span class="courseNumber" data ...
 [6] <h3 class="courseTitleBar">\n            <span class="courseNumber" data ...
 [7] <h3 class="courseTitleBar">\n            <span class="courseNumber" data ...
 [8] <h3 class="courseTitleBar">\n            <span class="courseNumber" data ...
 [9] <h3 class="courseTitleBar">\n            <span class="courseNumber" data ...
[10] <h3 class="courseTitleBar">\n            <span class="courseNumber" data ...
[11] <h3 class="courseSearchResultsHeading relatedCourses" id="relatedCourses ...
[12] <h3 class="courseTitleBar">\n            <span class="courseNumber" data ...
[13] <h3 class="courseTitleBar">\n            <span class="courseNumber" data ...
[14] <h3 class="courseTitleBar">\n            <span class="courseNumber" data ...
[15] <h3 class="courseTitleBar">\n            <span class="courseNumber" data ...
[16] <h3 class="courseTitleBar">\n            <span class="courseNumber" data ...
[17] <h3 class="courseTitleBar">\n            <span class="courseNumber" data ...
[18] <h3 class="courseTitleBar">\n            <span class="courseNumber" data ...
[19] <h3 class="courseTitleBar">\n            <span class="courseNumber" data ...
[20] <h3 class="courseTitleBar">\n            <span class="courseNumber" data ...
...

listings |>
  html_elements("h3") |>
  html_text()
 [1] "\n            STAT 120\n            Introduction to Statistics\n            \n                            6 credits\n                        \n        "                                   
 [2] "\n            STAT 220\n            Introduction to Data Science\n            \n                            6 credits\n                        \n        "                                 
 [3] "\n            STAT 230\n            Applied Regression Analysis\n            \n                            6 credits\n                        \n        "                                  
 [4] "\n            STAT 250\n            Introduction to Statistical Inference\n            \n                            6 credits\n                        \n        "                        
 [5] "\n            STAT 285\n            Statistical Consulting\n            \n                            2 credits\n                        \n        "                                       
 [6] "\n            STAT 297\n            Assessment and Communication of External Statistical Activity\n            \n                            1 credits\n                        \n        "
 [7] "\n            STAT 330\n            Advanced Statistical Modeling\n            \n                            6 credits\n                        \n        "                                
 [8] "\n            STAT 394\n            Directed Research in Statistics\n            \n                            1 – 6 credits\n                        \n        "                          
 [9] "\n            STAT 399\n            Senior Seminar\n            \n                            6 credits\n                        \n        "                                               
[10] "\n            STAT 400\n            Integrative Exercise\n            \n                            3 – 6 credits\n                        \n        "                                     
[11] "Related Courses"                                                                                                                                                                           
[12] "\n            CS 111\n            Introduction to Computer Science\n            \n                            6 credits\n                        \n        "                               
[13] "\n            CS 314\n            Data Visualization\n            \n                            6 credits\n                        \n        "                                             
[14] "\n            MATH 101\n            Calculus with Problem Solving\n            \n                            6 credits\n                        \n        "                                
[15] "\n            MATH 111\n            Introduction to Calculus\n            \n                            6 credits\n                        \n        "                                     
[16] "\n            MATH 120\n            Calculus 2\n            \n                            6 credits\n                        \n        "                                                   
[17] "\n            MATH 210\n            Calculus 3\n            \n                            6 credits\n                        \n        "                                                   
[18] "\n            MATH 211\n            Introduction to Multivariable Calculus\n            \n                            6 credits\n                        \n        "                       
[19] "\n            MATH 232\n            Linear Algebra\n            \n                            6 credits\n                        \n        "                                               
[20] "\n            MATH 240\n            Probability\n            \n                            6 credits\n                        \n        "                                                  
[21] "Liberal Arts Requirements"                                                                                                                                                                 
[22] "Other Course Tags"                                                                                                                                                                         

listings |>
  html_elements("h3") |>
  html_text() |> 
  str_squish()
 [1] "STAT 120 Introduction to Statistics 6 credits"                                   
 [2] "STAT 220 Introduction to Data Science 6 credits"                                 
 [3] "STAT 230 Applied Regression Analysis 6 credits"                                  
 [4] "STAT 250 Introduction to Statistical Inference 6 credits"                        
 [5] "STAT 285 Statistical Consulting 2 credits"                                       
 [6] "STAT 297 Assessment and Communication of External Statistical Activity 1 credits"
 [7] "STAT 330 Advanced Statistical Modeling 6 credits"                                
 [8] "STAT 394 Directed Research in Statistics 1 – 6 credits"                          
 [9] "STAT 399 Senior Seminar 6 credits"                                               
[10] "STAT 400 Integrative Exercise 3 – 6 credits"                                     
[11] "Related Courses"                                                                 
[12] "CS 111 Introduction to Computer Science 6 credits"                               
[13] "CS 314 Data Visualization 6 credits"                                             
[14] "MATH 101 Calculus with Problem Solving 6 credits"                                
[15] "MATH 111 Introduction to Calculus 6 credits"                                     
[16] "MATH 120 Calculus 2 6 credits"                                                   
[17] "MATH 210 Calculus 3 6 credits"                                                   
[18] "MATH 211 Introduction to Multivariable Calculus 6 credits"                       
[19] "MATH 232 Linear Algebra 6 credits"                                               
[20] "MATH 240 Probability 6 credits"                                                  
[21] "Liberal Arts Requirements"                                                       
[22] "Other Course Tags"                                                               

CSS selectors

Selecting courseNumber class

Course numbers are between <span class="courseNumber"> ... </span> tags

These tags can be selected using . followed by the name of the class

listings %>% 
  html_elements(".courseNumber")
{xml_nodeset (19)}
 [1] <span class="courseNumber" data-terms="25/WI">STAT 120</span>
 [2] <span class="courseNumber" data-terms="25/WI">STAT 220</span>
 [3] <span class="courseNumber" data-terms="25/WI">STAT 230</span>
 [4] <span class="courseNumber" data-terms="25/WI">STAT 250</span>
 [5] <span class="courseNumber" data-terms="25/WI">STAT 285</span>
 [6] <span class="courseNumber" data-terms="25/WI">STAT 297</span>
 [7] <span class="courseNumber" data-terms="25/WI">STAT 330</span>
 [8] <span class="courseNumber" data-terms="24/FA 25/WI 25/SP">STAT 394</span>
 [9] <span class="courseNumber" data-terms="25/WI">STAT 399</span>
[10] <span class="courseNumber" data-terms="25/WI">STAT 400</span>
[11] <span class="courseNumber" data-terms="25/WI">CS 111</span>
[12] <span class="courseNumber" data-terms="25/WI">CS 314</span>
[13] <span class="courseNumber" data-terms="25/WI">MATH 101</span>
[14] <span class="courseNumber" data-terms="25/WI">MATH 111</span>
[15] <span class="courseNumber" data-terms="25/WI">MATH 120</span>
[16] <span class="courseNumber" data-terms="25/WI">MATH 210</span>
[17] <span class="courseNumber" data-terms="25/WI">MATH 211</span>
[18] <span class="courseNumber" data-terms="25/WI">MATH 232</span>
[19] <span class="courseNumber" data-terms="25/WI">MATH 240</span>

Scraping courseNumbers

listings %>% 
  html_elements(".courseNumber") %>%
  html_text()
 [1] "STAT 120" "STAT 220" "STAT 230" "STAT 250" "STAT 285" "STAT 297"
 [7] "STAT 330" "STAT 394" "STAT 399" "STAT 400" "CS 111"   "CS 314"  
[13] "MATH 101" "MATH 111" "MATH 120" "MATH 210" "MATH 211" "MATH 232"
[19] "MATH 240"

Scraping credits

listings %>% 
  html_elements(".credits") %>%
  html_text()
 [1] "\n                            6 credits\n                        "    
 [2] "\n                            6 credits\n                        "    
 [3] "\n                            6 credits\n                        "    
 [4] "\n                            6 credits\n                        "    
 [5] "\n                            2 credits\n                        "    
 [6] "\n                            1 credits\n                        "    
 [7] "\n                            6 credits\n                        "    
 [8] "\n                            1 – 6 credits\n                        "
 [9] "\n                            6 credits\n                        "    
[10] "\n                            3 – 6 credits\n                        "
[11] "\n                            6 credits\n                        "    
[12] "\n                            6 credits\n                        "    
[13] "\n                            6 credits\n                        "    
[14] "\n                            6 credits\n                        "    
[15] "\n                            6 credits\n                        "    
[16] "\n                            6 credits\n                        "    
[17] "\n                            6 credits\n                        "    
[18] "\n                            6 credits\n                        "    
[19] "\n                            6 credits\n                        "    

Scraping credits

listings %>% 
  html_elements(".credits") %>%
  html_text() %>%
  str_squish()
 [1] "6 credits"     "6 credits"     "6 credits"     "6 credits"    
 [5] "2 credits"     "1 credits"     "6 credits"     "1 – 6 credits"
 [9] "6 credits"     "3 – 6 credits" "6 credits"     "6 credits"    
[13] "6 credits"     "6 credits"     "6 credits"     "6 credits"    
[17] "6 credits"     "6 credits"     "6 credits"    

stat_winter2025 <- tibble(
  course = listings %>% html_elements(".courseNumber") %>% html_text(),
  title = listings %>% html_elements(".courseTitle") %>% html_text(),
  credits = listings %>% html_elements(".credits") %>% html_text() %>% str_squish(),
  description = listings %>% html_elements(".courseDetailWrapper") %>% html_text() %>% str_squish()
)

stat_winter2025
# A tibble: 19 × 4
   course   title                                            credits description
   <chr>    <chr>                                            <chr>   <chr>      
 1 STAT 120 Introduction to Statistics                       6 cred… Introducti…
 2 STAT 220 Introduction to Data Science                     6 cred… This cours…
 3 STAT 230 Applied Regression Analysis                      6 cred… A second c…
 4 STAT 250 Introduction to Statistical Inference            6 cred… Introducti…
 5 STAT 285 Statistical Consulting                           2 cred… Students w…
 6 STAT 297 Assessment and Communication of External Statis… 1 cred… An indepen…
 7 STAT 330 Advanced Statistical Modeling                    6 cred… Topics inc…
 8 STAT 394 Directed Research in Statistics                  1 – 6 … Spatial pr…
 9 STAT 399 Senior Seminar                                   6 cred… As part of…
10 STAT 400 Integrative Exercise                             3 – 6 … Either a s…
11 CS 111   Introduction to Computer Science                 6 cred… This cours…
12 CS 314   Data Visualization                               6 cred… Understand…
13 MATH 101 Calculus with Problem Solving                    6 cred… An introdu…
14 MATH 111 Introduction to Calculus                         6 cred… An introdu…
15 MATH 120 Calculus 2                                       6 cred… Inverse fu…
16 MATH 210 Calculus 3                                       6 cred… Vectors, c…
17 MATH 211 Introduction to Multivariable Calculus           6 cred… Vectors, c…
18 MATH 232 Linear Algebra                                   6 cred… Linear alg…
19 MATH 240 Probability                                      6 cred… Introducti…

What about sections?

listings %>% 
  html_elements(".course-section") %>%
  html_element(".courseSectionNumber") %>% 
  html_text() %>% 
  str_squish()
 [1] "STAT 120.01 Winter 2025" "STAT 120.02 Winter 2025"
 [3] "STAT 120.03 Winter 2025" "STAT 220.00 Winter 2025"
 [5] "STAT 230.00 Winter 2025" "STAT 250.00 Winter 2025"
 [7] "STAT 285.00 Winter 2025" "STAT 297.00 Winter 2025"
 [9] "STAT 330.00 Winter 2025" "STAT 394.11 Winter 2025"
[11] "STAT 394.12 Winter 2025" "STAT 399.00 Winter 2025"
[13] "STAT 400.01 Winter 2025" "STAT 400.02 Winter 2025"
[15] "STAT 400.03 Winter 2025" "CS 111.01 Winter 2025"  
[17] "CS 111.02 Winter 2025"   "CS 314.00 Winter 2025"  
[19] "MATH 101.00 Winter 2025" "MATH 111.00 Winter 2025"
[21] "MATH 120.01 Winter 2025" "MATH 120.02 Winter 2025"
[23] "MATH 120.03 Winter 2025" "MATH 210.01 Winter 2025"
[25] "MATH 210.02 Winter 2025" "MATH 211.00 Winter 2025"
[27] "MATH 232.01 Winter 2025" "MATH 232.02 Winter 2025"
[29] "MATH 240.00 Winter 2025" "MATH 240.02 Winter 2025"

Sometimes, we can’t get around regex :(

listings %>% 
  html_elements(".classMeetings") %>% 
  html_text() %>% 
  str_squish()
 [1] "STAT 120.01 Winter 2025 Faculty:Claire Kelling 🏫 👤 Size:32 M, WCMC 102 11:10am-12:20pm FCMC 102 12:00pm-1:00pm Sophomore Priority; Not open to students who have already received credit for Psychology 200/201, Sociology/Anthropology 239 or Statistics 250 Sophomore Priority."
 [2] "STAT 120.02 Winter 2025 Faculty:Spencer Wadsworth 🏫 👤 Size:32 M, WCMC 102 12:30pm-1:40pm FCMC 102 1:10pm-2:10pm"                                                                                                                                                                  
 [3] "STAT 120.03 Winter 2025 Faculty:Rebecca Terry 🏫 👤 Size:32 M, WCMC 102 1:50pm-3:00pm FCMC 102 2:20pm-3:20pm"                                                                                                                                                                       
 [4] "STAT 220.00 Winter 2025 Faculty:Amanda Luby 🏫 👤 Size:30 M, WCMC 102 9:50am-11:00am FCMC 102 9:40am-10:40am"                                                                                                                                                                       
 [5] "STAT 230.00 Winter 2025 Faculty:Claire Kelling 🏫 👤 Size:28 M, WCMC 306 1:50pm-3:00pm FCMC 306 2:20pm-3:20pm"                                                                                                                                                                      
 [6] "STAT 250.00 Winter 2025 Faculty:Adam Loy 🏫 👤 Size:28 M, WCMC 301 1:50pm-3:00pm FCMC 301 2:20pm-3:20pm"                                                                                                                                                                            
 [7] "STAT 285.00 Winter 2025 Faculty:Katie St. Clair 🏫 👤 Grading:S/CR/NC TCMC 304 10:10am-11:55am"                                                                                                                                                                                     
 [8] "STAT 297.00 Winter 2025 Faculty:Katie St. Clair 🏫 👤 · Claire Kelling 🏫 👤 Grading:S/CR/NC"                                                                                                                                                                                       
 [9] "STAT 330.00 Winter 2025 Faculty:Katie St. Clair 🏫 👤 Size:20 M, WCMC 306 9:50am-11:00am FCMC 306 9:40am-10:40am"                                                                                                                                                                   
[10] "STAT 394.11 Winter 2025 Faculty:Claire Kelling 🏫 👤 Grading:S/CR/NC Credits:2"                                                                                                                                                                                                     
[11] "STAT 394.12 Winter 2025 Faculty:Claire Kelling 🏫 👤 Grading:S/CR/NC"                                                                                                                                                                                                               
[12] "STAT 399.00 Winter 2025 Faculty:Amanda Luby 🏫 👤 Size:4 Grading:S/CR/NC THCMC 328 10:10am-11:55am"                                                                                                                                                                                 
[13] "STAT 400.01 Winter 2025 Faculty:Katie St. Clair 🏫 👤 Size:10 Grading:S/NC Credits:6"                                                                                                                                                                                               
[14] "STAT 400.02 Winter 2025 Faculty:Adam Loy 🏫 👤 Size:12 Grading:S/NC Credits:3 T, THCMC 304 1:15pm-3:00pm"                                                                                                                                                                           
[15] "STAT 400.03 Winter 2025 Faculty:Katie St. Clair 🏫 👤 Size:8 Grading:S/NC Credits:3 TCMC 328 1:15pm-3:00pm"                                                                                                                                                                         
[16] "CS 111.01 Winter 2025 Faculty:Tom Finzell 🏫 👤 Size:38 M, WOlin 310 8:30am-9:40am FOlin 310 8:30am-9:30am"                                                                                                                                                                         
[17] "CS 111.02 Winter 2025 Faculty:Tom Finzell 🏫 👤 Size:38 M, WOlin 310 11:10am-12:20pm FOlin 310 12:00pm-1:00pm Sophomore Priority Sophomore Priority."                                                                                                                               
[18] "CS 314.00 Winter 2025 Faculty:Bridger Herman 🏫 👤 Size:34 M, WLeighton 304 12:30pm-1:40pm FLeighton 304 1:10pm-2:10pm"                                                                                                                                                             
[19] "MATH 101.00 Winter 2025 Faculty:Deanna Haunsperger 🏫 👤 Size:30 M, WCMC 209 9:50am-11:00am FCMC 209 9:40am-10:40am"                                                                                                                                                                
[20] "MATH 111.00 Winter 2025 Faculty:Rob Thompson 🏫 👤 Size:30 M, WCMC 210 11:10am-12:20pm FCMC 210 12:00pm-1:00pm"                                                                                                                                                                     
[21] "MATH 120.01 Winter 2025 Faculty:Rebecca Terry 🏫 👤 Size:30 M, WCMC 301 11:10am-12:20pm FCMC 301 12:00pm-1:00pm"                                                                                                                                                                    
[22] "MATH 120.02 Winter 2025 Faculty:Corey Brooke 🏫 👤 Size:30 M, WCMC 210 12:30pm-1:40pm FCMC 210 1:10pm-2:10pm"                                                                                                                                                                       
[23] "MATH 120.03 Winter 2025 Faculty:Mike Adams 🏫 👤 Size:30 M, WCMC 209 1:50pm-3:00pm FCMC 209 2:20pm-3:20pm"                                                                                                                                                                          
[24] "MATH 210.01 Winter 2025 Faculty:Corey Brooke 🏫 👤 Size:30 M, WCMC 210 9:50am-11:00am FCMC 210 9:40am-10:40am"                                                                                                                                                                      
[25] "MATH 210.02 Winter 2025 Faculty:Caroline Turnage-Butterbaugh 🏫 👤 Size:30 M, WCMC 209 11:10am-12:20pm FCMC 209 12:00pm-1:00pm"                                                                                                                                                     
[26] "MATH 211.00 Winter 2025 Faculty:Kate Meyer 🏫 👤 Size:30 M, WCMC 206 8:30am-9:40am FCMC 206 8:30am-9:30am"                                                                                                                                                                          
[27] "MATH 232.01 Winter 2025 Faculty:Rafe Jones 🏫 👤 Size:30 M, WCMC 206 11:10am-12:20pm FCMC 206 12:00pm-1:00pm"                                                                                                                                                                       
[28] "MATH 232.02 Winter 2025 Faculty:MurphyKate Montee 🏫 👤 Size:30 M, WCMC 209 12:30pm-1:40pm FCMC 209 1:10pm-2:10pm"                                                                                                                                                                  
[29] "MATH 240.00 Winter 2025 Faculty:Adam Loy 🏫 👤 Size:30 M, WCMC 306 12:30pm-1:40pm FCMC 306 1:10pm-2:10pm"                                                                                                                                                                           
[30] "MATH 240.02 Winter 2025 Faculty:Rob Thompson 🏫 👤 Size:30 M, WCMC 301 12:30pm-1:40pm FCMC 301 1:10pm-2:10pm"                                                                                                                                                                       

selectorGadget

selectorGadget

  • Click on the app logo next to the search bar
  • A box will open in the bottom right of the website
  • Click on a page element (it will turn green), SelectorGadget will generate a minimal CSS selector for that element, and will highlight (yellow) everything that is matched by the selector
  • Click on a highlighted element to remove it from the selector (red), or click on an unhighlighted element to add it to the selector

Try it

  • Use the SelectorGadget to explore http://www.imdb.com/chart/top

  • What should the columns of our target dataset be? Do they correspond to any specific css selectors?

03:00

Extract title

imdb <- read_html("http://www.imdb.com/chart/top")
titles <- imdb %>%
  html_elements(".with-margin .ipc-title__text") %>%
  html_text()

head(titles)
[1] "1. The Shawshank Redemption"                     
[2] "2. The Godfather"                                
[3] "3. The Dark Knight"                              
[4] "4. The Godfather Part II"                        
[5] "5. 12 Angry Men"                                 
[6] "6. The Lord of the Rings: The Return of the King"

Extract year

years <- imdb %>%
  html_elements(".cli-title-metadata-item:nth-child(1)") %>%
  html_text()

head(years)
[1] "1994" "1972" "2008" "1974" "1957" "2003"

Extract runtime

runtimes <- imdb %>%
  html_elements(".cli-title-metadata-item:nth-child(2)") %>%
  html_text()

head(runtimes)
[1] "2h 22m" "2h 55m" "2h 32m" "3h 22m" "1h 36m" "3h 21m"

Extract MPAA rating

mpaas <- imdb %>%
  html_elements(".cli-title-metadata-item:nth-child(3)") %>%
  html_text()

head(mpaas)
[1] "R"        "R"        "PG-13"    "R"        "Approved" "PG-13"   

Put the pieces together

imdb_top_250 <- tibble(
  title = titles, 
  year = years, 
  runtime = runtimes,
  mpaa = mpaas
  )

imdb_top_250
# A tibble: 25 × 4
   title                                                year  runtime mpaa    
   <chr>                                                <chr> <chr>   <chr>   
 1 1. The Shawshank Redemption                          1994  2h 22m  R       
 2 2. The Godfather                                     1972  2h 55m  R       
 3 3. The Dark Knight                                   2008  2h 32m  PG-13   
 4 4. The Godfather Part II                             1974  3h 22m  R       
 5 5. 12 Angry Men                                      1957  1h 36m  Approved
 6 6. The Lord of the Rings: The Return of the King     2003  3h 21m  PG-13   
 7 7. Schindler's List                                  1993  3h 15m  R       
 8 8. Pulp Fiction                                      1994  2h 34m  R       
 9 9. The Lord of the Rings: The Fellowship of the Ring 2001  2h 58m  PG-13   
10 10. The Good, the Bad and the Ugly                   1966  2h 58m  R       
# ℹ 15 more rows

Wait a second…. there’s not 250 movies here

imdb_top_250
# A tibble: 25 × 4
   title                                                year  runtime mpaa    
   <chr>                                                <chr> <chr>   <chr>   
 1 1. The Shawshank Redemption                          1994  2h 22m  R       
 2 2. The Godfather                                     1972  2h 55m  R       
 3 3. The Dark Knight                                   2008  2h 32m  PG-13   
 4 4. The Godfather Part II                             1974  3h 22m  R       
 5 5. 12 Angry Men                                      1957  1h 36m  Approved
 6 6. The Lord of the Rings: The Return of the King     2003  3h 21m  PG-13   
 7 7. Schindler's List                                  1993  3h 15m  R       
 8 8. Pulp Fiction                                      1994  2h 34m  R       
 9 9. The Lord of the Rings: The Fellowship of the Ring 2001  2h 58m  PG-13   
10 10. The Good, the Bad and the Ugly                   1966  2h 58m  R       
# ℹ 15 more rows

Most modern tables in webpages are dynamically loaded (they wait for you to scroll down to load more rows). rvest can’t scroll, so it can only see the initial data that’s loaded

What do we do?

  1. Is there an API available? Can I access it as a student/researcher at no/low cost?
    • IMDb: Yes, but cost prohibitive
  2. Is there a different scraping tool available?
    • Yes, {RSelenium} is one that might work, but beyond the scope of this course
  3. Can I get the information I need from a different website?
  4. If I download the page, is more information available?
    • In this case, yes, but might not always work

With local copy of website

imdb_local <- read_html("IMDb Top 250 Movies.html")
titles <- imdb_local %>%
  html_elements(".with-margin .ipc-title__text") %>%
  html_text()


titles
  [1] "1. The Shawshank Redemption"                                             
  [2] "2. The Godfather"                                                        
  [3] "3. The Dark Knight"                                                      
  [4] "4. The Godfather Part II"                                                
  [5] "5. 12 Angry Men"                                                         
  [6] "6. The Lord of the Rings: The Return of the King"                        
  [7] "7. Schindler's List"                                                     
  [8] "8. Pulp Fiction"                                                         
  [9] "9. The Lord of the Rings: The Fellowship of the Ring"                    
 [10] "10. The Good, the Bad and the Ugly"                                      
 [11] "11. Forrest Gump"                                                        
 [12] "12. The Lord of the Rings: The Two Towers"                               
 [13] "13. Fight Club"                                                          
 [14] "14. Inception"                                                           
 [15] "15. Star Wars: Episode V - The Empire Strikes Back"                      
 [16] "16. The Matrix"                                                          
 [17] "17. Goodfellas"                                                          
 [18] "18. One Flew Over the Cuckoo's Nest"                                     
 [19] "19. Interstellar"                                                        
 [20] "20. Se7en"                                                               
 [21] "21. It's a Wonderful Life"                                               
 [22] "22. Seven Samurai"                                                       
 [23] "23. The Silence of the Lambs"                                            
 [24] "24. Saving Private Ryan"                                                 
 [25] "25. City of God"                                                         
 [26] "26. The Green Mile"                                                      
 [27] "27. Life Is Beautiful"                                                   
 [28] "28. Terminator 2: Judgment Day"                                          
 [29] "29. Star Wars: Episode IV - A New Hope"                                  
 [30] "30. Back to the Future"                                                  
 [31] "31. Spirited Away"                                                       
 [32] "32. The Pianist"                                                         
 [33] "33. Gladiator"                                                           
 [34] "34. Parasite"                                                            
 [35] "35. Psycho"                                                              
 [36] "36. The Lion King"                                                       
 [37] "37. Grave of the Fireflies"                                              
 [38] "38. The Departed"                                                        
 [39] "39. Whiplash"                                                            
 [40] "40. Harakiri"                                                            
 [41] "41. American History X"                                                  
 [42] "42. The Prestige"                                                        
 [43] "43. Léon: The Professional"                                              
 [44] "44. Spider-Man: Across the Spider-Verse"                                 
 [45] "45. Casablanca"                                                          
 [46] "46. The Usual Suspects"                                                  
 [47] "47. The Intouchables"                                                    
 [48] "48. Cinema Paradiso"                                                     
 [49] "49. Modern Times"                                                        
 [50] "50. Alien"                                                               
 [51] "51. Rear Window"                                                         
 [52] "52. Once Upon a Time in the West"                                        
 [53] "53. Django Unchained"                                                    
 [54] "54. City Lights"                                                         
 [55] "55. Dune: Part Two"                                                      
 [56] "56. Apocalypse Now"                                                      
 [57] "57. Memento"                                                             
 [58] "58. WALL·E"                                                              
 [59] "59. Raiders of the Lost Ark"                                             
 [60] "60. The Lives of Others"                                                 
 [61] "61. Avengers: Infinity War"                                              
 [62] "62. Sunset Boulevard"                                                    
 [63] "63. Spider-Man: Into the Spider-Verse"                                   
 [64] "64. Paths of Glory"                                                      
 [65] "65. Witness for the Prosecution"                                         
 [66] "66. The Shining"                                                         
 [67] "67. The Great Dictator"                                                  
 [68] "68. 12th Fail"                                                           
 [69] "69. Aliens"                                                              
 [70] "70. Inglourious Basterds"                                                
 [71] "71. The Dark Knight Rises"                                               
 [72] "72. Coco"                                                                
 [73] "73. Amadeus"                                                             
 [74] "74. Toy Story"                                                           
 [75] "75. Avengers: Endgame"                                                   
 [76] "76. Oldboy"                                                              
 [77] "77. Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb"
 [78] "78. Good Will Hunting"                                                   
 [79] "79. American Beauty"                                                     
 [80] "80. Das Boot"                                                            
 [81] "81. Braveheart"                                                          
 [82] "82. Princess Mononoke"                                                   
 [83] "83. Your Name."                                                          
 [84] "84. High and Low"                                                        
 [85] "85. 3 Idiots"                                                            
 [86] "86. Joker"                                                               
 [87] "87. Once Upon a Time in America"                                         
 [88] "88. Capernaum"                                                           
 [89] "89. Singin' in the Rain"                                                 
 [90] "90. Come and See"                                                        
 [91] "91. Requiem for a Dream"                                                 
 [92] "92. Toy Story 3"                                                         
 [93] "93. Star Wars: Episode VI - Return of the Jedi"                          
 [94] "94. The Hunt"                                                            
 [95] "95. Eternal Sunshine of the Spotless Mind"                               
 [96] "96. Ikiru"                                                               
 [97] "97. 2001: A Space Odyssey"                                               
 [98] "98. Reservoir Dogs"                                                      
 [99] "99. The Apartment"                                                       
[100] "100. Lawrence of Arabia"                                                 
[101] "101. Incendies"                                                          
[102] "102. Scarface"                                                           
[103] "103. Double Indemnity"                                                   
[104] "104. North by Northwest"                                                 
[105] "105. Heat"                                                               
[106] "106. Citizen Kane"                                                       
[107] "107. M"                                                                  
[108] "108. Up"                                                                 
[109] "109. Full Metal Jacket"                                                  
[110] "110. Vertigo"                                                            
[111] "111. Amélie"                                                             
[112] "112. A Clockwork Orange"                                                 
[113] "113. Oppenheimer"                                                        
[114] "114. To Kill a Mockingbird"                                              
[115] "115. A Separation"                                                       
[116] "116. Die Hard"                                                           
[117] "117. The Sting"                                                          
[118] "118. Like Stars on Earth"                                                
[119] "119. Indiana Jones and the Last Crusade"                                 
[120] "120. Metropolis"                                                         
[121] "121. I'm Still Here"                                                     
[122] "122. Snatch"                                                             
[123] "123. 1917"                                                               
[124] "124. L.A. Confidential"                                                  
[125] "125. Bicycle Thieves"                                                    
[126] "126. Downfall"                                                           
[127] "127. Dangal"                                                             
[128] "128. Taxi Driver"                                                        
[129] "129. Hamilton"                                                           
[130] "130. The Wolf of Wall Street"                                            
[131] "131. Batman Begins"                                                      
[132] "132. Green Book"                                                         
[133] "133. For a Few Dollars More"                                             
[134] "134. Some Like It Hot"                                                   
[135] "135. The Truman Show"                                                    
[136] "136. Judgment at Nuremberg"                                              
[137] "137. The Kid"                                                            
[138] "138. The Father"                                                         
[139] "139. Shutter Island"                                                     
[140] "140. All About Eve"                                                      
[141] "141. There Will Be Blood"                                                
[142] "142. Jurassic Park"                                                      
[143] "143. Casino"                                                             
[144] "144. The Sixth Sense"                                                    
[145] "145. Ran"                                                                
[146] "146. Top Gun: Maverick"                                                  
[147] "147. No Country for Old Men"                                             
[148] "148. The Thing"                                                          
[149] "149. Pan's Labyrinth"                                                    
[150] "150. Unforgiven"                                                         
[151] "151. A Beautiful Mind"                                                   
[152] "152. Kill Bill: Vol. 1"                                                  
[153] "153. The Treasure of the Sierra Madre"                                   
[154] "154. Yojimbo"                                                            
[155] "155. Prisoners"                                                          
[156] "156. Finding Nemo"                                                       
[157] "157. The Great Escape"                                                   
[158] "158. Monty Python and the Holy Grail"                                    
[159] "159. Howl's Moving Castle"                                               
[160] "160. The Elephant Man"                                                   
[161] "161. Dial M for Murder"                                                  
[162] "162. Gone with the Wind"                                                 
[163] "163. Rashomon"                                                           
[164] "164. The Wild Robot"                                                     
[165] "165. Chinatown"                                                          
[166] "166. Klaus"                                                              
[167] "167. The Secret in Their Eyes"                                           
[168] "168. Lock, Stock and Two Smoking Barrels"                                
[169] "169. V for Vendetta"                                                     
[170] "170. Inside Out"                                                         
[171] "171. Three Billboards Outside Ebbing, Missouri"                          
[172] "172. Trainspotting"                                                      
[173] "173. The Bridge on the River Kwai"                                       
[174] "174. Raging Bull"                                                        
[175] "175. Catch Me If You Can"                                                
[176] "176. Fargo"                                                              
[177] "177. Warrior"                                                            
[178] "178. Harry Potter and the Deathly Hallows: Part 2"                       
[179] "179. Gran Torino"                                                        
[180] "180. Million Dollar Baby"                                                
[181] "181. Spider-Man: No Way Home"                                            
[182] "182. My Neighbor Totoro"                                                 
[183] "183. Mad Max: Fury Road"                                                 
[184] "184. Ben-Hur"                                                            
[185] "185. Children of Heaven"                                                 
[186] "186. Barry Lyndon"                                                       
[187] "187. 12 Years a Slave"                                                   
[188] "188. Before Sunrise"                                                     
[189] "189. Blade Runner"                                                       
[190] "190. The Grand Budapest Hotel"                                           
[191] "191. Dead Poets Society"                                                 
[192] "192. Hacksaw Ridge"                                                      
[193] "193. Gone Girl"                                                          
[194] "194. Memories of Murder"                                                 
[195] "195. In the Name of the Father"                                          
[196] "196. Monsters, Inc."                                                     
[197] "197. Ratatouille"                                                        
[198] "198. The Gold Rush"                                                      
[199] "199. Wild Tales"                                                         
[200] "200. How to Train Your Dragon"                                           
[201] "201. Sherlock Jr."                                                       
[202] "202. Jaws"                                                               
[203] "203. The Deer Hunter"                                                    
[204] "204. Mary and Max"                                                       
[205] "205. The General"                                                        
[206] "206. Ford v Ferrari"                                                     
[207] "207. The Wages of Fear"                                                  
[208] "208. On the Waterfront"                                                  
[209] "209. Mr. Smith Goes to Washington"                                       
[210] "210. Wild Strawberries"                                                  
[211] "211. Maharaja"                                                           
[212] "212. Logan"                                                              
[213] "213. The Third Man"                                                      
[214] "214. Rocky"                                                              
[215] "215. Tokyo Story"                                                        
[216] "216. The Big Lebowski"                                                   
[217] "217. Spotlight"                                                          
[218] "218. The Seventh Seal"                                                   
[219] "219. The Terminator"                                                     
[220] "220. Room"                                                               
[221] "221. Pirates of the Caribbean: The Curse of the Black Pearl"             
[222] "222. Hotel Rwanda"                                                       
[223] "223. La haine"                                                           
[224] "224. Platoon"                                                            
[225] "225. Demon Slayer: Kimetsu no Yaiba - Tsuzumi Mansion Arc"               
[226] "226. Jai Bhim"                                                           
[227] "227. Before Sunset"                                                      
[228] "228. The Best Years of Our Lives"                                        
[229] "229. The Exorcist"                                                       
[230] "230. The Passion of Joan of Arc"                                         
[231] "231. The Wizard of Oz"                                                   
[232] "232. The Incredibles"                                                    
[233] "233. Rush"                                                               
[234] "234. The Sound of Music"                                                 
[235] "235. Hachi: A Dog's Tale"                                                
[236] "236. Stand by Me"                                                        
[237] "237. Network"                                                            
[238] "238. My Father and My Son"                                               
[239] "239. The Handmaiden"                                                     
[240] "240. The Iron Giant"                                                     
[241] "241. To Be or Not to Be"                                                 
[242] "242. The Battle of Algiers"                                              
[243] "243. Into the Wild"                                                      
[244] "244. The Grapes of Wrath"                                                
[245] "245. Groundhog Day"                                                      
[246] "246. The Help"                                                           
[247] "247. A Silent Voice: The Movie"                                          
[248] "248. Amores Perros"                                                      
[249] "249. Rebecca"                                                            
[250] "250. A Man Escaped"                                                      

.rmd vs .R

Rmarkdown:

  • Integrate code, text, and graphs
  • Output is a “report”
  • Code is run interactively (in chunks) and when knitting your final document

R script:

  • Think of it as a file that contains only R chunks
  • No “knitting”: all code must be run explicitly
  • Useful for longer chunks of code
    • data-cleaning.R
    • fit-models.R
    • scrape-data.R

For scraping:

  • We don’t want to scrape a website more than we need to
  • For HW, it’s OK to continue to use .Rmd unless specified
  • For projects that involve intensive data-gathering:
    • use an R script to read in the “raw” data, clean it, and save it to a tidy csv
    • Read your “clean” data to your .rmd and proceed as usual

Example

Your turn:

  • In an R script:

    • Scrape the names, scores, and years of most popular TV shows on IMDB: www.imdb.com/chart/tvmeter

    • Create a data frame called tvshows with the variables: rank, title, stars, year, episodes, n_ratings

    • Wrangle your resulting data so that all variable types are imported correctly

    • Use write_csv to save your file. If time, read it into the 21-scraping.rmd and make a graph

Dataset includes variables like:

  • Political leanings
  • Religion
  • Drug usage
  • Sexual preferences
  • Zodiac sign
  • With over 2,000 total variables (although not all users had all variables recorded)

Follow-up study (and article correction)

https://www.tandfonline.com/doi/abs/10.1080/10691898.2015.11889737