[1] TRUE
Day 21
Carleton College
Stat 220 - Spring 2025
Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.
Screen scraping:
extract data from source code of website, with html parser (easy) or regular expression matching (less easy).
Can you query this webpage?
Are there restrictions on the use of the data?
How many requests can you make per minute?
…and more…
Use robotstxt::paths_allowed()
to see if you can scrape the web page.
What websites have data about you? Think of 1-2 and see if scraping is allowed on those sites.
Lots of data on the web is still available as HTML
It is structured (hierarchical / tree based), but it’s often not available in a form useful for analysis (flat / tidy).
HTML uses tags to describe different aspects of document content
Tag | Example |
---|---|
heading | <h1>My Title</h1> |
paragraph | <p>A paragraph of content...</p> |
table | <table> ... </table> |
anchor (with attribute) | <a href="http://www.mysite.net">click here for link</a> |
rvest
functionsFunction | Description |
---|---|
read_html |
Read HTML data from a url or character string |
html_element |
Select a specified element from HTML document |
html_elements |
Select specified elements from HTML document |
html_table |
Parse an HTML table into a data frame |
html_text |
Extract tag pairs’ content |
html_name |
Extract tags’ names |
html_attrs |
Extract all of each tag’s attributes |
html_attr |
Extract tags’ attribute value by name |
https://www.boxofficemojo.com/year/2024/
Take a look at the web page and the html source code
Chrome or Firefox: right click -> View page source
Look for the "table"
div ID or tag
{html_document}
<html class="a-no-js" data-19ax5a9jf="dingo">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body id="body" class="mojo-page-id-yld a-m-us a-aui_72554-c a-aui_a11y_6 ...
List of 2
$ node:<externalptr>
$ doc :<externalptr>
- attr(*, "class")= chr [1:2] "xml_document" "xml_node"
There are over 100 HTML elements:
<html>
element, and it must have two children: <head>
and <body>
<h1>
, <p>
, <ol>
form the structure of the page<b>
, <i>
, and <a>
format text inside block tagsWe’ll often work with tables
. HTML tables are composed of four main elements <table>
, <tr>
(table row), <th>
(table heading), and <td>
(table data).
Use html_element()
or html_elements()
to extract pieces out of HTML documents
Rows: 200
Columns: 11
$ Rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ Release <chr> "Inside Out 2", "Deadpool & Wolverine", "Wicked", "Moan…
$ Genre <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", …
$ Budget <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", …
$ `Running Time` <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", …
$ Gross <chr> "$652,980,194", "$636,745,858", "$432,943,285", "$404,0…
$ Theaters <chr> "4,440", "4,330", "3,888", "4,200", "4,449", "4,575", "…
$ `Total Gross` <chr> "$652,980,194", "$636,745,858", "$473,231,120", "$460,4…
$ `Release Date` <chr> "Jun 14", "Jul 26", "Nov 22", "Nov 27", "Jul 3", "Sep 6…
$ Distributor <chr> "Walt Disney Studios Motion Pictures", "Walt Disney Stu…
$ Estimated <chr> "false", "false", "false", "false", "false", "false", "…
top2024 <- top2024 %>%
mutate(
Gross = parse_number(Gross),
Theaters = parse_number(Theaters),
`Total Gross` = parse_number(`Total Gross`)
) %>%
separate(`Release Date`, into = c("Month", "Day"))
glimpse(top2024)
Rows: 200
Columns: 12
$ Rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ Release <chr> "Inside Out 2", "Deadpool & Wolverine", "Wicked", "Moan…
$ Genre <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", …
$ Budget <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", …
$ `Running Time` <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", …
$ Gross <dbl> 652980194, 636745858, 432943285, 404017489, 361004205, …
$ Theaters <dbl> 4440, 4330, 3888, 4200, 4449, 4575, 4074, 4170, 3948, 4…
$ `Total Gross` <dbl> 652980194, 636745858, 473231120, 460405297, 361004205, …
$ Month <chr> "Jun", "Jul", "Nov", "Nov", "Jul", "Sep", "Mar", "Jul",…
$ Day <chr> "14", "26", "22", "27", "3", "6", "1", "19", "29", "8",…
$ Distributor <chr> "Walt Disney Studios Motion Pictures", "Walt Disney Stu…
$ Estimated <chr> "false", "false", "false", "false", "false", "false", "…
https://www.carleton.edu/catalog/current/search/?subject=STAT&term=25WI
View the page source to try to find the html elements where this data is located (e.g. ‘h1’, ‘p’, ‘table’)
03:00
{xml_nodeset (22)}
[1] <h3 class="courseTitleBar">\n <span class="courseNumber" data ...
[2] <h3 class="courseTitleBar">\n <span class="courseNumber" data ...
[3] <h3 class="courseTitleBar">\n <span class="courseNumber" data ...
[4] <h3 class="courseTitleBar">\n <span class="courseNumber" data ...
[5] <h3 class="courseTitleBar">\n <span class="courseNumber" data ...
[6] <h3 class="courseTitleBar">\n <span class="courseNumber" data ...
[7] <h3 class="courseTitleBar">\n <span class="courseNumber" data ...
[8] <h3 class="courseTitleBar">\n <span class="courseNumber" data ...
[9] <h3 class="courseTitleBar">\n <span class="courseNumber" data ...
[10] <h3 class="courseTitleBar">\n <span class="courseNumber" data ...
[11] <h3 class="courseSearchResultsHeading relatedCourses" id="relatedCourses ...
[12] <h3 class="courseTitleBar">\n <span class="courseNumber" data ...
[13] <h3 class="courseTitleBar">\n <span class="courseNumber" data ...
[14] <h3 class="courseTitleBar">\n <span class="courseNumber" data ...
[15] <h3 class="courseTitleBar">\n <span class="courseNumber" data ...
[16] <h3 class="courseTitleBar">\n <span class="courseNumber" data ...
[17] <h3 class="courseTitleBar">\n <span class="courseNumber" data ...
[18] <h3 class="courseTitleBar">\n <span class="courseNumber" data ...
[19] <h3 class="courseTitleBar">\n <span class="courseNumber" data ...
[20] <h3 class="courseTitleBar">\n <span class="courseNumber" data ...
...
[1] "\n STAT 120\n Introduction to Statistics\n \n 6 credits\n \n "
[2] "\n STAT 220\n Introduction to Data Science\n \n 6 credits\n \n "
[3] "\n STAT 230\n Applied Regression Analysis\n \n 6 credits\n \n "
[4] "\n STAT 250\n Introduction to Statistical Inference\n \n 6 credits\n \n "
[5] "\n STAT 285\n Statistical Consulting\n \n 2 credits\n \n "
[6] "\n STAT 297\n Assessment and Communication of External Statistical Activity\n \n 1 credits\n \n "
[7] "\n STAT 330\n Advanced Statistical Modeling\n \n 6 credits\n \n "
[8] "\n STAT 394\n Directed Research in Statistics\n \n 1 – 6 credits\n \n "
[9] "\n STAT 399\n Senior Seminar\n \n 6 credits\n \n "
[10] "\n STAT 400\n Integrative Exercise\n \n 3 – 6 credits\n \n "
[11] "Related Courses"
[12] "\n CS 111\n Introduction to Computer Science\n \n 6 credits\n \n "
[13] "\n CS 314\n Data Visualization\n \n 6 credits\n \n "
[14] "\n MATH 101\n Calculus with Problem Solving\n \n 6 credits\n \n "
[15] "\n MATH 111\n Introduction to Calculus\n \n 6 credits\n \n "
[16] "\n MATH 120\n Calculus 2\n \n 6 credits\n \n "
[17] "\n MATH 210\n Calculus 3\n \n 6 credits\n \n "
[18] "\n MATH 211\n Introduction to Multivariable Calculus\n \n 6 credits\n \n "
[19] "\n MATH 232\n Linear Algebra\n \n 6 credits\n \n "
[20] "\n MATH 240\n Probability\n \n 6 credits\n \n "
[21] "Liberal Arts Requirements"
[22] "Other Course Tags"
[1] "STAT 120 Introduction to Statistics 6 credits"
[2] "STAT 220 Introduction to Data Science 6 credits"
[3] "STAT 230 Applied Regression Analysis 6 credits"
[4] "STAT 250 Introduction to Statistical Inference 6 credits"
[5] "STAT 285 Statistical Consulting 2 credits"
[6] "STAT 297 Assessment and Communication of External Statistical Activity 1 credits"
[7] "STAT 330 Advanced Statistical Modeling 6 credits"
[8] "STAT 394 Directed Research in Statistics 1 – 6 credits"
[9] "STAT 399 Senior Seminar 6 credits"
[10] "STAT 400 Integrative Exercise 3 – 6 credits"
[11] "Related Courses"
[12] "CS 111 Introduction to Computer Science 6 credits"
[13] "CS 314 Data Visualization 6 credits"
[14] "MATH 101 Calculus with Problem Solving 6 credits"
[15] "MATH 111 Introduction to Calculus 6 credits"
[16] "MATH 120 Calculus 2 6 credits"
[17] "MATH 210 Calculus 3 6 credits"
[18] "MATH 211 Introduction to Multivariable Calculus 6 credits"
[19] "MATH 232 Linear Algebra 6 credits"
[20] "MATH 240 Probability 6 credits"
[21] "Liberal Arts Requirements"
[22] "Other Course Tags"
Course numbers are between <span class="courseNumber"> ... </span>
tags
These tags can be selected using .
followed by the name of the class
{xml_nodeset (19)}
[1] <span class="courseNumber" data-terms="25/WI">STAT 120</span>
[2] <span class="courseNumber" data-terms="25/WI">STAT 220</span>
[3] <span class="courseNumber" data-terms="25/WI">STAT 230</span>
[4] <span class="courseNumber" data-terms="25/WI">STAT 250</span>
[5] <span class="courseNumber" data-terms="25/WI">STAT 285</span>
[6] <span class="courseNumber" data-terms="25/WI">STAT 297</span>
[7] <span class="courseNumber" data-terms="25/WI">STAT 330</span>
[8] <span class="courseNumber" data-terms="24/FA 25/WI 25/SP">STAT 394</span>
[9] <span class="courseNumber" data-terms="25/WI">STAT 399</span>
[10] <span class="courseNumber" data-terms="25/WI">STAT 400</span>
[11] <span class="courseNumber" data-terms="25/WI">CS 111</span>
[12] <span class="courseNumber" data-terms="25/WI">CS 314</span>
[13] <span class="courseNumber" data-terms="25/WI">MATH 101</span>
[14] <span class="courseNumber" data-terms="25/WI">MATH 111</span>
[15] <span class="courseNumber" data-terms="25/WI">MATH 120</span>
[16] <span class="courseNumber" data-terms="25/WI">MATH 210</span>
[17] <span class="courseNumber" data-terms="25/WI">MATH 211</span>
[18] <span class="courseNumber" data-terms="25/WI">MATH 232</span>
[19] <span class="courseNumber" data-terms="25/WI">MATH 240</span>
[1] "\n 6 credits\n "
[2] "\n 6 credits\n "
[3] "\n 6 credits\n "
[4] "\n 6 credits\n "
[5] "\n 2 credits\n "
[6] "\n 1 credits\n "
[7] "\n 6 credits\n "
[8] "\n 1 – 6 credits\n "
[9] "\n 6 credits\n "
[10] "\n 3 – 6 credits\n "
[11] "\n 6 credits\n "
[12] "\n 6 credits\n "
[13] "\n 6 credits\n "
[14] "\n 6 credits\n "
[15] "\n 6 credits\n "
[16] "\n 6 credits\n "
[17] "\n 6 credits\n "
[18] "\n 6 credits\n "
[19] "\n 6 credits\n "
[1] "6 credits" "6 credits" "6 credits" "6 credits"
[5] "2 credits" "1 credits" "6 credits" "1 – 6 credits"
[9] "6 credits" "3 – 6 credits" "6 credits" "6 credits"
[13] "6 credits" "6 credits" "6 credits" "6 credits"
[17] "6 credits" "6 credits" "6 credits"
stat_winter2025 <- tibble(
course = listings %>% html_elements(".courseNumber") %>% html_text(),
title = listings %>% html_elements(".courseTitle") %>% html_text(),
credits = listings %>% html_elements(".credits") %>% html_text() %>% str_squish(),
description = listings %>% html_elements(".courseDetailWrapper") %>% html_text() %>% str_squish()
)
stat_winter2025
# A tibble: 19 × 4
course title credits description
<chr> <chr> <chr> <chr>
1 STAT 120 Introduction to Statistics 6 cred… Introducti…
2 STAT 220 Introduction to Data Science 6 cred… This cours…
3 STAT 230 Applied Regression Analysis 6 cred… A second c…
4 STAT 250 Introduction to Statistical Inference 6 cred… Introducti…
5 STAT 285 Statistical Consulting 2 cred… Students w…
6 STAT 297 Assessment and Communication of External Statis… 1 cred… An indepen…
7 STAT 330 Advanced Statistical Modeling 6 cred… Topics inc…
8 STAT 394 Directed Research in Statistics 1 – 6 … Spatial pr…
9 STAT 399 Senior Seminar 6 cred… As part of…
10 STAT 400 Integrative Exercise 3 – 6 … Either a s…
11 CS 111 Introduction to Computer Science 6 cred… This cours…
12 CS 314 Data Visualization 6 cred… Understand…
13 MATH 101 Calculus with Problem Solving 6 cred… An introdu…
14 MATH 111 Introduction to Calculus 6 cred… An introdu…
15 MATH 120 Calculus 2 6 cred… Inverse fu…
16 MATH 210 Calculus 3 6 cred… Vectors, c…
17 MATH 211 Introduction to Multivariable Calculus 6 cred… Vectors, c…
18 MATH 232 Linear Algebra 6 cred… Linear alg…
19 MATH 240 Probability 6 cred… Introducti…
listings %>%
html_elements(".course-section") %>%
html_element(".courseSectionNumber") %>%
html_text() %>%
str_squish()
[1] "STAT 120.01 Winter 2025" "STAT 120.02 Winter 2025"
[3] "STAT 120.03 Winter 2025" "STAT 220.00 Winter 2025"
[5] "STAT 230.00 Winter 2025" "STAT 250.00 Winter 2025"
[7] "STAT 285.00 Winter 2025" "STAT 297.00 Winter 2025"
[9] "STAT 330.00 Winter 2025" "STAT 394.11 Winter 2025"
[11] "STAT 394.12 Winter 2025" "STAT 399.00 Winter 2025"
[13] "STAT 400.01 Winter 2025" "STAT 400.02 Winter 2025"
[15] "STAT 400.03 Winter 2025" "CS 111.01 Winter 2025"
[17] "CS 111.02 Winter 2025" "CS 314.00 Winter 2025"
[19] "MATH 101.00 Winter 2025" "MATH 111.00 Winter 2025"
[21] "MATH 120.01 Winter 2025" "MATH 120.02 Winter 2025"
[23] "MATH 120.03 Winter 2025" "MATH 210.01 Winter 2025"
[25] "MATH 210.02 Winter 2025" "MATH 211.00 Winter 2025"
[27] "MATH 232.01 Winter 2025" "MATH 232.02 Winter 2025"
[29] "MATH 240.00 Winter 2025" "MATH 240.02 Winter 2025"
[1] "STAT 120.01 Winter 2025 Faculty:Claire Kelling 🏫 👤 Size:32 M, WCMC 102 11:10am-12:20pm FCMC 102 12:00pm-1:00pm Sophomore Priority; Not open to students who have already received credit for Psychology 200/201, Sociology/Anthropology 239 or Statistics 250 Sophomore Priority."
[2] "STAT 120.02 Winter 2025 Faculty:Spencer Wadsworth 🏫 👤 Size:32 M, WCMC 102 12:30pm-1:40pm FCMC 102 1:10pm-2:10pm"
[3] "STAT 120.03 Winter 2025 Faculty:Rebecca Terry 🏫 👤 Size:32 M, WCMC 102 1:50pm-3:00pm FCMC 102 2:20pm-3:20pm"
[4] "STAT 220.00 Winter 2025 Faculty:Amanda Luby 🏫 👤 Size:30 M, WCMC 102 9:50am-11:00am FCMC 102 9:40am-10:40am"
[5] "STAT 230.00 Winter 2025 Faculty:Claire Kelling 🏫 👤 Size:28 M, WCMC 306 1:50pm-3:00pm FCMC 306 2:20pm-3:20pm"
[6] "STAT 250.00 Winter 2025 Faculty:Adam Loy 🏫 👤 Size:28 M, WCMC 301 1:50pm-3:00pm FCMC 301 2:20pm-3:20pm"
[7] "STAT 285.00 Winter 2025 Faculty:Katie St. Clair 🏫 👤 Grading:S/CR/NC TCMC 304 10:10am-11:55am"
[8] "STAT 297.00 Winter 2025 Faculty:Katie St. Clair 🏫 👤 · Claire Kelling 🏫 👤 Grading:S/CR/NC"
[9] "STAT 330.00 Winter 2025 Faculty:Katie St. Clair 🏫 👤 Size:20 M, WCMC 306 9:50am-11:00am FCMC 306 9:40am-10:40am"
[10] "STAT 394.11 Winter 2025 Faculty:Claire Kelling 🏫 👤 Grading:S/CR/NC Credits:2"
[11] "STAT 394.12 Winter 2025 Faculty:Claire Kelling 🏫 👤 Grading:S/CR/NC"
[12] "STAT 399.00 Winter 2025 Faculty:Amanda Luby 🏫 👤 Size:4 Grading:S/CR/NC THCMC 328 10:10am-11:55am"
[13] "STAT 400.01 Winter 2025 Faculty:Katie St. Clair 🏫 👤 Size:10 Grading:S/NC Credits:6"
[14] "STAT 400.02 Winter 2025 Faculty:Adam Loy 🏫 👤 Size:12 Grading:S/NC Credits:3 T, THCMC 304 1:15pm-3:00pm"
[15] "STAT 400.03 Winter 2025 Faculty:Katie St. Clair 🏫 👤 Size:8 Grading:S/NC Credits:3 TCMC 328 1:15pm-3:00pm"
[16] "CS 111.01 Winter 2025 Faculty:Tom Finzell 🏫 👤 Size:38 M, WOlin 310 8:30am-9:40am FOlin 310 8:30am-9:30am"
[17] "CS 111.02 Winter 2025 Faculty:Tom Finzell 🏫 👤 Size:38 M, WOlin 310 11:10am-12:20pm FOlin 310 12:00pm-1:00pm Sophomore Priority Sophomore Priority."
[18] "CS 314.00 Winter 2025 Faculty:Bridger Herman 🏫 👤 Size:34 M, WLeighton 304 12:30pm-1:40pm FLeighton 304 1:10pm-2:10pm"
[19] "MATH 101.00 Winter 2025 Faculty:Deanna Haunsperger 🏫 👤 Size:30 M, WCMC 209 9:50am-11:00am FCMC 209 9:40am-10:40am"
[20] "MATH 111.00 Winter 2025 Faculty:Rob Thompson 🏫 👤 Size:30 M, WCMC 210 11:10am-12:20pm FCMC 210 12:00pm-1:00pm"
[21] "MATH 120.01 Winter 2025 Faculty:Rebecca Terry 🏫 👤 Size:30 M, WCMC 301 11:10am-12:20pm FCMC 301 12:00pm-1:00pm"
[22] "MATH 120.02 Winter 2025 Faculty:Corey Brooke 🏫 👤 Size:30 M, WCMC 210 12:30pm-1:40pm FCMC 210 1:10pm-2:10pm"
[23] "MATH 120.03 Winter 2025 Faculty:Mike Adams 🏫 👤 Size:30 M, WCMC 209 1:50pm-3:00pm FCMC 209 2:20pm-3:20pm"
[24] "MATH 210.01 Winter 2025 Faculty:Corey Brooke 🏫 👤 Size:30 M, WCMC 210 9:50am-11:00am FCMC 210 9:40am-10:40am"
[25] "MATH 210.02 Winter 2025 Faculty:Caroline Turnage-Butterbaugh 🏫 👤 Size:30 M, WCMC 209 11:10am-12:20pm FCMC 209 12:00pm-1:00pm"
[26] "MATH 211.00 Winter 2025 Faculty:Kate Meyer 🏫 👤 Size:30 M, WCMC 206 8:30am-9:40am FCMC 206 8:30am-9:30am"
[27] "MATH 232.01 Winter 2025 Faculty:Rafe Jones 🏫 👤 Size:30 M, WCMC 206 11:10am-12:20pm FCMC 206 12:00pm-1:00pm"
[28] "MATH 232.02 Winter 2025 Faculty:MurphyKate Montee 🏫 👤 Size:30 M, WCMC 209 12:30pm-1:40pm FCMC 209 1:10pm-2:10pm"
[29] "MATH 240.00 Winter 2025 Faculty:Adam Loy 🏫 👤 Size:30 M, WCMC 306 12:30pm-1:40pm FCMC 306 1:10pm-2:10pm"
[30] "MATH 240.02 Winter 2025 Faculty:Rob Thompson 🏫 👤 Size:30 M, WCMC 301 12:30pm-1:40pm FCMC 301 1:10pm-2:10pm"
Open source tool that eases CSS selector generation and discovery
Easiest to use with the Chrome Extension
Find out more on the SelectorGadget vignette
Use the SelectorGadget to explore http://www.imdb.com/chart/top
What should the columns of our target dataset be? Do they correspond to any specific css selectors?
03:00
imdb <- read_html("http://www.imdb.com/chart/top")
titles <- imdb %>%
html_elements(".with-margin .ipc-title__text") %>%
html_text()
head(titles)
[1] "1. The Shawshank Redemption"
[2] "2. The Godfather"
[3] "3. The Dark Knight"
[4] "4. The Godfather Part II"
[5] "5. 12 Angry Men"
[6] "6. The Lord of the Rings: The Return of the King"
imdb_top_250 <- tibble(
title = titles,
year = years,
runtime = runtimes,
mpaa = mpaas
)
imdb_top_250
# A tibble: 25 × 4
title year runtime mpaa
<chr> <chr> <chr> <chr>
1 1. The Shawshank Redemption 1994 2h 22m R
2 2. The Godfather 1972 2h 55m R
3 3. The Dark Knight 2008 2h 32m PG-13
4 4. The Godfather Part II 1974 3h 22m R
5 5. 12 Angry Men 1957 1h 36m Approved
6 6. The Lord of the Rings: The Return of the King 2003 3h 21m PG-13
7 7. Schindler's List 1993 3h 15m R
8 8. Pulp Fiction 1994 2h 34m R
9 9. The Lord of the Rings: The Fellowship of the Ring 2001 2h 58m PG-13
10 10. The Good, the Bad and the Ugly 1966 2h 58m R
# ℹ 15 more rows
# A tibble: 25 × 4
title year runtime mpaa
<chr> <chr> <chr> <chr>
1 1. The Shawshank Redemption 1994 2h 22m R
2 2. The Godfather 1972 2h 55m R
3 3. The Dark Knight 2008 2h 32m PG-13
4 4. The Godfather Part II 1974 3h 22m R
5 5. 12 Angry Men 1957 1h 36m Approved
6 6. The Lord of the Rings: The Return of the King 2003 3h 21m PG-13
7 7. Schindler's List 1993 3h 15m R
8 8. Pulp Fiction 1994 2h 34m R
9 9. The Lord of the Rings: The Fellowship of the Ring 2001 2h 58m PG-13
10 10. The Good, the Bad and the Ugly 1966 2h 58m R
# ℹ 15 more rows
Most modern tables in webpages are dynamically loaded (they wait for you to scroll down to load more rows). rvest
can’t scroll, so it can only see the initial data that’s loaded
imdb_local <- read_html("IMDb Top 250 Movies.html")
titles <- imdb_local %>%
html_elements(".with-margin .ipc-title__text") %>%
html_text()
titles
[1] "1. The Shawshank Redemption"
[2] "2. The Godfather"
[3] "3. The Dark Knight"
[4] "4. The Godfather Part II"
[5] "5. 12 Angry Men"
[6] "6. The Lord of the Rings: The Return of the King"
[7] "7. Schindler's List"
[8] "8. Pulp Fiction"
[9] "9. The Lord of the Rings: The Fellowship of the Ring"
[10] "10. The Good, the Bad and the Ugly"
[11] "11. Forrest Gump"
[12] "12. The Lord of the Rings: The Two Towers"
[13] "13. Fight Club"
[14] "14. Inception"
[15] "15. Star Wars: Episode V - The Empire Strikes Back"
[16] "16. The Matrix"
[17] "17. Goodfellas"
[18] "18. One Flew Over the Cuckoo's Nest"
[19] "19. Interstellar"
[20] "20. Se7en"
[21] "21. It's a Wonderful Life"
[22] "22. Seven Samurai"
[23] "23. The Silence of the Lambs"
[24] "24. Saving Private Ryan"
[25] "25. City of God"
[26] "26. The Green Mile"
[27] "27. Life Is Beautiful"
[28] "28. Terminator 2: Judgment Day"
[29] "29. Star Wars: Episode IV - A New Hope"
[30] "30. Back to the Future"
[31] "31. Spirited Away"
[32] "32. The Pianist"
[33] "33. Gladiator"
[34] "34. Parasite"
[35] "35. Psycho"
[36] "36. The Lion King"
[37] "37. Grave of the Fireflies"
[38] "38. The Departed"
[39] "39. Whiplash"
[40] "40. Harakiri"
[41] "41. American History X"
[42] "42. The Prestige"
[43] "43. Léon: The Professional"
[44] "44. Spider-Man: Across the Spider-Verse"
[45] "45. Casablanca"
[46] "46. The Usual Suspects"
[47] "47. The Intouchables"
[48] "48. Cinema Paradiso"
[49] "49. Modern Times"
[50] "50. Alien"
[51] "51. Rear Window"
[52] "52. Once Upon a Time in the West"
[53] "53. Django Unchained"
[54] "54. City Lights"
[55] "55. Dune: Part Two"
[56] "56. Apocalypse Now"
[57] "57. Memento"
[58] "58. WALL·E"
[59] "59. Raiders of the Lost Ark"
[60] "60. The Lives of Others"
[61] "61. Avengers: Infinity War"
[62] "62. Sunset Boulevard"
[63] "63. Spider-Man: Into the Spider-Verse"
[64] "64. Paths of Glory"
[65] "65. Witness for the Prosecution"
[66] "66. The Shining"
[67] "67. The Great Dictator"
[68] "68. 12th Fail"
[69] "69. Aliens"
[70] "70. Inglourious Basterds"
[71] "71. The Dark Knight Rises"
[72] "72. Coco"
[73] "73. Amadeus"
[74] "74. Toy Story"
[75] "75. Avengers: Endgame"
[76] "76. Oldboy"
[77] "77. Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb"
[78] "78. Good Will Hunting"
[79] "79. American Beauty"
[80] "80. Das Boot"
[81] "81. Braveheart"
[82] "82. Princess Mononoke"
[83] "83. Your Name."
[84] "84. High and Low"
[85] "85. 3 Idiots"
[86] "86. Joker"
[87] "87. Once Upon a Time in America"
[88] "88. Capernaum"
[89] "89. Singin' in the Rain"
[90] "90. Come and See"
[91] "91. Requiem for a Dream"
[92] "92. Toy Story 3"
[93] "93. Star Wars: Episode VI - Return of the Jedi"
[94] "94. The Hunt"
[95] "95. Eternal Sunshine of the Spotless Mind"
[96] "96. Ikiru"
[97] "97. 2001: A Space Odyssey"
[98] "98. Reservoir Dogs"
[99] "99. The Apartment"
[100] "100. Lawrence of Arabia"
[101] "101. Incendies"
[102] "102. Scarface"
[103] "103. Double Indemnity"
[104] "104. North by Northwest"
[105] "105. Heat"
[106] "106. Citizen Kane"
[107] "107. M"
[108] "108. Up"
[109] "109. Full Metal Jacket"
[110] "110. Vertigo"
[111] "111. Amélie"
[112] "112. A Clockwork Orange"
[113] "113. Oppenheimer"
[114] "114. To Kill a Mockingbird"
[115] "115. A Separation"
[116] "116. Die Hard"
[117] "117. The Sting"
[118] "118. Like Stars on Earth"
[119] "119. Indiana Jones and the Last Crusade"
[120] "120. Metropolis"
[121] "121. I'm Still Here"
[122] "122. Snatch"
[123] "123. 1917"
[124] "124. L.A. Confidential"
[125] "125. Bicycle Thieves"
[126] "126. Downfall"
[127] "127. Dangal"
[128] "128. Taxi Driver"
[129] "129. Hamilton"
[130] "130. The Wolf of Wall Street"
[131] "131. Batman Begins"
[132] "132. Green Book"
[133] "133. For a Few Dollars More"
[134] "134. Some Like It Hot"
[135] "135. The Truman Show"
[136] "136. Judgment at Nuremberg"
[137] "137. The Kid"
[138] "138. The Father"
[139] "139. Shutter Island"
[140] "140. All About Eve"
[141] "141. There Will Be Blood"
[142] "142. Jurassic Park"
[143] "143. Casino"
[144] "144. The Sixth Sense"
[145] "145. Ran"
[146] "146. Top Gun: Maverick"
[147] "147. No Country for Old Men"
[148] "148. The Thing"
[149] "149. Pan's Labyrinth"
[150] "150. Unforgiven"
[151] "151. A Beautiful Mind"
[152] "152. Kill Bill: Vol. 1"
[153] "153. The Treasure of the Sierra Madre"
[154] "154. Yojimbo"
[155] "155. Prisoners"
[156] "156. Finding Nemo"
[157] "157. The Great Escape"
[158] "158. Monty Python and the Holy Grail"
[159] "159. Howl's Moving Castle"
[160] "160. The Elephant Man"
[161] "161. Dial M for Murder"
[162] "162. Gone with the Wind"
[163] "163. Rashomon"
[164] "164. The Wild Robot"
[165] "165. Chinatown"
[166] "166. Klaus"
[167] "167. The Secret in Their Eyes"
[168] "168. Lock, Stock and Two Smoking Barrels"
[169] "169. V for Vendetta"
[170] "170. Inside Out"
[171] "171. Three Billboards Outside Ebbing, Missouri"
[172] "172. Trainspotting"
[173] "173. The Bridge on the River Kwai"
[174] "174. Raging Bull"
[175] "175. Catch Me If You Can"
[176] "176. Fargo"
[177] "177. Warrior"
[178] "178. Harry Potter and the Deathly Hallows: Part 2"
[179] "179. Gran Torino"
[180] "180. Million Dollar Baby"
[181] "181. Spider-Man: No Way Home"
[182] "182. My Neighbor Totoro"
[183] "183. Mad Max: Fury Road"
[184] "184. Ben-Hur"
[185] "185. Children of Heaven"
[186] "186. Barry Lyndon"
[187] "187. 12 Years a Slave"
[188] "188. Before Sunrise"
[189] "189. Blade Runner"
[190] "190. The Grand Budapest Hotel"
[191] "191. Dead Poets Society"
[192] "192. Hacksaw Ridge"
[193] "193. Gone Girl"
[194] "194. Memories of Murder"
[195] "195. In the Name of the Father"
[196] "196. Monsters, Inc."
[197] "197. Ratatouille"
[198] "198. The Gold Rush"
[199] "199. Wild Tales"
[200] "200. How to Train Your Dragon"
[201] "201. Sherlock Jr."
[202] "202. Jaws"
[203] "203. The Deer Hunter"
[204] "204. Mary and Max"
[205] "205. The General"
[206] "206. Ford v Ferrari"
[207] "207. The Wages of Fear"
[208] "208. On the Waterfront"
[209] "209. Mr. Smith Goes to Washington"
[210] "210. Wild Strawberries"
[211] "211. Maharaja"
[212] "212. Logan"
[213] "213. The Third Man"
[214] "214. Rocky"
[215] "215. Tokyo Story"
[216] "216. The Big Lebowski"
[217] "217. Spotlight"
[218] "218. The Seventh Seal"
[219] "219. The Terminator"
[220] "220. Room"
[221] "221. Pirates of the Caribbean: The Curse of the Black Pearl"
[222] "222. Hotel Rwanda"
[223] "223. La haine"
[224] "224. Platoon"
[225] "225. Demon Slayer: Kimetsu no Yaiba - Tsuzumi Mansion Arc"
[226] "226. Jai Bhim"
[227] "227. Before Sunset"
[228] "228. The Best Years of Our Lives"
[229] "229. The Exorcist"
[230] "230. The Passion of Joan of Arc"
[231] "231. The Wizard of Oz"
[232] "232. The Incredibles"
[233] "233. Rush"
[234] "234. The Sound of Music"
[235] "235. Hachi: A Dog's Tale"
[236] "236. Stand by Me"
[237] "237. Network"
[238] "238. My Father and My Son"
[239] "239. The Handmaiden"
[240] "240. The Iron Giant"
[241] "241. To Be or Not to Be"
[242] "242. The Battle of Algiers"
[243] "243. Into the Wild"
[244] "244. The Grapes of Wrath"
[245] "245. Groundhog Day"
[246] "246. The Help"
[247] "247. A Silent Voice: The Movie"
[248] "248. Amores Perros"
[249] "249. Rebecca"
[250] "250. A Man Escaped"
Rmarkdown:
R script:
data-cleaning.R
fit-models.R
scrape-data.R
In an R script:
Scrape the names, scores, and years of most popular TV shows on IMDB: www.imdb.com/chart/tvmeter
Create a data frame called tvshows
with the variables: rank
, title
, stars
, year
, episodes
, n_ratings
Wrangle your resulting data so that all variable types are imported correctly
Use write_csv
to save your file. If time, read it into the 21-scraping.rmd
and make a graph
Dataset includes variables like:
https://www.tandfonline.com/doi/abs/10.1080/10691898.2015.11889737