Github +
Reproducible
Reporting

Day 02

Prof Amanda Luby

Carleton College
Stat 220 - Spring 2025

Plan for today:

  1. Questions from syllabus quiz
  2. Github
    • Accessing your private repos
    • knit 🧶 commit ✅ push ⤴️
  3. Reproducible Reporting

Questions from syllabus quiz

  • Will the course build off of Stat120 concepts?
  • Clarification of what gets submitted to gradescope and what to GitHub
  • RMarkdown versus Quarto?
  • Examples of previous final projects?
  • Info about portfolio projects?

Github

Git + GitHub

  • Git is a version control system - like “Track Changes”, on steroids
  • It’s not the only version control system, but it’s a very popular one
  • GitHub is the home for your Git-based projects on the internet—like DropBox but much, much better
  • We will use GitHub as a platform for web hosting and collaboration

Why do we need it?

Versioning

Versioning (with human-readable messages)

How does it work for Stat220?

How does it work for Stat220?

How does it work for Stat220?

How does it work for Stat220?

Let’s try it!

  • Follow the “Individual Assignment” directions at https://stat220-s25.github.io/computing/git-stat220.html to access your day02 repo and create an R project
  • Edit the .Rmd file:
    • Change “author” to your name
    • Use # to add descriptive section headers for each code chunk
    • Add a sentence or two describing the summary statistics of the dataset
  • knit 🧶 commit ✅ push ⤴️
  • View on github.com and confirm you can see your changes
15:00

Reproducible Reporting

Why do we need it?

Oops! I gave you the wrong set of data.

Why do we need it?

Other examples:

  • The results in Table 1 don’t seem to correspond to those in Figure 2.
  • In what order do I run these scripts?
  • Where did we get this data file?
  • Why did I omit those samples?
  • How did I make that figure?
  • “Your script is now giving an error.”
  • “The attached is similar to the code we used.”

Reproducible data science

Short Term Impact

  • Are the tables and figures reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done? (e.g., how were parameter settings chosen?)

Long Term Impact

  • Can the code be used for other data?
  • Can you extend the code to other things?

The toolkit

  • Scriptability \(\rightarrow\) R

  • Code environment \(\rightarrow\) RStudio

  • Literate programming (code, narrative, output in one place) \(\rightarrow\) R Markdown

  • Version control \(\rightarrow\) Git / GitHub

What is R Markdown?

  1. An authoring framework for data science.

  2. A document format (.Rmd).

  3. An R package named rmarkdown.

  4. A file format for making dynamic documents with R.

  5. A tool for integrating prose, code, and results.

  6. A computational document.

What about quarto?

  • “Next Gen” RMarkdown
  • More compatibility with other languages (python, observable.js, etc.)
  • not an R package/separate software
  • I’m going to distribute HW, etc. via RMarkdown, but you are welcome to use either!

The setup chunk

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,   
  comment = "#>", 
  out.width = "100%" 
)
```
  • A special chunk label: setup
  • Typically #1
  • All following chunks will use these options (i.e., sets global chunk options)
  • Tip: set include=FALSE
  • You can (and should) use individual chunk options too

Parameters

---
title: Survivor
output:
  html_document:
    toc: true
    toc_float: true
    theme: flatly
params:
  season: '20'
---

Your Task

There’s an example HTML report on the schedule

Your task is to reproduce it in 02-example-lego.rmd (or .qmd, if you prefer).

To be as reproducible as possible, you’ll need to use:

  • YAML metadata
  • YAML parameters
  • Code chunks with appropriate options
  • Inline R code

Hints

  1. To begin, use inline R code to replace “hard coding” the quantities that are highlighted below.

For example, instead of typing 19798 you would include nrow(sets) as an inline code chunk. Make sure the report knits and you get the right values.

  1. Next, add a parameter to your YAML header that stores the location of the data set. Make sure the report knits.

  2. Change the code chunk where you load the data set to use the data parameter you just defined rather than the hard-coded URL. Make sure the report knits.

  3. Now, let’s make a parameter for the source of the data set so you don’t have to search where every mention of it in the report, it will be with the other metadata (where it belongs). To do this, add a parameter that gives the source of your data (call it data_source) and set it equal to “the 2022-09-09 repository on Tidy Tuesday.” Make sure the report knits.

  4. At this point it looks like everything is working—awesome job! To put it to the test, let’s update the parameters of your report and knit it to see if everything changes as we would expect. Here are the new parameter values:

data: "http://math.carleton.edu/aluby/stat220/lego_subset.csv"
data_source: "Amanda's website"
  1. (optional) If you have time, or would like to try outside of class, I’ve created a practice gradescope assignment space. First, add .pdf output to your document and knit to PDF. Commit the PDF and push to github. Then, log into gradescope (you may have to go through the link on moodle the first time) and link your github repo to the submission space.
output: 
  html_document:
    default
  pdf_document:
    default