Portfolio Project 2

Wrangling weather forecasts

Overview

For your second portfolio project, you’ll apply what you’ve learned about wrangling data using the tidyverse. Your goal is to learn which areas of the U.S. struggle with weather prediction and explore possible reasons why. Specifically, you will focus on the error in high and low temperature forecasting, and may wish to also consider precipitation and outlook.

You should be careful about summarizing and joining data, and be on the lookout for data quality issues!

You should write a short report describing your findings. I envision an introductory paragraph that provides some context to your data, and a couple paragraphs outlining your findings. That’s it. I’m looking for something that is insightful and well-crafted, rather than long and exhaustive.

You should write your blog post in R Markdown, create any graphics using ggplot2, and use tools from this class for data wrangling. To submit your work, push both your R Markdown (.Rmd) file and knitted output document to GitHub. Do not forget to give your post an informative title!

Data

The data for this portfolio problem is from the National Weather Service. The data includes sixteen months of forecasts and observations from 167 cities, as well as a separate data set with information about those cities and some other American cities.

Your repos will contain the following files:

  • data/forecast_cities.csv
  • data/outlook_meanings.csv
  • data/weather_forecasts.csv

weather-forecasts.csv

variable class description
date date date described by the forecast
city factor observation city
state factor state or territory
high_or_low factor whether the forecast is for the high temperature of the low temperature
forecast_hours_before integer the number of hours before the observation (one of 12, 24, 36, or 48)
observed_temp integer the actual observed temperature on that date (high or low)
forecast_temp integer the predicted temperature on that date (high or low)
observed_precip double the observed precipitation on that date, in inches; note that some observations lack an indication of precipitation, while others explicitly report 0
forecast_outlook factor an abbreviation for the general outlook, such as precipitation type
possible_error factor either (1) “none” if the row contains no potential errors or (2) thename of the variable that is the cause of the potential error

forecast_cities.csv

variable class description
city character city
state character state or territory
lon lat double longitude
lat double latitude
koppen character Köppen climate classification
elevation double elevation in meters
distance_to_coast double distance_to_coast in miles
wind double mean wind speed
elevation_change_four double greatest elevation change in meters out of the four closest points to this city in a collection of elevations used by the team at Saint Louis University
elevation_change_eight double greatest elevation change in meters out of the eight closest points to this city in a collection of elevations used by the team at Saint Louis University
avg_annual_precip double average annual precipitation in inches

outlook_meanings.csv

variable class
forecast_outlook character
meaning character

Submission

Your submission will be a short report detailing your findings.

Rubric

A successful project will:

    • Very few grammatical mistakes, spelling mistakes, or typos
    • Informative title for your report is included
    • Any graphs are readable with appropriate titles and labels
    • The rendered document does not contain any unnecessary content (package loading messages, warnings, etc.)

An excellent project will meet all of the requirements for a successful project, plus

    • No grammatical mistakes, spelling mistakes, or typos
    • Graphs have been customized (theme, color palette, scales, etc.)

Can I work with someone?

This is an individual portfolio project. You may brainstorm with other people in the class, get feedback on any graphs or output, and get conceptual help with debugging or errors, but you should not be sharing code. All work that you submit should be your own.

From the syllabus: You are expected to collaborate with your group, but cannot rely on external sources other than to help motivate the questions or provide other background information (including online forums like StackExchange or Reddit). You may use any resources from class and package documentation, but getting answers on significant parts of solutions from outside resources is not allowed.

A note/reminder on AI: Large-language models (e.g. ChatGPT, Gemini, etc.) should only be used for coding or debugging help after you’ve attempted to solve the problem on your own. You should never copy and paste any course materials into a large-language model, and you should never copy and paste anything out of a large-language model into your course materials. Copying, paraphrasing, summarizing, or submitting work generated by anyone but yourself without proper attribution is considered academic dishonesty (this includes output from LLMs). You are not allowed to upload datasets or assignment prompts into a large language model.

FAQ

If you have any questions, please post them to the #portfolio-projects channel on slack.