Overview

For your second portfolio project, you’ll apply what you’ve learned about wrangling data using the tidyverse. Your goal is to learn which areas of the U.S. struggle with weather prediction and explore possible reasons why. Speciﬁcally, you will focus on the error in high and low temperature forecasting, and may wish to also consider precipitation and outlook.

You should be careful about summarizing and joining data, and be on the lookout for data quality issues!

You should write a short report describing your ﬁndings. I envision an introductory paragraph that provides some context to your data, and a couple paragraphs outlining your ﬁndings. That’s it. I’m looking for something that is insightful and well-crafted, rather than long and exhaustive.

You should write your blog post in R Markdown, create any graphics using ggplot2, and use tools from this class for data wrangling. To submit your work, push both your R Markdown (.Rmd) ﬁle and knitted output document to GitHub. Do not forget to give your post an informative title!

Data

The data for this portfolio problem is from the National Weather Service. The data includes sixteen months of forecasts and observations from 167 cities, as well as a separate data set with information about those cities and some other American cities.

Your repos will contain the following files:

data/forecast_cities.csv
data/outlook_meanings.csv
data/weather_forecasts.csv

`weather-forecasts.csv`

variable	class	description
date	date	date described by the forecast
city	factor	observation city
state	factor	state or territory
high_or_low	factor	whether the forecast is for the high temperature of the low temperature
forecast_hours_before	integer	the number of hours before the observation (one of 12, 24, 36, or 48)
observed_temp	integer	the actual observed temperature on that date (high or low)
forecast_temp	integer	the predicted temperature on that date (high or low)
observed_precip	double	the observed precipitation on that date, in inches; note that some observations lack an indication of precipitation, while others explicitly report 0
forecast_outlook	factor	an abbreviation for the general outlook, such as precipitation type
possible_error	factor	either (1) “none” if the row contains no potential errors or (2) thename of the variable that is the cause of the potential error

`forecast_cities.csv`

variable	class	description
city	character	city
state	character	state or territory
lon lat	double	longitude
lat	double	latitude
koppen	character	Köppen climate classiﬁcation
elevation	double	elevation in meters
distance_to_coast	double	distance_to_coast in miles
wind	double	mean wind speed
elevation_change_four	double	greatest elevation change in meters out of the four closest points to this city in a collection of elevations used by the team at Saint Louis University
elevation_change_eight	double	greatest elevation change in meters out of the eight closest points to this city in a collection of elevations used by the team at Saint Louis University
avg_annual_precip	double	average annual precipitation in inches

`outlook_meanings.csv`

variable	class
forecast_outlook	character
meaning	character

Submission

Your submission will be a short report detailing your findings.

Rubric

A successful project will:

An excellent project will meet all of the requirements for a successful project, plus

Rendered document is a github_document (this means that it is rendered to markdown output which is formattable by github). You can check to see that it worked by opening the rendered file within github. If you can see your plots, it worked!
Identify the data quality issue in weather_forecasts.csv (Hint: looking through forecast_cities first might give you an idea of what to look for)
Meet high submission quality standards
- No grammatical mistakes, spelling mistakes, or typos
- Graphs have been customized (theme, color palette, scales, etc.)

Can I work with someone?

This is an individual portfolio project. You may brainstorm with other people in the class, get feedback on any graphs or output, and get conceptual help with debugging or errors, but you should not be sharing code. All work that you submit should be your own.

From the syllabus: You are expected to collaborate with your group, but cannot rely on external sources other than to help motivate the questions or provide other background information (including online forums like StackExchange or Reddit). You may use any resources from class and package documentation, but getting answers on significant parts of solutions from outside resources is not allowed.

A note/reminder on AI: Large-language models (e.g. ChatGPT, Gemini, etc.) should only be used for coding or debugging help after you’ve attempted to solve the problem on your own. You should never copy and paste any course materials into a large-language model, and you should never copy and paste anything out of a large-language model into your course materials. Copying, paraphrasing, summarizing, or submitting work generated by anyone but yourself without proper attribution is considered academic dishonesty (this includes output from LLMs). You are not allowed to upload datasets or assignment prompts into a large language model.

FAQ

If you have any questions, please post them to the #portfolio-projects channel on slack.