Day 12
Carleton College
Stat 220 - Spring 2025
https://stat220-w25.github.io/computing/rstudio-stat220.html https://stat220-w25.github.io/computing/git-stat220.html
any finite sequence of characters (i.e., letters, numerals, symbols and punctuation marks).
Anything surrounded by quotes(") or single quotes(').
str_viewstr_view is a handy function that prints the underlying representation of a string and can also be used to check pattern matching (which we’ll see in a bit!)
The “escape” backslash is used to escape the special use of certain characters
Error: '\S' is an unrecognized escape in character string (<input>:1:16)
str_length() determines the length of a string.
str_c() allows us to easily create strings from variables/vectors.
[1] "STAT 220 meets from 9:50 a.m. to 11:00 a.m. MWF in CMC 102"
str_c() works with vectors
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
str_to_lower() and str_to_upper() can help “fix” the case
# A tibble: 1,924,665 × 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
2 1880 F Anna 2604 0.0267
3 1880 F Emma 2003 0.0205
4 1880 F Elizabeth 1939 0.0199
5 1880 F Minnie 1746 0.0179
6 1880 F Margaret 1578 0.0162
7 1880 F Ida 1472 0.0151
8 1880 F Alice 1414 0.0145
9 1880 F Bertha 1320 0.0135
10 1880 F Sarah 1288 0.0132
# ℹ 1,924,655 more rows

Replicate this plot with your own name. If your name doesn’t have enough data, try a friend or professor’s name
# A tibble: 1,924,665 × 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
2 1880 F Anna 2604 0.0267
3 1880 F Emma 2003 0.0205
4 1880 F Elizabeth 1939 0.0199
5 1880 F Minnie 1746 0.0179
6 1880 F Margaret 1578 0.0162
7 1880 F Ida 1472 0.0151
8 1880 F Alice 1414 0.0145
9 1880 F Bertha 1320 0.0135
10 1880 F Sarah 1288 0.0132
# ℹ 1,924,655 more rows
Example questions:
string = character vectorstart / end = position of the first and last charactersWe can pull apart strings from the start…
… or the end
What will the following commands return?
02:00
Confer with folks around you. Fill in the blanks of the .Rmd file to…
Isolate the last letter of every name
Create a logical variable that displays whether the last letter is one of “a”, “e”, “i”, “o”, “u”, or “y”.
Use a weighted mean to calculate the proportion of children whose name ends in a vowel by year (see ?weighted.mean)
and then display the results as a line plot.
06:00
What about a vector of strings?
[1] "ppl" "ine" "ear" "ran" "eac" "ana"
We can add character(s) to the beginning or end of a string
Use the courses dataset in the .rmd file to answer the following:
How many of the course numbers end in .00? Use str_detect() or str_count() to help you answer this question.
The section number appears after the decimal point. Use mutate() and str_sub() to create a section column containing this number.
How many courses contain the word Introduction? Does case matter here?
What is the longest course name (in terms of characters)? What is the shortest course name? Use str_length() to help you answer this question.
Which course name is comprised of the most words? To do this, create a new column containing the words in each title using mutate() and str_split(). Then, create another column calculating the length() of the values in column you just created.
Use str_subset() to return the course names that contain exclamation points (!).
Sometimes the patterns we wish to detect, extract, etc. too complex for exact matching
HH:MM:SS
hw01-aluby)Regular expressions (regexps) are a very terse language that allow you to describe patterns in strings
Confusing at first, but extremely useful
Suppose we wish to anonymize phone numbers in survey results
[1] "Home: 507-645-5489" "Cell: 219.917.9871"
[3] "My work phone is 507-202-2332" "I don't have a phone"
The helper function str_view() finds regex matches
. match any characterFind a “-” and any (.) character that follows
[] match any occurenceFind any numbers between 0 and 9
[] match any occurenceFind any numbers between 2 and 7
Detect either “.” or “-” in the info vector.
02:00
There are a number of special patterns that match more than one character
\\d - digit\\s - white space\\w - word\\t - tab\\n - newline[^] match any occurence exceptANYTHING BUT numbers between 2 and 7
Anchors look for matches at the start ^ or end $
[1] │ Home: 507-645-5489
[2] │ Cell: 219.917.9871
[3] │ My work phone is 507-202-2332
[4] │ I don't have a phone
str_detect, str_sub, etc:[1] │ Home: 507-645-5489
[2] │ Cell: 219.917.9871
[3] │ My work phone is 507-202-2332
[4] │ I don't have a phone
Fill in the code to determine how many baby names in 2015 ended with a vowel.
Use a regular expression to specify the pattern.
03:00