Stat 220 – Working with Strings

Strings in R

Anything surrounded by quotes(") or single quotes(').

"Carleton College"
"2025"
'"Hello World"'

`str_view`

str_view is a handy function that prints the underlying representation of a string and can also be used to check pattern matching (which we’ll see in a bit!)

str_view("Carleton College")

[1] │ Carleton College

str_view("2025")

[1] │ 2025

str_view('"Hello World"')

[1] │ "Hello World"

The “escape” backslash is used to escape the special use of certain characters

str_view("Math\Stats")

Error: '\S' is an unrecognized escape in character string (<input>:1:16)

str_view("\"")

[1] │ "

str_view("\\")

[1] │ \

str_view("Math\\Stats")

[1] │ Math\Stats

Simple, consistent functions for working with strings.
Part of the tidyverse

# loaded with tidyverse
library(stringr)

String length

str_length() determines the length of a string.

cc <- "Carleton College"
str_length(cc)

[1] 16

Combine strings

str_c() allows us to easily create strings from variables/vectors.

building <- "CMC"
room <- "102"
begin_time <- "9:50 a.m."
end_time <- "11:00 a.m."
days <- "MWF"
class <- "STAT 220"
str_c(class, "meets from", begin_time, "to", end_time, 
      days, "in", building, room, sep=" ")

[1] "STAT 220 meets from 9:50 a.m. to 11:00 a.m. MWF in CMC 102"

Concatenate Strings

str_c() works with vectors

letters

 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"

str_c(letters, 1:26)

 [1] "a1"  "b2"  "c3"  "d4"  "e5"  "f6"  "g7"  "h8"  "i9"  "j10" "k11" "l12"
[13] "m13" "n14" "o15" "p16" "q17" "r18" "s19" "t20" "u21" "v22" "w23" "x24"
[25] "y25" "z26"

str_c(letters, 1:26, sep = "")

 [1] "a1"  "b2"  "c3"  "d4"  "e5"  "f6"  "g7"  "h8"  "i9"  "j10" "k11" "l12"
[13] "m13" "n14" "o15" "p16" "q17" "r18" "s19" "t20" "u21" "v22" "w23" "x24"
[25] "y25" "z26"

str_c(letters, 1:26, sep = "-")

 [1] "a-1"  "b-2"  "c-3"  "d-4"  "e-5"  "f-6"  "g-7"  "h-8"  "i-9"  "j-10"
[11] "k-11" "l-12" "m-13" "n-14" "o-15" "p-16" "q-17" "r-18" "s-19" "t-20"
[21] "u-21" "v-22" "w-23" "x-24" "y-25" "z-26"

Case conversion

str_to_lower() and str_to_upper() can help “fix” the case

text <- "NoRthFieLd, mN"
str_to_lower(text)

[1] "northfield, mn"

str_to_upper(text)

[1] "NORTHFIELD, MN"

Babynames

library(babynames)
babynames

# A tibble: 1,924,665 × 5
    year sex   name          n   prop
   <dbl> <chr> <chr>     <int>  <dbl>
 1  1880 F     Mary       7065 0.0724
 2  1880 F     Anna       2604 0.0267
 3  1880 F     Emma       2003 0.0205
 4  1880 F     Elizabeth  1939 0.0199
 5  1880 F     Minnie     1746 0.0179
 6  1880 F     Margaret   1578 0.0162
 7  1880 F     Ida        1472 0.0151
 8  1880 F     Alice      1414 0.0145
 9  1880 F     Bertha     1320 0.0135
10  1880 F     Sarah      1288 0.0132
# ℹ 1,924,655 more rows

Popularity of “Amanda” over time

Replicate this plot with your own name. If your name doesn’t have enough data, try a friend or professor’s name

Babynames

babynames

# A tibble: 1,924,665 × 5
    year sex   name          n   prop
   <dbl> <chr> <chr>     <int>  <dbl>
 1  1880 F     Mary       7065 0.0724
 2  1880 F     Anna       2604 0.0267
 3  1880 F     Emma       2003 0.0205
 4  1880 F     Elizabeth  1939 0.0199
 5  1880 F     Minnie     1746 0.0179
 6  1880 F     Margaret   1578 0.0162
 7  1880 F     Ida        1472 0.0151
 8  1880 F     Alice      1414 0.0145
 9  1880 F     Bertha     1320 0.0135
10  1880 F     Sarah      1288 0.0132
# ℹ 1,924,655 more rows

Example questions:

How many names end in a vowel?
How many names contain the pattern “stat”
How many names contain 3 A’s?

Extract substrings

str_sub(string, start = 1, end = -1)

string = character vector
start / end = position of the first and last characters

Extract substrings

We can pull apart strings from the start…

cc <- "Carleton College"
str_sub(cc, 10)  # end defaults to last character

[1] "College"

str_sub(cc, 1, 8)

[1] "Carleton"

# match the elements of each vector for positions
str_sub(cc, c(1, 10), c(8, 16))

[1] "Carleton" "College"

Extract substrings

… or the end

cc <- "Carleton College"
str_sub(cc, -3)

[1] "ege"

str_sub(cc, -8, -3)

[1] " Colle"

Your turn:

What will the following commands return?

str_sub("Amanda Luby", 1, 4)
str_sub("Amanda Luby", 4)
str_sub("Amanda Luby", -5)

02:00

Your turn (again):

Confer with folks around you. Fill in the blanks of the .Rmd file to…

Isolate the last letter of every name
Create a logical variable that displays whether the last letter is one of “a”, “e”, “i”, “o”, “u”, or “y”.
Use a weighted mean to calculate the proportion of children whose name ends in a vowel by year (see ?weighted.mean)
and then display the results as a line plot.

06:00

Extract substrings

What about a vector of strings?

fruits <- c("apple", "pineapple", "Pear", "orange", "peach", "banana")
str_sub(fruits, 2, 4)

[1] "ppl" "ine" "ear" "ran" "eac" "ana"

str_sub(fruits, 2, 2)

[1] "p" "i" "e" "r" "e" "a"

str_sub(fruits, 4, 4)

[1] "l" "e" "r" "n" "c" "a"

str_sub(fruits, c(2, 4, 2, 4, 2, 4), c(2, 4, 2, 4, 2, 4))

[1] "p" "e" "e" "n" "e" "a"

Pad strings

We can add character(s) to the beginning or end of a string

nums <- 1:10
as.character(nums)

 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

str_pad(nums, 2, pad ="0")

 [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10"

str_pad(nums, 3, pad ="0")

 [1] "001" "002" "003" "004" "005" "006" "007" "008" "009" "010"

str_pad(nums, 3, pad ="0", side = "right")

 [1] "100" "200" "300" "400" "500" "600" "700" "800" "900" "100"

str_pad(nums, 3, pad ="0", side = "both")

 [1] "010" "020" "030" "040" "050" "060" "070" "080" "090" "100"

Use the courses dataset in the .rmd file to answer the following:

How many of the course numbers end in .00? Use str_detect() or str_count() to help you answer this question.
The section number appears after the decimal point. Use mutate() and str_sub() to create a section column containing this number.
How many courses contain the word Introduction? Does case matter here?
What is the longest course name (in terms of characters)? What is the shortest course name? Use str_length() to help you answer this question.
Which course name is comprised of the most words? To do this, create a new column containing the words in each title using mutate() and str_split(). Then, create another column calculating the length() of the values in column you just created.
Use str_subset() to return the course names that contain exclamation points (!).

countdown::countdown(10)

10:00

Working with
Strings

Today

Lab Quiz 2 Info

Local Versions of R/RStudio/GitHub

string

Strings in R

`str_view`

String length

Combine strings

Concatenate Strings

Case conversion

Babynames

Popularity of “Amanda” over time

Babynames

Extract substrings

Extract substrings

Extract substrings

Your turn:

Your turn (again):

Extract substrings

Pad strings

Regular Expressions

Regular expressions

Example

Visualizing matches

`.` match any character

`[]` match any occurence

`[]` match any occurence

Your turn:

Special patterns

Caution!

`[^]` match any occurence except

Anchors

Use regex in `str_detect`, `str_sub`, etc:

Your turn