Agenda

  • Basics of string data
  • String operators
    • Cover both stringr and base functions
  • Special characters
  • Pattern matching

Before we get started

  • In many cases, I’ve opted to show both base and stringr functions
  • stringr is part of the tidyverse, and has some nice functionality, but many of the base functions are so common I think I’d be doing you a disservice if I didn’t also introduce them.
  • There are some stringr functions that are far easier than the base alternative, so I’ll skip base on them (e.g., str_extract()).

Some properties of strings

  • Strings can be anything wrapped in quotes
  • All of the below are strings
"TRUE"
"1"
"a"
"4.78"
"purple"

  • Strings are the most flexible data type. Can be coerced to other types if it makes sense (the below spits out warnings)
as.double(c("TRUE", "1", "a", "4.78", "purple"))
## [1]   NA 1.00   NA 4.78   NA
as.logical(c("TRUE", "1", "a", "4.78", "purple"))
## [1] TRUE   NA   NA   NA   NA

Vectors must be of the same type

  • This implies that if you have a character element in an atomic vector, all will be coerced to character (because it’s the most flexible)
c("string", 1.45, TRUE, 5L)
## [1] "string" "1.45"   "TRUE"   "5"

Factors

  • Factors may not behave as you’d expect
c(factor("a"), "b", 1, 4.59)
## [1] "1"    "b"    "1"    "4.59"
  • Why do you think this is happening?
  • How could we get this to do what we intend? (i.e., return “a”)

Overriding factors

c(as.character(factor("a")), "b", 1, 4.59)
## [1] "a"    "b"    "1"    "4.59"

String data for today


Strings to process

library(stringr) # loaded with the tidyverse as of version 1.2.0
fruit
sentences
words

fruit

head(fruit)
##  [1] "apple"             "apricot"           "avocado"          
##  [4] "banana"            "bell pepper"       "bilberry"         

sentences

head(sentences)
##   [1] "The birch canoe slid on the smooth planks."               
##   [2] "Glue the sheet to the dark blue background."              
##   [3] "It's easy to tell the depth of a well."                   
##   [4] "These days a chicken leg is a rare dish."                 
##   [5] "Rice is often served in round bowls."                     
##   [6] "The juice of lemons makes fine punch."                    

words

head(words)
##   [1] "a"           "able"        "about"       "absolute"    "accept"     
##   [6] "account"   

String operators


Make everything upper case

stringr

str_to_upper(fruit[1:5])
##  [1] "APPLE"             "APRICOT"           "AVOCADO"          
##  [4] "BANANA"            "BELL PEPPER"       

base

toupper(fruit[1:5])
##  [1] "APPLE"             "APRICOT"           "AVOCADO"          
##  [4] "BANANA"            "BELL PEPPER"       

Make everything lower case

stringr

str_to_lower(sentences[1:5])
##   [1] "the birch canoe slid on the smooth planks."               
##   [2] "glue the sheet to the dark blue background."              
##   [3] "it's easy to tell the depth of a well."                   
##   [4] "these days a chicken leg is a rare dish."                 
##   [5] "rice is often served in round bowls."                     

base

tolower(sentences[1:5])
##   [1] "the birch canoe slid on the smooth planks."               
##   [2] "glue the sheet to the dark blue background."              
##   [3] "it's easy to tell the depth of a well."                   
##   [4] "these days a chicken leg is a rare dish."                 
##   [5] "rice is often served in round bowls."                     

Make title case

Notice these are slightly different

stringr

str_to_title("big movie that is really amazing")
## [1] "Big Movie That Is Really Amazing"

base (tools package - comes pre-installed)

tools::toTitleCase("big movie that is really amazing")
## [1] "Big Movie that is Really Amazing"

Other options?

Look at ?toupper

.simpleCap <- function(x) {
    s <- strsplit(x, " ")[[1]]
    paste(toupper(substring(s, 1, 1)), substring(s, 2),
          sep = "", collapse = " ")
}
.simpleCap("the quick brown fox jumps over the lazy brown dog")
## [1] "The Quick Brown Fox Jumps Over The Lazy Brown Dog"

Which mimics str_to_title rather than tools::toTitleCase.

tools::toTitleCase("the quick brown fox jumps over the lazy brown dog")
## [1] "The Quick Brown Fox Jumps over the Lazy Brown Dog"

Join strings together

stringr

str_c("green", "apple")
## [1] "greenapple"
str_c("green", "apple", sep = " ")
## [1] "green apple"
str_c("green", "apple", sep = " : ")
## [1] "green : apple"

base

paste0("green", "apple")
## [1] "greenapple"
paste("green", "apple")
## [1] "green apple"
paste("green", "apple", sep = " : ")
## [1] "green : apple"

Joining strings w/vectors

str_c("a", c("b", "c", "d"), 1:3)
## [1] "ab1" "ac2" "ad3"
str_c("a", c("b", "c", "d"), c(1, 1, 1, 2, 2, 2, 3, 3, 3))
## [1] "ab1" "ac1" "ad1" "ab2" "ac2" "ad2" "ab3" "ac3" "ad3"
  • Note, the last vector could be created with rep(1:3, each = 3)
  • Base version is the same but with paste0

Collapsing strings

str_c("a", c("b", "c", "d"), 1:3, collapse = "|")
## [1] "ab1|ac2|ad3"
str_c("a", c("b", "c", "d"), c(1, 1, 1, 2, 2, 2, 3, 3, 3), collapse = ":")
## [1] "ab1:ac1:ad1:ab2:ac2:ad2:ab3:ac3:ad3"

Calculate string length

words[1:3]
## [1] "a"     "able"  "about"

stringr

str_length(words[1:3])
## [1] 1 4 5

base

nchar(words[1:3])
## [1] 1 4 5

substrings: stringr

words[10:13]
## [1] "active"  "actual"  "add"     "address"
str_sub(words[10:13], 3)
## [1] "tive"  "tual"  "d"     "dress"
str_sub(words[10:13], 3, 5)
## [1] "tiv" "tua" "d"   "dre"
str_sub(words[10:13], -3)
## [1] "ive" "ual" "add" "ess"

substrings: base

substr(words[10:13], 3, nchar(words[10:13]))
## [1] "tive"  "tual"  "d"     "dress"
substr(words[10:13], 3, 5)
## [1] "tiv" "tua" "d"   "dre"
substr(words[10:13], nchar(words[10:13]) - 2, nchar(words[10:13]))
## [1] "ive" "ual" "add" "ess"

A few more substrings with stringr

words[10:13]
## [1] "active"  "actual"  "add"     "address"

Extract the second to second to last characters

str_sub(words[10:13], 2, -2)
## [1] "ctiv"  "ctua"  "d"     "ddres"

Use to modify

str_sub(words[10:13], 2, 4) <- "XX"
words[10:13]
## [1] "aXXve"  "aXXal"  "aXX"    "aXXess"

Locate where strings occur

fruit[c(1, 62, 2:5)]
## [1] "apple"       "pineapple"   "apricot"     "avocado"     "banana"     
## [6] "bell pepper"
str_locate(fruit[c(1, 62, 2:5)], "ap")
##      start end
## [1,]     1   2
## [2,]     5   6
## [3,]     1   2
## [4,]    NA  NA
## [5,]    NA  NA
## [6,]    NA  NA

Trim white space

white_space <- c(" before", "after ", " both ")

stringr

str_trim(white_space)
## [1] "before" "after"  "both"
str_trim(white_space, side = "left")
## [1] "before" "after " "both "
str_trim(white_space, side = "right")
## [1] " before" "after"   " both"

base

trimws(white_space)
## [1] "before" "after"  "both"
trimws(white_space, which = "left")
## [1] "before" "after " "both "
trimws(white_space, which = "right")
## [1] " before" "after"   " both"

Pad white space

stringr

(we won’t talk about base, but it’s sprintf if you’re interested)

strings <- c("abc", "abcdefg")
str_pad(strings, 10)
## [1] "       abc" "   abcdefg"
str_pad(strings, 10, side = "right")
## [1] "abc       " "abcdefg   "
str_pad(strings, 10, side = "both")
## [1] "   abc    " " abcdefg  "

Pad w/something else

string_nums <- as.character(1:15)
string_nums
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
## [15] "15"
str_pad(string_nums, 3, pad = "0")
##  [1] "001" "002" "003" "004" "005" "006" "007" "008" "009" "010" "011"
## [12] "012" "013" "014" "015"

Special characters


What you see isn’t always what R sees

fox <- "the quick \nbrown fox \n \t jumps over the \t lazy dog"
fox
## [1] "the quick \nbrown fox \n \t jumps over the \t lazy dog"
  • The above code has special characters telling R to break for a new line \n and to tab over \t.
  • You can see how R “sees” the data using the base::writeLines() function

writeLines(fox)
## the quick 
## brown fox 
##  	 jumps over the 	 lazy dog
  • \n and \t are probably the two most common.
  • Use ?"'" to see others

Special symbols

symbols <- c("\u03B1", "\u03B2", "\u03B3", "\u03B4", "\u03B5", "\u03B6")
symbols
## [1] "α" "β" "γ" "δ" "ε" "ζ"

These are called unicode characters and are consistent across programming languages. You can do plenty of other symbols outside of greek too. For example "\u0807" turns into ࠇ.

See http://graphemica.com/unicode/characters/ to find specific numbers


You might be thinking… Plots!

  • Better ways to do that
  • Let’s create some normal distributions and annotate the plot
x <- seq(-5, 5, .1)
l1 <- dnorm(x, 0, 1)
l2 <- dnorm(x, 0, 2)
l3 <- dnorm(x, 0, 3)

d <- data.frame(x = x, 
				likelihood = c(l1, l2, l3), 
				sd = rep(1:3, each = length(x)))

Let’s annotate this plot

library(tidyverse)
theme_set(theme_classic())
ggplot(d, aes(x, likelihood, color = factor(sd))) + geom_line()

plot of chunk basic_plot


ggplot(d, aes(x, likelihood, color = factor(sd))) + 
	geom_line() +
	annotate("text", x = 0, y = 0.41, label = "sigma == 1", parse = TRUE, 
		color = "#F8766D") +
	annotate("text", x = 0, y = 0.22, label = "sigma == 2", parse = TRUE, 
		color = "#00BA38") +
	annotate("text", x = 0, y = 0.15, label = "sigma == 3", parse = TRUE, 
		color = "#619CFF")

plot of chunk plot_annotate


Base plot annotations

  • Use expression with no quotes
splt <- split(d, d$sd)

plot(splt[[1]]$x, splt[[1]]$likelihood, 
	type = "l", 
	col = "#F8766D", 
	bty = "n", 
	xlab = "data", 
	ylab = "likelihood", 
	lwd = 2)
lines(splt[[2]]$x, splt[[2]]$likelihood, col = "#00BA38", lwd = 2)
lines(splt[[3]]$x, splt[[3]]$likelihood, col = "#619CFF", lwd = 2)

text(0, 0.41, expression(sigma == 1), col = "#F8766D")
text(0, 0.21, expression(sigma == 2), col = "#00BA38")
text(0, 0.14, expression(sigma == 3), col = "#619CFF")

plot of chunk expression_noecho


Escaping special characters

  • If you want the literal text to show up, instead of the symbol, you have to escape it it \.
  • Because \ itself is a special character, that means you need two: \\.
show_symbols <- c("\\u03B1", "\\u03B2", "\\u03B3", "\\u03B4", "\\u03B5", "\\u03B6")
show_symbols
## [1] "\\u03B1" "\\u03B2" "\\u03B3" "\\u03B4" "\\u03B5" "\\u03B6"
  • There are also a host of regular expressions that have special characters that you’ll need to escape if you want them to print too. We’ll talk more about these later.