Lesser Known R Features

A collection of lesser known R tricks and features.

built in constants

R has a small number of built in numeric constants, including Inf and pi. But there are also a several useful lists of often used names and abbreviations which includes letters, month names, and various information about United States.

letters          ##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
                 ## [14] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"


LETTERS          ##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M"
                 ## [14] "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"


month.name       ## [1] "January"   "February"  "March"     "April"
                 ## [5] "May"       "June"      "July"      "August"
                 ## [9] "September" "October"   "November"  "December"

month.abb        ## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep"
                 ## [10] "Oct" "Nov" "Dec"


state.name       ## [1] "Alabama"        "Alaska"         "Arizona"
                 ## [4] "Arkansas"       "California"     "Colorado"
                 ## [7] "Connecticut"    "Delaware"       "Florida"
                 ## ................................................

state.abb        ##  [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI"
                 ## [12] "ID" "IL" "IN" "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI"
                 ## [23] "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ" "NM" "NY" "NC"
                 ## ...........................................................

Also available: state.region, state.division, state.area and state.center.

initiating a matrix

Creating a placeholder matrix that later gets filled up is a reoccurring procedure. Below are several different ways to achieve different prepared 3x3 matrices.

matrix(1, 3, 3)   ##      [,1] [,2] [,3]
                  ## [1,]    1    1    1
                  ## [2,]    1    1    1
                  ## [3,]    1    1    1


mat.or.vec(3, 3)  ##      [,1] [,2] [,3]
                  ## [1,]    0    0    0
                  ## [2,]    0    0    0
                  ## [3,]    0    0    0


diag(3)           ##      [,1] [,2] [,3]
                  ## [1,]    1    0    0
                  ## [2,]    0    1    0
                  ## [3,]    0    0    1


.row(c(3,3))      ##      [,1] [,2] [,3]
                  ## [1,]    1    1    1
                  ## [2,]    2    2    2
                  ## [3,]    3    3    3


.col(c(3,3))      ##      [,1] [,2] [,3]
                  ## [1,]    1    2    3
                  ## [2,]    1    2    3
                  ## [3,]    1    2    3


1:3 %o% 1:3       ##      [,1] [,2] [,3]
                  ## [1,]    1    2    3
                  ## [2,]    2    4    6
                  ## [3,]    3    6    9

matrix element names

Each element inside the matrix can have its own name. And those names can be used for selecting elements matrix elements.

x <- matrix(1:9, ncol=3)
names(x) <- paste0("e", 1:9)

x         ##      [,1] [,2] [,3]
          ## [1,]    1    4    7
          ## [2,]    2    5    8
          ## [3,]    3    6    9
          ## attr(,"names")
          ## [1] "e1" "e2" "e3" "e4" "e5" "e6" "e7" "e8" "e9"


x["e3"]   ## e3
          ##  3

array index format

Indices from a matrix can be obtained in a <row, column> form. And this special format can also be used to select elements from a matrix.

x <- matrix(1:6, nrow=2)           ##      [,1] [,2] [,3]
                                   ## [1,]    1    3    5
                                   ## [2,]    2    4    6

which(x > 3, arr.ind=TRUE)         ##      row col
                                   ## [1,]   2   2
                                   ## [2,]   1   3
                                   ## [3,]   2   3



inds <- rbind(c(1,2), c(2,1))      ##      [,1] [,2]
                                   ## [1,]    1    2
                                   ## [2,]    2    1

x[inds]                            ## [1] 3 2

elements in a nested list

A single vector of indeces instead of multiple subset operations can be used to select an element from a nested list.

a <- list(list(list(list("element"))))


a[[1]]                   ## [[1]]
                         ## [[1]][[1]]
                         ## [[1]][[1]][[1]]
                         ## [1] "element"


a[[1]][[1]][[1]][[1]]    ## [1] "element"


a[[c(1,1,1,1)]]          ## [1] "element"

matrix of lists

Matrix can contain various classes. Below is an example of matrix with data frames.

mat <- matrix(list(iris, mtcars, USArrests, chickwts), ncol=2)

mat           ##      [,1]    [,2]
              ## [1,] List,5  List,4
              ## [2,] List,11 List,2


mat[[2,2]]    ##   weight      feed
              ## 1    179 horsebean
              ## 2    160 horsebean
              ## 3    136 horsebean
              ## 4    227 horsebean
              ## 5    217 horsebean
              ## 6    168 horsebean
              ## ..................

means of rows and columns

Taking means of rows or columns of a matrix is an often repeated operation: But R also has handy functions for repeating this operation on a flattened matrix, given that the dimensions are known.

mat <- matrix(round(rnorm(4), 2), ncol=2)
vec <- as.numeric(mat)


mat                         ##      [,1]  [,2]
                            ## [1,] 2.19  0.03
                            ## [2,] 0.76 -0.41

vec                         ## [1]  2.19  0.76  0.03 -0.41


colMeans(mat)               ## [1] 1.475 -0.190

.colMeans(vec, m=2, n=2)    ## [1] 1.475 -0.190

Equivalents also exist for .rowMeans, .colSums and .rowSums.

split / unsplit

split() and unsplit() is a somewhat convenient way to do split-apply-combine tasks in base R. During this procedure the data frame is first split into a list of data frames - one for each group. Then a function is applied to all the data frames in a list. And finally the list is recombined again to a single data frame.

dfs <- split(iris, iris$Species)
dfs <- lapply(dfs, transform, Sepal.Length=as.vector(scale(Sepal.Length)))
dfs <- unsplit(dfs, iris$Species)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1   0.26667447         3.5          1.4         0.2  setosa
## 2  -0.30071802         3.0          1.4         0.2  setosa
## 3  -0.86811050         3.2          1.3         0.2  setosa
## 4  -1.15180675         3.1          1.5         0.2  setosa
## 5  -0.01702177         3.6          1.4         0.2  setosa
## 6   1.11776320         3.9          1.7         0.4  setosa
## ...........................................................

However it is possible to do all of this with a single call to a split()<- function:

df <- iris
split(df$Sepal.Length, df$Species) <- tapply(df$Sepal.Length, df$Species, scale)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1   0.26667447         3.5          1.4         0.2  setosa
## 2  -0.30071802         3.0          1.4         0.2  setosa
## 3  -0.86811050         3.2          1.3         0.2  setosa
## 4  -1.15180675         3.1          1.5         0.2  setosa
## 5  -0.01702177         3.6          1.4         0.2  setosa
## 6   1.11776320         3.9          1.7         0.4  setosa
## ...........................................................

Or for all the columns in one go𐩒:

df <- iris
split(df[,1:4], df$Species) <- Map(scale, split(df[,1:4], df$Species))

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1   0.26667447   0.1899414   -0.3570112  -0.4364923  setosa
## 2  -0.30071802  -1.1290958   -0.3570112  -0.4364923  setosa
## 3  -0.86811050  -0.6014810   -0.9328358  -0.4364923  setosa
## 4  -1.15180675  -0.8652884    0.2188133  -0.4364923  setosa
## 5  -0.01702177   0.4537488   -0.3570112  -0.4364923  setosa
## 6   1.11776320   1.2451711    1.3704625   1.4613004  setosa
## ...........................................................

approximate pattern matching

grep() is an often used function to search for strings matching a specified pattern. But there also exists agrep() which allows approximate matching with mistakes.

agrep("Nortx", state.name, value=TRUE)
## [1] "North Carolina" "North Dakota"

repeating expressions

Taking an average of 10 random numbers 10 times can be done with a for loop. And, perhaps more elegantly, with a sapply statement. However R also has a dedicated function: replicate(), just for a task like this.

res <- numeric(10)
for(i in 1:10) {
  res[i] <- mean(rnorm(10))
}

res
## [1] -0.273649279  0.403281624 -0.296831611  0.064649686 -0.079948623
## [6]  0.015131656  0.033084290  0.152712287  0.040878931  0.007198655


sapply(1:10, function(x) mean(rnorm(10)))
## [1] -0.273649279  0.403281624 -0.296831611  0.064649686 -0.079948623
## [6]  0.015131656  0.033084290  0.152712287  0.040878931  0.007198655


replicate(10, mean(rnorm(10)))
## [1] -0.273649279  0.403281624 -0.296831611  0.064649686 -0.079948623
## [6]  0.015131656  0.033084290  0.152712287  0.040878931  0.007198655

obtaining combinations

Starting with 5 letters, how many different 2-letter combinations can be obtained, if order does not matter and without repeats?

choose(5, 2)
## [1] 10

Here they are:

combn(letters[1:5], 2)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] "a"  "a"  "a"  "a"  "b"  "b"  "b"  "c"  "c"  "d"
## [2,] "b"  "c"  "d"  "e"  "c"  "d"  "e"  "d"  "e"  "e"

Applying a function to each combination:

combn(letters[1:5], 2, FUN=function(x) paste(x, collapse="+"))
## [1] "a+b" "a+c" "a+d" "a+e" "b+c" "b+d" "b+e" "c+d" "c+e" "d+e"

And if the order does matter and repeats are allowed:

expand.grid(letters[1:5], letters[1:5])
##   Var1 Var2
## 1    a    a
## 2    b    a
## 3    c    a
## 4    d    a
## 5    e    a
## 6    a    b
## ...........

changing values to NA

In a vector of 5 numbers typical way to change all values above 5 to “NA” is demonstrated on the left side. And the right side provides a rarely used alternative way.

x <- sample(5)       ## [1] 5 1 2 3 4


x[x>3] <- NA
x                    ## [1] NA  3  2  1 NA


is.na(x) <- x > 5
x                    ## [1] NA  3  2  1 NA

assigning operators

Possibility to create a custom infix operators by using the %...% syntax is well known. Here is an example of the operators opposite of %in%:

`%out%` <- function(x, y) !(x %in% y)

LETTERS[LETTERS %out% c("A", "E", "I", "O", "U")]
##  [1] "B" "C" "D" "F" "G" "H" "J" "K" "L" "M" "N"
## [12] "P" "Q" "R" "S" "T" "V" "W" "X" "Y" "Z"

It is also possible to create a custom assigning function, similar to names(x)<-. As an example here is a function that can replace the first element of a vector.

`first<-` <- function(x, value) c(value, x[-1])

x <- 1:10
first(x) <- 0

x
## [1]  0  2  3  4  5  6  7  8  9 10

However, a more surprising construct is a combination of the two. Here is an example of a function that can replace all elements falling outside of specified set.

`%out%<-` <- function(x, y, value) {x[!(x %in% y)] <- value; x}

x <- 1:10
x %out% c(4,5,6,7) <- 0

x
## [1] 0 0 0 4 5 6 7 0 0 0

Maybe even more surprising is that this can be used on standard operators (those without %...%). Below is a function that modifies the first argument of a product so that the product is equal to the given value.

`*<-` <- function(x, y, value) x*value/(x*y)

x <- 5
y <- 2

x * y        ## [1] 10

x * y <- 1
x            ## [1] 0.5

x * y        ## [1] 1

And here is an even bigger contraption - assignment from both sides:

`<-<-` <- function(x, y, value) x <- paste0(y, "_", value)

"start" -> x <- "end"

x
## [1] "start_end"

multiple linear regressions

A somewhat hidden feature of lm() is that it accepts Y in a matrix format and does regression for each column separately. Doing it this way is also a lot faster compared to performing a separate lm() call for each column separately.

Example of regressing each variable in iris dataset against Species. This results in estimating the coefficients of 4 separate linear models.

lm(data.matrix(iris[,-5]) ~ iris$Species)

## Call:
## lm(formula = data.matrix(iris[, -5]) ~ iris$Species)
##
## Coefficients:
##                         Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
## (Intercept)             -8.346e-17     2.555e-16    3.243e-16     2.853e-16
## iris$Speciesversicolor   1.316e-16    -5.809e-16    1.191e-16    -7.439e-16
## iris$Speciesvirginica   -4.441e-17    -7.772e-16    1.998e-16     3.775e-16

color palette

R has over 650 named colors. Here are random 20 colors from that list:

sample(colors(), 20)

##  [1] "grey56"         "pink3"          "lightslategrey" "grey94"
##  [5] "royalblue1"     "grey39"         "grey97"         "gray94"
##  [9] "gray80"         "seashell1"      "turquoise3"     "rosybrown1"
## [13] "cyan"           "gray2"          "sienna1"        "gray25"
## [17] "grey99"         "green4"         "deepskyblue3"   "gray99"

palette() allows to change the colors represented by numbers.

palette(c("cornflowerblue", "orange", "limegreen", "pink", "purple", "grey"))
pie(table(chickwts$feed), col=1:6)

And to restore the colors:

palette("default")
pie(table(chickwts$feed), col=1:6)

color interpolation

Sometimes it is necessary to color a numeric variable by its value. For this purpose colorRamp can create a function that will interpolate a given set of colors to the [0,1] interval. Then we can obtain a color corresponding to any number between 0 and 1𐩒.

pal <- colorRamp(c("blue", "green", "orange", "red"))

rgb(pal(0.5), max=255)
## [1] "#7FD200"

And here it is used to color the points by horse power:

# first - transform hp to a range 0-1
hp01 <- (mtcars$hp - min(mtcars$hp)) / diff(range(mtcars$hp))

plot(mtcars$hp, mtcars$mpg, pch=19, col=rgb(pal(hp01), max=255))

screens

Sometimes it is convenient to place a plot within a plot. One way to achieve this is with split.screen():

figs <- rbind(c(0.0, 1.0, 0.0, 1.0),
              c(0.3, 0.5, 0.6, 0.8)
              )
screenIDs <- split.screen(figs)

screen(screenIDs[1])
barplot(1:10, col="lightslategrey")

screen(screenIDs[2])
par(mar=c(0,0,0,0))
pie(1:5)

hooks

Hooks are a mechanism for injecting a function after a certain action takes place. They are sparsely used within R. For the demonstration plot.new hook𐩒 will be used here.

This hook allows user to insert an action at the end of the plot.new() function. Here it will be used for adding a date stamp to every created plot.

setHook("plot.new", function() {mtext(Sys.Date(), 3, adj=1, xpd=TRUE)}, "append")

Now all plots should have a date:

par(mfrow=c(1,2))

plot(density(iris$Sepal.Width), lwd=2, col="lightslategrey", main="density")
pie(table(mtcars$gear))

the dollar operator

Dollar operator $ is used to select elements from a list by name. However it is a generic method and can be modified.

Here is a rewriting of $ operator to select rows, instead of columns, from data.frames𐩒:

`$.data.frame` <- function(x, name) {x[rownames(x)==name,]}

USArrests$Utah
##      Murder Assault UrbanPop Rape
## Utah    3.2     120       80 22.9

Auto-completion after pressing tab can also be added by rewriting the .DollarNames method:

.DollarNames.data.frame <- function(x, pattern="") {
  grep(pattern, rownames(x), value=TRUE)
}


> USArrests$A <tab>
...labama   ...laska    ...rizona   ...rkansas

To add more weirdness tab autocompletion can be made to auto-correct row name mistakes:

.DollarNames.data.frame <- function(x, pattern="") {
  agrep(pattern, rownames(x), value=TRUE, max.distance=0.25)
}

> USArrests$Kali <tab>
> USArrests$California

  1. This works because scale by default scales each column separately.   ↑

  2. Sadly we need to transform it to an acceptable format first using rgb().   ↑

  3. This mechanism is used and abused in basetheme package.   ↑

  4. Dollar operator is a nice way to implement element selection for custom S3 classes. But do not change the dollar behaviour for data.frames as it is used in a lot of base R functions.   ↑