A collection of lesser known R tricks and features.
R has a small number of built in numeric constants, including Inf
and pi
.
But there are also a several useful lists of often used names and abbreviations which includes letters, month names, and various information about United States.
letters ## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
## [14] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
LETTERS ## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M"
## [14] "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
month.name ## [1] "January" "February" "March" "April"
## [5] "May" "June" "July" "August"
## [9] "September" "October" "November" "December"
month.abb ## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep"
## [10] "Oct" "Nov" "Dec"
state.name ## [1] "Alabama" "Alaska" "Arizona"
## [4] "Arkansas" "California" "Colorado"
## [7] "Connecticut" "Delaware" "Florida"
## ................................................
state.abb ## [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI"
## [12] "ID" "IL" "IN" "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI"
## [23] "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ" "NM" "NY" "NC"
## ...........................................................
Also available: state.region
, state.division
, state.area
and state.center
.
Creating a placeholder matrix that later gets filled up is a reoccurring procedure. Below are several different ways to achieve different prepared 3x3 matrices.
matrix(1, 3, 3) ## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 1 1 1
## [3,] 1 1 1
mat.or.vec(3, 3) ## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0
## [3,] 0 0 0
diag(3) ## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
.row(c(3,3)) ## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 2 2 2
## [3,] 3 3 3
.col(c(3,3)) ## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 1 2 3
## [3,] 1 2 3
1:3 %o% 1:3 ## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 2 4 6
## [3,] 3 6 9
Each element inside the matrix can have its own name. And those names can be used for selecting elements matrix elements.
x <- matrix(1:9, ncol=3)
names(x) <- paste0("e", 1:9)
x ## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
## attr(,"names")
## [1] "e1" "e2" "e3" "e4" "e5" "e6" "e7" "e8" "e9"
x["e3"] ## e3
## 3
Indices from a matrix can be obtained in a <row, column>
form.
And this special format can also be used to select elements from a matrix.
x <- matrix(1:6, nrow=2) ## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
which(x > 3, arr.ind=TRUE) ## row col
## [1,] 2 2
## [2,] 1 3
## [3,] 2 3
inds <- rbind(c(1,2), c(2,1)) ## [,1] [,2]
## [1,] 1 2
## [2,] 2 1
x[inds] ## [1] 3 2
A single vector of indeces instead of multiple subset operations can be used to select an element from a nested list.
a <- list(list(list(list("element"))))
a[[1]] ## [[1]]
## [[1]][[1]]
## [[1]][[1]][[1]]
## [1] "element"
a[[1]][[1]][[1]][[1]] ## [1] "element"
a[[c(1,1,1,1)]] ## [1] "element"
Matrix can contain various classes. Below is an example of matrix with data frames.
mat <- matrix(list(iris, mtcars, USArrests, chickwts), ncol=2)
mat ## [,1] [,2]
## [1,] List,5 List,4
## [2,] List,11 List,2
mat[[2,2]] ## weight feed
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
## 6 168 horsebean
## ..................
Taking means of rows or columns of a matrix is an often repeated operation: But R also has handy functions for repeating this operation on a flattened matrix, given that the dimensions are known.
mat <- matrix(round(rnorm(4), 2), ncol=2)
vec <- as.numeric(mat)
mat ## [,1] [,2]
## [1,] 2.19 0.03
## [2,] 0.76 -0.41
vec ## [1] 2.19 0.76 0.03 -0.41
colMeans(mat) ## [1] 1.475 -0.190
.colMeans(vec, m=2, n=2) ## [1] 1.475 -0.190
Equivalents also exist for .rowMeans
, .colSums
and .rowSums
.
split()
and unsplit()
is a somewhat convenient way to do split-apply-combine tasks in base R.
During this procedure the data frame is first split into a list of data frames - one for each group.
Then a function is applied to all the data frames in a list.
And finally the list is recombined again to a single data frame.
dfs <- split(iris, iris$Species)
dfs <- lapply(dfs, transform, Sepal.Length=as.vector(scale(Sepal.Length)))
dfs <- unsplit(dfs, iris$Species)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 0.26667447 3.5 1.4 0.2 setosa
## 2 -0.30071802 3.0 1.4 0.2 setosa
## 3 -0.86811050 3.2 1.3 0.2 setosa
## 4 -1.15180675 3.1 1.5 0.2 setosa
## 5 -0.01702177 3.6 1.4 0.2 setosa
## 6 1.11776320 3.9 1.7 0.4 setosa
## ...........................................................
However it is possible to do all of this with a single call to a split()<-
function:
df <- iris
split(df$Sepal.Length, df$Species) <- tapply(df$Sepal.Length, df$Species, scale)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 0.26667447 3.5 1.4 0.2 setosa
## 2 -0.30071802 3.0 1.4 0.2 setosa
## 3 -0.86811050 3.2 1.3 0.2 setosa
## 4 -1.15180675 3.1 1.5 0.2 setosa
## 5 -0.01702177 3.6 1.4 0.2 setosa
## 6 1.11776320 3.9 1.7 0.4 setosa
## ...........................................................
Or for all the columns in one go^{𐩒}:
df <- iris
split(df[,1:4], df$Species) <- Map(scale, split(df[,1:4], df$Species))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 0.26667447 0.1899414 -0.3570112 -0.4364923 setosa
## 2 -0.30071802 -1.1290958 -0.3570112 -0.4364923 setosa
## 3 -0.86811050 -0.6014810 -0.9328358 -0.4364923 setosa
## 4 -1.15180675 -0.8652884 0.2188133 -0.4364923 setosa
## 5 -0.01702177 0.4537488 -0.3570112 -0.4364923 setosa
## 6 1.11776320 1.2451711 1.3704625 1.4613004 setosa
## ...........................................................
grep()
is an often used function to search for strings matching a specified pattern.
But there also exists agrep()
which allows approximate matching with mistakes.
agrep("Nortx", state.name, value=TRUE)
## [1] "North Carolina" "North Dakota"
Taking an average of 10 random numbers 10 times can be done with a for loop.
And, perhaps more elegantly, with a sapply
statement.
However R also has a dedicated function: replicate()
, just for a task like this.
res <- numeric(10)
for(i in 1:10) {
res[i] <- mean(rnorm(10))
}
res
## [1] -0.273649279 0.403281624 -0.296831611 0.064649686 -0.079948623
## [6] 0.015131656 0.033084290 0.152712287 0.040878931 0.007198655
sapply(1:10, function(x) mean(rnorm(10)))
## [1] -0.273649279 0.403281624 -0.296831611 0.064649686 -0.079948623
## [6] 0.015131656 0.033084290 0.152712287 0.040878931 0.007198655
replicate(10, mean(rnorm(10)))
## [1] -0.273649279 0.403281624 -0.296831611 0.064649686 -0.079948623
## [6] 0.015131656 0.033084290 0.152712287 0.040878931 0.007198655
Starting with 5 letters, how many different 2-letter combinations can be obtained, if order does not matter and without repeats?
choose(5, 2)
## [1] 10
Here they are:
combn(letters[1:5], 2)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] "a" "a" "a" "a" "b" "b" "b" "c" "c" "d"
## [2,] "b" "c" "d" "e" "c" "d" "e" "d" "e" "e"
Applying a function to each combination:
combn(letters[1:5], 2, FUN=function(x) paste(x, collapse="+"))
## [1] "a+b" "a+c" "a+d" "a+e" "b+c" "b+d" "b+e" "c+d" "c+e" "d+e"
And if the order does matter and repeats are allowed:
expand.grid(letters[1:5], letters[1:5])
## Var1 Var2
## 1 a a
## 2 b a
## 3 c a
## 4 d a
## 5 e a
## 6 a b
## ...........
In a vector of 5 numbers typical way to change all values above 5 to “NA” is demonstrated on the left side. And the right side provides a rarely used alternative way.
x <- sample(5) ## [1] 5 1 2 3 4
x[x>3] <- NA
x ## [1] NA 3 2 1 NA
is.na(x) <- x > 5
x ## [1] NA 3 2 1 NA
Possibility to create a custom infix operators by using the %...%
syntax is well known.
Here is an example of the operators opposite of %in%
:
`%out%` <- function(x, y) !(x %in% y)
LETTERS[LETTERS %out% c("A", "E", "I", "O", "U")]
## [1] "B" "C" "D" "F" "G" "H" "J" "K" "L" "M" "N"
## [12] "P" "Q" "R" "S" "T" "V" "W" "X" "Y" "Z"
It is also possible to create a custom assigning function, similar to names(x)<-
.
As an example here is a function that can replace the first element of a vector.
`first<-` <- function(x, value) c(value, x[-1])
x <- 1:10
first(x) <- 0
x
## [1] 0 2 3 4 5 6 7 8 9 10
However, a more surprising construct is a combination of the two. Here is an example of a function that can replace all elements falling outside of specified set.
`%out%<-` <- function(x, y, value) {x[!(x %in% y)] <- value; x}
x <- 1:10
x %out% c(4,5,6,7) <- 0
x
## [1] 0 0 0 4 5 6 7 0 0 0
Maybe even more surprising is that this can be used on standard operators (those without %...%
).
Below is a function that modifies the first argument of a product so that the product is equal to the given value.
`*<-` <- function(x, y, value) x*value/(x*y)
x <- 5
y <- 2
x * y ## [1] 10
x * y <- 1
x ## [1] 0.5
x * y ## [1] 1
And here is an even bigger contraption - assignment from both sides:
`<-<-` <- function(x, y, value) x <- paste0(y, "_", value)
"start" -> x <- "end"
x
## [1] "start_end"
A somewhat hidden feature of lm()
is that it accepts Y in a matrix format and does regression for each column separately.
Doing it this way is also a lot faster compared to performing a separate lm()
call for each column separately.
Example of regressing each variable in iris
dataset against Species
.
This results in estimating the coefficients of 4 separate linear models.
lm(data.matrix(iris[,-5]) ~ iris$Species)
## Call:
## lm(formula = data.matrix(iris[, -5]) ~ iris$Species)
##
## Coefficients:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## (Intercept) -8.346e-17 2.555e-16 3.243e-16 2.853e-16
## iris$Speciesversicolor 1.316e-16 -5.809e-16 1.191e-16 -7.439e-16
## iris$Speciesvirginica -4.441e-17 -7.772e-16 1.998e-16 3.775e-16
R has over 650 named colors. Here are random 20 colors from that list:
sample(colors(), 20)
## [1] "grey56" "pink3" "lightslategrey" "grey94"
## [5] "royalblue1" "grey39" "grey97" "gray94"
## [9] "gray80" "seashell1" "turquoise3" "rosybrown1"
## [13] "cyan" "gray2" "sienna1" "gray25"
## [17] "grey99" "green4" "deepskyblue3" "gray99"
palette()
allows to change the colors represented by numbers.
palette(c("cornflowerblue", "orange", "limegreen", "pink", "purple", "grey"))
pie(table(chickwts$feed), col=1:6)
And to restore the colors:
palette("default")
pie(table(chickwts$feed), col=1:6)
Sometimes it is necessary to color a numeric variable by its value.
For this purpose colorRamp
can create a function that will interpolate a given set of colors to the [0,1] interval.
Then we can obtain a color corresponding to any number between 0 and 1^{𐩒}.
pal <- colorRamp(c("blue", "green", "orange", "red"))
rgb(pal(0.5), max=255)
## [1] "#7FD200"
And here it is used to color the points by horse power:
# first - transform hp to a range 0-1
hp01 <- (mtcars$hp - min(mtcars$hp)) / diff(range(mtcars$hp))
plot(mtcars$hp, mtcars$mpg, pch=19, col=rgb(pal(hp01), max=255))
Sometimes it is convenient to place a plot within a plot.
One way to achieve this is with split.screen()
:
figs <- rbind(c(0.0, 1.0, 0.0, 1.0),
c(0.3, 0.5, 0.6, 0.8)
)
screenIDs <- split.screen(figs)
screen(screenIDs[1])
barplot(1:10, col="lightslategrey")
screen(screenIDs[2])
par(mar=c(0,0,0,0))
pie(1:5)
Hooks are a mechanism for injecting a function after a certain action takes place.
They are sparsely used within R.
For the demonstration plot.new
hook^{𐩒} will be used here.
This hook allows user to insert an action at the end of the plot.new()
function.
Here it will be used for adding a date stamp to every created plot.
setHook("plot.new", function() {mtext(Sys.Date(), 3, adj=1, xpd=TRUE)}, "append")
Now all plots should have a date:
par(mfrow=c(1,2))
plot(density(iris$Sepal.Width), lwd=2, col="lightslategrey", main="density")
pie(table(mtcars$gear))
Dollar operator $
is used to select elements from a list by name.
However it is a generic method and can be modified.
Here is a rewriting of $
operator to select rows, instead of columns, from data.frames
^{𐩒}:
`$.data.frame` <- function(x, name) {x[rownames(x)==name,]}
USArrests$Utah
## Murder Assault UrbanPop Rape
## Utah 3.2 120 80 22.9
Auto-completion after pressing tab can also be added by rewriting the .DollarNames
method:
.DollarNames.data.frame <- function(x, pattern="") {
grep(pattern, rownames(x), value=TRUE)
}
> USArrests$A <tab>
...labama ...laska ...rizona ...rkansas
To add more weirdness tab autocompletion can be made to auto-correct row name mistakes:
.DollarNames.data.frame <- function(x, pattern="") {
agrep(pattern, rownames(x), value=TRUE, max.distance=0.25)
}
> USArrests$Kali <tab>
> USArrests$California
This works because scale
by default scales each column separately. ↑
Sadly we need to transform it to an acceptable format first using rgb()
. ↑
Dollar operator is a nice way to implement element selection for custom S3 classes. But do not change the dollar behaviour for data.frames
as it is used in a lot of base R functions. ↑