R Grouping functions: sapply vs. lapply vs. apply. vs. tapply vs. by vs. aggregate

R Grouping functions have many *apply functions which are ably described in the help files (e.g. ?apply). There are enough of them, though, that beginning users may have difficulty deciding which one is appropriate for their situation or even remember them all. They may have a general sense that “I should be using an *apply function here”, but it can be tough to keep them all straight at first.

Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular plyr package, the base functions remain useful and worth knowing.

This answer is intended to act as a sort of signpost for new users to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.

applyWhen you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.

 

If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick,colMeansrowMeans.colSumsrowSums

lapplyWhen you want to apply a function to each element of a list in turn and get a list back.
This is the workhorse of many of the other *apply functions. Peel back their code and you will often find underneathlapply.

sapplyWhen you want to apply a function to each element of a list in turn, but you want avector back, rather than a list.
If you find yourself typing,unlist(lapply(...)) stop and consider.sapply

In more advanced uses of itsapply will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length, sapply will use them as columns of a matrix:

If our function returns a 2-dimensional matrix, sapply will do essentially the same thing, treating each returned matrix as a single long vector:

Unless we specify,simplify = "array" in which case it will use the individual matrices to build a multi-dimensional array:

Each of these behaviours is of course contingent on our function returning vectors or matrices of the same length or dimension.
vapplyWhen you want to use butsapply perhaps need to squeeze some more speed out of your code.
For,vapply you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector.

mapplyFor when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in sapply.
This is multivariate in the sense that your function must accept multiple arguments.

MapA wrapper to withmapplySIMPLIFY = FALSE, so it is guaranteed to return a list.

rapplyFor when you want to apply a function to each element of a nested list structure, recursively.
To give you some idea of how uncommon israpply, I forgot about it when first posting this answer! Obviously, I’m sure many people use it, but YMMV. rapply is best illustrated with a user-defined function to apply:

tapplyFor when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.
The black sheep of the *apply family, of sorts. The help file’s use of the phrase “ragged array” can be a bit confusing, but it is actually quite simple.
A vector:

A factor (of the same length!) defining groups:

Add up the values in x within each subgroup defined by y:

More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors. tapply is similar in spirit to the split-apply-combine functions that are common in R (aggregate, by, ave, ddply, etc.) Hence its black sheep status.

You might also like More from author