Skip to content

Instantly share code, notes, and snippets.

@arunsrinivasan
Last active August 29, 2015 14:01
Show Gist options
  • Save arunsrinivasan/5cc6317eb47d4786fb98 to your computer and use it in GitHub Desktop.
Save arunsrinivasan/5cc6317eb47d4786fb98 to your computer and use it in GitHub Desktop.
A suggestion on Hadley's point about "Performance", "Premature optimisation" and "vectorise"

Under the section Vectorise (and also briefly mentioned under section Do as little as possible), one point I think would be nice to have is to be aware of the data structure the vectorised functions are implemented for. Using vectorised code without understanding that is a form of "premature optimisation" as well, IMHO.

For example, consider the case of rowSums on a data.frame. Some issues to consider here are:

  • Memory - using rowSums on a data.frame will coerce into a matrix first. Imagine a huge (> 1Gb) data.frame and this might turn out to be a bad idea if the conversion drains memory and starts swapping.

Note: I personally think discussion about performance should merit on trade-offs between "speed" and "memory".

  • Data structure - We can do much more in terms of speed (and memory) by taking advantage of the data structure here. Here's an example:
set.seed(1L)
require(data.table)
DF <- as.data.frame(setDT(lapply(1:1e2, function(x) as.numeric(sample(10, 1e6, TRUE)))))

## using vectorised rowSums
system.time(ans1 <- rowSums(DF))
#   user  system elapsed 
#  2.029   1.154   3.660 

## using simple for-loop
foo <- function(x) {
    ## skipping checks here just for illustration
    ans = x[[1L]]
    for (i in seq_len(ncol(x))[-1L]) {
        ans = ans + x[[i]]
    }
    ans
}
system.time(ans2 <- foo(DF))
#   user  system elapsed 
#  0.565   0.570   1.172 

identical(ans1, ans2) ## [1] TRUE

The for-loop has no coercion (no twice the memory usage) and is ~3x faster. We've performance improvement in terms of both "speed" and "memory" by choosing not to use rowSums on a data.frame.

Even better would be to write this for-loop in C. But that shouldn't matter a lot as long as you're not dealing with a lot of columns (which is rarely the case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment