Here's a simple timing test of aggregation functions in R, using 1.3 million rows and 80,000 groups of real data on a 1.8GHz Intel Core i5. Thanks to Arun Srinivasan for helpful comments.
The fastest function to run through the data.frame benchmark is data.table, which runs twice faster than dplyr, which runs ten times faster than base R.
For a benchmark that includes plyr, see this earlier Gist for a computationally more intensive test on half a million rows, where dplyr still runs 1.5 times faster than aggregate in base R.
Both tests confirm what W. Andrew Barr blogged on dplyr:
the 2 most important improvements in dplyr are
- a MASSIVE increase in speed, making dplyr useful on big data sets
- the ability to chain operations together in a natural order
Tony Fischetti has clear examples of the latter, and Erick Gregory shows that easy access to SQL databases should also be added to the list.
Yes, it wouldn't make sense if you're not wanting to stick to the
data.tableobject, of course. That's the misunderstanding part (from my side). Then you'd have to convert back and forth. My reason for requesting is, since people are interested in speed, to show that there are much faster options thandplyr.I've also benchmarked against
dplyr(for quite sometime now) and will be putting it up on the webpage tonight or tomorrow. It's on half a billion rows. When you see the speed-up there againstdplyr, it's quite enticing ;).Anyhow, thanks a lot for taking the time (esp. out of your weekend) and accepting my request in doing this benchmark.