Checking for exact equality of FPs
require(dplyr)
DF = data.frame(a=seq(0, 1, by=0.2), b=1:2)
merge(data.frame(a=0.6), DF, all.x=TRUE)
# a b
# 1 0.6 NA
Checking for exact equality of FPs
require(dplyr)
DF = data.frame(a=seq(0, 1, by=0.2), b=1:2)
merge(data.frame(a=0.6), DF, all.x=TRUE)
# a b
# 1 0.6 NA
Suppose I've two data.frame
s DF1
and DF2
as shown below:
require(dplyr)
set.seed(1L)
DF1 = data.frame(x=sample(3,10,TRUE), y1=1:10, y2=11:20)
# x y1 y2
# 1: 1 1 11
# 2: 1 5 15
# 3: 1 10 20
Here's another comparison between dplyr
and data.table
on relatively large data, with different number of unique groups.
require(dplyr)
require(data.table)
N = 20e6L # 20 million rows, UPDATE: also ran for 50 million rows (see table below)
K = 10L # other values tested are 25L, 50L, 100L
DT <- data.table(Insertion = sample(K, N, TRUE),
Unit = sample(paste("V", 1:K, sep=""), N, TRUE),
Channel = sample(K, N, TRUE),
Benchmarking for this gist: https://gist.github.com/PeteHaitch/75d6f7fd0566767e1e80
sim_data <- function(n, m, d, sim_strand = FALSE){
if (d >= n){
stop("Require d < n")
}
i <- sample(n - d, d)
A small note on this tweet from @KevinUshey and this tweet from @ChengHLee:
The number of rows, while is important, is only one of the factors that influence the time taken to perform the join. From my benchmarking experience, the two features that I found to influence join speed, especially on hash table based approaches (ex: dplyr
), much more are:
That is, these features influence join speed in spite of having the same number of rows.
Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.
Under the section Vectorise
(and also briefly mentioned under section Do as little as possible
), one point I think would be nice to have is to be aware of the data structure the vectorised functions are implemented for. Using vectorised code without understanding that is a form of "premature optimisation" as well, IMHO.
For example, consider the case of rowSums
on a data.frame
. Some issues to consider here are:
rowSums
on a data.frame
will coerce into a matrix
first. Imagine a huge (> 1Gb) data.frame and this might turn out to be a bad idea if the conversion drains memory and starts swapping.Note: I personally think discussion about performance should merit on trade-offs between "speed" and "memory".