A common data manipulation task is that of making 'wide' data 'long':
for example, using the reshape2 library,
library(reshape2)
wide_df <- data.frame( stringsAsFactors=FALSE,
x=c('a', 'b'),
y=c('A', 'B'),
za=c(1, 2),
zb=c(3, 4),
zc=c(5, 6)
)
print(wide_df)## x y za zb zc
## 1 a A 1 3 5
## 2 b B 2 4 6
print(melt(wide_df, id.vars=c('x', 'y')))## x y variable value
## 1 a A za 1
## 2 b B za 2
## 3 a A zb 3
## 4 b B zb 4
## 5 a A zc 5
## 6 b B zc 6
We will try to re-implement this melt function using Rcpp.
Let's define some variables:
id_indis the indices corresponding to ouridvariables (1 and 2 in this case),value_indis the indices corresponding to ourvaluevariables (3, 4, 5 in this case),n_idis the number ofidvariables (2),n_valueis the number ofvaluevariables (3).
There are, essentially, three different operations constituting a melt:
- We repeat each
idvectorn_valuetimes, - We repeat each of the names of the
valuevectorsn_idtimes, - We concatenate the
valuevectors, coercing the type of each vector to the 'highest' type if necessary.
We store the output in a data.frame with our repeated id variables,
a variable column built up of repeated value names, and a value
column made up of the concatenated value vectors.
reshape2::melt does this all in R code, but this forces some unnecessary
copies of the data.frame and is hence very slow for large data sets. Let's
try implementing an Rcpp version of melt for data.frames.
We'll start by simply implementing functions for each of our 3 cases above.
#include <Rcpp.h>
using namespace Rcpp;
// a define used later for error handling
#define TYPE2STR(x) (CHAR(Rf_type2str(TYPEOF(x))))
// 1 is easy since we already have the sugar 'rep' function; we'll see its use
// later in the code
// 2. We repeat the names of the `value` vectors `n_id` times,
template <int RTYPE>
Vector<RTYPE> rep_each( const Vector<RTYPE>& x, int each ) {
int nn = x.size(); // length of x
int n = nn * each; // length of our output vector
Vector<RTYPE> output = no_init(n); // allocate the vector w/out initialization
// now, loop through x, and assign elements to output as needed
int counter = 0;
for (int i=0; i < nn; ++i) {
for (int j=0; j < each; ++j) {
output[counter] = x[i];
++counter;
}
}
return output;
}
// 3. We concatenate the `value` vectors, coercing the type of each vector
// to the 'highest' type if necessary.
// note the signature: we pass a DataFrame by reference, as well as the indices
// for the vectors we care about. This way we avoid passing in copies of the
// vectors; instead, we pass in references from which we can build our output.
// n_value and n_row let us calculate the size of our output vector
template <int RTYPE>
Vector<RTYPE> concatenate( const DataFrame& x,
const IntegerVector& value_ind,
int n_value,
int n_row ) {
int n = n_value * n_row;
Vector<RTYPE> output = no_init(n);
int counter = 0;
for (int i=0; i < n_value; ++i) {
// generate a reference to each vector within x, as tmp
Vector<RTYPE> tmp = as< Vector<RTYPE> >(x[ value_ind[i] ]);
for (int j=0; j < n_row; ++j) {
output[counter] = tmp[j];
++counter;
}
}
return output;
}
// now, we put everything together into a melt function
// [[Rcpp::export]]
List meltRcpp( DataFrame x,
IntegerVector id_ind,
IntegerVector value_ind,
String variable_name,
String value_name ) {
int n_id = id_ind.length();
int n_value = value_ind.length();
int n_row = x.nrows();
// Allocate the output data.frame
// size is equal to the number of id.vars + 2 (variable, value columns)
List output(n_id+2);
// id.vars
// We repeat each 'id' vector n_value times, and assign them immediately
// after generation
// A define to handle the different possible types
#define REP(RTYPE) \
case RTYPE: { \
output[ id_ind[i] ] = rep( as< Vector<RTYPE> >(x[ id_ind[i] ]), n_value); \
break; \
} \
for (int i=0; i < n_id; ++i) {
switch( TYPEOF(x[id_ind[i]]) ) {
REP(LGLSXP);
REP(INTSXP);
REP(REALSXP);
REP(STRSXP);
default: {
Rf_error("Unsupported RTYPE '%s'", TYPE2STR( x[id_ind[i]] ));
}
}
}
// names
// We repeat each 'name' for the value variables n_row times
// first, we do a bit of work to pull out the names we want
CharacterVector names = as<CharacterVector>(x.attr("names"));
CharacterVector val_names = no_init(n_value);
for (int i=0; i < n_value; ++i) {
val_names[i] = names[ value_ind[i] ];
}
// now, we can repeat each of these names
output[n_id] = rep_each(val_names, n_row);
// value.vars
// We concatenate all of the value variables
// we assume all the value variables are of the same type;
// an exercise is to coerce the value vectors
#define CONCATENATE(RTYPE) \
case RTYPE: { \
output[n_id+1] = concatenate<RTYPE>(x, value_ind, n_value, n_row); \
break; \
} \
switch( TYPEOF(x[value_ind[0]]) ) {
CONCATENATE(LGLSXP);
CONCATENATE(INTSXP);
CONCATENATE(REALSXP);
CONCATENATE(STRSXP);
default: {
Rf_error("Unsupported vector type '%s'", TYPE2STR( x[value_ind[0]] ));
}
}
// set some attributes
CharacterVector out_names = no_init(n_id + 2);
for (int i=0; i < n_id; ++i) {
out_names[i] = names[ id_ind[i] ];
}
out_names[n_id] = variable_name;
out_names[n_id+1] = value_name;
output.attr("names") = out_names;
int out_nrow = n_row * n_value;
IntegerVector row_names = no_init(out_nrow);
for (int i=0; i < out_nrow; ++i) {
row_names[i] = i+1;
}
output.attr("row.names") = row_names;
output.attr("class") = "data.frame";
// finally, we return
return output;
}Note one major deficiency is that we do not check or coerce our value variables; this is left as an exercise.
Now, let's test and benchmark our meltRcpp implementation. Note that, since
arrays in C++ are zero indexed, we have to subtract one from the
regular indexes we pass. Ideally, we would wrap this into an R function
that provides a similar melt interface, but that's outside of Rcpp.
all.equal(
melt(wide_df, id.vars=c('x', 'y')),
meltRcpp(wide_df, (0:1), (2:4), "variable", "value")
)## [1] "Component 3: 'current' is not a factor"
## note that we don't coerce the 'variable' column to factor, hence
## all.equal not being TRUE -- all the values do match, howeverAnd a benchmark, for a larger DF:
library(microbenchmark)
n <- 1E5
big_df <- data.frame( stringsAsFactors=FALSE,
x=sample(letters, n, TRUE),
y=sample(LETTERS, n, TRUE),
za=rnorm(n),
zb=rnorm(n),
zc=rnorm(n)
)
all.equal(
melt(big_df, id.vars=c('x', 'y')),
meltRcpp(big_df, (0:1), (2:4), "variable", "value")
)## [1] "Component 3: 'current' is not a factor"
microbenchmark( times=10,
melt=melt(big_df, id.vars=c('x', 'y')),
meltRcpp=meltRcpp(big_df, (0:1), (2:4), "variable", "value")
) ## Unit: milliseconds
## expr min lq median uq max neval
## melt 92.469 96.801 128.821 131.92 156.56 10
## meltRcpp 6.617 6.934 8.311 15.35 17.78 10
There are still a lot of holes in the implementation we have (e.g. more error checking is warranted, type coercion for the 'value' variables is necessary, some extra work is needed to properly handle factors, we might want to copy attributes from our original vectors / df), but we can see that our efforts have been very fruitful -- we see a ~10x increase in speed in our test case.