Goal is to expose the vectorize group feature of rmr2 in a plyrmr way
- Operations should encapsulate the knowledge of whether they can handle multiple groups. vectorized.reduce should be set accordingly.
- vectorized.reduce should be propagated along a pipe when possible. Rules TBD
- A repertoire of vectorized reduce ops should be made available, and adding more should be easy (no C++)
- Wordcount is our guiding app here.
- Leverage dplyr and its handler system for fast aggregation
- group_by data frames before passing to aggregator.
- Then we can only use dplyr as aggregator. Alternatives?
- How do we do this for ops that don't have an equivalent in dplyr such as bind.cols, transmute etc We can try and simulate with dplyr operations, but there is no equivalent for transmute. We can try and simulate it with do, but it's difficult and slow (5k row/s). We could introduce summarize. And so we did.
- How do we do this in sparkR? Maybe a lapplyPartition after a groupByKey?