piccolbo/vecgroup.md

Last active August 29, 2015 14:04

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/piccolbo/e94e8633304401291ae6.js"></script>
Save piccolbo/e94e8633304401291ae6 to your computer and use it in GitHub Desktop.

Download ZIP

Vectorized grouped ops in plyrmr

Raw

vecgroup.md

Goal is to expose the vectorize group feature of rmr2 in a plyrmr way

What

Operations should encapsulate the knowledge of whether they can handle multiple groups. vectorized.reduce should be set accordingly.
vectorized.reduce should be propagated along a pipe when possible. Rules TBD
A repertoire of vectorized reduce ops should be made available, and adding more should be easy (no C++)
Wordcount is our guiding app here.

How

Leverage dplyr and its handler system for fast aggregation
group_by data frames before passing to aggregator.

Problems

Then we can only use dplyr as aggregator. Alternatives?
How do we do this for ops that don't have an equivalent in dplyr such as bind.cols, transmute etc We can try and simulate with dplyr operations, but there is no equivalent for transmute. We can try and simulate it with do, but it's difficult and slow (5k row/s). We could introduce summarize. And so we did.
How do we do this in sparkR? Maybe a lapplyPartition after a groupByKey?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment