This comparison is inspired by the comparison here: https://gist.github.com/conormm/fd8b1980c28dd21cfaf6975c86c74d07
The Nim data frame implementation we use here is the Datamancer data frame.
Note that due to UFCS in Nim we can write the commands we present either
similar to the Python notation as if the function were a method of the
object (method call syntax: df.filter(...)
) or using regular
function call syntax similar to R (filter(df, ...)
). In R using the
%>%
pipe operator we can achieve something that looks a bit like
method call syntax. In the R examples in which both ways are
presented, we replace that by the equivalent in Nim.
We will use the method call syntax for all Nim examples with the
exceptions mentioned. Also, unless the expression does not return
anything we will discard
all Nim statements.
And as stated in the gist linked above, possibly take the Python examples with a grain of salt in terms of notation.
R
filter(df, var > 20000 & var < 30000)
filter(df, var == 'string') # df %>% filter(var != 'string')
df %>% filter(var != 'string')
df %>% group_by(group) %>% filter(sum(var) > 2000000)
Python
df[(df['var'] > 20000) & (df['var'] < 30000)]
df[df['var'] == 'string']
df[df['var'] != 'string']
df.groupby('group').filter(lambda x: sum(x['var']) > 2000000)
Nim
discard df.filter(f{`var` > 20000 and `var` < 30000})
# Note: `"string"` here refers to a specific string literal and not the string
# type (which may very well be the case in languages like Python),
# using function call syntax
discard filter(df, f{`var` == "string"})
# or using method call syntax with a similar example
discard df.filter(f{`var` != "string"})
# this returns all subgroups that pass the reducing predicate
discard df.group_by("group").filter(f{sum(`var`) > 2000000})
In these four cases the major difference between the R and Nim implementations is two things stemming from the major fact that R is an interpreted language and Nim compiled. The “pseudo code” described in the R examples needs to be converted into something that can be compiled as regular Nim code. Nim’s powerful macro system allows us to achieve the elegance of R with a small amount of additional fluff.
The two differences are:
- we need to enter an
untyped
macro to allow us to rewrite code. This is done by using thef{}
macro - References to columns are not done by
untyped
identifiers as (in essence) is done in R. There would be no way to differentiate between a variablefoo
and a column of namefoo
.let foo = 5 # let `df` be a dataframe with column "foo" discard df.filter({ foo > 10 })
For our implementation this code snippet means that
foo
will be interpreted as the local variable resulting in a formula that is constantlyfalse
. The resulting data frame will thus be empty.
For the latter we could abuse Nim’s macro system to plainly have an identifier refer to a column iff there is no symbol defined of that identifier. However, that is too ambiguous and makes it undesirable to use column names that might clash with typical symbols in code. There are different ways to refer to columns, which won’t be discussed in detail (a couple of ways will be mentioned below though as there are certain restrictions).
Side note: Nim’s implementation could be handled in a way similar to Pandas. That means essentially having broadcast operations for DF column + scalar operations. However, this has the incredible downside of introducing two for loops in code such as:
df["var"] > 200 and df["var"] < 500
Each comparison operation then needs to be broadcast and evaluated
individually using a full loop over the data. This is wasted
work. That’s the reason we take the full macro path in Nim (and having
the additional beauty of being able to emulate the R syntax closely)
to convert the DSL into a single for loop over the data independent
of the complexity of the formula { multiple formulas after one another
cannot be fused unfortunately. That would add significant complexity
requiring analysis of the loop bodies, which in addition would require
even the procedures filter, mutate, ...
to be macros so that
multiple formulas can even be analyzed together }. One currently still
missing feature is lifting of constant expressions. Something like
f{`var` < mean(`var`)}
currently recomputes the mean in each
iteration, just as it would in a regular for loop. Soon those things
will be lifted out of the loop body.
The last formula (in R):
df %>% group_by(group) %>% filter(sum(var) > 2000000)
in Nim:
df.group_by("group").filter(f{sum(`var`) > 2000000})
Depending on your expectations this case might be a bit weird. The
formula argument to filter
is a reducing formula that returns a
boolean. If that boolean is true, the full subgroup (determined by the
groub_by
call) will be kept, else dropped.
One might on the other hand think it would return the group labels that match that predicate. We will now take a detour to do that instead, because it allow us to explain a few more things.
We will deconstruct (one of) the ways to determine the labels that pass the above predicate.
The Nim code to do that is (line breaks for readability):
df.group_by("group")
.summarize(f{int: sum(`var`)})
.filter(f{idx("(sum var)") > 2000000})
From a “high-level” point of view the only difference really is the splitting of the reduce + filter into individual calls. First a reducing operation of the kind we wish (summing the individual groups of “group”) and then filtering on the result.
There are now 2 things in particular that stand out.
- in the
summarize
formula we see theint:
at the beginning. This is a type hint for the formula macro. It means to read the column (“var” in this case) as a column of integers. It is required, because from thesum(`var`)
it isn’t possible to tell which type needs to be read as there are multiple overloads of the symbolsum
. - The
filter
proc now uses theidx
“function”. It’s not really a function, but a different way to refer to data frame columns. There are again 2 things to untangle here:- why is the argument this string “sum var” when the original data
frame contains a column “var”? Because the
summarize
operation returns a new data frame that contains those columns that are described by the formulas. As we did not give a name for the formula, the default naming scheme is used. That scheme is simply a lisp like representation of the code in the formula. A function callsum
with the argumentvar
: “(sum var)”. For people used to lisp this should be comfortable. This allows to write the default formula name easily by hand even if more complex expressions are used (though they get lengthy). A name to the formula is given by:df.summarize(f{int: "summed" << sum(`var`)})
where the new column is called “summed”. The
<<
is to refer to a reducing formula. The other two are:~
mapping operationT -> T
. For any formula that is not reducing in nature a"name" ~ RHS
can be used to name the newly created column. See the additionalmutate
example below.<-
as a constant assignment, typically only used in the context of therename
procedure to rename DF columns. See the section onrename
below.
- why do we use
idx
suddenly? The default naming schemes of formulas contains spaces. The accented quotes in Nim (identifiers in`
) cannot properly handle spaces as we lose information. We can either reconstruct expressions in accented quotes without any spaces (what we do) or with spaces between every identifier. E.g.:`x/y` `x / y`
cannot be distinguished and it’s up to us to convert such an expression into a string. We choose to convert both to
"x/y"
.
The
idx
function simply refers to a column and can take:- something in accented quotes
- a literal string
- any Nim expression returning a string
- why is the argument this string “sum var” when the original data
frame contains a column “var”? Because the
So the following is equivalent to the above version.
df.group_by("group")
.summarize(f{int: "sumVar" << sum(`var`)}) # name new column
.filter(f{`sumVar` > 2000000}) # filter by new column
R
select(df, var1, var2)
select(df, -var3)
Python
df[['var1', 'var2']]
df.drop('var3', 1)
Nim:
discard df.select("var1", "var2")
# or
discard df.select(@["var1", "var2"])
discard df.select(df.getKeys().filterIt(it != "var3"))
# or more naturally
discard df.drop("var3")
R
arrange(df, var1)
arrange(df, desc(var1))
Python
df.sort_values('var1')
df.sort_values('var1', ascending=False)
Nim
df.arrange("var1")
df.arrange("var1", order = SortOrder.Descending)
R
df %>% group_by(group)
df %>% group_by(group1, group2)
df %>% ungroup()
Python
df.groupby('group1')
df.groupby(['group1', 'group2'])
df.reset_index() / or when grouping: df.groupby('group1', as_index=False)
Nim
discard df.group_by("group")
# the following is currently broken because of Nim issue #17429
# (we could introduce another overload w/o `add` argument to work around)
discard df.group_by("group1", "group2")
discard df.group_by(@["group1", "group2"])
# ungroup is just
discard df.group_by()
R
df %>% group_by(group) %>% summarise(mean_var1 = mean(var1))
df %>%
group_by(group1, group2) %>%
summarise(mean_var1 = mean(var1),
sum_var1 = sum(var1),
count_var1 = n())
df %>%
group_by(group1, group2) %>%
summarise(mean_var1 = mean(var1),
sum_2 = sum(var2),
var3 = first(var3))
Python
df.groupby('group1')['var1'].agg({'mean_col' : np.mean()}) # pass dict to specify column name
df.groupby(['group1', 'group2'])['var1'].agg(['mean', 'sum', 'count']) # for count also consider 'size'. size will return n for NaN values also, whereas 'count' will not.
# first perform the aggregation
group_agg = df.groupby(["group1", "group2"]).agg({
"var1" : ["mean"],
"var2" : ["sum"],
"var3" : ["first"]
})
# second rename the columns by joining the column name with the agg function (e.g. "var1_mean")
group_agg.columns = ["_".join(x) for x in group_agg.columns.ravel()]
# You can also pass multiple functions to aggregate the same column e.g:
group_agg = df.groupby(["group1", "group2"]).agg({"var1" : ["mean", "std", "sum"]})
Nim
discard df.group_by("group").summarize(f{float: "mean_var1" << mean(`var1`)})
discard df.group_by(@["group1", "group2"])
.summarize(f{float: "mean_var1" << mean(`var1`)},
f{float: "sum_var1" << sum(`var1`)},
f{int: "count_var1" << col(`var1`).len})
discard df.group_by(@["group1", "group2"])
.summarize(f{float: "mean_var1" << mean(`var1`)},
f{float: "sum_2" << sum(`var2`)},
f{float: "var3" << col(`var3`)[0]})
After the explanations above, these should be mostly straight
forward. We need the type hints in all cases as we simply call
procedures that are all overloaded. The correct types to use of course
depend on the individual dataframe. It might be int
in each case
depending on data and use case (even if that’s weird for a mean).
The col
“function” used here is the equivalent to the idx
function
introduced above in the case one wishes to explicitly refer to a full
data frame column. In this specific use case we can use it to
essentially access the underlying tensor and use regular Nim
procedures on it. In the first case:
col(`var1`).len
will be turned intovar1T.len
, wherevar1T
is the underlying tensor of the column “var1”let var1T = df["var1", float]
. Given that thecol(...)
is converted to a simple tensor, using any procedure that’s valid for tensors should work as expected (for the readers who know arraymancer well:ggplotnim
contains an overloadlen
for tensors that checks the dimension and returns the tensor size).- in the second case
col(`var3`)[0]
it’s the same idea. We simply get the regular tensor and access the 0-th element, i.e. the equivalent of a hypotheticalfirst
procedure. We could define such a thing of course, too.
This possibly leaves the question “why do we sometimes just write
`var1`
and other times explicitly col(...)
or idx(...)
?”. The
macro magic behind the formula tries to determine the correct usage in
the accented quote column reference. Typically it will be interpreted
as an access to a individual element of a column. However, in calls to
functions we try to determine the signature and choose accordingly
(e.g. for mean
we know it’s reducing and thus the input is the full
column). There are cases though where one needs to specify what one
needs. Either due to ambiguous overloads or to in the case shown here
where we want to work on the actual tensor in a more atypical way.
This code example is imo rather bad, as it should be done using
summarise
in dplyr instead as mutate
is normally used for mapping
operations. The point I think this is trying to get across is that one
can mutate on a group, such that including operations that would be
affected by different groups (i.e. the mean of one group is probably
different than the mean of the whole dataset).
R
df %>% group_by(group) %>% mutate(mean_var1 = mean(var1))
Python
df.groupby('group').assign(mean_var1 = lambda x: np.mean(x.var1)
Nim
discard df.group_by("group").mutate(f{"mean_var1" << mean(var1)})
We need to clarify in Nim that it’s a reducing operation. If we write
replaced <<
by a tilde ~
instead, it would throw us a CT error
saying that the operation on the RHS is reducing in nature.
A more typical example utilizing mutate
might be something like:
discard df.mutate(f{float: "sumA+B" ~ `A` + `B`})
which creates a new column named “sumA+B” of type float.
R
df %>% distinct()
df %>% distinct(col1) # returns dataframe with unique values of col1
Python
df.drop_duplicates()
df.drop_duplicates(subset='col1') # returns dataframe with unique values of col1
Nim
discard df.unique()
discard df.unique("col1") # returns dataframe with unique values of col1
In Nim we replaced the word distinct
, because it’s a
keyword. unique
seems like a good alternative and I like it much
better than sequtils deduplicate
.
rename
is used to (guess what) rename different columns. For this we
use a formula with <-
. Let df
be a data frame with columns “col1”
and “col2”:
discard df.rename(f{"foo" <- "col1"},
f{"bar" <- "col2"})
which simply renames both columns.
Sampling from a data frame is not implemented in ggplotnim’s data frame so far.
R
sample_n(df, 100)
sample_frac(df, 0.5)
Python
df.sample(100)
df.sample(frac=0.5)