In R, the functions for fitting statistical models based on a linear predictor usually allow for a formula/data specification of the model. The data part is a data.frame
object in R
. In Julia the corresponding type is a DataFrame
as implemented in the DataFrames package.
A formula object in R
is an expression whose top-level operator is a unary or binary tilde, ~
. The interpretation of many operators, including +
, *
, ^
, \
and :
are overridden in the formula language. See the result of
?formula
in R
for the details.
Model-fitting functions in R
produce the model matrix from a formula/data specification and other optional arguments in two stages. First the formula and the data frame are used to create a model.frame
object as
-
Create a
terms
object from theformula
anddata
arguments. The specificR
method isterms.formula
which, in turn, calls a C functiontermsform
in$RSRC/src/library/stats/models.c
. Stages within that function are- Check for a valid formula
- Determine the names of variables used in the formula
- Check if the intercept term is present. Because it is present by default this amounts to checking if it has been suppressed, either by a term of the form
0 + ...
or... - 1
or-1 + ...
- Recode the model in a binary form, expanding various operators. After this stage the model terms are either names of variables, interactions of variables or function calls. The binary form is a binary matrix with rows corresponding to the variables and columns corresponding to the terms. The bitcount for each column is the
order
of the term, which is the number of variables in the interaction. That is, an intercept term has order zero, amain-effects
term has order 1, a second-order interaction has order 2, etc. - Reorder the model terms by bitcount, otherwise preserving the original order.
- Compute the factor pattern, 0 indicates that the term does not appear in the model, 1 indicates it will appear as contrasts and 2 indicates it will appear as indicators. This is essentially the binary matrix described above except for the case where the intercept term is suppressed and the first term that would be a contrast becomes the indicators.
-
A few fixups are generally performed after this. See the code for the
R
functionterms.formula
. -
Extract the variables from the data frame and apply any subsetting specification.
-
Apply the NA handler, such as
na.omit
to the reduced data frame. -
Drop unused levels in any
factor
columns. In Julia these are of typePooledDataArray
. -
Create the model matrix. The
R
functionmodel.matrix
delegates most of the work to the C functionmodelmatrix
in the same file asmodelframe
. Most of the work here is in handling the contrasts for any factor variables and creating interaction terms from sets of contrasts. After the symbolic analysis to create theTerms
object in aModelFrame
, creation of aModelMatrix
depends only on the factor pattern.
In Julia the formula specification is as a Julia expression. For example
julia> :(y ~ a + b + log(c))
:(~(y,+(a,b,log(c))))
which has many benefits such as not needing to write a parser, but also a drawback related to operator precedence. If we use the colon, :
, operator to specify an interaction, as in R
, the term will need to be enclosed in parentheses. A formula that could be written in R
as
julia> :(y ~ a + b + a:b)
:(~(y,+(a,b,a):b))
must be written as
julia> :(y ~ a + b + (a:b))
:(~(y,+(a,b,a:b)))
We don't actually need to define a colon operator on columns of a data frame to be able to evaluate the model matrix because the information on interactions is encoded in the factor
matrix.
Other than that, I think the pieces are getting into place in the db/formula
branch of the DataFrames
package. Extraction of variable names is working, I think
julia> ff = Formula(:(y ~ a + b + (a:b)))
Formula: y ~ :(+(a,b,a:b))
julia> DataFrames.allvars(ff)
3-element Symbol Array:
:b
:y
:a
as is dropUnusedLevels!
and a basic na_omit
NA handler. The construction of the Terms
object still needs more work. Expansion from the PooledDataArray
refs field to the contrasts columns is done through a contrasts generator. At present the only contrasts generator is contr_treatment
but it is easy to add others.
More parenthesis magic. For a one-sided formula the rhs must be enclosed in parentheses
is not the intended result. It should be