First, the h
in dhcast
is for hierarchy, and is useful when you need
to (A) cast (go from long to wide data) and where (B) casting should observe a
given variable hierarchy, and (optionally) aggregate your data(summarize related
records in the long dataset) at the same time.
Lets say, for example, that we want to predict some time varying attribute of a grocery store customers based their produce purchases. Furthermore, we want to aggregate total spending by type of fruit, and also model the effect of certain attributes (e.g. organic v.s. conventional) within each type of fruit. The effect of purchasing organic produce may vary by type of fruit, so we nest the indicator of "organic" within the type of fruit.
The reason that you can't just call model.matrix(~ fruit / is_organic)
to
create a hierarchical dataset is that (A) customers may purchase both
conventional and organic apples in the same week, and (B) customers may make
multiple purchases in the same week so you may need to aggregate records as
well as cast them.
Note, it's important that you understand data.table's dcast()
, as dhcast()
is built around dcast()
API, and simply adds the /
(nesting) and :
(interaction) operators to the left hand side of the casting formula.
Note that in this toy dataset, potatoes are never organic and broccoli is always organic. As result we don't need variables for organic within potatoes or broccoli:
items <-
as.data.table(read.csv(textConnection(c(
"item,is_organic
apple,FALSE
apple,TRUE
banana,FALSE
banana,TRUE
potato,FALSE
broccoli,TRUE"))))
set.seed(10101L)
DT <- cbind(
as.data.table(
expand.grid(
Customer=LETTERS[1:5],
Date=seq.Date(Sys.Date(),length.out=5,by="week"),
spending=c(1.32, 3.65, 5.09),
stringsAsFactors=FALSE)),
items[sample(.N,75,replace=TRUE)])[sample(.N,15)]
Now for the real work:
# CREATE THE INTERCEPT FIELDS
intercepts <-
dhcast(DT,
Customer + Date ~ item/is_organic,
value.var="spending")
# CREATE THE VALUE FIELDS
values <-
dhcast(DT,
Customer + Date ~ item/is_organic,
sum,
value.var="spending")
# MERGE THE TWO DATASETS. This is possible because the result of dhcast is
# key()'d on the left hand side variables (Customer and Date).
data <- intercepts [ values ]
Inspecting the variable names of the returned intercepts data.table via
cat(names(intercepts),sep = "\n")
, we see that the interaction term
is_organic
only appears for apples and banana, as expected
Customer
Date
(intercept)
item=apple:(intercept)
item=banana:(intercept)
item=broccoli:(intercept)
item=potato:(intercept)
item=apple:is_organic=FALSE:(intercept)
item=apple:is_organic=TRUE:(intercept)
item=banana:is_organic=FALSE:(intercept)
item=banana:is_organic=TRUE:(intercept)
and a similar inspection of the data in intercepts
shows that interaction
banana:organic is only true (1) when banana is also true:
intercepts[,
.(banana = `item=banana:(intercept)`,
conventional_banana = `item=banana:is_organic=FALSE:(intercept)`,
organic_banana = `item=banana:is_organic=TRUE:(intercept)`
)]
>> banana conventional_banana organic_banana
>> 1: 1 1 0
>> 2: 1 0 1
>> 3: 1 1 0
>> 4: 0 0 0
>> 5: 0 0 0
>> 6: 1 0 1
>> 7: 0 0 0
>> 8: 0 0 0
>> 9: 1 0 1
>> 10: 1 1 0
>> 11: 0 0 0
>> 12: 1 1 0
Finally, note that the name attributed to the value of the aggregation function
defaults to value.var
when fun.aggregate
is supplied and (intercept)
otherwise. This can be overridden via the parameter value.name