I see our project as a set of two transformations: one that changes a ggplot2 object into a ggspec, then another that changes a ggspec into a Vega-Lite spec.
The analogy may not be exact, but I see these transformations in terms of linear algebra, where a ggplot2 object is a vector in "ggplot" space, a ggspec is a vector in "ggspec" space, and a Vega-Lite spec is a vector in "Vega-Lite" space.
One of our goals is that "ggspec"-space should be a faithful representation of "ggplot2"-space. One way of doing this is to make sure that the "transformation-matrix" is as close to diagonal as we can make it. As Haley likes to point out, the ggplot2 object is a "list of 9"; therefore the "ggspec" object will have no more than 9 elements (maybe it will not have a "theme"?).
If we wanted to (and we don't want to), we could reproduce the ggplot2 object using the ggspec.
I have included below a proposal for a modification to ggspec - let me go through each of the elements.
Our idea is that data will be "promoted" to the top of the spec from within the layers, have duplicates removed, and be named. The change here is that the metadata
is now defined.
I propose metadata
to be an object, where the names are the names in the particular dataset, and the values are objects themselves. This object would have a mandatory field, and two optional fields:
type
: string, the Vega-Lite type, where we would map from Rnumeric
,integer
:"quantitative"
character
:"nominal"
factor
:"nominal"
ordered
:"ordinal"
POSIXct
,Date
:"temporal"
levels
: array of strings with the levels of thefactor
orordered
timezone
: string, timezone of POSIXct
Here is my understanding of how ggplot works with color-scales and factors: for "regular" factors, it uses a "nominal" scale; for ordered factors (has class "ordered"
), it uses an ordinal scale. I think we should follow this.
Part of our goal here is to make sure that the Vega-Lite spec that we produce will be generalizable to new data. As such, I think that if the data arrives at ggplot as a factor, that means that the levels are the only possible values the varaible can take, e.g. day-of-the-week.
Another use of factors is within ggplot, perhaps to order a variable according the value of another variable. For example, consider a bar chart where we want to order cities by their population. In ggplot, we would use an "internal" factor, where the levels are determined when the plot is built. Here, the cities could change, and the population (hence ordering) could change.
In this situation, we could do a similar thing, by specifying city
as a nominal variable. It remains to figure out how to "decode" something like forcats::fct_reorder()
and how to denote that in the ggspec, but that can be a problem for later.
It remains to be determined how to deal with POSIXct
and Date
. I have some ideas, but they are not yet completely formed. The challenge is that R has a notion of timezones, while Vega-Lite (like native JavaScript) does not.
The observations
would be the usual d3-format array of objects for the data-frame.
The ggspec data object would contain all the datasets from the ggplot object in one place. It would reserve data-00
for the dataset in the ggplot2 data
element, then data-01
, data-02
, ..., for data-frames specified in the layers
.
In summary, the ggspec data
element would be a function of the ggplot2 data
element and the ggplot2 layers
element.
At present, we are not concerned with the ggplot2 mapping
object, as our thought is to support, initially, mappings that are defined in the layers.
Here, layers
is an array of layer
objects, I think each ggspec layer
object will be a function of the ggplot layer
, and the data
. It is a function of data
only to be able to include the name of the dataset. With apologies to Wenyu, this is a significant change from the previous proposal: we propose not to include the type
here; instead, it would be provided in the metadata
in the ggspec data
object.
If it will be OK with Wenyu, he could determine the type
of the particular Vega-Lite encoding
using the type
from the metadata
, to be overridden by the ggspec scales
if need be. However, I think it would be OK, initially, just to use the value from metadata
.
Maybe there will not be a need to rename the type
according to the scale. Consider this example (paste into the editor):
{
"$schema": "https://vega.github.io/schema/vega-lite/v3.json",
"data": {"url": "data/cars.json"},
"mark": "point",
"encoding": {
"x": {"field": "Horsepower", "type": "quantitative"},
"y": {"field": "Miles_per_Gallon", "type": "quantitative"},
"color": {
"field": "Cylinders",
"type": "ordinal",
"scale": {"range": "category"}
}
}
}
Here, the scales
is an array of scale
objects; each ggspec scale
would be a function of the ggplot2 scale
. We introduce the name
field to this proposal.
The ggspec labels
is a function only of the ggplot2 labels
. For Wenyu, if the scale for an aesthetic/encoding is named, we can use that name; otherwise we can look for it in labels
.