This is a simple example demonstrating why you might want to use IbisML instead of just plain Ibis in an ML preprocessing pipeline.
You are training an ML model that gets better accuracy when the floating point number columns in the training data are normalized (by subtracting the mean and dividing by the standard deviation). Your data contains multiple floating point columns.
To demonstrate this, we can use the iris flower dataset.
First, import the libraries we need, and load the iris flower dataset into an Ibis table:
import ibis
from ibis import _
import ibis.selectors as s
import ibis.expr.datatypes as dt
import ibis_ml as ml
import seaborn as sns
iris = ibis.memtable(sns.load_dataset("iris"))
This data contains four floating point number columns (named sepal_length
, sepal_width
, petal_length
, and petal_width
).
Next, split the data into training and test samples. IbisML has a function for doing this. There is no unique key column in the iris flower dataset, and IbisML needs a unique key column to split the data, so we add it, split the data, then drop it from both samples:
iris = iris.mutate(unique_id=ibis.row_number())
iris_train, iris_test = ml.train_test_split(iris, "unique_id")
iris_train = iris_train.drop("unique_id")
iris_test = iris_test.drop("unique_id")
Now we're ready to normalize the floating point columns.
It's straightforward to normalize all the floating point columns in a table with Ibis. For example, you could do this for the iris_train
table:
iris_train.mutate(
s.across(s.of_type(dt.float()), (_ - _.mean()) / _.std())
)
We could apply the same operation to the iris_test
table. However, that is not what we want to do. In an ML preprocessing pipeline, you almost always want to calculate the parameters for the normalization (the mean and standard deviation in this case) on the training data, then use those same parameters to normalize the test data and inference data.
In other words, we do not want to compute a new mean and standard deviation for the test data and for each batch of inference data and use those to normalize that data. We want to use the mean and standard deviation of the training data to normalize the test and inference data.
So, using plain Ibis, we would have to compute the mean and standard deviation of all the floating point columns:
params = iris_train.aggregate(
s.across(
s.of_type(dt.float()),
dict(mean=ibis._.mean(), std=ibis._.std())
)
)
The result params
contains eight values (four means and four standard deviations). We can put this in a pandas DataFrame:
params = params.to_pandas()
Then we would have to apply these eight values to the test data. This can be done manually like this:
iris_test.mutate(
sepal_length=(_.sepal_length - params.sepal_length_mean[0]) / params.sepal_length_std[0],
sepal_width=(_.sepal_width - params.sepal_width_mean[0]) / params.sepal_width_std[0],
petal_length=(_.petal_length - params.petal_length_mean[0]) / params.petal_length_std[0],
petal_width=(_.petal_width - params.petal_width_mean[0]) / params.petal_width_std[0],
)
Or it can be done using selectors like this:
iris_test.mutate(
s.across(
s.of_type(dt.float()),
lambda x: (x - params[x.get_name() + "_mean"][0]) / params[x.get_name() + "_std"][0]
)
)
We would also have to persist these eight values so we can use them to normalize the inference data.
Even in this very simple example, you can see that this type of operation is complicated to do with plain Ibis.
First, create a recipe that normalizes the floating point number columns, and fit this recipe on the training data:
recipe = ml.Recipe(ml.ScaleStandard(ml.floating())).fit(iris_train)
Then use the fitted recipe to transform the test data:
result = recipe.to_ibis(iris_test)
You can also persist the recipe and use it to normalize the inference data.
This operation is simple and straightforward with IbisML.