Skip to content

Instantly share code, notes, and snippets.

@simonpcouch
Created September 24, 2024 17:17
Show Gist options
  • Select an option

  • Save simonpcouch/646eaaed213f634efd72a7f5615a9519 to your computer and use it in GitHub Desktop.

Select an option

Save simonpcouch/646eaaed213f634efd72a7f5615a9519 to your computer and use it in GitHub Desktop.

Splitting Data With add_tailor(method)

This gist whittles down discussion (with myself) in tidymodels/workflows#233.

The add_tailor(method) argument supplies information about how the data supplied to fit.workflow() came to be. Its currently supplied as the class of the rsplit object from which data arose, i.e. "mc_split".

tune does not need to use add_tailor(method), as it has all of the information it needs in the split itself to make an inner split, and interfaces with the .fit_*() helpers directly instead of fit.workflow(). workflows, in fit.workflow(), does not actually have all the information it needs between fit.workflow(data) and add_tailor(method) unless the data arose from a random sample. So, in short: tune does not use method and workflows can't do anything with it, as it is currently implemented.

We could go a few ways from here:

a) Remove add_tailor(method) altogether and just take a random split if the tailor has a calibrator.

  • Has the potential to silently leak data in the case where fit.workflow(data) arose from a non-random split.
  • Note that this leakage would not occur inside of tune.

b) Require fit.workflow(data) to be a split if the tailor has a calibrator.

  • This would make fit.workflow(data) not type stable, requiring folks programming around workflows to work with additional logic; seems like it violates the underlying principle of a common interface.

c) Error on fit.workflow() if the tailor has a calibrator, referring people to use fit_resamples().

  • Ultimately the most principled, I think: requires the least amount of work from us, allows us to cut out the added complexity of workflows needing to "know about" rsample, forefronts the idea that workflows need to partition data even for a single fit on a workflow with a calibrator, and ultimately only impacts the realistically very small segment of people who want to apply a calibrator and do so without resampling it.

d) Adapt add_tailor(method) to take in a split or some metadata about it that's sufficient to actually be able to make an inner split.

  • This means that the specification of the workflow would now need to know about the data and resampling scheme. :(
@simonpcouch
Copy link
Author

simonpcouch commented Sep 26, 2024

e) Add another argument calibration_data to fit.workflow() that's required when the workflow has a tailor with a calibrator.

(Thanks, Hannah!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment