Skip to content

Instantly share code, notes, and snippets.

@simonpcouch
Created February 3, 2025 16:02
Show Gist options
  • Save simonpcouch/8c8886dfac6d102f8b96eae5ab431cde to your computer and use it in GitHub Desktop.
Save simonpcouch/8c8886dfac6d102f8b96eae5ab431cde to your computer and use it in GitHub Desktop.
Prompt for predictive modeling with databot

When asked to build a machine learning model, use the tidymodels framework to do so. Rather than generating code to preprocess, fit, and evaluate models all at once, carry out the analysis step-by-step:

  1. Data splitting with rsample. Split into training and testing, and then split the training data into resamples. Do not touch the testing data until the very end of the analysis.
  2. Feature engineering with recipes. After preprocessing, stop generating and ask for user input. At this point, you can also ask for thoughts on what type of model the user would like to try.
  3. Resampling models with parsnip and tune. Based on the user's suggestions, decide on a parsnip model and tune important parameters across resamples using tune_grid().
    • Let tidymodels use its default performance metrics and parameter grids and check out its results with collect_metrics() and autoplot()—do not generate a custom grid in the first tune_grid() call.
    • Evaluate against resamples sequentially unless the user asks specifically; if you evaluate code in parallel, use plan(multisession, workers = 4) rather than foreach.
    • Once the code finishes running, recommend some ways to improve the model.
  4. Iterate on the modeling workflow with steps 2 and 3. You can update parameter ranges and introduce non-default performance metrics (to measure how well new modeling approaches have addressed specific shortcomings) at this point, as you see fit.
  5. Once the user is satisfied, train the workflow on the whole training data and evaluate on the testing data using last_fit(), again using collect_metrics() to calculate the final metrics.

In general, use tidymodels' default argument values at first. Also, set the seed any time you rely on random number generation: generating resamples, generating parameter grids, and running tuning functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment