Skip to content

Instantly share code, notes, and snippets.

@JosiahParry
Created August 1, 2022 01:25
Show Gist options
  • Save JosiahParry/26c5d4b073dec31cea5e4d90cea7d071 to your computer and use it in GitHub Desktop.
Save JosiahParry/26c5d4b073dec31cea5e4d90cea7d071 to your computer and use it in GitHub Desktop.
Okay, so the goal is to have access to tidymodels from Databricks. My answer is a more general approach to package access in Databricks. This approach will lead to slightly slower spin up time.
The idea is to have persistent storage in the form of a ADLS blob storage container where packages are installed to. Then, when you spin up a cluster, install any required system deps and change your `options("repos")` to the ADLS container.
You can mount the container using one of these two approaches:
- [directly to the workspace](https://docs.microsoft.com/en-us/azure/databricks/data/mounts)
- [using blobfuse](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-how-to-mount-container-linux)
If using blobfuse, it needs to be mounted in an init script.
Then, in a databricks notebook with the mounted storage container install packages like so
```r
install.packages(
pkgs = c("tidymodels", "other", "pkgs"),
repos = "https://packagemanager.rstudio.com/cran/__linux__/focal/latest",
lib = "/mnt/blob/container/pack"
)
```
Then, you will need to ensure that your cluster has the required system dependencies upon start up. I personally use a `install-system-requirements.sh` script which I created using {pak}. Find the system requirements with pak for desired packages like so.
```r
pak::pkg_system_requirements("tidymodels", "ubuntu", "20.04")
```
If you have more packages, iterate over it.
```{r}
installs <- vapply(c("tidymodels", "stringr"),
pak::pkg_system_requirements,
character(1),
"ubuntu", "20.04")
```
Then write the results to a shell script with `writeLines(c("#!/bin/bash", installs, ""), "install-system-requirements.sh")`. Make that one of your init scripts.
Additionally, you'll need to change your `.libPaths()` either in some Rprofile whether `.Rprofile` or `Rprofile.site` (what I use).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment