Skip to content

Instantly share code, notes, and snippets.

@arose13
Created January 9, 2020 14:46
Show Gist options
  • Save arose13/a61ff6e81296eea713d3d0f45c604952 to your computer and use it in GitHub Desktop.
Save arose13/a61ff6e81296eea713d3d0f45c604952 to your computer and use it in GitHub Desktop.
How to get Dask to run on Databricks
#!/bin/bash
# CREATE DIRECTORY ON DBFS FOR LOGS
LOG_DIR=/dbfs/databricks/scripts/logs/$DB_CLUSTER_ID/dask/
HOSTNAME=`hostname`
mkdir -p $LOG_DIR
# INSTALL DASK AND OTHER DEPENDENCIES
set -ex
/databricks/python/bin/python -V
. /databricks/conda/etc/profile.d/conda.sh
conda activate /databricks/python
conda install -y dask
conda install -y pandas=0.23.0
# START DASK – ON DRIVER NODE START THE SCHEDULER PROCESS
# ON WORKER NODES START WORKER PROCESSES
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
dask-scheduler &>/dev/null &
echo $! > $LOG_DIR/dask-scheduler.$HOSTNAME.pid
conda install -y pandas=0.23.0
else
dask-worker tcp://$DB_DRIVER_IP:8786 --nprocs 4 --nthreads 8 &>/dev/null &
echo $! > $LOG_DIR/dask-worker.$HOSTNAME.pid &
conda install -y pandas=0.23.0
fi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment