willirath/SLURMCluster_vs_Singularity.ipynb

Last active December 8, 2024 16:08

Star (11) You must be signed in to star a gist
Fork (3) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/willirath/2176a9fa792577b269cb393995f43dda.js"></script>
Save willirath/2176a9fa792577b269cb393995f43dda to your computer and use it in GitHub Desktop.

Download ZIP

Dask-Jobqueue SLURMCluster with Singularity

Raw

SLURMCluster_vs_Singularity.ipynb

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

vsoch commented May 8, 2020

This is so neat - I must try it out! Do you have the rest of the code (e.g., recipe for the container and requirements.txt) to reproduce this? The notebook is run from a host machine via ssh - have you tried other ways to run the notebook (possibly from a head node directly?)

Author

willirath commented May 8, 2020

Here's an overview of all the moving parts:

Container: I used this Docker image built with jupyter-repo2docker in https://github.com/esm-vfc/esm-vfc-stacks/) and launched it with singularity run -B/tmp:/tmp -B $(mktemp -d):/run/user docker://esmvfc/esm-vfc-stacks:2020.05.06-52cf4f6 jupyter lab --no-browser --ip="0.0.0.0" on a head node of a Bull/Atos machine in Göttingen. A cleaner way would be to pull / convert the image before starting the container. Also, I'm not sure if pulling from the internet would work on compute nodes of many HPC centres.

Requirements: The environment is defined in the environment.yaml there. Because I was lazy and didn't have dask-jobqueue installed in the container, I just ran a %pip install dask_jobqueue in the notebook to get it. (Dask-Jobqueue is only needed by the process that orchestrates the Dask cluster. The workers living on the batch nodes are using vanilla Dask distributed to connect back to the Dask scheduler.)

SSH setup and sbatch adapters: I started the container with the JupyterLab and the Dask scheduler on a head node where I had a running SSH agent (that was forwarded from my laptop...), because I wanted to avoid setting up HPC-cluster internal SSH keys for this test. Agent was forwarded into the container by setting SINGULARITY_SSH_AUTH_SOCKET=$SSH_AUTH_SOCKET. In the sbatch etc. adapters above, I need to ssh back into the same host that is also running the container, because Dask-Jobqueue writes a job-script to /tmp/<jobscript> and then runs sbatch on this. Without the -B /tmp:/tmp, this would have been more involved.

Networking Laptop to Jupyter: Uses this approach that tunnels into a head node of the HPC cluster using dynamic port forwarding. With--ip="0.0.0.0" for JupyterLab, there is, however, no difference to running Jupyter on a compute node. People around me usually do this with this job script (which is for a non-container Jupyter installation managed with Conda).

Here, I've had everything in containers, but there's no big blockers against having all or parts of the notebook stack directly on the host system. Dask, however, generally assumes that the environment seen by the Dask scheduler is identical or at least very similar to the one seen by the workers.

vsoch commented May 8, 2020

Beautiful, thank you!

Author

willirath commented May 8, 2020

Thanks! I'd love to know if you have any remarks on this.

vsoch commented Jul 29, 2020

hey @willirath - sorry it took me so long to test this out! I just tested this on our cluster, and my first shot (using an OnDemand notebook) I wasn't able to even start the dask cluster - this error in particular: dask/distributed#3156. I then moved to an interactive session and could confirm that I could start the cluster outside of the nodebook, so I decided to go with that. I took notes here. It seemed to do okay when I wanted a fairly small cluster, but when I tried to scale it, the interface was (and still is) essentially frozen, because getting nodes allocated just takes that long. It's been over 30 minutes and still hanging on that submit, so I don't think (at least for our crowded cluster) this would be an approach that many folks would jump into. This makes me sad, because it's another failure on my part to find some modern way for our users to submit jobs. The cloud is so easy and fast, and local resources are always painful I'm afraid. :/

Author

willirath commented Jul 30, 2020

Hi @vsoch, I've left a few comments in your repo (not sure if in-line comments on the commit are a good way to do this).

In working with Dask on HPC, I've found that the typical scheduler configs often are not made with many small jobs in mind. As Dask jobqueue just submits one job per node, this often is not a good fit. An alternative that might be closer to how the HPC admins want their machine to be used is dask-mpi.

vsoch commented Jul 30, 2020

Oh interesting! I'm going to try the dask-mpi right now. We definitely have mpi on our cluster.

Author

willirath commented Jul 30, 2020 via email

You could also skip the automatic deployment methods entirely and use `srun` to start a scheduler using the CLI `dask-scheduler` with the scheduler file arg and point `srun ... dask-worker` calls to the scheduler via the scheduler file.

vsoch commented Jul 30, 2020

I could try that - I just did a search for "srun" in the dask docs and came up empty. Is there an example somewhere of this?

Author

willirath commented Jul 31, 2020

See this gist for using salloc and srun with the Dask cli.

vsoch commented Jul 31, 2020

Okay testing this out! When you say

Within the allocation, make sure you have dask etc (e.g. by activating a virtual or conda env).

Do you mean to have the environment sourced in my profile or something else?

Author

willirath commented Jul 31, 2020

No, I mean right after the sallloc you get a fresh shell where you need to, e.g., conda activate the env that has Dask installed.

(I think it's rarely advisable to activate envs automatically.)

vsoch commented Jul 31, 2020

haha okay, I haven't gotten my allocation yet so that's why I haven't seen that yet :)

Author

willirath commented Jul 31, 2020 via email

You might be able to speed that up by explicitly setting a relatively short wall time (1 hour or so). Depends on how the SLURM scheduler is configured but I was told they often like short jobs that can be used for backfilling while the big jobs wait for a large chunk of the cluster to become available.

thomasarsouze commented Dec 14, 2020

Hi @willirath.
Thanks a lot for this exemple, that's really something I want to make work. I've been trying to reproduce on a small local cluster that uses SLURM.

So:

I've created an image of the container: singularity build --remote esm-vfc-stacks_latest.sif docker://esmvfc/esm-vfc-stacks:latest
then run it: singularity run esm-vfc-stacks_latest.sif jupyter lab --no-browser --ip=login0.frioul
Running inside the notebook your steps: pip install dask_jobqueue, overwritting sbatch, squeue, scancel. So I have a job_script so looks correct:

#!/usr/bin/env bash

#SBATCH -J dask-worker
#SBATCH -e /home/arsouze/pangeo/tests/logs/dask-worker-%J.err
#SBATCH -o /home/arsouze/pangeo/tests/logs/dask-worker-%J.out
#SBATCH -n 1
#SBATCH --cpus-per-task=40
#SBATCH --mem=159G
#SBATCH -t 00:20:00
#SBATCH -C quad,cache
export I_MPI_DOMAIN=auto
export I_MPI_PIN_RESPECT_CPUSET=0
module load intel intelmpi
singularity run /home/arsouze/pangeo/esm-vfc-stacks_latest.sif python -m distributed.cli.dask_worker tcp://195.83.183.81:36584 --nthreads 10 --nprocs 4 --memory-limit 42.50GB --name name --nanny --death-timeout 60 --local-directory $SCRATCHDIR --host ${SLURMD_NODENAME}.frioul

but when I try to scale I have the following error message:

Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py:623> exception=RuntimeError('Command exited with non-zero exit code.\nExit code: 255\nCommand:\nsbatch /tmp/tmplbeliirt.sh\nstdout:\n\nstderr:\n\n')>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/deploy/spec.py", line 50, in _
    await self.start()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_jobqueue/core.py", line 310, in start
    out = await self._submit_job(fn)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_jobqueue/core.py", line 293, in _submit_job
    return self._call(shlex.split(self.submit_command) + [script_filename])
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_jobqueue/core.py", line 393, in _call
    "stderr:\n{}\n".format(proc.returncode, cmd_str, out, err)
RuntimeError: Command exited with non-zero exit code.
Exit code: 255
Command:
sbatch /tmp/tmplbeliirt.sh
stdout:

stderr:

and the slurm job doesn't get submitted...
I think that at this point, I'm not clear on what is failing in the workflow. Do you have any hints ?

Author

willirath commented Dec 14, 2020 •

edited

Loading

I'd debug this with manual slurm commands:

Can you use the sbatch over SSH workaround? I'd check by trying if,

ssh $(hostname) -q -t ". /etc/profile && squeue [maybe more args]"

gives the same output as the equivalend squeue command directly run on the host machine.

Then, the question is, if the job script that is written to some tmp location by the SLURMCluster running within the container is readable on the host machine.
Finally, can you sumbit the job script from the tmp file with sbatch via SSH?

(edited: Add third point)

Author

willirath commented Dec 14, 2020

Can you SSH back into the host machine? The error could stem from ssh $(hostname) not being possible (at least without a password). You'd need to set up SSH keys for this.

thomasarsouze commented Dec 15, 2020

Thx for your quick answer. Indeed, thanks to a local security feature provided by our admins, ssh $(hostname) is refused ! Will resume once they remove that.

kathoef commented Mar 9, 2021

We have come up with a little convenience tool that provides a structured way of bind mounting host-system SLURM libraries into a Singularity container session and thus enables the batch scheduler commands. This approach omits the SSH restrictions that system administrators might have set up (we also use such an HPC system, which has motivated that development).

All you need is to come up with a system-specific "configuration file" (which needs basically a one-time exploratory session with a few strace commands to isolate the necessary batch scheduler shared libraries and configuration files). Make sure you have read the compatibility section, though, as there are a few limitations: https://github.com/ExaESM-WP4/Batch-scheduler-Singularity-bindings

/cc @vsoch and @thomasarsouze

poplarShift commented May 9, 2022

This is great, thank you so much for this! Only needed minor tweaks to account for differences in my cluster and the singularity version.

willirath/SLURMCluster_vs_Singularity.ipynb

vsoch commented May 8, 2020

willirath commented May 8, 2020

vsoch commented May 8, 2020

willirath commented May 8, 2020

vsoch commented Jul 29, 2020

willirath commented Jul 30, 2020

vsoch commented Jul 30, 2020

willirath commented Jul 30, 2020 via email

vsoch commented Jul 30, 2020

willirath commented Jul 31, 2020

vsoch commented Jul 31, 2020

willirath commented Jul 31, 2020

vsoch commented Jul 31, 2020

willirath commented Jul 31, 2020 via email

thomasarsouze commented Dec 14, 2020

willirath commented Dec 14, 2020 • edited Loading

willirath commented Dec 14, 2020

thomasarsouze commented Dec 15, 2020

kathoef commented Mar 9, 2021

poplarShift commented May 9, 2022

willirath commented Dec 14, 2020 •

edited

Loading