-
-
Save willirath/2176a9fa792577b269cb393995f43dda to your computer and use it in GitHub Desktop.
I'd debug this with manual slurm commands:
- Can you use the
sbatch
over SSH workaround? I'd check by trying if,
ssh $(hostname) -q -t ". /etc/profile && squeue [maybe more args]"
gives the same output as the equivalend squeue
command directly run on the host machine.
-
Then, the question is, if the job script that is written to some tmp location by the
SLURMCluster
running within the container is readable on the host machine. -
Finally, can you sumbit the job script from the tmp file with
sbatch
via SSH?
(edited: Add third point)
- Can you SSH back into the host machine? The error could stem from
ssh $(hostname)
not being possible (at least without a password). You'd need to set up SSH keys for this.
Thx for your quick answer. Indeed, thanks to a local security feature provided by our admins, ssh $(hostname)
is refused ! Will resume once they remove that.
We have come up with a little convenience tool that provides a structured way of bind mounting host-system SLURM libraries into a Singularity container session and thus enables the batch scheduler commands. This approach omits the SSH restrictions that system administrators might have set up (we also use such an HPC system, which has motivated that development).
All you need is to come up with a system-specific "configuration file" (which needs basically a one-time exploratory session with a few strace
commands to isolate the necessary batch scheduler shared libraries and configuration files). Make sure you have read the compatibility section, though, as there are a few limitations: https://github.com/ExaESM-WP4/Batch-scheduler-Singularity-bindings
/cc @vsoch and @thomasarsouze
This is great, thank you so much for this! Only needed minor tweaks to account for differences in my cluster and the singularity version.
Hi @willirath.
Thanks a lot for this exemple, that's really something I want to make work. I've been trying to reproduce on a small local cluster that uses SLURM.
So:
singularity build --remote esm-vfc-stacks_latest.sif docker://esmvfc/esm-vfc-stacks:latest
singularity run esm-vfc-stacks_latest.sif jupyter lab --no-browser --ip=login0.frioul
pip install dask_jobqueue
, overwritting sbatch, squeue, scancel. So I have a job_script so looks correct:but when I try to scale I have the following error message:
and the slurm job doesn't get submitted...
I think that at this point, I'm not clear on what is failing in the workflow. Do you have any hints ?