Being able to SSH directly into a compute job has the advantage of using all remote development tools such as using your IDE's debugger also for GPU jobs (VSCode, PyCharm, ...).
- Slurm: Scheduling system that many HPC clusters use
- Enroot: Container system like Docker for NVIDIA GPUs
General problem:
Containerized compute jobs are not directly accessible via SSH from your local machine (your notebook or PC). Also many HPC clusters do not provide internet access on their compute node (for security reasons).
Proposed solution:
Run your own SSH server within your compute job and make an SSH tunnel from your local machine through the login node, through the compute node, and finally into the compute job.
LOCAL MACHINE -> LOGIN NODE -> COMPUTE NODE -> CONTAINERIZED COMPUTE JOB
This Docker image installs an OpenSSH server into a NVDIA's PyTorch image (depending on your setup you may change the base image or install additional software):
# ./Dockerfile
FROM nvcr.io/nvidia/pytorch:22.08-py3
USER root
RUN apt-get update
RUN apt-get install openssh-server sudo -y
# change port and allow root login
RUN echo "Port <SSH port>" >> /etc/ssh/sshd_config
RUN echo "LogLevel DEBUG3" >> /etc/ssh/sshd_config
RUN mkdir -p /run/sshd
RUN ssh-keygen -A
RUN service ssh start
# init conda env
RUN conda init
EXPOSE <SSH port>
CMD ["/usr/sbin/sshd","-D", "-e"]
In order to use the Docker image with Slurm you need to push it to Docker hub and then import it with Enroot:
# build the image
docker build -t <your username>/<your image>:latest .
# push the image
docker push <your username>/<your image>:latest
# import with enroot
srun enroot import -o <your image path>.sqsh docker://<your username>/<your image>:latest
# add to ~/.ssh/config
# replace <user> with your username
# replace <job> with your job name
Host devcontainer.dfki
User <user>
Port <SSH port>
HostName localhost
ProxyJump devnode.dfki
CheckHostIP no
StrictHostKeyChecking=no
UserKnownHostsFile=/dev/null
Host devnode.dfki
User <user>
CheckHostIP no
ProxyCommand ssh slurm.dfki "nc \$(squeue --me --name=<job name> --states=R -h -O NodeList) 22"
StrictHostKeyChecking=no
UserKnownHostsFile=/dev/null
Host slurm.dfki
User <user>
HostName <login node>
You must set --no-container-remap-root
.
srun -K \
--container-mounts=/home/$USER:/home/$USER \
--container-workdir=$(pwd) \
--container-image=<your image path>.sqsh \
--ntasks=1 --nodes=1 -p <your partition> \
--gpus=1 \
--job-name <your job name> --no-container-remap-root \
--time 12:00:00 /usr/sbin/sshd -D -e
ssh devcontainer.dfki
That's it!
- The SSHD port is hard-coded. This will cause problems as soon as multiple people start using this setup. Better make sure to change the port to something unique.
Note that the
EXPOSE
is completely irrelevant forenroot
. It will listen to whatever you specify in the ssh config.Also, if your're already using enroot, consider
enroot batch
, like enroot uses for its build process,eg:
whith your
devimage.batch
being: