Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save malteos/5fe791fe10bb55028a02952d5f394bb3 to your computer and use it in GitHub Desktop.
Save malteos/5fe791fe10bb55028a02952d5f394bb3 to your computer and use it in GitHub Desktop.

Connect via SSH to a Slurm compute job that runs as Enroot container

Being able to SSH directly into a compute job has the advantage of using all remote development tools such as using your IDE's debugger also for GPU jobs (VSCode, PyCharm, ...).

  • Slurm: Scheduling system that many HPC clusters use
  • Enroot: Container system like Docker for NVIDIA GPUs

General problem:

Containerized compute jobs are not directly accessible via SSH from your local machine (your notebook or PC). Also many HPC clusters do not provide internet access on their compute node (for security reasons).

Proposed solution:

Run your own SSH server within your compute job and make an SSH tunnel from your local machine through the login node, through the compute node, and finally into the compute job.

LOCAL MACHINE -> LOGIN NODE -> COMPUTE NODE -> CONTAINERIZED COMPUTE JOB  

Custom image with SSHD installed

This Docker image installs an OpenSSH server into a NVDIA's PyTorch image (depending on your setup you may change the base image or install additional software):

# ./Dockerfile
FROM nvcr.io/nvidia/pytorch:22.08-py3

USER root
RUN apt-get update 
RUN apt-get install openssh-server sudo -y

# change port and allow root login
RUN echo "Port <SSH port>" >> /etc/ssh/sshd_config
RUN echo "LogLevel DEBUG3" >> /etc/ssh/sshd_config

RUN mkdir -p /run/sshd
RUN ssh-keygen -A

RUN service ssh start

# init conda env
RUN conda init

EXPOSE <SSH port>

CMD ["/usr/sbin/sshd","-D", "-e"]

In order to use the Docker image with Slurm you need to push it to Docker hub and then import it with Enroot:

# build the image
docker build -t <your username>/<your image>:latest .

# push the image
docker push <your username>/<your image>:latest 

# import with enroot
srun enroot import -o <your image path>.sqsh docker://<your username>/<your image>:latest

Adjust your own SSH config (~/.ssh/config)

# add to ~/.ssh/config
# replace <user> with your username
# replace <job> with your job name

Host devcontainer.dfki
	User <user>
	Port <SSH port>
	HostName localhost
	ProxyJump devnode.dfki
	CheckHostIP no
	StrictHostKeyChecking=no
	UserKnownHostsFile=/dev/null

Host devnode.dfki
	User <user>
	CheckHostIP no
	ProxyCommand ssh slurm.dfki "nc \$(squeue --me --name=<job name> --states=R -h -O NodeList) 22"
	StrictHostKeyChecking=no
	UserKnownHostsFile=/dev/null

Host slurm.dfki
	User <user>
	HostName <login node>

Start compute job

You must set --no-container-remap-root.

srun -K \
    --container-mounts=/home/$USER:/home/$USER \
    --container-workdir=$(pwd) \
    --container-image=<your image path>.sqsh \
    --ntasks=1 --nodes=1 -p <your partition> \
    --gpus=1 \
    --job-name <your job name> --no-container-remap-root \
    --time 12:00:00 /usr/sbin/sshd -D -e

Connect to compute job

ssh devcontainer.dfki

That's it!

Issues

  • The SSHD port is hard-coded. This will cause problems as soon as multiple people start using this setup. Better make sure to change the port to something unique.
@krono
Copy link

krono commented Nov 14, 2024

Note that the EXPOSE is completely irrelevant for enroot. It will listen to whatever you specify in the ssh config.
Also, if your're already using enroot, consider enroot batch, like enroot uses for its build process,

eg:

  • import your image, modify it and then export it as new.
enroot import -o devbase.sqsh 'docker://nvcr.io#nvidia/pytorch:24.10-py3'
enroot create devbase.sqsh
SSH_PORT=22002 enroot batch devimage.batch
enroot export -o devimage.sqsh devbase

whith your devimage.batch being:

#! /usr/bin/enroot batch
#ENROOT_REMAP_ROOT=y
#ENROOT_ROOTFS_WRITABLE=y
#ENROOT_ROOTFS=${ENROOT_ROOTFS:-devbase}

environ() {
    echo "SHELL=/bin/bash"
}

rc() {
    apt-get -y update && apt-get -y install --no-install-recommends openssh-server sudo
    echo "Port "${SSH_PORT-22}" >> /etc/ssh/sshd_config.d/port.conf
    echo "LogLevel DEBUG3" >> /etc/ssh/sshd_config.d/logging.conf
    echo "service ssh start" >> /opt/nvidia/entrypoint.d/90-ssh.sh
    echo "conda init" >> /opt/nvidia/entrypoint.d/99-conda.sh
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment