Skip to content

Instantly share code, notes, and snippets.

@ckandoth
Last active May 19, 2024 07:49
Show Gist options
  • Save ckandoth/2acef6310041244a690e4c08d2610423 to your computer and use it in GitHub Desktop.
Save ckandoth/2acef6310041244a690e4c08d2610423 to your computer and use it in GitHub Desktop.
Install Slurm 19.05 on a standalone machine running Ubuntu 20.04

Use apt to install the necessary packages:

sudo apt install -y slurm-wlm slurm-wlm-doc

Load file:///usr/share/doc/slurm-wlm/html/configurator.html in a browser (or file://wsl%24/Ubuntu/usr/share/doc/slurm-wlm/html/configurator.html on WSL2), and:

  1. Set your machine's hostname in SlurmctldHost and NodeName.
  2. Set CPUs as appropriate, and optionally Sockets, CoresPerSocket, and ThreadsPerCore. Use command lscpu to find what you have.
  3. Set RealMemory to the number of megabytes you want to allocate to Slurm jobs,
  4. Set StateSaveLocation to /var/spool/slurm-llnl.
  5. Set ProctrackType to linuxproc because processes are less likely to escape Slurm control on a single machine config.
  6. Make sure SelectType is set to Cons_res, and set SelectTypeParameters to CR_Core_Memory.
  7. Set JobAcctGatherType to Linux to gather resource use per job, and set AccountingStorageType to FileTxt.

Hit Submit, and save the resulting text into /etc/slurm-llnl/slurm.conf i.e. the configuration file referred to in /lib/systemd/system/slurmctld.service and /lib/systemd/system/slurmd.service.

Load /etc/slurm-llnl/slurm.conf in a text editor, uncomment DefMemPerCPU, and set it to 8192 or whatever number of megabytes you want each job to request if not explicitly requested using --mem during job submission. Read the docs and edit other defaults as you see fit.

Create /var/spool/slurm-llnl and /var/log/slurm_jobacct.log, then set ownership appropriately:

sudo mkdir -p /var/spool/slurm-llnl
sudo touch /var/log/slurm_jobacct.log
sudo chown slurm:slurm /var/spool/slurm-llnl /var/log/slurm_jobacct.log

Install mailutils so that Slurm won't complain about /bin/mail missing:

sudo apt install -y mailutils

Make sure munge is installed and running, and a munge.key was created with user-only read-only permissions, owned by munge:munge:

sudo service munge start
sudo ls -l /etc/munge/munge.key

Start services slurmctld and slurmd:

sudo service slurmd start
sudo service slurmctld start
@Lihua1990
Copy link

Thanks a lot, it worked!

Best,
Lihua

@hmamine
Copy link

hmamine commented Sep 30, 2021

Hi and thank you for sharing! I have reached somehow similar issue. I change the hostname and also ControlMachine to the same exact name and now I am running to the following issue (see below) - I appreciate any inputs on how to fix this one.
slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: failed (Result: timeout) since Thu 2021-09-30 12:02:02 EDT; 13s ago
Process: 26000 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)

Sep 30 12:01:11 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure)
Sep 30 12:01:21 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure)
Sep 30 12:01:31 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure)
Sep 30 12:01:41 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure)
Sep 30 12:01:51 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure)
Sep 30 12:01:52 quanzeng-PowerEdge-T420 systemd[1]: slurmd.service: Start operation timed out. Terminating.
Sep 30 12:02:01 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure)
Sep 30 12:02:02 quanzeng-PowerEdge-T420 systemd[1]: Failed to start Slurm node daemon.
Sep 30 12:02:02 quanzeng-PowerEdge-T420 systemd[1]: slurmd.service: Unit entered failed state.
Sep 30 12:02:02 quanzeng-PowerEdge-T420 systemd[1]: slurmd.service: Failed with result 'timeout'.

@mkasemer
Copy link

mkasemer commented Sep 30, 2021

All,

I had similar issues. I found the following to be helpful: https://blog.llandsmeer.com/tech/2020/03/02/slurm-single-instance.html

I can confirm that this works on Ubuntu 20, with Slurm 19.05.5-1 (which installs through apt).

I have copied the steps below:

Set up munge

$ sudo apt install munge

Test if it works:

$ munge -n | unmunge
STATUS:           Success (0)
[...]

Set up MariaDB

$ sudo apt install mariadb-server
$ sudo mysql -u root
create database slurm_acct_db;
create user 'slurm'@'localhost';
set password for 'slurm'@'localhost' = password('slurmdbpass');
grant usage on *.* to 'slurm'@'localhost';
grant all privileges on slurm_acct_db.* to 'slurm'@'localhost';
flush privileges;
exit

Set up SLURM

$ sudo apt install slurmd slurm-client slurmctld

Use configurator.html to create the SLURM config file. There is one online but it is only useful for the last version.

Find out which version you have (dpkg -l | grep slurm, mine was 17.11.2). Go to https://www.schedmd.com/archives.php and download the package correspond to your version (ended up with a small version mismatch, worked out anyway).

Unpack and enter directory, then build and run the Configuration Tool

$ cd slurm-17.11.10
$ ./configure
$ make html
$ xdg-open doc/html/configurator.html

(mkasemer: you may need to edit more here depending on your machine, but these basics worked for me)
Fill in all NodeName/Hostname field in with own hostname(1).
For testing, fill in root for SlurmUser.
Make sure that the slurmd and slurmctld PID file path are the same as listed in the systemd file (e.g., /lib/systemd/system/slurmd.service).
You might want to look at the Number of CPUs setting (mkasemer: I edited sockets, cores per socket, and threads per core. I left number of CPUs blank, and let Slurm figure it out based on these values)
Copy-paste to /etc/slurm-llnll/slurm.conf.

Next, create a file /etc/slurm-llnl/cgroup.conf:

CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes

Restart daemons

sudo systemctl restart slurmctld
sudo systemctl restart slurmd

Running sinfo should show no errors:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1   idle a715

Test an actual job

Run sleep 1 on 8 processors:

$ time srun -n8 sleep 1
srun -n8 sleep 1  -- 1,20s (0,01s(u) + 0,00s(s) 6kb 0+49 ctx)

Some useful debugging commands:

$ slurmctld -D
$ slurmd -D
$ sinfo

Note from mkasemer
When attempting to use this with OpenMPI installed via apt, it had issues (see here for a complete description of the problem). Specifically, when using a submission script of the following form (using srun, as is often suggested):

#!/bin/bash
#SBATCH -J jobname
#SBATCH -e error.%A
#SBATCH -o output.%A
#SBATHC -N 1
#SBATCH -n 2

srun binary

srun would not work properly. There would be MPI Initialization errors all over, despite MPI being installed with Slurm support. The fix that works is:

#!/bin/bash
#SBATCH -J jobname
#SBATCH -e error.%A
#SBATCH -o output.%A
#SBATHC -N 1
#SBATCH -n 2

mpirun -np 2 binary

You have to specify both the number of slots that Slurm is to save (using #SBATCH -n), as well as the number of slots for mpirun to use (using mpirun -np). So, a little annoying, but it works properly.

@hmamine
Copy link

hmamine commented Sep 30, 2021

Thank you, I could work it out till this point - Would you have suggestion for debug? Thanks
$ sinfo
slurm_load_partitions: Unable to contact slurm controller (connect failure)

@mkasemer
Copy link

I am not an expert at all, and unfortunately cannot help. I only followed the instructions given above and they worked.

@gangadharsingh056
Copy link

Getting this error when "apt install munge" getting error "Errors were encountered while processing:
postfix
E: Sub-process /usr/bin/dpkg returned an error code (1)"
and checking slurmd.service
root@:/etc/slurm-llnl# sudo apt install munge
Reading package lists... Done
Building dependency tree
Reading state information... Done
munge is already the newest version (0.5.13-2build1).
munge set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 52 not upgraded.
1 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Do you want to continue? [Y/n] Y
Setting up postfix (3.4.13-0ubuntu1.2) ...

Postfix (main.cf) configuration was not changed. If you need to make changes,
edit /etc/postfix/main.cf (and others) as needed. To view Postfix
configuration values, see postconf(1).

After modifying main.cf, be sure to run 'systemctl reload postfix'.

Running newaliases
newaliases: warning: valid_hostname: misplaced hyphen: gpunode1-wlp0s20f3.--
newaliases: fatal: file /etc/postfix/main.cf: parameter myhostname: bad parameter value: gpunode1-wlp0s20f3.--
dpkg: error processing package postfix (--configure):
installed postfix package post-installation script subprocess returned error exit status 75
Processing triggers for libc-bin (2.31-0ubuntu9.2) ...
Errors were encountered while processing:
postfix
E: Sub-process /usr/bin/dpkg returned an error code (1)

root@gpunode1:/etc/slurm-llnl# systemctl status slurmd.service
● slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2021-10-01 11:12:34 IST; 3min 49s ago
Docs: man:slurmd(8)
Main PID: 26727 (slurmd)
Tasks: 2
Memory: 2.9M
CGroup: /system.slice/slurmd.service
└─26727 /usr/sbin/slurmd

Oct 01 11:16:09 gpunode1 slurmd-gpunode1[26727]: error: Unable to register: Resource temporarily unavailable
Oct 01 11:16:10 gpunode1 slurmd-gpunode1[26727]: error: Unable to resolve "linuxK": Host name lookup failure
Oct 01 11:16:10 gpunode1 slurmd-gpunode1[26727]: error: Unable to establish control machine address
Oct 01 11:16:10 gpunode1 slurmd-gpunode1[26727]: error: Unable to register: Resource temporarily unavailable
Oct 01 11:16:12 gpunode1 slurmd-gpunode1[26727]: error: Unable to resolve "linuxK": Host name lookup failure
Oct 01 11:16:12 gpunode1 slurmd-gpunode1[26727]: error: Unable to establish control machine address
Oct 01 11:16:12 gpunode1 slurmd-gpunode1[26727]: error: Unable to register: Resource temporarily unavailable
Oct 01 11:16:13 gpunode1 slurmd-gpunode1[26727]: error: Unable to resolve "linuxK": Host name lookup failure
Oct 01 11:16:13 gpunode1 slurmd-gpunode1[26727]: error: Unable to establish control machine address
Oct 01 11:16:13 gpunode1 slurmd-gpunode1[26727]: error: Unable to register: Resource temporarily unavailable

Any suggestion how to resolve it?

@SmallPackage
Copy link

SmallPackage commented Oct 14, 2021

Hi and thank you for sharing! I have reached somehow similar issue. I change the hostname and also ControlMachine to the same exact name and now I am running to the following issue (see below) - I appreciate any inputs on how to fix this one. slurmd.service - Slurm node daemon Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled) Active: failed (Result: timeout) since Thu 2021-09-30 12:02:02 EDT; 13s ago Process: 26000 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)

Sep 30 12:01:11 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure) Sep 30 12:01:21 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure) Sep 30 12:01:31 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure) Sep 30 12:01:41 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure) Sep 30 12:01:51 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure) Sep 30 12:01:52 quanzeng-PowerEdge-T420 systemd[1]: slurmd.service: Start operation timed out. Terminating. Sep 30 12:02:01 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure) Sep 30 12:02:02 quanzeng-PowerEdge-T420 systemd[1]: Failed to start Slurm node daemon. Sep 30 12:02:02 quanzeng-PowerEdge-T420 systemd[1]: slurmd.service: Unit entered failed state. Sep 30 12:02:02 quanzeng-PowerEdge-T420 systemd[1]: slurmd.service: Failed with result 'timeout'.

@hmamine I had this issue,
And, use configurator.html not configurator.easy.html solve it

@Lihua1990
Copy link

Sorry I'm not an expert on this. After one rebooting, I restart the services and I also have the problem:

"slurm_load_partitions: Unable to contact slurm controller (connect failure)"

no idea why and still not solved.

@frankliuao
Copy link

Thanks for kindly sharing.
I had some trouble working with the instructions you provided but couldn't find similar issues elsewhere. My problem was that whenever I tried to start the slurmd service, it gave me an error of

(base) frankliuao@Lorentz:~$ sudo service slurmd start
Job for slurmd.service failed because the control process exited with error code.
See "systemctl status slurmd.service" and "journalctl -xe" for details.

Then I figured out from their website that the actual log of Slurm is being stored in

/var/log/slurm

So I read the errors

error: cannot find proctrack plugin for linuxproc
and realized that ProctrackType should be set to proctrack/linuxproc, instead of just linuxproc.

Same for JobAcctGatherType, not Linux, but jobacct_gather/linux, see
Official doc

After that, the service could be started.

@frankliuao
Copy link

To add to my comment above, one also needs to change AccountingStorageType to accounting_storage/filetxt instead of filetxt. The recommended value accounting_storage/slurmdbd didn't work because it relies on slurmdbd service, which I wasn't able to get running in a short time.

@v-iashin
Copy link

v-iashin commented Apr 14, 2022

To add GPUs to your cluster do the following (I assume your machine has NVidia drivers):

  1. Open /etc/slurm-llnl/slurm.conf. Uncomment and change #GresTypes= to GresTypes=gpu and add Gres=gpu:3 among specifications of a compute node at the bottom, e.g.:
NodeName=<your node name> CPUs=XX RealMemory=XXXXX Gres=gpu:3 Sockets=X CoresPerSocket=XX ThreadsPerCore=X State=XXXXXXX
  1. Create /etc/slurm-llnl/gres.conf and add:
# e.g. for 3 GPUs
NodeName=<your node name> Name=gpu Type:2080ti File=/dev/nvidia0
NodeName=<your node name> Name=gpu Type:2080ti File=/dev/nvidia1
NodeName=<your node name> Name=gpu Type:2080ti File=/dev/nvidia2
  1. Restart the cluster: sudo service slurmd restart && sudo service slurmctld restart
  2. Test that a job allocates the GPUs: srun -N 1 --gres=gpu:2080ti:2 env | grep CUDA --> CUDA_VISIBLE_DEVICES=0,1

@sandeep1143
Copy link

sandeep1143 commented Apr 30, 2022

Hi,
Thanks this has worked successfully but when i check status like,

sandeep@sandeep-VirtualBox:~$ systemctl status slurmd.service
● slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: active (running) since Sat 2022-04-30 22:50:47 IST; 9min ago
Docs: man:slurmd(8)
Process: 12931 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 12933 (slurmd)
Tasks: 1
Memory: 4.1M
CGroup: /system.slice/slurmd.service
└─12933 /usr/sbin/slurmd

Apr 30 22:50:47 sandeep-VirtualBox systemd[1]: Starting Slurm node daemon...
Apr 30 22:50:47 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12931]: Node reconfigured socket/core boundaries SocketsPerBoard=4:1(hw) CoresPe>
Apr 30 22:50:47 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12931]: Message aggregation disabled
Apr 30 22:50:47 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12931]: CPU frequency setting not configured for this node
Apr 30 22:50:47 sandeep-VirtualBox systemd[1]: slurmd.service: Can't open PID file /run/slurmd.pid (yet?) after start: Operation not permitted
Apr 30 22:50:47 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12933]: slurmd version 19.05.5 started
Apr 30 22:50:47 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12933]: slurmd started on Sat, 30 Apr 2022 22:50:47 +0530
Apr 30 22:50:47 sandeep-VirtualBox systemd[1]: Started Slurm node daemon.
Apr 30 22:50:47 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12933]: CPUs=4 Boards=1 Sockets=1 Cores=4 Threads=1 Memory=7951 TmpDisk=82909 Up>
Apr 30 22:50:56 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12933]: error: Unable to register: Unable to contact slurm controller (connect f>
lines 1-21/21 (END)

last i got cant open pid file error and unable to contact slurm controller. Can you help what to do for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment