Use apt to install the necessary packages:
sudo apt install -y slurm-wlm slurm-wlm-doc
Load file:///usr/share/doc/slurm-wlm/html/configurator.html in a browser (or file://wsl%24/Ubuntu/usr/share/doc/slurm-wlm/html/configurator.html on WSL2), and:
- Set your machine's hostname in
SlurmctldHostandNodeName. - Set
CPUsas appropriate, and optionallySockets,CoresPerSocket, andThreadsPerCore. Use commandlscputo find what you have. - Set
RealMemoryto the number of megabytes you want to allocate to Slurm jobs, - Set
StateSaveLocationto/var/spool/slurm-llnl. - Set
ProctrackTypetolinuxprocbecause processes are less likely to escape Slurm control on a single machine config. - Make sure
SelectTypeis set toCons_res, and setSelectTypeParameterstoCR_Core_Memory. - Set
JobAcctGatherTypetoLinuxto gather resource use per job, and setAccountingStorageTypetoFileTxt.
Hit Submit, and save the resulting text into /etc/slurm-llnl/slurm.conf i.e. the configuration file referred to in /lib/systemd/system/slurmctld.service and /lib/systemd/system/slurmd.service.
Load /etc/slurm-llnl/slurm.conf in a text editor, uncomment DefMemPerCPU, and set it to 8192 or whatever number of megabytes you want each job to request if not explicitly requested using --mem during job submission. Read the docs and edit other defaults as you see fit.
Create /var/spool/slurm-llnl and /var/log/slurm_jobacct.log, then set ownership appropriately:
sudo mkdir -p /var/spool/slurm-llnl
sudo touch /var/log/slurm_jobacct.log
sudo chown slurm:slurm /var/spool/slurm-llnl /var/log/slurm_jobacct.log
Install mailutils so that Slurm won't complain about /bin/mail missing:
sudo apt install -y mailutils
Make sure munge is installed and running, and a munge.key was created with user-only read-only permissions, owned by munge:munge:
sudo service munge start
sudo ls -l /etc/munge/munge.key
Start services slurmctld and slurmd:
sudo service slurmd start
sudo service slurmctld start
Hi and thank you for sharing! I have reached somehow similar issue. I change the hostname and also ControlMachine to the same exact name and now I am running to the following issue (see below) - I appreciate any inputs on how to fix this one.
slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: failed (Result: timeout) since Thu 2021-09-30 12:02:02 EDT; 13s ago
Process: 26000 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
Sep 30 12:01:11 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure)
Sep 30 12:01:21 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure)
Sep 30 12:01:31 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure)
Sep 30 12:01:41 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure)
Sep 30 12:01:51 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure)
Sep 30 12:01:52 quanzeng-PowerEdge-T420 systemd[1]: slurmd.service: Start operation timed out. Terminating.
Sep 30 12:02:01 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure)
Sep 30 12:02:02 quanzeng-PowerEdge-T420 systemd[1]: Failed to start Slurm node daemon.
Sep 30 12:02:02 quanzeng-PowerEdge-T420 systemd[1]: slurmd.service: Unit entered failed state.
Sep 30 12:02:02 quanzeng-PowerEdge-T420 systemd[1]: slurmd.service: Failed with result 'timeout'.