Eldar Insafutdinov eldar

Multi-node-training on slurm with PyTorch

A simple note for how to start multi-node-training on slurm scheduler with PyTorch.
Useful especially when scheduler is too busy that you cannot get multiple GPUs allocated, or you need more than 4 GPUs for a single job.
Requirement: Have to use PyTorch DistributedDataParallel(DDP) for this purpose.
Warning: might need to re-factor your own code.
Warning: might be secretly condemned by your colleagues because using too many GPUs.

	import std.algorithm, std.conv, std.range, std.stdio;

	// FizzBuzz from http://ideone.com/ciKtm
	// I cannot port 'f <> b <\|> n'

	void main()
	{
	auto fizz = cycle([null, null, "Fizz"]);
	auto buzz = cycle([null, null, null, null, "Buzz"]);
	auto nums = map!(to!string)(iota(1, 101));