Emin Orhan eminorhan

Multi-node-training on slurm with PyTorch

A simple note for how to start multi-node-training on slurm scheduler with PyTorch.
Useful especially when scheduler is too busy that you cannot get multiple GPUs allocated, or you need more than 4 GPUs for a single job.
Requirement: Have to use PyTorch DistributedDataParallel(DDP) for this purpose.
Warning: might need to re-factor your own code.
Warning: might be secretly condemned by your colleagues because using too many GPUs.

	"""Stream a response from the OpenAI completion API."""
	import os
	import re
	import sys
	import time
	import random

	import openai
	openai.api_key = open(os.path.expanduser("~/.openai")).read().strip()

	Pretty print tables summarizing properties of tensor arrays in numpy, pytorch, jax, etc.

	Now on pip! `pip install arrgh` https://github.com/nmwsharp/arrgh