richardrl

Multi-node-training on slurm with PyTorch

What's this?

A simple note for how to start multi-node-training on slurm scheduler with PyTorch.
Useful especially when scheduler is too busy that you cannot get multiple GPUs allocated, or you need more than 4 GPUs for a single job.
Requirement: Have to use PyTorch DistributedDataParallel(DDP) for this purpose.
Warning: might need to re-factor your own code.
Warning: might be secretly condemned by your colleagues because using too many GPUs.

Preamble

There is a longstanding issue/missing feature/bug with sockets on Docker on macOS; it may never work; you'll need to use a network connection between Docker containers and X11 on macOS for the foreseeable future.

I started from this gist and made some adjustments:

the volume mappings aren't relevant/used, due to the socket issue above.
this method only allows X11 connections from your Mac, not the entire local network, which would include everyone on the café/airport WiFi.
updated to include using the host.docker.internal name for the the container host, instead.
you have to restart XQuartz after the config change.

	import torch
	import torch.nn as nn
	import torch.nn.functional as F


	class SpatialSoftArgmax(nn.Module):
	"""Spatial softmax as defined in [1].

	Concretely, the spatial softmax of each feature
	map is used to compute a weighted mean of the pixel

	if args.enable_torques:
	print("ASSUMING THE WITNESS POINTS ARE DEFINED WITH RESPECT TO THE CENTER OF MASS!!! CHANGE BODY FRAME ORIGIN IF THIS IS NOT TRUE")
	# for each contact point, setup the product of binary-continuous for w*b
	# sum all the torques for a single object

	# setup obj-obj first
	for i in range(config.num_internal_bodies):
	net_torque_to_object_i = np.zeros(3)
	for j in range(config.num_internal_bodies):
	net_torque_to_i_from_j = np.zeros(3)

	import os
	import torch
	from tqdm import tqdm
	import time

	# declare which gpu device to use
	cuda_device = '0'

	def check_mem(cuda_device):
	devices_info = os.popen('"/usr/bin/nvidia-smi" --query-gpu=memory.total,memory.used --format=csv,nounits,noheader').read().strip().split("\n")

	import pyglet
	import pyglet.gl as gl

	import numpy as np

	import os
	import tempfile
	import subprocess
	import collections

	from multiprocessing import Pool
	from functools import partial

	def _pickle_method(method):
	func_name = method.im_func.__name__
	obj = method.im_self
	cls = method.im_class
	if func_name.startswith('__') and not func_name.endswith('__'): #deal with mangled names
	cls_name = cls.__name__.lstrip('_')
	func_name = '_' + cls_name + func_name

	#!/usr/bin/python
	# -- coding:utf-8 --
	"""
	Idea and code was taken from stackoverflow().

	This sample illustrates how to
	+ how to pass method of instance method
	to multiprocessing(idea and code was introduced
	at http://goo.gl/tRHN1D by torek).

	from graphviz import Digraph
	import torch
	from torch.autograd import Variable, Function

	def iter_graph(root, callback):
	queue = [root]
	seen = set()
	while queue:
	fn = queue.pop()
	if fn in seen:

	import open3d
	import numpy as np
	def main():
	vis = open3d.visualization.Visualizer()


	vis.create_window("Pose Visualizer")
	vis.get_render_option().line_width = 10.0

	obb = open3d.geometry.OrientedBoundingBox(center=np.array([0.0,0.0,0.0]), R=np.eye(3), extent=np.array([1.0, 1.0, 1.0]))