Arjun Ashok ashok-arjun

Array

Multi-node-training on slurm with PyTorch

What's this?

A simple note for how to start multi-node-training on slurm scheduler with PyTorch.
Useful especially when scheduler is too busy that you cannot get multiple GPUs allocated, or you need more than 4 GPUs for a single job.
Requirement: Have to use PyTorch DistributedDataParallel(DDP) for this purpose.
Warning: might need to re-factor your own code.
Warning: might be secretly condemned by your colleagues because using too many GPUs.

(Internal Tranining Material)

Usually the first step in performance optimization is to do profiling, e.g. to identify performance hotspots of a workload. This gist tells basic knowledge of performance profiling on PyTorch, you will get:

How to find the bottleneck operator?
How to trace source file of a particular operator?
How do I indentify threading issues? (oversubscription)
How do I tell a specific operator is running efficiently or not?

This tutorial takes one of my recent projects - pssp-transformer as an example to guide you through path of PyTorch CPU peformance optimization. Focus will be on Part 1 & Part 2.

	function ClickConnect() {
	console.log('Working')
	document
	.querySelector('#top-toolbar > colab-connect-button')
	.shadowRoot.querySelector('#connect')
	.click()
	}
	intervalTiming = setInterval(ClickConnect, 60000)

	import imageio
	import numpy as np
	import os

	from collections import defaultdict
	from torch.utils.data import Dataset

	from tqdm.autonotebook import tqdm

	dir_structure_help = r"""

	from __future__ import print_function, absolute_import

	__all__ = ['accuracy']

	def accuracy(output, target, topk=(1,)):
	"""Computes the precision@k for the specified values of k"""
	maxk = max(topk)
	batch_size = target.size(0)

	_, pred = output.topk(maxk, 1, True, True)

	"""CODE FOR ANALYSIS"""
	def eigenDecompositionAnalysis(self, b1_model, X_train_cumuls, Y_train_cumuls, T_train_cumuls, \
	X_valid_cumuls, Y_valid_cumuls, T_valid_cumuls, X_protoset_cumuls, \
	Y_protoset_cumuls, T_protoset_cumuls, \
	iteration_index, start_iter, end_iter, order_list, device,
	num_classes, num_phases, model_list, threshold=0.9):
	"""SOME CUSTOM FUNCTIONS"""

	def torch_cat(main_array, new_array):
	"""Custom `cat` function"""

	#!/usr/bin/env python
	# coding: utf-8

	# You need PIL <http://www.pythonware.com/products/pil/> to run this script
	# Download unifont.ttf from <http://unifoundry.com/unifont.html> (or use
	# any TTF you have)
	# Copyright 2011 Álvaro Justen [alvarojusten at gmail dot com]
	# License: GPL <http://www.gnu.org/copyleft/gpl.html>

	from image_utils import ImageText

	INFO 11-06 16:59:41 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 15.2%, CPU KV cache usage: 0.0%.
	INFO 11-06 16:59:41 metrics.py:367] Prefix cache hit rate: GPU: 11.35%, CPU: 0.00%
	INFO: Waiting for background tasks to complete. (CTRL+C to force quit)
	INFO: Waiting for application shutdown.
	INFO: Application shutdown complete.
	INFO: Finished server process [1]
	INFO 11-06 16:59:50 server.py:228] vLLM ZMQ RPC Server was interrupted.
	Future exception was never retrieved
	future: <Future finished exception=TimeoutError()>
	Traceback (most recent call last):