Zhangyu Jin JinZhangYu

Multi-node-training on slurm with PyTorch

A simple note for how to start multi-node-training on slurm scheduler with PyTorch.
Useful especially when scheduler is too busy that you cannot get multiple GPUs allocated, or you need more than 4 GPUs for a single job.
Requirement: Have to use PyTorch DistributedDataParallel(DDP) for this purpose.
Warning: might need to re-factor your own code.
Warning: might be secretly condemned by your colleagues because using too many GPUs.

注意：本文内容适用于 Tmux 2.3 及以上的版本，但是绝大部分的特性低版本也都适用，鼠标支持、VI 模式、插件管理在低版本可能会与本文不兼容。

启动新会话：

tmux [new -s 会话名 -n 窗口名]

恢复会话：

	library(repr)

	# Change plot size to 4 x 3
	options(repr.plot.width=4, repr.plot.height=3)
	curve(sin(x), from = 0, to=2*pi, n = 100)

	# Change plot size to 8 x 3
	options(repr.plot.width=8, repr.plot.height=3)
	curve(sin(x), from = 0, to=4*pi, n = 200)

	Mac Port 基本用法总结

	1. Mac Port的下载地址
	http://www.macports.org/install.php

	2. Mac Port的说明文档
	http://guide.macports.org/

	3. Mac Port中第三方软件下载包存放的默认路径是：/opt/local/var/macports/distfiles/
	为了提高安装速度，可以在安装新port时直接将此目录下的文件拷贝到新的Mac Port相同的目录中就可以避免Port去网上下载。