Slurm Concept | Kubernetes Equivalent | Description |
---|---|---|
Cluster | Cluster | Overall compute infrastructure |
Node | Node | Physical/virtual machine in the cluster |
Partition | Namespace + Resource Quotas | Logical division of resources |
Account | RBAC Roles and RoleBindings | Access control mechanisms |
#!/usr/bin/env python3 | |
import subprocess | |
import time | |
import logging | |
from datetime import datetime | |
import pynvml | |
import os | |
# Configure logging |
#!/bin/bash | |
#SBATCH -t 2:00:00 | |
#SBATCH -N 1 | |
#SBATCH -p a100 | |
#SBATCH --gpus-per-node=2 | |
GCC_VERSION="10.3.0" | |
CUDA_VERSION="11.6" | |
TORCH_VERSION="1.13.1" | |
MV2_VERSION="realease-plus-3.0a2" |
""" | |
To run the benchmark, you would use mpirun_rsh like this: | |
For single-node multi-GPU: | |
mpirun_rsh <ENV_PARAMS> -np 2 python distributed_benchmark.py --task text --parallel_mode ddp | |
and for multi-node: | |
mpirun_rsh <ENV_PARAMS> -hostfile hosts -np 4 python distributed_benchmark.py --task vision --parallel_mode fsdp_full |
#!/bin/bash | |
# set tokenizer | |
TOKENIZER_TYPE=<TODO> | |
TOKENIZER_MODEL=<TODO> | |
# set up distributed | |
GPUS_PER_NODE=<TODO> | |
NNODES=<TODO> | |
export MASTER_ADDR=localhost #ONLY FOR SINGLE-NODE. CHANGE FOR MULTINODE. |
import argparse | |
import math | |
# Helper function to pretty-print message sizes | |
def convert_flops(params): | |
if params == 0: | |
return "0" | |
size_name = ("", "KFLOPs", "MFLOPs", "GFLOPs", "TFLOPs", "PFLOPs", "EFLOPs", "ZFLOPs", "YFLOPs") | |
i = int(math.floor(math.log(params, 1000))) | |
p = math.pow(1000, i) |
import argparse | |
import math | |
# Helper function to pretty-print message sizes | |
def convert_params(params): | |
if params == 0: | |
return "0" | |
size_name = ("", "K", "M", "B", "T", "P", "E", "Z", "Y") | |
i = int(math.floor(math.log(params, 1000))) | |
p = math.pow(1000, i) |
import torch | |
from safetensors.torch import save_file, load_file | |
import numpy as np | |
import argparse | |
import os | |
import time | |
if __name__ == "__main__": | |
parser = argparse.ArgumentParser() | |
parser.add_argument("--no-save", action="store_false", help="disables saving initial tensors") |
Thank you for your interest in contributing to open source software projects (“Projects”) made available by the Network-Based Computing Laboratory (NBCL) or its affiliates (“NBCL”). This Individual Contributor License Agreement (“Agreement”) sets out the terms governing any source code, object code, bug fixes, configuration changes, tools, specifications, documentation, data, materials, feedback, information or other works of authorship that you submit or have submitted, in any form and in any manner, to NBCL in respect of any of the Projects (collectively “Contributions”). If you have any questions respecting this Agreement, please contact [email protected].
You agree that the following terms apply to all of your past, present and future Contributions. Except for the licenses granted in this Agreement, you retain all of your right, title and interest in and to your Contributions.
Copyright License. You hereby grant, and agree to grant, to NB
import argparse | |
import math | |
# Helper function to pretty-print message sizes | |
def convert_params(params): | |
if params == 0: | |
return "0" | |
size_name = ("", "K", "M", "B", "T", "P", "E", "Z", "Y") | |
i = int(math.floor(math.log(params, 1000))) | |
p = math.pow(1000, i) |