Quentin Anthony Quentin-Anthony

How Megatron Builds and Pulls from Datasets

I describe a bit below on how megatron statically builds datasets, and then how models can pull from those datasets at training time. In order:

How GPT datasets are produced inside Megatron‑Core;
Exactly what a training step receives (__getitem__ --> DataLoader --> model);
How to host the finished .bin / .idx pair in an S3‑compatible bucket and stream it lazily during training. I think this is the desired end-state for templar's training needs.

Slurm to Kubernetes Cheat Sheet

Conceptual Mapping

Slurm Concept	Kubernetes Equivalent	Description
Cluster	Cluster	Overall compute infrastructure
Node	Node	Physical/virtual machine in the cluster
Partition	Namespace + Resource Quotas	Logical division of resources
Account	RBAC Roles and RoleBindings	Access control mechanisms

NOWLAB Individual Contributor License Agreement

Thank you for your interest in contributing to open source software projects (“Projects”) made available by the Network-Based Computing Laboratory (NBCL) or its affiliates (“NBCL”). This Individual Contributor License Agreement (“Agreement”) sets out the terms governing any source code, object code, bug fixes, configuration changes, tools, specifications, documentation, data, materials, feedback, information or other works of authorship that you submit or have submitted, in any form and in any manner, to NBCL in respect of any of the Projects (collectively “Contributions”). If you have any questions respecting this Agreement, please contact [email protected].

You agree that the following terms apply to all of your past, present and future Contributions. Except for the licenses granted in this Agreement, you retain all of your right, title and interest in and to your Contributions.

Copyright License. You hereby grant, and agree to grant, to NB

	#!/usr/bin/env python3

	import subprocess
	import time
	import logging
	from datetime import datetime
	import pynvml
	import os

	# Configure logging

	#!/bin/bash
	#SBATCH -t 2:00:00
	#SBATCH -N 1
	#SBATCH -p a100
	#SBATCH --gpus-per-node=2

	GCC_VERSION="10.3.0"
	CUDA_VERSION="11.6"
	TORCH_VERSION="1.13.1"
	MV2_VERSION="realease-plus-3.0a2"

	"""
	To run the benchmark, you would use mpirun_rsh like this:

	For single-node multi-GPU:

	mpirun_rsh <ENV_PARAMS> -np 2 python distributed_benchmark.py --task text --parallel_mode ddp

	and for multi-node:

	mpirun_rsh <ENV_PARAMS> -hostfile hosts -np 4 python distributed_benchmark.py --task vision --parallel_mode fsdp_full

	#!/bin/bash

	# set tokenizer
	TOKENIZER_TYPE=<TODO>
	TOKENIZER_MODEL=<TODO>

	# set up distributed
	GPUS_PER_NODE=<TODO>
	NNODES=<TODO>
	export MASTER_ADDR=localhost #ONLY FOR SINGLE-NODE. CHANGE FOR MULTINODE.

	import argparse
	import math

	# Helper function to pretty-print message sizes
	def convert_flops(params):
	if params == 0:
	return "0"
	size_name = ("", "KFLOPs", "MFLOPs", "GFLOPs", "TFLOPs", "PFLOPs", "EFLOPs", "ZFLOPs", "YFLOPs")
	i = int(math.floor(math.log(params, 1000)))
	p = math.pow(1000, i)

	import torch
	from safetensors.torch import save_file, load_file
	import numpy as np
	import argparse
	import os
	import time

	if __name__ == "__main__":
	parser = argparse.ArgumentParser()
	parser.add_argument("--no-save", action="store_false", help="disables saving initial tensors")