Quentin Anthony Quentin-Anthony

Slurm to Kubernetes Cheat Sheet

Slurm Concept	Kubernetes Equivalent	Description
Cluster	Cluster	Overall compute infrastructure
Node	Node	Physical/virtual machine in the cluster
Partition	Namespace + Resource Quotas	Logical division of resources
Account	RBAC Roles and RoleBindings	Access control mechanisms

I describe a bit below on how megatron statically builds datasets, and then how models can pull from those datasets at training time. In order:

How GPT datasets are produced inside Megatron‑Core;
Exactly what a training step receives (__getitem__ --> DataLoader --> model);
How to host the finished .bin / .idx pair in an S3‑compatible bucket and stream it lazily during training. I think this is the desired end-state for templar's training needs.

	#!/usr/bin/env python3

	import subprocess
	import time
	import logging
	from datetime import datetime
	import pynvml
	import os

	# Configure logging