Skip to content

Instantly share code, notes, and snippets.

@Quentin-Anthony
Quentin-Anthony / metrics_collector.py
Created February 4, 2025 18:46
Collects system metrics
#!/usr/bin/env python3
import subprocess
import time
import logging
from datetime import datetime
import pynvml
import os
# Configure logging
@Quentin-Anthony
Quentin-Anthony / slurm_to_kub.md
Created April 30, 2025 19:06
cheatsheet for migrating from slurm to kubernetes

Slurm to Kubernetes Cheat Sheet

Conceptual Mapping

Slurm Concept Kubernetes Equivalent Description
Cluster Cluster Overall compute infrastructure
Node Node Physical/virtual machine in the cluster
Partition Namespace + Resource Quotas Logical division of resources
Account RBAC Roles and RoleBindings Access control mechanisms

How Megatron Builds and Pulls from Datasets

I describe a bit below on how megatron statically builds datasets, and then how models can pull from those datasets at training time. In order:

  1. How GPT datasets are produced inside Megatron‑Core;
  2. Exactly what a training step receives (__getitem__ --> DataLoader --> model);
  3. How to host the finished .bin / .idx pair in an S3‑compatible bucket and stream it lazily during training. I think this is the desired end-state for templar's training needs.