| Slurm Concept | Kubernetes Equivalent | Description |
|---|---|---|
| Cluster | Cluster | Overall compute infrastructure |
| Node | Node | Physical/virtual machine in the cluster |
| Partition | Namespace + Resource Quotas | Logical division of resources |
| Account | RBAC Roles and RoleBindings | Access control mechanisms |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #!/usr/bin/env python3 | |
| import subprocess | |
| import time | |
| import logging | |
| from datetime import datetime | |
| import pynvml | |
| import os | |
| # Configure logging |
I describe a bit below on how megatron statically builds datasets, and then how models can pull from those datasets at training time. In order:
- How GPT datasets are produced inside Megatron‑Core;
- Exactly what a training step receives (
__getitem__--> DataLoader --> model); - How to host the finished
.bin/.idxpair in an S3‑compatible bucket and stream it lazily during training. I think this is the desired end-state for templar's training needs.
OlderNewer