Skip to content

Instantly share code, notes, and snippets.

@yejingxin
Last active November 9, 2024 00:22
Show Gist options
  • Save yejingxin/15fa67d96803fc5fa586302d6f072cc7 to your computer and use it in GitHub Desktop.
Save yejingxin/15fa67d96803fc5fa586302d6f072cc7 to your computer and use it in GitHub Desktop.
Goodputdefinition.md

GPU Training Goodput Explained

A typical journey for a cloud GPU customer involves training a model with a specific goal: processing a set number of tokens with a model of a certain size.

Setup

  • Model size: $P$ billion parameters
  • Training data: $D$ tokens
  • Availability goal (SLO): $A$
  • Effective training speed: $F$ FLOPs/second across $C$ chips

Theoretical Training Time Calculation

To complete the training, users need a certain number of floating-point operations (FLOPs), which can be calculated as:

$$ 6 \times P \times D $$

With the available speed $F$, this would take:

$$ \text{Time} = \frac{6PD}{F} \text{ seconds} $$

Given the availability target $A$, users need to allocate slightly more chips:

$$ \text{Required chips} = \frac{C}{A} $$

The theoretical minimum chip-seconds required for training is therefore:

$$ \frac{6PDC}{FA} $$

Practical Realities and Additional Time

However, real-world factors often increase the actual time needed. These include:

  • Cluster provisioning time: time to get resources ready
  • Job scheduling time: time for the job to enter the queue and begin
  • Initialization time: time from job start to the first training step
  • Checkpointing: time to save and restore model progress
  • Job recovery: time spent restoring after interruptions or recoveries
  • SLO pauses: time when the job is paused due to GPU unavailability
  • Holdback chip-seconds: addtional chips are holdback to ensure GPUs are available
  • ...

Adding up all these, the actual total chip-seconds used for training, $T$, is often higher than the theoretical minimum.

Goodput Calculation

The GPU training goodput is the efficiency ratio of theoretical minimum chip-seconds to actual chip-seconds used:

$$ \text{Goodput} = \frac{6PDC}{FAT} $$

This metric helps gauge how effectively the resources are being used relative to ideal conditions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment