GPU Training Goodput Explained

A typical journey for a cloud GPU customer involves training a model with a specific goal: processing a set number of tokens with a model of a certain size.

Setup

Model size: $P$ billion parameters
Training data: $D$ tokens
Availability goal (SLO): $A$
Effective training speed: $F$ FLOPs/second across $C$ chips

Theoretical Training Time Calculation

To complete the training, users need a certain number of floating-point operations (FLOPs), which can be calculated as:

$$ 6 \times P \times D $$

With the available speed $F$, this would take:

$$ \text{Time} = \frac{6PD}{F} \text{ seconds} $$

Given the availability target $A$, users need to allocate slightly more chips:

$$ \text{Required chips} = \frac{C}{A} $$

The theoretical minimum chip-seconds required for training is therefore:

$$ \frac{6PDC}{FA} $$

Practical Realities and Additional Time

However, real-world factors often increase the actual time needed. These include:

Cluster provisioning time: time to get resources ready
Job scheduling time: time for the job to enter the queue and begin
Initialization time: time from job start to the first training step
Checkpointing: time to save and restore model progress
Job recovery: time spent restoring after interruptions or recoveries
SLO pauses: time when the job is paused due to GPU unavailability
Holdback chip-seconds: addtional chips are holdback to ensure GPUs are available
...

Adding up all these, the actual total chip-seconds used for training, $T$, is often higher than the theoretical minimum.

Goodput Calculation

The GPU training goodput is the efficiency ratio of theoretical minimum chip-seconds to actual chip-seconds used:

$$ \text{Goodput} = \frac{6PDC}{FAT} $$

This metric helps gauge how effectively the resources are being used relative to ideal conditions.

yejingxin/goodput.md

GPU Training Goodput Explained

Setup

Theoretical Training Time Calculation

Practical Realities and Additional Time

Goodput Calculation