A typical journey for a cloud GPU customer involves training a model with a specific goal: processing a set number of tokens with a model of a certain size.
-
Model size:
$P$ billion parameters -
Training data:
$D$ tokens -
Availability goal (SLO):
$A$ -
Effective training speed:
$F$ FLOPs/second across$C$ chips
To complete the training, users need a certain number of floating-point operations (FLOPs), which can be calculated as:
With the available speed
Given the availability target
The theoretical minimum chip-seconds required for training is therefore:
However, real-world factors often increase the actual time needed. These include:
- Cluster provisioning time: time to get resources ready
- Job scheduling time: time for the job to enter the queue and begin
- Initialization time: time from job start to the first training step
- Checkpointing: time to save and restore model progress
- Job recovery: time spent restoring after interruptions or recoveries
- SLO pauses: time when the job is paused due to GPU unavailability
- Holdback chip-seconds: addtional chips are holdback to ensure GPUs are available
- ...
Adding up all these, the actual total chip-seconds used for training,
The GPU training goodput is the efficiency ratio of theoretical minimum chip-seconds to actual chip-seconds used:
This metric helps gauge how effectively the resources are being used relative to ideal conditions.