Training GPT-2 124M

I tried the GPT-2 124M parameter model on 3 different GPU setups - following karpathy/llm.c#481. Here's how long each run it took and how much it cost:

	1x A10G	1x A100 40GB	8x A100 40GB
training time	48h	15h	2h
cost	$163	$27	$45

Most of the completions are fairly nonsensical, but here are some interesting ones:

Prompt	Completion
The GitHub project llm.c is a...	project of The Leapfrog Group, which was founded in October 2003 to develop and develop hyper-centralized, distributed software.
In 2029, humanity's first mission to Mars began. As they landed, the crew faced unexpected challenges when...	it was revealed by NASA that the planet was entirely consistent with Earth and S equinox.
To make spaghetti bolognese, you must first...	make the meatballs with the sea salt.
The largest desert in the world is the...	front of the Milky Way, and it's getting worse.
System.out.println("Hello	x> 4.")"

Model weights + checkpoints for each of the training runs can be found at https://huggingface.co/aidando73/repro-gpt-2-124M/tree/main.

Dataset used is 10B tokens of fineweb: https://huggingface.co/datasets/HuggingFaceFW/fineweb

The rest of this gist describes the details of each training run.

Training run 1 - 1x A10G 24GB

47h 35m
- 8:44am 20th Nov AEDT - 08:19 Nov 22 AEDT
Cost ~$163 ($1.39/hour)
- Only including training time
- AWS g5.8xlarge instances
- Much more expensive than lambdalabs.com
  - An analogous A10 box costs around $0.75/hr on lambdalabs.com

Note: My box got stopped right before the 19000 mark. The dev boxes at work are based on ec2 spot instances.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
|  0%   65C    P0            266W /  300W |   13740MiB /  23028MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A       909      G   /usr/lib/xorg/Xorg                            106MiB |
|    0   N/A  N/A      9413      C   ./train_gpt2cu                              13616MiB |
+-----------------------------------------------------------------------------------------+

~60k tokens per second

step   77/19560 | loss 7.626959 (+nanz)| norm 1.3845 (+nanz)| lr 6.60e-05 | 8436.04 ms | -100.0% bf16 MFU | 62134 tok/s
step   78/19560 | loss 7.596109 (+nanz)| norm 1.0311 (+nanz)| lr 6.69e-05 | 8432.50 ms | -100.0% bf16 MFU | 62136 tok/s
step   79/19560 | loss 7.550291 (+nanz)| norm 1.0829 (+nanz)| lr 6.77e-05 | 8435.31 ms | -100.0% bf16 MFU | 62137 tok/s

Had to reduce batch size to 32 (-b 32). Otherwise, ran into OOM issues.

Training run 2 - 1x A100 SXM 40GB

15h 18m
- 1:26pm 20th Nov 2024 AEDT - 4:44am 21st Nov 2024 AEDT
Cost ~$27.29 ($1.29/hr on lambdalabs.com)
- includes installation, dataset download and preprocessing

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12              Driver Version: 550.90.12      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  |   00000000:08:00.0 Off |                    0 |
| N/A   74C    P0            405W /  400W |   25203MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     19862      C   ./train_gpt2cu                              25194MiB |
+-----------------------------------------------------------------------------------------+

~200k tokens per second

step   24/19560 | loss 9.491743 (+nanz)| norm 2.1357 (+nanz)| lr 2.06e-05 | 2510.29 ms | 53.8% bf16 MFU | 209731 tok/s
step   25/19560 | loss 9.461581 (+nanz)| norm 2.0905 (+nanz)| lr 2.14e-05 | 2510.24 ms | 53.8% bf16 MFU | 209669 tok/s
step   26/19560 | loss 9.447474 (+nanz)| norm 1.9947 (+nanz)| lr 2.23e-05 | 2510.92 ms | 53.8% bf16 MFU | 209609 tok/s

Had to compile with: make train_gpt2cu USE_CUDNN=1 NO_MULTI_GPU=1 otherwise got MPI support is disabled. Please enable MPI support to use MPI-based NCCL-init method. error

Training run 3 - 8x A100 40GB

Training took 2h 11m
- 4:45pm 20th Nov AEDT - 6:56pm 20th Nov AEDT
Cost $44.95
- Includes installation, dataset download and preprocessing. Also ran into some errors (see below)

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12              Driver Version: 550.90.12      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  |   00000000:08:00.0 Off |                    0 |
| N/A   76C    P0            395W /  400W |   24677MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  |   00000000:09:00.0 Off |                    0 |
| N/A   68C    P0            423W /  400W |   24677MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  |   00000000:0A:00.0 Off |                    0 |
| N/A   67C    P0            409W /  400W |   24677MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  |   00000000:0B:00.0 Off |                    0 |
| N/A   74C    P0            406W /  400W |   24677MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-SXM4-40GB          On  |   00000000:0C:00.0 Off |                    0 |
| N/A   72C    P0            431W /  400W |   24677MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-SXM4-40GB          On  |   00000000:0D:00.0 Off |                    0 |
| N/A   66C    P0            353W /  400W |   24677MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-SXM4-40GB          On  |   00000000:0E:00.0 Off |                    0 |
| N/A   66C    P0            329W /  400W |   24677MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-SXM4-40GB          On  |   00000000:0F:00.0 Off |                    0 |
| N/A   76C    P0            423W /  400W |   24677MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     24393      C   ./train_gpt2cu                              24540MiB |
|    1   N/A  N/A     24394      C   ./train_gpt2cu                              24540MiB |
|    2   N/A  N/A     24395      C   ./train_gpt2cu                              24540MiB |
|    3   N/A  N/A     24396      C   ./train_gpt2cu                              24540MiB |
|    4   N/A  N/A     24397      C   ./train_gpt2cu                              24540MiB |
|    5   N/A  N/A     24398      C   ./train_gpt2cu                              24540MiB |
|    6   N/A  N/A     24399      C   ./train_gpt2cu                              24540MiB |
|    7   N/A  N/A     24400      C   ./train_gpt2cu                              24540MiB |
+-----------------------------------------------------------------------------------------+

Had to prepend nice to the command. nice nohup bash -c 'echo "start $(date)" && mpirun -np 8 ./train_gpt2cu ..., process was being killed after I exited terminal [1]
~1.6M tokens/s

step  179/19560 | loss 6.443990 (-1.13z)| norm 0.9910 (-0.26z)| lr 1.53e-04 | 321.80 ms | 52.4% bf16 MFU | 1630269 tok/s
step  180/19560 | loss 6.446496 (-1.12z)| norm 1.2475 (+0.62z)| lr 1.54e-04 | 321.94 ms | 52.4% bf16 MFU | 1630183 tok/s
step  181/19560 | loss 6.426447 (-1.15z)| norm 1.1282 (+0.24z)| lr 1.55e-04 | 321.83 ms | 52.4% bf16 MFU | 1630129 tok/s

Notes

Dollar amounts are in USD
Got to the end of training and ran into this error: karpathy/llm.c#786
- So didn't manage to get the final weights. But still have the 1500 checkpoint
- Can probably workaround this by setting the step count to be 20001 and then getting the model checkpoint at 20000 - since checkpoints are being saved properly. Might followup later if I have the time
Inference was performed following this method: https://gist.github.com/aidando73/cbc3ef69b21ad292bf021059a3fd0f06

Additional setup

Alongside instructions in: karpathy/llm.c#481

# After installing conda
# Avoid installing into base environment - gets a bit messy
conda create -n myenv python=3.10
conda activate myenv

# If cuda not installed yet, install cuda 12.4 (12.6 ran into some pytorch issues when initializing hellaswag)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4

sudo apt-get install -y nvidia-driver-550-open	
sudo apt-get install -y cuda-drivers-550	

echo 'export PATH=$PATH:/usr/local/cuda-12.4/bin' >> .bashrc
echo 'export NVCC=/usr/local/cuda-12.4/bin/nvcc' >> .bashrc

# Match cuda version
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
# Test if torch is compiled with cuda. Required for hella swag evaluation.
python -c "import torch; print(torch.cuda.is_available())"

# Download fineweb dataset in the background:
pip install -r requirements.txt
nohup python dev/data/fineweb.py --version 10B &
# Note: whilst downloading the dataset, nohup.out will be empty, but you can use the following command to track progress:
watch -c "du -ah ~/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb && tail ~/llm.c/nohup.out" # Downloads about ~29GB

# Run this for hellaswag eval
# Might run into https://stackoverflow.com/questions/66371130/cuda-initialization-unexpected-error-from-cudagetdevicecount
# or https://github.com/karpathy/llm.c/issues/785
python dev/data/hellaswag.py

nohup bash -c 'echo "start $(date)" && ./train_gpt2cu \
    -i "dev/data/fineweb10B/fineweb_train_*.bin" \
    -j "dev/data/fineweb10B/fineweb_val_*.bin" \
    -o log124M \
    -e "d12" \
    -b 64 -t 1024 \
    -d 524288 \
    -r 1 \
    -z 1 \
    -c 0.1 \
    -l 0.0006 \
    -q 0.0 \
    -u 700 \
    -n 5000 \
    -v 250 -s 20000 \
    -h 1 && echo "end $(date)"' &

# To check progress:
tail -f ~/work/llm.c/nohup.out

# To generate graphs install this script https://gist.github.com/aidando73/f1a4966305af05699c38d7a64bd48922
# Then run:
(cd dev && python ./vislog.py)