Skip to content

Instantly share code, notes, and snippets.

@aidando73
Last active November 21, 2024 22:05
Show Gist options
  • Save aidando73/fc7a6069fef9b718576ff9adbc8fe6b7 to your computer and use it in GitHub Desktop.
Save aidando73/fc7a6069fef9b718576ff9adbc8fe6b7 to your computer and use it in GitHub Desktop.
Training GPT-2 124M

Training GPT-2 124M

I tried the GPT-2 124M parameter model on 3 different GPU setups - following karpathy/llm.c#481. Here's how long each run it took and how much it cost:

1x A10G 1x A100 40GB 8x A100 40GB
training time 48h 15h 2h
cost $163 $27 $45

Most of the completions are fairly nonsensical, but here are some interesting ones:

Prompt Completion
The GitHub project llm.c is a... project of The Leapfrog Group, which was founded in October 2003 to develop and develop hyper-centralized, distributed software.
In 2029, humanity's first mission to Mars began. As they landed, the crew faced unexpected challenges when... it was revealed by NASA that the planet was entirely consistent with Earth and S equinox.
To make spaghetti bolognese, you must first... make the meatballs with the sea salt.
The largest desert in the world is the... front of the Milky Way, and it's getting worse.
System.out.println("Hello x> 4.")"

Model weights + checkpoints for each of the training runs can be found at https://huggingface.co/aidando73/repro-gpt-2-124M/tree/main.

Dataset used is 10B tokens of fineweb: https://huggingface.co/datasets/HuggingFaceFW/fineweb

The rest of this gist describes the details of each training run.

Training run 1 - 1x A10G 24GB

  • 47h 35m

    • 8:44am 20th Nov AEDT - 08:19 Nov 22 AEDT
  • Cost ~$163 ($1.39/hour)

    • Only including training time
    • AWS g5.8xlarge instances
    • Much more expensive than lambdalabs.com
      • An analogous A10 box costs around $0.75/hr on lambdalabs.com

plot

Note: My box got stopped right before the 19000 mark. The dev boxes at work are based on ec2 spot instances.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
|  0%   65C    P0            266W /  300W |   13740MiB /  23028MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A       909      G   /usr/lib/xorg/Xorg                            106MiB |
|    0   N/A  N/A      9413      C   ./train_gpt2cu                              13616MiB |
+-----------------------------------------------------------------------------------------+
  • ~60k tokens per second
step   77/19560 | loss 7.626959 (+nanz)| norm 1.3845 (+nanz)| lr 6.60e-05 | 8436.04 ms | -100.0% bf16 MFU | 62134 tok/s
step   78/19560 | loss 7.596109 (+nanz)| norm 1.0311 (+nanz)| lr 6.69e-05 | 8432.50 ms | -100.0% bf16 MFU | 62136 tok/s
step   79/19560 | loss 7.550291 (+nanz)| norm 1.0829 (+nanz)| lr 6.77e-05 | 8435.31 ms | -100.0% bf16 MFU | 62137 tok/s
  • Had to reduce batch size to 32 (-b 32). Otherwise, ran into OOM issues.

Training run 2 - 1x A100 SXM 40GB

  • 15h 18m
    • 1:26pm 20th Nov 2024 AEDT - 4:44am 21st Nov 2024 AEDT
  • Cost ~$27.29 ($1.29/hr on lambdalabs.com)
    • includes installation, dataset download and preprocessing

plot

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12              Driver Version: 550.90.12      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  |   00000000:08:00.0 Off |                    0 |
| N/A   74C    P0            405W /  400W |   25203MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     19862      C   ./train_gpt2cu                              25194MiB |
+-----------------------------------------------------------------------------------------+
  • ~200k tokens per second
step   24/19560 | loss 9.491743 (+nanz)| norm 2.1357 (+nanz)| lr 2.06e-05 | 2510.29 ms | 53.8% bf16 MFU | 209731 tok/s
step   25/19560 | loss 9.461581 (+nanz)| norm 2.0905 (+nanz)| lr 2.14e-05 | 2510.24 ms | 53.8% bf16 MFU | 209669 tok/s
step   26/19560 | loss 9.447474 (+nanz)| norm 1.9947 (+nanz)| lr 2.23e-05 | 2510.92 ms | 53.8% bf16 MFU | 209609 tok/s
  • Had to compile with: make train_gpt2cu USE_CUDNN=1 NO_MULTI_GPU=1 otherwise got MPI support is disabled. Please enable MPI support to use MPI-based NCCL-init method. error

Training run 3 - 8x A100 40GB

  • Training took 2h 11m
    • 4:45pm 20th Nov AEDT - 6:56pm 20th Nov AEDT
  • Cost $44.95
    • Includes installation, dataset download and preprocessing. Also ran into some errors (see below)

image

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12              Driver Version: 550.90.12      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  |   00000000:08:00.0 Off |                    0 |
| N/A   76C    P0            395W /  400W |   24677MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  |   00000000:09:00.0 Off |                    0 |
| N/A   68C    P0            423W /  400W |   24677MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  |   00000000:0A:00.0 Off |                    0 |
| N/A   67C    P0            409W /  400W |   24677MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  |   00000000:0B:00.0 Off |                    0 |
| N/A   74C    P0            406W /  400W |   24677MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-SXM4-40GB          On  |   00000000:0C:00.0 Off |                    0 |
| N/A   72C    P0            431W /  400W |   24677MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-SXM4-40GB          On  |   00000000:0D:00.0 Off |                    0 |
| N/A   66C    P0            353W /  400W |   24677MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-SXM4-40GB          On  |   00000000:0E:00.0 Off |                    0 |
| N/A   66C    P0            329W /  400W |   24677MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-SXM4-40GB          On  |   00000000:0F:00.0 Off |                    0 |
| N/A   76C    P0            423W /  400W |   24677MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     24393      C   ./train_gpt2cu                              24540MiB |
|    1   N/A  N/A     24394      C   ./train_gpt2cu                              24540MiB |
|    2   N/A  N/A     24395      C   ./train_gpt2cu                              24540MiB |
|    3   N/A  N/A     24396      C   ./train_gpt2cu                              24540MiB |
|    4   N/A  N/A     24397      C   ./train_gpt2cu                              24540MiB |
|    5   N/A  N/A     24398      C   ./train_gpt2cu                              24540MiB |
|    6   N/A  N/A     24399      C   ./train_gpt2cu                              24540MiB |
|    7   N/A  N/A     24400      C   ./train_gpt2cu                              24540MiB |
+-----------------------------------------------------------------------------------------+
  • Had to prepend nice to the command. nice nohup bash -c 'echo "start $(date)" && mpirun -np 8 ./train_gpt2cu ..., process was being killed after I exited terminal [1]
  • ~1.6M tokens/s
step  179/19560 | loss 6.443990 (-1.13z)| norm 0.9910 (-0.26z)| lr 1.53e-04 | 321.80 ms | 52.4% bf16 MFU | 1630269 tok/s
step  180/19560 | loss 6.446496 (-1.12z)| norm 1.2475 (+0.62z)| lr 1.54e-04 | 321.94 ms | 52.4% bf16 MFU | 1630183 tok/s
step  181/19560 | loss 6.426447 (-1.15z)| norm 1.1282 (+0.24z)| lr 1.55e-04 | 321.83 ms | 52.4% bf16 MFU | 1630129 tok/s

Notes

  • Dollar amounts are in USD
  • Got to the end of training and ran into this error: karpathy/llm.c#786
    • So didn't manage to get the final weights. But still have the 1500 checkpoint
    • Can probably workaround this by setting the step count to be 20001 and then getting the model checkpoint at 20000 - since checkpoints are being saved properly. Might followup later if I have the time
  • Inference was performed following this method: https://gist.github.com/aidando73/cbc3ef69b21ad292bf021059a3fd0f06

Additional setup

Alongside instructions in: karpathy/llm.c#481

# After installing conda
# Avoid installing into base environment - gets a bit messy
conda create -n myenv python=3.10
conda activate myenv

# If cuda not installed yet, install cuda 12.4 (12.6 ran into some pytorch issues when initializing hellaswag)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4

sudo apt-get install -y nvidia-driver-550-open	
sudo apt-get install -y cuda-drivers-550	

echo 'export PATH=$PATH:/usr/local/cuda-12.4/bin' >> .bashrc
echo 'export NVCC=/usr/local/cuda-12.4/bin/nvcc' >> .bashrc

# Match cuda version
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
# Test if torch is compiled with cuda. Required for hella swag evaluation.
python -c "import torch; print(torch.cuda.is_available())"

# Download fineweb dataset in the background:
pip install -r requirements.txt
nohup python dev/data/fineweb.py --version 10B &
# Note: whilst downloading the dataset, nohup.out will be empty, but you can use the following command to track progress:
watch -c "du -ah ~/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb && tail ~/llm.c/nohup.out" # Downloads about ~29GB

# Run this for hellaswag eval
# Might run into https://stackoverflow.com/questions/66371130/cuda-initialization-unexpected-error-from-cudagetdevicecount
# or https://github.com/karpathy/llm.c/issues/785
python dev/data/hellaswag.py

nohup bash -c 'echo "start $(date)" && ./train_gpt2cu \
    -i "dev/data/fineweb10B/fineweb_train_*.bin" \
    -j "dev/data/fineweb10B/fineweb_val_*.bin" \
    -o log124M \
    -e "d12" \
    -b 64 -t 1024 \
    -d 524288 \
    -r 1 \
    -z 1 \
    -c 0.1 \
    -l 0.0006 \
    -q 0.0 \
    -u 700 \
    -n 5000 \
    -v 250 -s 20000 \
    -h 1 && echo "end $(date)"' &

# To check progress:
tail -f ~/work/llm.c/nohup.out

# To generate graphs install this script https://gist.github.com/aidando73/f1a4966305af05699c38d7a64bd48922
# Then run:
(cd dev && python ./vislog.py)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment