I tried the GPT-2 124M parameter model on 3 different GPU setups - following karpathy/llm.c#481. Here's how long each run it took and how much it cost:
1x A10G | 1x A100 40GB | 8x A100 40GB | |
---|---|---|---|
training time | 48h | 15h | 2h |
cost | $163 | $27 | $45 |
Most of the completions are fairly nonsensical, but here are some interesting ones:
Prompt | Completion |
---|---|
The GitHub project llm.c is a... | project of The Leapfrog Group, which was founded in October 2003 to develop and develop hyper-centralized, distributed software. |
In 2029, humanity's first mission to Mars began. As they landed, the crew faced unexpected challenges when... | it was revealed by NASA that the planet was entirely consistent with Earth and S equinox. |
To make spaghetti bolognese, you must first... | make the meatballs with the sea salt. |
The largest desert in the world is the... | front of the Milky Way, and it's getting worse. |
System.out.println("Hello | x> 4.")" |
Model weights + checkpoints for each of the training runs can be found at https://huggingface.co/aidando73/repro-gpt-2-124M/tree/main.
Dataset used is 10B tokens of fineweb: https://huggingface.co/datasets/HuggingFaceFW/fineweb
The rest of this gist describes the details of each training run.
-
47h 35m
- 8:44am 20th Nov AEDT - 08:19 Nov 22 AEDT
-
Cost ~$163 ($1.39/hour)
- Only including training time
- AWS
g5.8xlarge
instances - Much more expensive than lambdalabs.com
- An analogous A10 box costs around $0.75/hr on lambdalabs.com
Note: My box got stopped right before the 19000 mark. The dev boxes at work are based on ec2 spot instances.
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08 Driver Version: 550.127.08 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 |
| 0% 65C P0 266W / 300W | 13740MiB / 23028MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 909 G /usr/lib/xorg/Xorg 106MiB |
| 0 N/A N/A 9413 C ./train_gpt2cu 13616MiB |
+-----------------------------------------------------------------------------------------+
- ~60k tokens per second
step 77/19560 | loss 7.626959 (+nanz)| norm 1.3845 (+nanz)| lr 6.60e-05 | 8436.04 ms | -100.0% bf16 MFU | 62134 tok/s
step 78/19560 | loss 7.596109 (+nanz)| norm 1.0311 (+nanz)| lr 6.69e-05 | 8432.50 ms | -100.0% bf16 MFU | 62136 tok/s
step 79/19560 | loss 7.550291 (+nanz)| norm 1.0829 (+nanz)| lr 6.77e-05 | 8435.31 ms | -100.0% bf16 MFU | 62137 tok/s
- Had to reduce batch size to 32 (
-b 32
). Otherwise, ran into OOM issues.
- 15h 18m
- 1:26pm 20th Nov 2024 AEDT - 4:44am 21st Nov 2024 AEDT
- Cost ~$27.29 ($1.29/hr on lambdalabs.com)
- includes installation, dataset download and preprocessing
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12 Driver Version: 550.90.12 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:08:00.0 Off | 0 |
| N/A 74C P0 405W / 400W | 25203MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 19862 C ./train_gpt2cu 25194MiB |
+-----------------------------------------------------------------------------------------+
- ~200k tokens per second
step 24/19560 | loss 9.491743 (+nanz)| norm 2.1357 (+nanz)| lr 2.06e-05 | 2510.29 ms | 53.8% bf16 MFU | 209731 tok/s
step 25/19560 | loss 9.461581 (+nanz)| norm 2.0905 (+nanz)| lr 2.14e-05 | 2510.24 ms | 53.8% bf16 MFU | 209669 tok/s
step 26/19560 | loss 9.447474 (+nanz)| norm 1.9947 (+nanz)| lr 2.23e-05 | 2510.92 ms | 53.8% bf16 MFU | 209609 tok/s
- Had to compile with:
make train_gpt2cu USE_CUDNN=1 NO_MULTI_GPU=1
otherwise gotMPI support is disabled. Please enable MPI support to use MPI-based NCCL-init method.
error
- Training took 2h 11m
- 4:45pm 20th Nov AEDT - 6:56pm 20th Nov AEDT
- Cost $44.95
- Includes installation, dataset download and preprocessing. Also ran into some errors (see below)
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12 Driver Version: 550.90.12 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:08:00.0 Off | 0 |
| N/A 76C P0 395W / 400W | 24677MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-SXM4-40GB On | 00000000:09:00.0 Off | 0 |
| N/A 68C P0 423W / 400W | 24677MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100-SXM4-40GB On | 00000000:0A:00.0 Off | 0 |
| N/A 67C P0 409W / 400W | 24677MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A100-SXM4-40GB On | 00000000:0B:00.0 Off | 0 |
| N/A 74C P0 406W / 400W | 24677MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA A100-SXM4-40GB On | 00000000:0C:00.0 Off | 0 |
| N/A 72C P0 431W / 400W | 24677MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA A100-SXM4-40GB On | 00000000:0D:00.0 Off | 0 |
| N/A 66C P0 353W / 400W | 24677MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA A100-SXM4-40GB On | 00000000:0E:00.0 Off | 0 |
| N/A 66C P0 329W / 400W | 24677MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA A100-SXM4-40GB On | 00000000:0F:00.0 Off | 0 |
| N/A 76C P0 423W / 400W | 24677MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 24393 C ./train_gpt2cu 24540MiB |
| 1 N/A N/A 24394 C ./train_gpt2cu 24540MiB |
| 2 N/A N/A 24395 C ./train_gpt2cu 24540MiB |
| 3 N/A N/A 24396 C ./train_gpt2cu 24540MiB |
| 4 N/A N/A 24397 C ./train_gpt2cu 24540MiB |
| 5 N/A N/A 24398 C ./train_gpt2cu 24540MiB |
| 6 N/A N/A 24399 C ./train_gpt2cu 24540MiB |
| 7 N/A N/A 24400 C ./train_gpt2cu 24540MiB |
+-----------------------------------------------------------------------------------------+
- Had to prepend
nice
to the command.nice nohup bash -c 'echo "start $(date)" && mpirun -np 8 ./train_gpt2cu ...
, process was being killed after I exited terminal [1] - ~1.6M tokens/s
step 179/19560 | loss 6.443990 (-1.13z)| norm 0.9910 (-0.26z)| lr 1.53e-04 | 321.80 ms | 52.4% bf16 MFU | 1630269 tok/s
step 180/19560 | loss 6.446496 (-1.12z)| norm 1.2475 (+0.62z)| lr 1.54e-04 | 321.94 ms | 52.4% bf16 MFU | 1630183 tok/s
step 181/19560 | loss 6.426447 (-1.15z)| norm 1.1282 (+0.24z)| lr 1.55e-04 | 321.83 ms | 52.4% bf16 MFU | 1630129 tok/s
- Dollar amounts are in USD
- Got to the end of training and ran into this error: karpathy/llm.c#786
- So didn't manage to get the final weights. But still have the 1500 checkpoint
- Can probably workaround this by setting the step count to be 20001 and then getting the model checkpoint at 20000 - since checkpoints are being saved properly. Might followup later if I have the time
- Inference was performed following this method: https://gist.github.com/aidando73/cbc3ef69b21ad292bf021059a3fd0f06
Alongside instructions in: karpathy/llm.c#481
# After installing conda
# Avoid installing into base environment - gets a bit messy
conda create -n myenv python=3.10
conda activate myenv
# If cuda not installed yet, install cuda 12.4 (12.6 ran into some pytorch issues when initializing hellaswag)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4
sudo apt-get install -y nvidia-driver-550-open
sudo apt-get install -y cuda-drivers-550
echo 'export PATH=$PATH:/usr/local/cuda-12.4/bin' >> .bashrc
echo 'export NVCC=/usr/local/cuda-12.4/bin/nvcc' >> .bashrc
# Match cuda version
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
# Test if torch is compiled with cuda. Required for hella swag evaluation.
python -c "import torch; print(torch.cuda.is_available())"
# Download fineweb dataset in the background:
pip install -r requirements.txt
nohup python dev/data/fineweb.py --version 10B &
# Note: whilst downloading the dataset, nohup.out will be empty, but you can use the following command to track progress:
watch -c "du -ah ~/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb && tail ~/llm.c/nohup.out" # Downloads about ~29GB
# Run this for hellaswag eval
# Might run into https://stackoverflow.com/questions/66371130/cuda-initialization-unexpected-error-from-cudagetdevicecount
# or https://github.com/karpathy/llm.c/issues/785
python dev/data/hellaswag.py
nohup bash -c 'echo "start $(date)" && ./train_gpt2cu \
-i "dev/data/fineweb10B/fineweb_train_*.bin" \
-j "dev/data/fineweb10B/fineweb_val_*.bin" \
-o log124M \
-e "d12" \
-b 64 -t 1024 \
-d 524288 \
-r 1 \
-z 1 \
-c 0.1 \
-l 0.0006 \
-q 0.0 \
-u 700 \
-n 5000 \
-v 250 -s 20000 \
-h 1 && echo "end $(date)"' &
# To check progress:
tail -f ~/work/llm.c/nohup.out
# To generate graphs install this script https://gist.github.com/aidando73/f1a4966305af05699c38d7a64bd48922
# Then run:
(cd dev && python ./vislog.py)