Last active
December 7, 2024 19:25
-
-
Save lucataco/619a07ff433a4c62ffaa4439a0034e9a to your computer and use it in GitHub Desktop.
L40S vs A40 Benchmarks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
**Goal**: Run benchmarks of SDXL, SVD, and Llama 13B on an L40S test node | |
**TL;DR**: | |
- L40S has same inference speeds as A40 for SDXL | |
- L40S has 10% faster inference speeds than A40S for llama2 | |
- L40S are ~9% faster at Video rendering than A40s | |
**Process**: Run non-docker/cog python code for fp16 | |
- SDXL: | |
https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0 | |
https://gist.github.com/lucataco/5cace44ef82c616fa7795bb6c202fb14 | |
- Llama2-13b-chat: | |
https://gist.github.com/lucataco/8f30c8cb6beb239fba9adb2237a90959 | |
https://gist.github.com/lucataco/8f30c8cb6beb239fba9adb2237a90959 | |
- SVD: | |
https://github.com/replicate/cog-svd/commit/09bf70cbc2fcb5a52db23cf433f14c625448ac4e | |
**Systems**: | |
3 VMs, each with a different GPU (& VRAM capacity): | |
- L4 (24Gb) | |
- A40(48Gb) | |
- L40S(45gb) | |
All running on CUDA: 12.2 | |
Conda setup: | |
conda create -n bench python=3.10 | |
conda activate bench | |
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia | |
pip install -r requirements.txt | |
requirements-sdxl.txt: | |
diffusers==0.19.2 | |
torch==2.0.1 | |
transformers==4.31.0 | |
invisible-watermark==0.2.0 | |
accelerate==0.21.0 | |
pandas==2.0.3 | |
torchvision==0.15.2 | |
numpy==1.25.1 | |
pandas==2.0.3 | |
fire==0.5.0 | |
opencv-python>=4.1.0.25 | |
mediapipe==0.10.2 | |
requirements-svd.txt: | |
black==23.7.0 | |
chardet==5.1.0 | |
clip @ git+https://github.com/openai/CLIP.git | |
einops>=0.6.1 | |
fairscale>=0.4.13 | |
fsspec>=2023.6.0 | |
invisible-watermark>=0.2.0 | |
kornia==0.6.9 | |
matplotlib>=3.7.2 | |
natsort>=8.4.0 | |
ninja>=1.11.1 | |
numpy>=1.24.4 | |
omegaconf>=2.3.0 | |
open-clip-torch>=2.20.0 | |
opencv-python==4.6.0.66 | |
pandas>=2.0.3 | |
pillow>=9.5.0 | |
pudb>=2022.1.3 | |
pytorch-lightning==2.0.1 | |
pyyaml>=6.0.1 | |
scipy>=1.10.1 | |
streamlit>=0.73.1 | |
tensorboardx==2.6 | |
timm>=0.9.2 | |
tokenizers==0.12.1 | |
torch>=2.0.1 | |
torchdata==0.6.1 | |
torchmetrics>=1.0.1 | |
torchvision>=0.15.2 | |
tqdm>=4.65.0 | |
transformers==4.19.1 | |
triton==2.0.0 | |
urllib3<1.27,>=1.25.4 | |
wandb>=0.15.6 | |
webdataset>=0.2.33 | |
wheel>=0.41.0 | |
xformers>=0.0.20 | |
git+https://github.com/Stability-AI/generative-models.git | |
requirements-llama.txt: | |
accelerate==0.23.0 | |
bitsandbytes==0.41.1 | |
protobuf==3.20.3 | |
scipy==1.11.2 | |
sentencepiece==0.1.99 | |
spaces==0.16.1 | |
torch==2.0.0 | |
transformers==4.34.0 | |
Runs: | |
SDXL: | |
Single image tests run back to back | |
Runs x GPU L4 A40 L40S | |
1x 31.935 s 10.193 s 9.676 s | |
10x 315.453 s 91.027 s 91.678 s | |
100x 3124.300 s 907.273 s 915.423 s | |
**Runs are measured in seconds - (lower is better)* | |
L40S are the same speed as A40s for SDXL txt2img inference | |
**Llama2-13b-chat**: | |
Single prompt test with varying max new tokens | |
MaxTokens x GPU L4 A40 L40S | |
512 1.86 t/s 52.66 t/s 58.02 t/s | |
1024 1.84 t/s 53.72 t/s 59.28 t/s | |
2048 N/A 53.48 t/s 59.42 t/s | |
**Runs are measured in tokens per second - (higher is better)* | |
L40S are 10.5% faster than A40 for llama2 inference | |
**Stable Video Diffusion**: | |
Single video tests run back to back | |
Runs x GPU L4 A40 L40S | |
1x 183.330 s 66.176 s 59.425 s | |
10x 1798.206 s 630.390 s 584.991 s | |
**Runs are measured in seconds - (lower is better)* | |
L40S are (7.2% to 10.9%) ~9% faster for Video rendering than A40s |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment