lucataco · December 7, 2024 19:25
diff --git a/notes.txt b/notes.txt
 **Goal**: Run benchmarks of SDXL, SVD, and Llama 13B on an L40S test node

 **TL;DR**:

 - L40S has same inference speeds as A40 for SDXL
 - L40S has 10% faster inference speeds than A40S for llama2
 - L40S are ~9% faster at Video rendering than A40s

 **Process**: Run non-docker/cog python code for fp16

 - SDXL:
    https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0
    https://gist.github.com/lucataco/5cace44ef82c616fa7795bb6c202fb14
    
 - Llama2-13b-chat:
    https://gist.github.com/lucataco/8f30c8cb6beb239fba9adb2237a90959
    https://gist.github.com/lucataco/8f30c8cb6beb239fba9adb2237a90959

 - SVD:
    https://github.com/replicate/cog-svd/commit/09bf70cbc2fcb5a52db23cf433f14c625448ac4e

 **Systems**:

 3 VMs, each with a different GPU (& VRAM capacity): 

 - L4 (24Gb)
 - A40(48Gb)
 - L40S(45gb)

 All running on CUDA: 12.2

 Conda setup:
 conda create -n bench python=3.10
 conda activate bench
 conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
 pip install -r requirements.txt

 requirements-sdxl.txt:
 diffusers==0.19.2
 torch==2.0.1
 transformers==4.31.0
 invisible-watermark==0.2.0
 accelerate==0.21.0
 pandas==2.0.3
 torchvision==0.15.2
 numpy==1.25.1
 pandas==2.0.3
 fire==0.5.0
 opencv-python>=4.1.0.25
 mediapipe==0.10.2

 requirements-svd.txt:
 black==23.7.0
 chardet==5.1.0
 clip @ git+https://github.com/openai/CLIP.git
 einops>=0.6.1
 fairscale>=0.4.13
 fsspec>=2023.6.0
 invisible-watermark>=0.2.0
 kornia==0.6.9
 matplotlib>=3.7.2
 natsort>=8.4.0
 ninja>=1.11.1
 numpy>=1.24.4
 omegaconf>=2.3.0
 open-clip-torch>=2.20.0
 opencv-python==4.6.0.66
 pandas>=2.0.3
 pillow>=9.5.0
 pudb>=2022.1.3
 pytorch-lightning==2.0.1
 pyyaml>=6.0.1
 scipy>=1.10.1
 streamlit>=0.73.1
 tensorboardx==2.6
 timm>=0.9.2
 tokenizers==0.12.1
 torch>=2.0.1
 torchdata==0.6.1
 torchmetrics>=1.0.1
 torchvision>=0.15.2
 tqdm>=4.65.0
 transformers==4.19.1
 triton==2.0.0
 urllib3<1.27,>=1.25.4
 wandb>=0.15.6
 webdataset>=0.2.33
 wheel>=0.41.0
 xformers>=0.0.20
 git+https://github.com/Stability-AI/generative-models.git

 requirements-llama.txt:
 accelerate==0.23.0
 bitsandbytes==0.41.1
 protobuf==3.20.3
 scipy==1.11.2
 sentencepiece==0.1.99
 spaces==0.16.1
 torch==2.0.0
 transformers==4.34.0


 Runs:
 SDXL:
 Single image tests run back to back
 Runs x GPU	L4	        A40	        L40S
 1x	        31.935 s	10.193 s	9.676 s
 10x	        315.453 s	91.027 s	91.678 s
 100x	    3124.300 s	907.273 s	915.423 s

 **Runs are measured in seconds - (lower is better)*
 L40S are the same speed as A40s for SDXL txt2img inference


 **Llama2-13b-chat**:
 Single prompt test with varying max new tokens
 MaxTokens x GPU	L4	        A40	        L40S
 512	            1.86 t/s	52.66 t/s	58.02 t/s
 1024	        1.84 t/s	53.72 t/s	59.28 t/s
 2048	        N/A	        53.48 t/s	59.42 t/s

 **Runs are measured in tokens per second - (higher is better)*
 L40S are 10.5% faster than A40 for llama2 inference


 **Stable Video Diffusion**:
 Single video tests run back to back
 Runs x GPU	L4	        A40	        L40S
 1x	        183.330 s	66.176 s	59.425 s
 10x	        1798.206 s	630.390 s	584.991 s

 **Runs are measured in seconds - (lower is better)*
 L40S are (7.2% to 10.9%) ~9% faster for Video rendering than A40s
	Goal: Run benchmarks of SDXL, SVD, and Llama 13B on an L40S test node

	TL;DR:

	- L40S has same inference speeds as A40 for SDXL
	- L40S has 10% faster inference speeds than A40S for llama2
	- L40S are ~9% faster at Video rendering than A40s

	Process: Run non-docker/cog python code for fp16

	- SDXL:
	https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0
	https://gist.github.com/lucataco/5cace44ef82c616fa7795bb6c202fb14

	- Llama2-13b-chat:
	https://gist.github.com/lucataco/8f30c8cb6beb239fba9adb2237a90959
	https://gist.github.com/lucataco/8f30c8cb6beb239fba9adb2237a90959

	- SVD:
	https://github.com/replicate/cog-svd/commit/09bf70cbc2fcb5a52db23cf433f14c625448ac4e

	Systems:

	3 VMs, each with a different GPU (& VRAM capacity):

	- L4 (24Gb)
	- A40(48Gb)
	- L40S(45gb)

	All running on CUDA: 12.2

	Conda setup:
	conda create -n bench python=3.10
	conda activate bench
	conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
	pip install -r requirements.txt

	requirements-sdxl.txt:
	diffusers==0.19.2
	torch==2.0.1
	transformers==4.31.0
	invisible-watermark==0.2.0
	accelerate==0.21.0
	pandas==2.0.3
	torchvision==0.15.2
	numpy==1.25.1
	pandas==2.0.3
	fire==0.5.0
	opencv-python>=4.1.0.25
	mediapipe==0.10.2

	requirements-svd.txt:
	black==23.7.0
	chardet==5.1.0
	clip @ git+https://github.com/openai/CLIP.git
	einops>=0.6.1
	fairscale>=0.4.13
	fsspec>=2023.6.0
	invisible-watermark>=0.2.0
	kornia==0.6.9
	matplotlib>=3.7.2
	natsort>=8.4.0
	ninja>=1.11.1
	numpy>=1.24.4
	omegaconf>=2.3.0
	open-clip-torch>=2.20.0
	opencv-python==4.6.0.66
	pandas>=2.0.3
	pillow>=9.5.0
	pudb>=2022.1.3
	pytorch-lightning==2.0.1
	pyyaml>=6.0.1
	scipy>=1.10.1
	streamlit>=0.73.1
	tensorboardx==2.6
	timm>=0.9.2
	tokenizers==0.12.1
	torch>=2.0.1
	torchdata==0.6.1
	torchmetrics>=1.0.1
	torchvision>=0.15.2
	tqdm>=4.65.0
	transformers==4.19.1
	triton==2.0.0
	urllib3<1.27,>=1.25.4
	wandb>=0.15.6
	webdataset>=0.2.33
	wheel>=0.41.0
	xformers>=0.0.20
	git+https://github.com/Stability-AI/generative-models.git

	requirements-llama.txt:
	accelerate==0.23.0
	bitsandbytes==0.41.1
	protobuf==3.20.3
	scipy==1.11.2
	sentencepiece==0.1.99
	spaces==0.16.1
	torch==2.0.0
	transformers==4.34.0


	Runs:
	SDXL:
	Single image tests run back to back
	Runs x GPU L4 A40 L40S
	1x 31.935 s 10.193 s 9.676 s
	10x 315.453 s 91.027 s 91.678 s
	100x 3124.300 s 907.273 s 915.423 s

	*Runs are measured in seconds - (lower is better)
	L40S are the same speed as A40s for SDXL txt2img inference


	Llama2-13b-chat:
	Single prompt test with varying max new tokens
	MaxTokens x GPU L4 A40 L40S
	512 1.86 t/s 52.66 t/s 58.02 t/s
	1024 1.84 t/s 53.72 t/s 59.28 t/s
	2048 N/A 53.48 t/s 59.42 t/s

	*Runs are measured in tokens per second - (higher is better)
	L40S are 10.5% faster than A40 for llama2 inference


	Stable Video Diffusion:
	Single video tests run back to back
	Runs x GPU L4 A40 L40S
	1x 183.330 s 66.176 s 59.425 s
	10x 1798.206 s 630.390 s 584.991 s

	*Runs are measured in seconds - (lower is better)
	L40S are (7.2% to 10.9%) ~9% faster for Video rendering than A40s