Birch-san

Loading LLaMA via Huggingface + Safetensors, with 4-bit quantization

Let's say we're trying to load a LLaMA model via AutoModelForCausalLM.from_pretrained with 4-bit quantization in order to inference from it:

python -m generate.py

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, LlamaTokenizerFast, LlamaForCausalLM
import transformers

Fine-tuning LLaMA-7B on ~12GB VRAM with QLoRA, 4-bit quantization

nvidia-smi said this required 11181MiB, at least to train on the sequence lengths of prompt that occurred initially in the alpaca dataset (~337 token long prompts).
You can get this down to about 10.9GB if (by modifying qlora.py) you run torch.cuda.empty_cache() after PEFT has been applied to your loaded model and before you begin training.

Setup

All instructions are written assuming your command-line shell is bash.

Clone repository:

For CUDA 12, see Installing CUDA 12.1.1 + PyTorch nightly + Python 3.10 on Ubuntu 22.10 for how to install Nvidia driver 530, gcc 12 and CUDA 12.1.1 libraries.
If you want CUDA 11.8, then you can use latest Nvidia driver from Production branch, 525, with gcc 11.

Activate your conda environment, if you haven't done so already.

CUDA 11:
Make sure gcc 11 is the default gcc for your OS, or select gcc 11 explicitly.
CUDA 12:
Make sure gcc 12 is the default gcc for your OS, or select gcc 12 explicitly.
Check CUDA_DIR below points to the CUDA installation you wish to use.

I wrote these instructions as part of "installing PyTorch with CUDA 12.1.1".
I extracted them into this separate gist, because I realised there's a much easier way to install magma for CUDA 12.1.1:
https://anaconda.org/pytorch/magma-cuda121

There's a conda package!

conda install -c pytorch magma-cuda121

Installing CUDA 12.1.1 + PyTorch nightly + Python 3.10 on Ubuntu 22.10

Should you keep your NVIDIA driver?

CUDA 12.1.1 toolkit is gonna offer to install Nvidia driver 530 for us. It's from New Feature branch. It's likely to be newer than the default Nvidia driver you would've installed via apt-get (apt would prefer to give you 525, i.e. Production Branch).

If you're confident that you already have a new enough Nvidia driver for CUDA 12.1.1, and you'd like to keep your driver: feel free to skip this "uninstall driver" step.

But if you're not sure, or you know your driver is too old: let's uninstall it. CUDA will install a new driver for us later.

	# for i, x in model.named_parameters():
	# print(i)
	transformer.word_embeddings.weight
	transformer.h.0.ln_attn.weight
	transformer.h.0.ln_attn.bias
	transformer.h.0.ln_mlp.weight
	transformer.h.0.ln_mlp.bias
	transformer.h.0.self_attention.query_key_value.weight
	transformer.h.0.self_attention.dense.weight
	transformer.h.0.mlp.dense_h_to_4h.weight

	float_width=2 # float16
	cond_count=2 # uncond and cond for 1 sample
	attn_heads=8 # SD1.5 isn't optimized for flash attn, so all layers have 8 heads, lol
	vae_scale_factor=8
	px_height=px_width=768
	latent_height=px_height/vae_scale_factor
	latent_width=px_width/vae_scale_factor
	q_proj_tokens=k_proj_tokens=latent_height*latent_width
	qk_bytes = cond_countattn_headsfloat_widthq_proj_tokensk_proj_tokens
	qk_mb = qk_bytes/1024**2

	from torch import FloatTensor, load, baddbmm, zeros
	from dataclasses import dataclass
	import torch
	from os.path import join

	@dataclass
	class Fixtures:
	q_proj: FloatTensor
	k_proj: FloatTensor

	from torch import FloatTensor

	vae_scale_factor = 8
	typical_self_attn_key_length = (512/vae_scale_factor) * (512/vae_scale_factor)
	desired_self_attn_key_length = (768/vae_scale_factor) * (768/vae_scale_factor)

	key_length_factor=desired_self_attn_key_length/typical_self_attn_key_length if is_self_attn else 1.

	def softmax(x: FloatTensor, dim=-1) -> FloatTensor:
	maxes = x.max(dim, keepdim=True).values

	from torch import FloatTensor

	vae_scale_factor = 8
	typical_self_attn_key_length = (512/vae_scale_factor) * (512/vae_scale_factor)
	desired_self_attn_key_length = (200/vae_scale_factor) * (200/vae_scale_factor)

	key_length_factor=desired_self_attn_key_length/typical_self_attn_key_length if is_self_attn else 1.

	def softmax(x: FloatTensor, dim=-1) -> FloatTensor:
	key_tokens = x.size(-1)