Loading LLaMA via Huggingface + Safetensors, with 4-bit quantization

Let's say we're trying to load a LLaMA model via AutoModelForCausalLM.from_pretrained with 4-bit quantization in order to inference from it:

python -m generate.py

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, LlamaTokenizerFast, LlamaForCausalLM
import transformers

# model_name_or_path='/path/to/local/llama-hf'
model_name_or_path='huggyllama/llama-7b'
model: LlamaForCausalLM = AutoModelForCausalLM.from_pretrained(
  model_name_or_path,
  cache_dir=None,
  load_in_4bit=True,
  load_in_8bit=False,
  # device_map='auto' works too, but I have found that on lower-VRAM GPUs it can unnecessarily move some layers to CPU
  device_map={ '': 0 },
  max_memory='24000MB',
  quantization_config=BitsAndBytesConfig(
    load_in_4bit=True,
    load_in_8bit=False,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4'
  ),
  torch_dtype=torch.bfloat16,
  trust_remote_code=False,
  use_safetensors=True,
)

tokenizer: LlamaTokenizerFast = AutoTokenizer.from_pretrained(model_name_or_path)

pipeline = transformers.pipeline(
  "text-generation",
  model=model,
  tokenizer=tokenizer,
  torch_dtype=torch.bfloat16,
  trust_remote_code=True,
  device_map="auto",
)
sequences = pipeline(
  "The talisman I bought from that merchant seems to be",
  max_new_tokens=100,
  do_sample=True,
  top_k=10,
  num_return_sequences=1,
  temperature=1.,
)
seq, *_ = sequences
print(f"Result:\n{seq['generated_text']}")

Which model can we use? Will we have to create our own?

`decapoda-research/llama-7b-hf` doesn't seem to work

If you searched Huggingface for a LLaMA dataset, you may have found the decapoda-research/llama-7b-hf distribution, but there's a few problems with this:

tokenizer.json specifies "tokenizer_class": "LLaMATokenizer" rather than "tokenizer_class": "LlamaTokenizer"
tokenizer.json specifies bos_token, eos_token, unk_token as empty strings, which I expect is incorrect
config.json doesn't specify a pad_token_id
no safetensors distribution
dubious provenance (redistribution of LLaMA weights is probably disallowed)

`huggyllama/llama-7b` seems to work

The huggyllama/llama-7b distribution solves all these issues except the "dubious provenance" issue.

We can solve this by converting the weights ourselves.

Assuming you are a researcher and applied for the model weights legitimately, or you found that they fell onto your computer somehow: here is how to convert the official LLaMA weights into a Huggingface + safetensors format compatible with AutoModelForCausalLM.from_pretrained(). This should give you a result similar to the huggyllama/llama-7b distribution.

Converting official LLaMA weights to HuggingFace + safetensors format

Your starting point

I'll assume you have a folder of LLaMA weights (grouped into directories by model size) in ~/ml-weights/llama that looks like this:

.
├── 7B
│   ├── checklist.chk
│   ├── consolidated.00.pth
│   └── params.json

We'll want to copy in some files from Huggingface's Llama tokenizer distribution, to our Llama root directory:

# leave the llama directory. we're gonna clone a git repository elsewhere
cd ..
git clone https://huggingface.co/hf-internal-testing/llama-tokenizer
cd llama-tokenizer
cp special_tokens_map.json  tokenizer_config.json  tokenizer.json  tokenizer.model ~/ml-weights/llama

Your Llama root directory should now look like this:

.
├── 7B
│   ├── checklist.chk
│   ├── consolidated.00.pth
│   └── params.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model

If you want to check whether you have an identical starting point to mine, the checksums from runnning md5sum * **/* in this directory are:

6b2e0a735969660e720c27061ef3f3d3  special_tokens_map.json
db83911b2ef8ac676e45c9edbca106ab  tokenizer_config.json
36b57b689cc01b527b943d3dc8f43cef  tokenizer.json
eeec4125e9c7560836b4873b6f8e3025  tokenizer.model
af0068274b59ff5bccb4c874c01e886f  7B/checklist.chk
6efc8dab194ab59e49cd24be5574d85e  7B/consolidated.00.pth
7596560e011154b90eb51a1b15739763  7B/params.json

Use transformers' llama conversion script

We're going to install some Python libraries now. You might want to create a clean Conda environment (or virtualenv, if you prefer and know how) for this:

conda create -n p311 python=3.11

Activate the new conda environment:

conda activate p311

Install transformers, and packages which its conversion script needs:

pip install transformers sentencepiece protobuf==3.20.3 safetensors torch accelerate

Let's visit the directory into which we installed transformers (because the conversion script is in there):

cd ~/anaconda3/envs/p311/lib/python3.11/site-packages/transformers

We'll be outputting our converted weights to a new folder, llama-hf; let's make it now:

mkdir -p ~/ml-weights/llama-hf

Let's make a quick change to the conversion script, to ensure that it outputs .safetensors weights.

Open the file ~/anaconda3/envs/p311/lib/python3.11/site-packages/transformers/models/llama/convert_llama_weights_to_hf.py in your favourite text editor.

Change the model.save_pretrained() call to use safe serialization:

- model.save_pretrained(model_path)
+ model.save_pretrained(model_path, safe_serialization=True)

Save your modified convert_llama_weights_to_hf.py.

Finally, let's run the conversion script:

python models/llama/convert_llama_weights_to_hf.py --input_dir ~/ml-weights/llama --model_size 7B --output_dir ~/ml-weights/llama-hf

Result

This should have output into ~/ml-weights/llama-hf the following files:

.
├── config.json
├── generation_config.json
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model

If you want to check whether you have an identical result to mine, the checksums from runnning md5sum * in this directory are:

4545ca1c3123db5f40649d838315d1a5  config.json
185162afdfbe7b61b786b1556233efcb  generation_config.json
a4669a7c3dba870c3ce07d02fdf5e51f  model-00001-of-00002.safetensors
6031c2c2ca7c072abd98bd64c10f16a5  model-00002-of-00002.safetensors
406284c6a66bc16d2a81cafb7d4dfdb5  model.safetensors.index.json
6b2e0a735969660e720c27061ef3f3d3  special_tokens_map.json
edd1a5897748864768b1fab645b31491  tokenizer_config.json
8bc8ad3b8256b780d0917e72d257c176  tokenizer.json
eeec4125e9c7560836b4873b6f8e3025  tokenizer.model

You should now be able to run the script from the top of this gist against your locally-converted model, with:

from os import environ
model_name_or_path=f'{environ["HOME"]}/ml-weights/llama-hf'

Birch-san/llama-convert.md