Let's say we're trying to load a LLaMA model via AutoModelForCausalLM.from_pretrained
with 4-bit quantization in order to inference from it:
python -m generate.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, LlamaTokenizerFast, LlamaForCausalLM
import transformers
# model_name_or_path='/path/to/local/llama-hf'
model_name_or_path='huggyllama/llama-7b'
model: LlamaForCausalLM = AutoModelForCausalLM.from_pretrained(
model_name_or_path,
cache_dir=None,
load_in_4bit=True,
load_in_8bit=False,
# device_map='auto' works too, but I have found that on lower-VRAM GPUs it can unnecessarily move some layers to CPU
device_map={ '': 0 },
max_memory='24000MB',
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4'
),
torch_dtype=torch.bfloat16,
trust_remote_code=False,
use_safetensors=True,
)
tokenizer: LlamaTokenizerFast = AutoTokenizer.from_pretrained(model_name_or_path)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
sequences = pipeline(
"The talisman I bought from that merchant seems to be",
max_new_tokens=100,
do_sample=True,
top_k=10,
num_return_sequences=1,
temperature=1.,
)
seq, *_ = sequences
print(f"Result:\n{seq['generated_text']}")
Which model can we use? Will we have to create our own?
decapoda-research/llama-7b-hf
doesn't seem to work
If you searched Huggingface for a LLaMA dataset, you may have found the decapoda-research/llama-7b-hf
distribution, but there's a few problems with this:
- tokenizer.json specifies
"tokenizer_class": "LLaMATokenizer"
rather than"tokenizer_class": "LlamaTokenizer"
- tokenizer.json specifies
bos_token
,eos_token
,unk_token
as empty strings, which I expect is incorrect - config.json doesn't specify a
pad_token_id
- no safetensors distribution
- dubious provenance (redistribution of LLaMA weights is probably disallowed)
huggyllama/llama-7b
seems to work
The huggyllama/llama-7b
distribution solves all these issues except the "dubious provenance" issue.
We can solve this by converting the weights ourselves.
Assuming you are a researcher and applied for the model weights legitimately, or you found that they fell onto your computer somehow: here is how to convert the official LLaMA weights into a Huggingface + safetensors format compatible with AutoModelForCausalLM.from_pretrained()
. This should give you a result similar to the huggyllama/llama-7b
distribution.
I'll assume you have a folder of LLaMA weights (grouped into directories by model size) in ~/ml-weights/llama
that looks like this:
.
├── 7B
│ ├── checklist.chk
│ ├── consolidated.00.pth
│ └── params.json
We'll want to copy in some files from Huggingface's Llama tokenizer distribution, to our Llama root directory:
# leave the llama directory. we're gonna clone a git repository elsewhere
cd ..
git clone https://huggingface.co/hf-internal-testing/llama-tokenizer
cd llama-tokenizer
cp special_tokens_map.json tokenizer_config.json tokenizer.json tokenizer.model ~/ml-weights/llama
Your Llama root directory should now look like this:
.
├── 7B
│ ├── checklist.chk
│ ├── consolidated.00.pth
│ └── params.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model
If you want to check whether you have an identical starting point to mine, the checksums from runnning md5sum * **/*
in this directory are:
6b2e0a735969660e720c27061ef3f3d3 special_tokens_map.json
db83911b2ef8ac676e45c9edbca106ab tokenizer_config.json
36b57b689cc01b527b943d3dc8f43cef tokenizer.json
eeec4125e9c7560836b4873b6f8e3025 tokenizer.model
af0068274b59ff5bccb4c874c01e886f 7B/checklist.chk
6efc8dab194ab59e49cd24be5574d85e 7B/consolidated.00.pth
7596560e011154b90eb51a1b15739763 7B/params.json
We're going to install some Python libraries now. You might want to create a clean Conda environment (or virtualenv, if you prefer and know how) for this:
conda create -n p311 python=3.11
Activate the new conda environment:
conda activate p311
Install transformers, and packages which its conversion script needs:
pip install transformers sentencepiece protobuf==3.20.3 safetensors torch accelerate
Let's visit the directory into which we installed transformers (because the conversion script is in there):
cd ~/anaconda3/envs/p311/lib/python3.11/site-packages/transformers
We'll be outputting our converted weights to a new folder, llama-hf; let's make it now:
mkdir -p ~/ml-weights/llama-hf
Let's make a quick change to the conversion script, to ensure that it outputs .safetensors
weights.
Open the file ~/anaconda3/envs/p311/lib/python3.11/site-packages/transformers/models/llama/convert_llama_weights_to_hf.py
in your favourite text editor.
Change the model.save_pretrained()
call to use safe serialization:
- model.save_pretrained(model_path)
+ model.save_pretrained(model_path, safe_serialization=True)
Save your modified convert_llama_weights_to_hf.py
.
Finally, let's run the conversion script:
python models/llama/convert_llama_weights_to_hf.py --input_dir ~/ml-weights/llama --model_size 7B --output_dir ~/ml-weights/llama-hf
This should have output into ~/ml-weights/llama-hf
the following files:
.
├── config.json
├── generation_config.json
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model
If you want to check whether you have an identical result to mine, the checksums from runnning md5sum *
in this directory are:
4545ca1c3123db5f40649d838315d1a5 config.json
185162afdfbe7b61b786b1556233efcb generation_config.json
a4669a7c3dba870c3ce07d02fdf5e51f model-00001-of-00002.safetensors
6031c2c2ca7c072abd98bd64c10f16a5 model-00002-of-00002.safetensors
406284c6a66bc16d2a81cafb7d4dfdb5 model.safetensors.index.json
6b2e0a735969660e720c27061ef3f3d3 special_tokens_map.json
edd1a5897748864768b1fab645b31491 tokenizer_config.json
8bc8ad3b8256b780d0917e72d257c176 tokenizer.json
eeec4125e9c7560836b4873b6f8e3025 tokenizer.model
You should now be able to run the script from the top of this gist against your locally-converted model, with:
from os import environ
model_name_or_path=f'{environ["HOME"]}/ml-weights/llama-hf'