The LLaMA model weights may be converted from Huggingface PyTorch format back to GGML in two steps:
- download from decapoda-research/llama-7b-hf
and save as pytorch
.pth
- use the ggerganov/llama.cpp script,
convert-pth-to-ggml.py
to convert from pytorch.pth
to GGML
This process will result in ggml model with float16
(fp16
) precision.
You need the LLaMA tokenizer configuration and the model configuration files. There currently isn't a
good conversion from Hugging Face to the original pytorch (the tokenizer files are the same but the
model checklist.chk
and params.json
are missing). The best way to do this is to:
- install pyllama and transformers:
pip install -U pyllama transformers
- you will also need to install the requirements from ggerganov/llama.cpp:
llama.cpp $ pip install -r requirements.txt
- download the
7B
configuration (let theconsolidated.00.pth
- model weights download - fail):
python -m llama.download --model_size=7B --folder=llama
This will download a directory structure like:
llama/
config.json
ggml-vocab.bin
tokenizer.model
tokenizer_checklist.chk
tokenizer_config.json
7B/
checklist.chk
params.json
Your remaining task is to convert the Hugging Face pytorch pickle file to a pytorch state dict and convert that to GGML.
- load the Huggingface model and save the state dict as pytorch
.pth
(in EMP ensure you have the SSO Proxy on):
from transformers import AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained("decapoda-research/llama-7b-hf")
torch.save(model.state_dict(), "llama/7B/consolidated.00.pth")
- consolidate the 7B files into a single directory, so you have:
llama_7b/
config.json
ggml-vocab.bin
tokenizer.model
tokenizer_checklist.chk
tokenizer_config.json
checklist.chk
consolidated.00.pth
params.json
- convert the
consolidated.00.pth
file toggml-model-fp16.bin
using theconvert-transformers-to-ggml.py
script from llama.cpp
python convert-transformers-to-ggml.py llama_7B 1
When you are done, you will have file you can use with llama.cpp
, but you have to put it back into the llama/7B/
directory.
llama/
config.json
ggml-vocab.bin
tokenizer.model
tokenizer_checklist.chk
tokenizer_config.json
7B/
checklist.chk
params.json
ggml-model-fp16.bin # <-- added here
Now you can use this with llama.cpp
(after building llama.cpp
):
./main -m ./models/7B/ggml-model-fp16.bin -n 128