- Request access to one of the llama2 model repositories from Meta's HuggingFace organization, for example the
Llama-2-13b-chat-hf
. - Generate a HuggingFace read-only access token from your user profile settings page.
- Setup a Python 3.10 enviornment with the following dependencies installed:
transformers, huggingface_hub
. - Run the following code to download and load the model in HuggingFace
transformers
:
TOKEN = # copy-paste your HuggingFace access token here
### Option 1
# Replace 'XXX' with the model variant of your choosing
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig
tokenizer = AutoTokenizer.from_pretrained(
"meta-llama/Llama-2-XXX",
use_auth_token=TOKEN
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-XXX",
device_map="auto",
# Double quantization
# quantization_config=BitsAndBytesConfig(
# load_in_4bit=True,
# bnb_4bit_use_double_quant=True,
# ),
use_auth_token=TOKEN
)
# Generate
inputs = tokenizer("Hello world", return_tensors="pt").to("cuda")
generated_ids = model.generate(
**inputs,
do_sample=True,
top_p=0.95,
temperature=0.01,
max_length=250,
)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
### Option 2
# Replace 'XXX' with the model variant of your choosing
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="meta-llama/Llama-2-XXX",
device_map="auto",
# model_kwargs={"load_in_8bit": True}, # quantize to 8-bit
# model_kwargs={"load_in_4bit": True}, # quantize to 4-bit
use_auth_token=TOKEN
)
# Generate
pipe(
"Hello world",
do_sample=True,
top_p=0.95,
temperature=0.01,
max_length=250,
)
Note: This will store a cached version of the model in ~/.cache/huggingface/hub/
.
- Request access from Meta here. This is needed to download the model weights.
- Clone
llama
and download the model(s) withsh download.sh
using the access code obtained from (1).- Note: The access URL expires in 24hrs.
Read the announcement blogpost for more information.
- Move the downloaded model files to a subfolder named with the corresponding parameter count (eg.
llama-2-7b-chat/7B/
if you downloadedllama-2-7b-chat
).- Note: This is the expected format for the HuggingFace conversion script.
- Download the relevant
tokenizer.model
from Meta's HuggingFace organization, see here for thellama-2-7b-chat
reference. - Setup a Python 3.10 enviornment with the following dependencies installed:
torch, tokenizers, transformers, accelerate, xformers, bitsandbytes, scipy
.- Note: If running on MacOS, you may need to upgrade to
MacOS 13.3
and install a nightly release oftorch
, see here for reference.
- Note: If running on MacOS, you may need to upgrade to
- Convert the Meta model checkpoints to HuggingFace format using the helper script
convert_llama_weights_to_hf.py
, which can be found here. Run the following:python convert_llama_weights_to_hf.py --input_dir path/to/llama/model --model_size <model_size> --output_dir path/to/output
. - If necessary, perform quantization of the model using
bitsandbytes
, see here. - Run inference (i.e. text generation) of the model using the
transformers.pipeline
as outlined here.
- Clone the
llama2.c
repository. - Install the package depdenecies with
pip install -r requirements.txt
. - Export the model weights into the
llama2.c
format using the helper script:python export.py llama2.bin --meta-llama path/to/llama/model
- Run inference (i.e. text generation) of the model using the
./run llama_2.bin <temperature> <generation_tokens> <input_prompt>
.
FastChat
: Open platform for training, serving, and evaluating large language model based chatbots.text-generation-inference
: A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power Hugging Chat, the Inference API and Inference Endpoint.llama-recipes
: Meta's recipes and tools for using llama-2.