Option 1 (easy): HuggingFace Hub Download

Request access to one of the llama2 model repositories from Meta's HuggingFace organization, for example the Llama-2-13b-chat-hf.
Generate a HuggingFace read-only access token from your user profile settings page.
Setup a Python 3.10 enviornment with the following dependencies installed: transformers, huggingface_hub.
Run the following code to download and load the model in HuggingFace transformers:

TOKEN = # copy-paste your HuggingFace access token here

### Option 1
# Replace 'XXX' with the model variant of your choosing
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig

tokenizer = AutoTokenizer.from_pretrained(
   "meta-llama/Llama-2-XXX",
    use_auth_token=TOKEN
)
model = AutoModelForCausalLM.from_pretrained(
   "meta-llama/Llama-2-XXX",
    device_map="auto",
    # Double quantization
    # quantization_config=BitsAndBytesConfig(
    #     load_in_4bit=True,
    #     bnb_4bit_use_double_quant=True,
    # ),
    use_auth_token=TOKEN
)

# Generate
inputs = tokenizer("Hello world", return_tensors="pt").to("cuda")
generated_ids = model.generate(
    **inputs,
    do_sample=True,
    top_p=0.95,
    temperature=0.01,
    max_length=250,
)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

### Option 2
# Replace 'XXX' with the model variant of your choosing
from transformers import pipeline

pipe = pipeline(
   "text-generation", 
   model="meta-llama/Llama-2-XXX",
   device_map="auto",
   # model_kwargs={"load_in_8bit": True}, # quantize to 8-bit
   # model_kwargs={"load_in_4bit": True}, # quantize to 4-bit
   use_auth_token=TOKEN
)

# Generate
pipe(
    "Hello world",
    do_sample=True,
    top_p=0.95,
    temperature=0.01,
    max_length=250,
)

Note: This will store a cached version of the model in ~/.cache/huggingface/hub/.

Option 2 (hard): Meta Direct Download

Request access from Meta here. This is needed to download the model weights.
Clone llama and download the model(s) with sh download.sh using the access code obtained from (1).
- Note: The access URL expires in 24hrs.

Run with HuggingFace `transformers`

Read the announcement blogpost for more information.

Move the downloaded model files to a subfolder named with the corresponding parameter count (eg.llama-2-7b-chat/7B/ if you downloaded llama-2-7b-chat).
- Note: This is the expected format for the HuggingFace conversion script.
Download the relevant tokenizer.model from Meta's HuggingFace organization, see here for the llama-2-7b-chat reference.
Setup a Python 3.10 enviornment with the following dependencies installed: torch, tokenizers, transformers, accelerate, xformers, bitsandbytes, scipy.
- Note: If running on MacOS, you may need to upgrade to MacOS 13.3 and install a nightly release of torch, see here for reference.
Convert the Meta model checkpoints to HuggingFace format using the helper script convert_llama_weights_to_hf.py, which can be found here. Run the following: python convert_llama_weights_to_hf.py --input_dir path/to/llama/model --model_size <model_size> --output_dir path/to/output.
If necessary, perform quantization of the model using bitsandbytes, see here.
Run inference (i.e. text generation) of the model using the transformers.pipeline as outlined here.

Run with Karpathy's `llama2.c`

Clone the llama2.c repository.
Install the package depdenecies with pip install -r requirements.txt.
Export the model weights into the llama2.c format using the helper script: python export.py llama2.bin --meta-llama path/to/llama/model
Run inference (i.e. text generation) of the model using the ./run llama_2.bin <temperature> <generation_tokens> <input_prompt>.

Model Serving Options

FastChat: Open platform for training, serving, and evaluating large language model based chatbots.
text-generation-inference: A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power Hugging Chat, the Inference API and Inference Endpoint.
llama-recipes: Meta's recipes and tools for using llama-2.

Useful Links

https://www.philschmid.de/llama-2

vladimiralencar/llama-2-setup.md

Option 1 (easy): HuggingFace Hub Download

Option 2 (hard): Meta Direct Download

Run with HuggingFace transformers

Run with Karpathy's llama2.c

Model Serving Options

Useful Links

Run with HuggingFace `transformers`

Run with Karpathy's `llama2.c`