Skip to content

Instantly share code, notes, and snippets.

@vladimiralencar
Forked from zachschillaci27/llama-2-setup.md
Created August 10, 2024 00:43
Show Gist options
  • Save vladimiralencar/d48585e44b24557ae5ebec1f533a910a to your computer and use it in GitHub Desktop.
Save vladimiralencar/d48585e44b24557ae5ebec1f533a910a to your computer and use it in GitHub Desktop.
Download and run llama-2 locally

Option 1 (easy): HuggingFace Hub Download

  1. Request access to one of the llama2 model repositories from Meta's HuggingFace organization, for example the Llama-2-13b-chat-hf.
  2. Generate a HuggingFace read-only access token from your user profile settings page.
  3. Setup a Python 3.10 enviornment with the following dependencies installed: transformers, huggingface_hub.
  4. Run the following code to download and load the model in HuggingFace transformers:
TOKEN = # copy-paste your HuggingFace access token here

### Option 1
# Replace 'XXX' with the model variant of your choosing
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig

tokenizer = AutoTokenizer.from_pretrained(
   "meta-llama/Llama-2-XXX",
    use_auth_token=TOKEN
)
model = AutoModelForCausalLM.from_pretrained(
   "meta-llama/Llama-2-XXX",
    device_map="auto",
    # Double quantization
    # quantization_config=BitsAndBytesConfig(
    #     load_in_4bit=True,
    #     bnb_4bit_use_double_quant=True,
    # ),
    use_auth_token=TOKEN
)

# Generate
inputs = tokenizer("Hello world", return_tensors="pt").to("cuda")
generated_ids = model.generate(
    **inputs,
    do_sample=True,
    top_p=0.95,
    temperature=0.01,
    max_length=250,
)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

### Option 2
# Replace 'XXX' with the model variant of your choosing
from transformers import pipeline

pipe = pipeline(
   "text-generation", 
   model="meta-llama/Llama-2-XXX",
   device_map="auto",
   # model_kwargs={"load_in_8bit": True}, # quantize to 8-bit
   # model_kwargs={"load_in_4bit": True}, # quantize to 4-bit
   use_auth_token=TOKEN
)

# Generate
pipe(
    "Hello world",
    do_sample=True,
    top_p=0.95,
    temperature=0.01,
    max_length=250,
)

Note: This will store a cached version of the model in ~/.cache/huggingface/hub/.

Option 2 (hard): Meta Direct Download

  1. Request access from Meta here. This is needed to download the model weights.
  2. Clone llama and download the model(s) with sh download.sh using the access code obtained from (1).
    • Note: The access URL expires in 24hrs.

Run with HuggingFace transformers

Read the announcement blogpost for more information.

  1. Move the downloaded model files to a subfolder named with the corresponding parameter count (eg.llama-2-7b-chat/7B/ if you downloaded llama-2-7b-chat).
    • Note: This is the expected format for the HuggingFace conversion script.
  2. Download the relevant tokenizer.model from Meta's HuggingFace organization, see here for the llama-2-7b-chat reference.
  3. Setup a Python 3.10 enviornment with the following dependencies installed: torch, tokenizers, transformers, accelerate, xformers, bitsandbytes, scipy.
    • Note: If running on MacOS, you may need to upgrade to MacOS 13.3 and install a nightly release of torch, see here for reference.
  4. Convert the Meta model checkpoints to HuggingFace format using the helper script convert_llama_weights_to_hf.py, which can be found here. Run the following: python convert_llama_weights_to_hf.py --input_dir path/to/llama/model --model_size <model_size> --output_dir path/to/output.
  5. If necessary, perform quantization of the model using bitsandbytes, see here.
  6. Run inference (i.e. text generation) of the model using the transformers.pipeline as outlined here.

Run with Karpathy's llama2.c

  1. Clone the llama2.c repository.
  2. Install the package depdenecies with pip install -r requirements.txt.
  3. Export the model weights into the llama2.c format using the helper script: python export.py llama2.bin --meta-llama path/to/llama/model
  4. Run inference (i.e. text generation) of the model using the ./run llama_2.bin <temperature> <generation_tokens> <input_prompt>.

Model Serving Options

  1. FastChat: Open platform for training, serving, and evaluating large language model based chatbots.
  2. text-generation-inference: A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power Hugging Chat, the Inference API and Inference Endpoint.
  3. llama-recipes: Meta's recipes and tools for using llama-2.

Useful Links

  1. https://www.philschmid.de/llama-2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment