The following instructions were used to get Facebook's Llama-2 up and running on Ubuntu 22.04 (70B model) and M1 Macbook Air (7B model).
Divided into 2 parts:
- Part 1: Download models from facebook's repo: https://github.com/facebookresearch/llama
- Part 2: Use llama.cpp repo to convert the model to make inference: https://github.com/ggerganov/llama.cpp
Important if you are tyring to work with the 70B model and have 500 GB or less free space
This process requires a lot of free space if you are downloading the 70B model. Even with 500 GB space, I was running out of space in the middle because of a lot of intermediate files being generated.
Use
df -h
to keep checking free space or runwatch df -h
in a separate terminal to keep watching the space every 2 seconds.
After downloading the model (Part 1) and converting it to ggml-model (Part 2 step 4), I moved my downloaded model from part 1 (consolidated.xx.pth files) to another hard drive to free some space before running the quantize command (part 2 step 5).
-
Install python 3.9 or above. Most of the recent linux distro comes with it.
-
Setup virtualenv (optional but recommened)
-
Goto https://ai.meta.com/resources/models-and-libraries/llama-downloads/ and request for access.
It should take around 5-10 minutes for you to receive an email from Meta AI. Meanwhile, you can complete step 4 to 7.
-
Install git.
# for ubuntu (venv) $ sudo apt update (venv) $ sudo apt install git
-
Also install wget and md5sum if you don't have it already.
-
Clone facebook's repo
(venv) $ git clone https://github.com/facebookresearch/llama.git
-
Once the clone is complete, go inside the llama directory and install requirements. This will take around 10 minutes.
(venv) $ cd llama (venv) $ pip install -e .
-
Make download.sh executable and run it:
(venv) $ chmod +x download.sh (venv) $ ./download.sh
-
After running the script, you will be prompted to enter the link that you received in step 3. Link starts with
https://download.llamameta.net/*?Policy=eyJTdG...
. Copy that properly, paste and hit enter. -
Then you will be asked to choose a model (something like below):
Enter the list of models to download without spaces (7B,13B,70B,7B-chat,13B-chat,70B-chat), or press Enter for all:
- 70B is the largest with size around 129 GB. It took me around 14 hrs to download it completely.
- So make sure you have sufficient space, good internet speed and extra time to work on it.
- If you are just starting out, use the 7B model to test things out.
- Alternatively, look into huggingface transformers which, I think may require a paid account.
-
In a new directory outside the llama dir from above, clone this repo:
(venv) $ cd .. (venv) $ git clone https://github.com/ggerganov/llama.cpp.git
-
Go inside the directory and run
make
command:(venv) $ cd llama.cpp (venv) $ make
-
Install requirements:
(venv) $ pip install -r requirements.txt
-
Running convert.py
(venv) $ python convert.py <path to downloaded model folder>
for e.g. in my case it is like this:
(venv) $ python convert.py ../llama/llama-2-70b-chat/
I run this script from inside llama.cpp directory and my directory structure looks like this:
parent_folder/ llama/ ---- cloned facebook's llama repo llama-2-70b-chat/ ---- downloaded model from facebook other files in that repo llama.cpp/ ---- cloned ggerganov/llama.cpp repo convert.py ---- script to run other files in that repo
Read through
convert.py
script'smain
function on around line 1282 to know more about other parameters. -
Quantize the model
(venv) $ ./quantize ../llama/llama-2-70b/ggml-model-f16.gguf ../llama/llama-2-70b/ggml-model-q4_0.gguf q4_0
-
Run the inference
# for 70B model we need to add -gqa 8 (venv) $ ./main -m ../llama/llama-2-70b-chat/ggml-model-q4_0.bin -n 128 -gqa 8 # for 7B model (venv) $ ./main -m ../llama/llama-2-70b-chat/ggml-model-q4_0.bin -n 128