This is how I got https://github.com/facebookresearch/llama working with Llama2 on a Windows 11 machine with a 4080 (16GB VRAM).
- Download a modern version of wget (with support for TLS 1.2) - ie. https://eternallybored.org/misc/wget/
- (if necessary) Modify
download.sh
to call your version ofwget
instead of the default one. - Run
./download.sh
via Git Bash and give it the URL from your email (it should start withhttps://download.llamameta.net
, nothttps://l.facebook.com/
). Warning: the 70B parameter models are big - figure 2GB to download per 1B parameters. - Create a virtual environment -
python -m venv .venv
- Activate virtual environment -
. .\.venv\scripts\Activate.ps1
- Install prereqs -
pip install -r requirements.txt
- Remove CPU-based torch -
pip uninstall torch
- Install CUDA-based torch -
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
(this takes a while - it's ~2.6GB just for torch) - Run
python -m torch.utils.collect_env
to confirm you're running the CUDA-supporting version and that it detects CUDA. - Modify
example_text_completion.py
to addimport torch
and put this inmain()
:torch.distributed.init_process_group("gloo")
(I couldn't find a Windows build of torch with CUDA and NCCL). - Run
torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b/ --tokenizer_path tokenizer.model --max_seq_len 1280 --max_batch_size 4 --max_gen_length 1024
- Edit
example_text_completion.py
and re-run the above to play with the model.
I wasn't able to get the 13b model to work - was getting an error: AssertionError: Loading a checkpoint for MP=2 but world size is 1
, but setting --nproc_per_node 2
gave RuntimeError: CUDA error: invalid device ordinal
because I only have one GPU. meta-llama/llama#101 (comment) looks like a possible option to "reshard" the 13b model to run on a single GPU, but I haven't investigated.