This is how I got https://github.com/facebookresearch/llama working with Llama2 on a Windows 11 machine with a 4080 (16GB VRAM).
- Download a modern version of wget (with support for TLS 1.2) - ie. https://eternallybored.org/misc/wget/
- (if necessary) Modify
download.shto call your version ofwgetinstead of the default one. - Run
./download.shvia Git Bash and give it the URL from your email (it should start withhttps://download.llamameta.net, nothttps://l.facebook.com/). Warning: the 70B parameter models are big - figure 2GB to download per 1B parameters. - Create a virtual environment -
python -m venv .venv - Activate virtual environment -
. .\.venv\scripts\Activate.ps1 - Install prereqs -
pip install -r requirements.txt - Remove CPU-based torch -
pip uninstall torch - Install CUDA-based torch -
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118(this takes a while - it's ~2.6GB just for torch) - Run
python -m torch.utils.collect_envto confirm you're running the CUDA-supporting version and that it detects CUDA. - Modify
example_text_completion.pyto addimport torchand put this inmain():torch.distributed.init_process_group("gloo")(I couldn't find a Windows build of torch with CUDA and NCCL). - Run
torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b/ --tokenizer_path tokenizer.model --max_seq_len 1280 --max_batch_size 4 --max_gen_length 1024 - Edit
example_text_completion.pyand re-run the above to play with the model.
I wasn't able to get the 13b model to work - was getting an error: AssertionError: Loading a checkpoint for MP=2 but world size is 1, but setting --nproc_per_node 2 gave RuntimeError: CUDA error: invalid device ordinal because I only have one GPU. meta-llama/llama#101 (comment) looks like a possible option to "reshard" the 13b model to run on a single GPU, but I haven't investigated.