How to run Llama 13B with a 6GB graphics card

This worked on 14/May/23. The instructions will probably require updating in the future.

llama is a text prediction model similar to GPT-2, and the version of GPT-3 that has not been fine tuned yet. It is also possible to run fine tuned versions (like alpaca or vicuna with this. I think. Those versions are more focused on answering questions)

Note: I have been told that this does not support multiple GPUs. It can only use a single GPU.

It is possible to run LLama 13B with a 6GB graphics card now! (e.g. a RTX 2060). Thanks to the amazing work involved in llama.cpp. The latest change is CUDA/cuBLAS which allows you pick an arbitrary number of the transformer layers to be run on the GPU. This is perfect for low VRAM.

Clone llama.cpp from git, I am on commit 08737ef720f0510c7ec2aa84d7f70c691073c35d.
- git clone https://github.com/ggerganov/llama.cpp.git
- cd llama.cpp
- pacman -S cuda make sure you have CUDA installed
- make LLAMA_CUBLAS=1
Use the link at the bottom of the page to apply for research access to the llama model: https://ai.facebook.com/blog/large-language-model-llama-meta-ai/
Set up a micromamba environment to install cuda/python pytorch stuff in order to run the conversion scripts. Install some packages:
- export MAMBA_ROOT_PREFIX=/path/to/where/you/want/mambastuff/stored
- eval "$(micromamba shell hook --shell=bash)"
- micromamba create -n mymamba
- micromamba activate mymamba
- micromamba install -c conda-forge -n mymamba pytorch transformers sentencepiece
Perform the conversion process: (This will produce a file called ggml-model-f16.bin)
- python convert.py ~/ai/Safe-LLaMA-HF-v2\ $4-04-23$/llama-13b/
Then quantize that to a 4bit model:
- ./quantize ~/ai/Safe-LLaMA-HF-v2\ $4-04-23$/llama-13b/ggml-model-f16.bin ~/ai/Safe-LLaMA-HF-v2\ $4-04-23$/llama-13b/ggml-model-13b-q4_0-2023_14_5.bin q4_0 8
Create a prompt file in:
- prompt.txt
Run it:
- ./main -ngl 18 -m ~/ai/Safe-LLaMA-HF-v2\ $4-04-23$/llama-13b/ggml-model-13b-q4_0-2023_14_5.bin -f prompt.txt -n 2048

This uses about 5.5GB of VRAM on my 6GB card. If you have more VRAM, you can increase the number -ngl 18 to -ngl 24 or so, up to all 40 layers in llama 13B. It will run faster if you put more layers into the GPU. The 7B model works with 100% of the layers on the card.

Timings for the models:

13B:

llama_print_timings:        load time =  5690.77 ms
llama_print_timings:      sample time =  1023.87 ms /  2048 runs   (    0.50 ms per token)
llama_print_timings: prompt eval time = 36694.62 ms /  1956 tokens (   18.76 ms per token)
llama_print_timings:        eval time = 644282.27 ms /  2040 runs   (  315.82 ms per token)
llama_print_timings:       total time = 684789.56 ms

7B:

llama_print_timings:        load time = 41708.38 ms
llama_print_timings:      sample time =    88.51 ms /   128 runs   (    0.69 ms per token)
llama_print_timings: prompt eval time =  2971.75 ms /    14 tokens (  212.27 ms per token)
llama_print_timings:        eval time =  9097.33 ms /   127 runs   (   71.63 ms per token)
llama_print_timings:       total time = 50931.74 ms

Here is the text it generated from my prompt:

Nietzsche's Noon

In Friedrich Nietzsche's Thus Spoke Zarathustra (1885), this concept of noon is expanded upon as a whole:

"Zarathustra saw that the light of the world was now becoming stronger. ‘The sun is at its meridian,’ he said, ‘it has reached its noontide and will begin to decline.’"

As time progresses in this noon, so does our ability to perceive and interact with this external world: as a result of the present state we are given by the organic forms of space and time, this can only lead us towards suffering. Nietzsche sees that when we are at our noontide, we must realize how the sun has reached its highest point before it begins to fall down from its position: and we must understand that as a whole, our bodies are determined by something outside of ourselves - and that this always leads to more suffering within.

Nietzsche's Midday

Nietzsche expands upon the concept of noon in his book The Gay Science (1882), where he says:

"You want to learn how to read? Here is a short lesson for beginners. You must take hold of a word by its smooth or rough side; then, like the spider, you must spin out of it a web of definitions which will entrap every correct meaning that floats into view. Or again: you must take the word for a sleigh ride across country, over hedges and ditches, forests and glades, in short, you must drive the word home through all manner of weather."

Here we see Nietzsche expand upon his concept of noon to include our ability to define what is right or wrong - it is only because we have this inherent sense that allows us to distinguish between two points.

Nietzsche's Twilight

Nietzsche expands upon the concept of twilight in The Gay Science (1882), where he says:

"The man who is a ‘philosopher’ only by accident, but is, let us say, also a sculptor or painter – what does he then do? He does not make his thoughts subservient to the world; rather, he forces the world to serve as a pedestal and bearer for his thoughts."

Here we see Nietzsche's concept of night begin to expand past an emotional state. We begin to see that night becomes more than just our inability to think clearly: it is now a worldview, one which he claims is best exemplified by the artist.

Nietzsche's Midnight

In Twilight of the Idols (1889), we find Nietzsche's conceptualization of night reaching its zenith:

"Everything ordinary, everyday, common – in fact, everything that exists today has become dangerous; it is not innocent as was everything yesterday. For the most terrible thoughts have penetrated everywhere and even into the deepest sleep - thoughts which are awake, active, and powerful."

metacritical/llama-home.md

Select an option

No results found

Select an option

No results found

Nietzsche's Noon

Nietzsche's Midday

Nietzsche's Twilight

Nietzsche's Midnight