tl;dr use Linux, install bitsandbytes
(either globally or in KAI's conda env, add load_in_8bit=True
, device_map="auto"
in model pipeline creation calls)
Many people are unable to load models due to their GPU's limited VRAM. These models contain billions of parameters (model weights and biases), each of which is a 32 (or 16) bit float. Thanks to the hard work of some researchers [1], it's possible to run these models using 8-bit numbers, which halves the required amount of VRAM compared to running in half-precision. E.g. if a model requires 16GB of VRAM, running with 8-bit inference only requires 8GB.
This guide was written for KoboldAI 1.19.1, and tested with Ubuntu 20.04. These instructions are based on work by Gmin
in KoboldAI's Discord server, and Huggingface's efficient LM inference guide.
- KoboldAI (KAI) must be running on Linux
- Must use NVIDIA GPU that supports 8-bit tensor cores (Turing, Ampere or newer architectures - e.g. T4, RTX20s RTX30s, A40-A100)
- CPU RAM must be large enough to load the entire model in memory (KAI has some optimizations to incrementally load the model, but 8-bit mode seems to break this)
- GPU must contain ~1/2 of the recommended VRAM requirement. The model cannot be split between GPU and CPU.
bitsandbytes
is a Python library that manages low-level 8-bit operations for model inference. Add bitsandbytes
to the environments/huggingface.yml
file, under the pip
section. Your file should look something like this:
name: koboldai
channels:
# ...
dependencies:
# ...
- pip:
- bitsandbytes # <---- add this
# ...
Next, install bitsandbytes
in KoboldAI's environment with bin/micromamba install -f environments/huggingface.yml -r runtime -n koboldai
. The output should look something like this:
...
Requirement already satisfied: MarkupSafe>=2.0 in /home/...
Installing collected packages: bitsandbytes
Successfully installed bitsandbytes-0.35.1
Make the following changes to aiserver.py
:
-
Under
class vars:
, setlazy_load
toFalse
:class vars: # ... debug = False # If set to true, will send debug information to the client for display lazy_load = False # <--- change this # ...
-
Under
reset_model_settings()
, setvars.lazy_load
toFalse
also:def reset_model_settings(): # ... vars.lazy_load = False # <--- change this
-
Edit this line to add
load_in_8bit=True
anddevice_map="auto"
:# vvvvvvvvvvv add these vvvvvvvvvvvvvvv # model = AutoModelForCausalLM.from_pretrained("models/{}".format(vars.model.replace('/', '_')), load_in_8bit=True, device_map ="auto", revision=vars.revision, cache_dir="cache", **lowmem)
Start KoboldAI normally. Set all model layers to GPU, as we cannot split the model between CPU and GPU.
The changes we made do not apply to GPT-2 models, nor models loaded from custom directories (but you can enable it for custom directories by adding the load_in_8bit
/device_map
parameters to the appropriate AutoModelForCausalLM.from_pretrained()
calls.
It is still crashing when generating on my non RTX nvidia.