tl;dr use Linux, install bitsandbytes
(either globally or in KAI's conda env, add load_in_8bit=True
, device_map="auto"
in model pipeline creation calls)
Many people are unable to load models due to their GPU's limited VRAM. These models contain billions of parameters (model weights and biases), each of which is a 32 (or 16) bit float. Thanks to the hard work of some researchers [1], it's possible to run these models using 8-bit numbers, which halves the required amount of VRAM compared to running in half-precision. E.g. if a model requires 16GB of VRAM, running with 8-bit inference only requires 8GB.
This guide was written for KoboldAI 1.19.1, and tested with Ubuntu 20.04. These instructions are based on work by Gmin
in KoboldAI's Discord server, and Huggingface's efficient LM inference guide.