Running KoboldAI in 8-bit mode

tl;dr use Linux, install bitsandbytes (either globally or in KAI's conda env, add load_in_8bit=True, device_map="auto" in model pipeline creation calls)

Many people are unable to load models due to their GPU's limited VRAM. These models contain billions of parameters (model weights and biases), each of which is a 32 (or 16) bit float. Thanks to the hard work of some researchers [1], it's possible to run these models using 8-bit numbers, which halves the required amount of VRAM compared to running in half-precision. E.g. if a model requires 16GB of VRAM, running with 8-bit inference only requires 8GB.

This guide was written for KoboldAI 1.19.1, and tested with Ubuntu 20.04. These instructions are based on work by Gmin in KoboldAI's Discord server, and Huggingface's efficient LM inference guide.

Requirements

KoboldAI (KAI) must be running on Linux
Must use NVIDIA GPU that supports 8-bit tensor cores (Turing, Ampere or newer architectures - e.g. T4, RTX20s RTX30s, A40-A100)
CPU RAM must be large enough to load the entire model in memory (KAI has some optimizations to incrementally load the model, but 8-bit mode seems to break this)
GPU must contain ~1/2 of the recommended VRAM requirement. The model cannot be split between GPU and CPU.

Getting started

Installing `bitsandbytes`

bitsandbytes is a Python library that manages low-level 8-bit operations for model inference. Add bitsandbytes to the environments/huggingface.yml file, under the pip section. Your file should look something like this:

name: koboldai
channels:
  # ...
dependencies:
  # ...
  - pip:
    - bitsandbytes  # <---- add this
    # ...

Next, install bitsandbytes in KoboldAI's environment with bin/micromamba install -f environments/huggingface.yml -r runtime -n koboldai. The output should look something like this:

...
Requirement already satisfied: MarkupSafe>=2.0 in /home/...
Installing collected packages: bitsandbytes
Successfully installed bitsandbytes-0.35.1

Code changes

Make the following changes to aiserver.py:

Under class vars:, set lazy_load to False:

class vars:
  # ...
  debug       = False # If set to true, will send debug information to the client for display
  lazy_load   = False # <--- change this
  # ...

Under reset_model_settings(), set vars.lazy_load to False also:

def reset_model_settings():
  # ...
  vars.lazy_load = False # <--- change this

Edit this line to add load_in_8bit=True and device_map="auto":

                                                                                                 # vvvvvvvvvvv add these vvvvvvvvvvvvvvv #
model     = AutoModelForCausalLM.from_pretrained("models/{}".format(vars.model.replace('/', '_')), load_in_8bit=True, device_map ="auto", revision=vars.revision, cache_dir="cache", **lowmem)

Go!

Start KoboldAI normally. Set all model layers to GPU, as we cannot split the model between CPU and GPU.

The changes we made do not apply to GPT-2 models, nor models loaded from custom directories (but you can enable it for custom directories by adding the load_in_8bit/device_map parameters to the appropriate AutoModelForCausalLM.from_pretrained() calls.

1: LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer)

ERROR | __main__:generate:4944 - Traceback (most recent call last): File "/home/user/aipainter/local_chat/KoboldAI-Client/aiserver.py", line 4933, in generate genout, already_generated = tpool.execute(_generate, txt, minimum, maximum, found_entries) File "/home/user/.pyenv/versions/kobold/lib/python3.10/site-packages/eventlet/tpool.py", line 132, in execute six.reraise(c, e, tb) File "/home/user/.pyenv/versions/kobold/lib/python3.10/site-packages/six.py", line 719, in reraise raise value File "/home/user/.pyenv/versions/kobold/lib/python3.10/site-packages/eventlet/tpool.py", line 86, in tworker rv = meth(*args, **kwargs) File "/home/user/aipainter/local_chat/KoboldAI-Client/aiserver.py", line 4856, in _generate genout = generator( File "/home/user/.pyenv/versions/kobold/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/user/.pyenv/versions/kobold/lib/python3.10/site-packages/transformers/generation_utils.py", line 1543, in generate return self.sample( File "/home/user/aipainter/local_chat/KoboldAI-Client/aiserver.py", line 1994, in new_sample return new_sample.old_sample(self, *args, **kwargs) File "/home/user/.pyenv/versions/kobold/lib/python3.10/site-packages/transformers/generation_utils.py", line 2518, in sample next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

GPT-4chan 29sec 10gb/16gb generating lotus-12b 60sec 13gb/24gb out of memory - a few generations in normal fp16 but one in 8bit opt-13b 64sec 13gb/20gb generating - next generation is an error as memory isn't freed Bloomz 7B loads with default weights. Replies in 20 seconds. FP16 replies in 9 Opt-6b loads with default wweights. Replies in 18 seconds. FP16 replies in 5.

whjms/kobold-8bit.md

Running KoboldAI in 8-bit mode

Requirements

Getting started

Installing `bitsandbytes`

Code changes

Go!

Ph0rk0z commented Feb 8, 2023

anonymous721 commented Feb 16, 2023

archytasos commented Feb 18, 2023 •

edited

Loading

Ph0rk0z commented Feb 19, 2023

Ph0rk0z commented Mar 9, 2023

whjms/kobold-8bit.md

Running KoboldAI in 8-bit mode

Requirements

Getting started

Installing bitsandbytes

Code changes

Go!

Ph0rk0z commented Feb 8, 2023

anonymous721 commented Feb 16, 2023

archytasos commented Feb 18, 2023 • edited Loading

Ph0rk0z commented Feb 19, 2023

Ph0rk0z commented Mar 9, 2023

Installing `bitsandbytes`

archytasos commented Feb 18, 2023 •

edited

Loading