Skip to content

Instantly share code, notes, and snippets.

@whjms
Last active April 7, 2023 16:35
Show Gist options
  • Save whjms/2505ef082a656e7a80a3f663c16f4277 to your computer and use it in GitHub Desktop.
Save whjms/2505ef082a656e7a80a3f663c16f4277 to your computer and use it in GitHub Desktop.
Instructions for running KoboldAI in 8-bit mode

Running KoboldAI in 8-bit mode

tl;dr use Linux, install bitsandbytes (either globally or in KAI's conda env, add load_in_8bit=True, device_map="auto" in model pipeline creation calls)

Many people are unable to load models due to their GPU's limited VRAM. These models contain billions of parameters (model weights and biases), each of which is a 32 (or 16) bit float. Thanks to the hard work of some researchers [1], it's possible to run these models using 8-bit numbers, which halves the required amount of VRAM compared to running in half-precision. E.g. if a model requires 16GB of VRAM, running with 8-bit inference only requires 8GB.

This guide was written for KoboldAI 1.19.1, and tested with Ubuntu 20.04. These instructions are based on work by Gmin in KoboldAI's Discord server, and Huggingface's efficient LM inference guide.

Requirements

  • KoboldAI (KAI) must be running on Linux
  • Must use NVIDIA GPU that supports 8-bit tensor cores (Turing, Ampere or newer architectures - e.g. T4, RTX20s RTX30s, A40-A100)
  • CPU RAM must be large enough to load the entire model in memory (KAI has some optimizations to incrementally load the model, but 8-bit mode seems to break this)
  • GPU must contain ~1/2 of the recommended VRAM requirement. The model cannot be split between GPU and CPU.

Getting started

Installing bitsandbytes

bitsandbytes is a Python library that manages low-level 8-bit operations for model inference. Add bitsandbytes to the environments/huggingface.yml file, under the pip section. Your file should look something like this:

name: koboldai
channels:
  # ...
dependencies:
  # ...
  - pip:
    - bitsandbytes  # <---- add this
    # ...

Next, install bitsandbytes in KoboldAI's environment with bin/micromamba install -f environments/huggingface.yml -r runtime -n koboldai. The output should look something like this:

...
Requirement already satisfied: MarkupSafe>=2.0 in /home/...
Installing collected packages: bitsandbytes
Successfully installed bitsandbytes-0.35.1

Code changes

Make the following changes to aiserver.py:

  1. Under class vars:, set lazy_load to False:

    class vars:
      # ...
      debug       = False # If set to true, will send debug information to the client for display
      lazy_load   = False # <--- change this
      # ...
  2. Under reset_model_settings(), set vars.lazy_load to False also:

    def reset_model_settings():
      # ...
      vars.lazy_load = False # <--- change this
  3. Edit this line to add load_in_8bit=True and device_map="auto":

                                                                                                     # vvvvvvvvvvv add these vvvvvvvvvvvvvvv #
    model     = AutoModelForCausalLM.from_pretrained("models/{}".format(vars.model.replace('/', '_')), load_in_8bit=True, device_map ="auto", revision=vars.revision, cache_dir="cache", **lowmem)

Go!

Start KoboldAI normally. Set all model layers to GPU, as we cannot split the model between CPU and GPU.

The changes we made do not apply to GPT-2 models, nor models loaded from custom directories (but you can enable it for custom directories by adding the load_in_8bit/device_map parameters to the appropriate AutoModelForCausalLM.from_pretrained() calls.


1: LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer)

@anonymous721
Copy link

anonymous721 commented Jan 25, 2023

For the record, if anyone is wondering how firm the Nvidia requirement is (since some guides say so even when AMD has workarounds), it seems this can't readily be made to work on non-Nvidia cards. I built bitsandbytes-rocm, and in KoboldAI's environments/rocm.yml I pointed pip to that local package instead of standard bitsandbytes. It looks like it works at first, since it successfully loads the models at the reduced size, but when you actually attempt to generate, it errors out because Int8 Matmul is not supported. My understanding is that while RDNA2 cards do support Int8 operations, the specific operations used here are only available on Nvidia.

@gururise
Copy link

gururise commented Feb 6, 2023

Latest version of bitsandbytes was just released that supports Int8 Matmul on non RTX Nvidia cards. Should be easy to port this to the ROCM version to get all supported AMD cards working.

@Ph0rk0z
Copy link

Ph0rk0z commented Feb 8, 2023

It is still crashing when generating on my non RTX nvidia.

@anonymous721
Copy link

Should be easy to port this to the ROCM version to get all supported AMD cards working.

I'm not so sure. The change uses cublaslt, which seems to not have direct equivalent support in ROCm yet, so it's definitely not as simple as just running hipify.
I have no idea whether the technique the workaround uses could be re-implemented using something ROCm does have, since I know nothing about GPU programming and my attempts were just groping in the dark, but there doesn't seem to be a direct 1:1 translation.

@archytasos
Copy link

archytasos commented Feb 18, 2023

Thank you for sharing the gist!

I have tried the code on a specific branch and was able to successfully load both the OPT-6.7B-Erebus and pygmalion-6b models on my 1080Ti with 11GB VRAM (using the Kepler architecture and Cuda7.5+). However, I did encounter a minor issue which must be worked around.. I tested two OPT-based models, both models both appear to be functioning correctly with the default threshold. On the other hand, the pygmalion-6b model and other GPT-J based fails to perform inferences and I received an error message regarding the probability value on the last layer:

ERROR      | __main__:generate:4944 - Traceback (most recent call last):
  File "/home/user/aipainter/local_chat/KoboldAI-Client/aiserver.py", line 4933, in generate
    genout, already_generated = tpool.execute(_generate, txt, minimum, maximum, found_entries)
  File "/home/user/.pyenv/versions/kobold/lib/python3.10/site-packages/eventlet/tpool.py", line 132, in execute
    six.reraise(c, e, tb)
  File "/home/user/.pyenv/versions/kobold/lib/python3.10/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/user/.pyenv/versions/kobold/lib/python3.10/site-packages/eventlet/tpool.py", line 86, in tworker
    rv = meth(*args, **kwargs)
  File "/home/user/aipainter/local_chat/KoboldAI-Client/aiserver.py", line 4856, in _generate
    genout = generator(
  File "/home/user/.pyenv/versions/kobold/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context 
    return func(*args, **kwargs)
  File "/home/user/.pyenv/versions/kobold/lib/python3.10/site-packages/transformers/generation_utils.py", line 1543, in generate
    return self.sample(
  File "/home/user/aipainter/local_chat/KoboldAI-Client/aiserver.py", line 1994, in new_sample
    return new_sample.old_sample(self, *args, **kwargs)
  File "/home/user/.pyenv/versions/kobold/lib/python3.10/site-packages/transformers/generation_utils.py", line 2518, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Since the model has been fine-tuned, in order to improve its stability, it may be necessary to set a lower threshold for the maximum int8 value. Like this:. After the modification, fine-tuned 6.7B GPT-J models load on my 1080Ti with 7GiB, while it appears to work, setting a lower threshold results in more VRAM being needed during inference. This is because more weights are represented as fp16 instead of int8.

@Ph0rk0z
Copy link

Ph0rk0z commented Feb 19, 2023

Thanks for that. I can finally generate something in 8 bit.

GPT-4chan 29sec 10gb/16gb generating
lotus-12b 60sec 13gb/24gb out of memory - a few generations in normal fp16 but one in 8bit
opt-13b   64sec 13gb/20gb generating - next generation is an error as memory isn't freed


Bloomz 7B loads with default weights. Replies in 20 seconds. FP16 replies in 9
Opt-6b loads with default wweights. Replies in 18 seconds. FP16 replies in 5.

Performance isn't that great and it seems like on the larger models memory is never freed.

@Ph0rk0z
Copy link

Ph0rk0z commented Mar 9, 2023

Threshold of 1+ allows me to use this and keep generating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment