Skip to content

Instantly share code, notes, and snippets.

@ochafik
Created May 16, 2024 03:50
Show Gist options
  • Select an option

  • Save ochafik/cacd0201a4f4ca18dd1760ae0e6be855 to your computer and use it in GitHub Desktop.

Select an option

Save ochafik/cacd0201a4f4ca18dd1760ae0e6be855 to your computer and use it in GitHub Desktop.
llama.cpp quantization for the disk-poor

https://github.com/ochafik/llama.cpp/commits/fast-quant/

convert: (hacky) on the fly quantization to save disk (no intermediate f32.gguf)

I was trying to convert & quantize files by myself (couldn't find a memory-mappable Mixtral 8x7b since ggml-org/llama.cpp#6387) and realized I didn't have enough disk space left 😓

So here's a dirty hack that does the trick (Unix-only for now): I've updated convert.py to quantize the model on the fly using subprocess calls to a lightly modified ./quantize:

  • First, write a bogus GGUF file temp-empty-f32.gguf that has all the KVs and the tensor metadata, but no tensor data.
  • Then, call ./quantize --skeleton temp-empty-f32.gguf out-Q6_K.gguf Q6_K: this write everything to out-Q6_K.gguf except the actual quantized tensors (left as zeroes).
  • And finally, for each tensor, do the actual quantization. This is peak hacky: I'm writing each unquantized tensor in a temp-single-f32.gguf file (which needlessly also contains all KVs) and calling ./quantize --single-tensor <tensor-name> temp-single-f32.gguf out-Q6_K.gguf Q6_K. That --single-tensor mode just memory-maps in writable mode the output GGUF and writes the quantized data of just that one vector.

So it's not mergeable as it is, but...

  • Allows quantization of very large models w/ limited disk space now.
  • I think the cleanest way forward might be to bind a quantization function in Python (which is what I had in mind when I wrote these example ggml Python bindings tbh) and let it be parallelized; happy to explore if that sounds useful & right.
git remote add ochafik https://github.com/ochafik/llama.cpp
git fetch ochafik
git checkout ochafik/quantization-for-the-disk-poor

make clean && make quantize

python convert.py \
  --outfile Nous-Hermes-2-Mixtral-8x7B-DPO-Q4_K_M \
  --quant Q4_K_M \
  `huggingface-cli download NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO`

One more thing: if you're wary of wearing off your SSD by repeatedly writing 2GB GGUF files w/ just a single tensor, you might want to create them... in RAM. Also, it's probably faster.

On Mac, the following creates a RAM-backed 4GB volume at /Volumes/RAM Disk (see this gist):

diskutil erasevolume HFS+ 'RAM Disk' `hdiutil attach -nobrowse -nomount ram://8388608`

So just replace "temp-single-f32.gguf" with "/Volumes/RAM Disk/temp-single-f32.gguf" in convert.py and you're good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment