llama.cpp quantization for the disk-poor

https://github.com/ochafik/llama.cpp/commits/fast-quant/

convert: (hacky) on the fly quantization to save disk (no intermediate f32.gguf)

I was trying to convert & quantize files by myself (couldn't find a memory-mappable Mixtral 8x7b since ggml-org/llama.cpp#6387) and realized I didn't have enough disk space left 😓

So here's a dirty hack that does the trick (Unix-only for now): I've updated convert.py to quantize the model on the fly using subprocess calls to a lightly modified ./quantize:

First, write a bogus GGUF file temp-empty-f32.gguf that has all the KVs and the tensor metadata, but no tensor data.
Then, call ./quantize --skeleton temp-empty-f32.gguf out-Q6_K.gguf Q6_K: this write everything to out-Q6_K.gguf except the actual quantized tensors (left as zeroes).
And finally, for each tensor, do the actual quantization. This is peak hacky: I'm writing each unquantized tensor in a temp-single-f32.gguf file (which needlessly also contains all KVs) and calling ./quantize --single-tensor <tensor-name> temp-single-f32.gguf out-Q6_K.gguf Q6_K. That --single-tensor mode just memory-maps in writable mode the output GGUF and writes the quantized data of just that one vector.

So it's not mergeable as it is, but...

Allows quantization of very large models w/ limited disk space now.
I think the cleanest way forward might be to bind a quantization function in Python (which is what I had in mind when I wrote these example ggml Python bindings tbh) and let it be parallelized; happy to explore if that sounds useful & right.

git remote add ochafik https://github.com/ochafik/llama.cpp
git fetch ochafik
git checkout ochafik/quantization-for-the-disk-poor

make clean && make quantize

python convert.py \
  --outfile Nous-Hermes-2-Mixtral-8x7B-DPO-Q4_K_M \
  --quant Q4_K_M \
  `huggingface-cli download NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO`

One more thing: if you're wary of wearing off your SSD by repeatedly writing 2GB GGUF files w/ just a single tensor, you might want to create them... in RAM. Also, it's probably faster.

On Mac, the following creates a RAM-backed 4GB volume at /Volumes/RAM Disk (see this gist):

diskutil erasevolume HFS+ 'RAM Disk' `hdiutil attach -nobrowse -nomount ram://8388608`

So just replace "temp-single-f32.gguf" with "/Volumes/RAM Disk/temp-single-f32.gguf" in convert.py and you're good to go.

ochafik/README.md Secret

Select an option

No results found

Select an option

No results found