https://github.com/ochafik/llama.cpp/commits/fast-quant/
convert: (hacky) on the fly quantization to save disk (no intermediate f32.gguf)
I was trying to convert & quantize files by myself (couldn't find a memory-mappable Mixtral 8x7b since ggml-org/llama.cpp#6387) and realized I didn't have enough disk space left 😓
So here's a dirty hack that does the trick (Unix-only for now): I've updated convert.py to quantize the model on the fly using subprocess calls to a lightly modified ./quantize:
- First, write a bogus GGUF file
temp-empty-f32.ggufthat has all the KVs and the tensor metadata, but no tensor data. - Then, call
./quantize --skeleton temp-empty-f32.gguf out-Q6_K.gguf Q6_K: this write everything toout-Q6_K.ggufexcept the actual quantized tensors (left as zeroes). - And finally, for each tensor, do the actual quantization. This is peak hacky: I'm writing each unquantized tensor in a
temp-single-f32.gguffile (which needlessly also contains all KVs) and calling./quantize --single-tensor <tensor-name> temp-single-f32.gguf out-Q6_K.gguf Q6_K. That--single-tensormode just memory-maps in writable mode the output GGUF and writes the quantized data of just that one vector.
So it's not mergeable as it is, but...
- Allows quantization of very large models w/ limited disk space now.
- I think the cleanest way forward might be to bind a quantization function in Python (which is what I had in mind when I wrote these example ggml Python bindings tbh) and let it be parallelized; happy to explore if that sounds useful & right.
git remote add ochafik https://github.com/ochafik/llama.cpp
git fetch ochafik
git checkout ochafik/quantization-for-the-disk-poor
make clean && make quantize
python convert.py \
--outfile Nous-Hermes-2-Mixtral-8x7B-DPO-Q4_K_M \
--quant Q4_K_M \
`huggingface-cli download NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO`One more thing: if you're wary of wearing off your SSD by repeatedly writing 2GB GGUF files w/ just a single tensor, you might want to create them... in RAM. Also, it's probably faster.
On Mac, the following creates a RAM-backed 4GB volume at /Volumes/RAM Disk (see this gist):
diskutil erasevolume HFS+ 'RAM Disk' `hdiutil attach -nobrowse -nomount ram://8388608`So just replace "temp-single-f32.gguf" with "/Volumes/RAM Disk/temp-single-f32.gguf" in convert.py and you're good to go.