See our collection for versions of Deepseek-R1 including GGUF & 4-bit formats. Unsloth's DeepSeek-R1 1.58-bit + 2-bit Dynamic Quants selectively avoids quantizing certain parameters, greatly increasing accuracy than standard 1-bit/2-bit.
Instructions to run this model in llama.cpp: Or you can view more detailed instructions here: unsloth.ai/blog/deepseekr1-dynamic
Do not forget about <|User|> and <|Assistant|> tokens! - Or use a chat template formatter Obtain the latest llama.cpp at https://github.com/ggerganov/llama.cpp It's best to use --min-p 0.05 or 0.1 to counteract very rare token predictions - I found this to work well especially for the 1.58bit model. Example with Q4_0 K quantized cache Notice -no-cnv disables auto conversation mode