Below are the generic details and resources of the machine I used:
- Operating System: Linux Mint 22.3 - MATE 64-bit (safe-graphics / boot with 'nomodeset') - Linux Kernel: 6.14.0-37-generic
- GPU: not relevant / not used
- CPU: 2 cores – Intel® Core™2 Duo CPU E8400 @ 3.00GHz × 2
- RAM: 8 GB (DDR2)
Given my recent personal good experience in running inference with CPU-only (no GPU) on local models up to 2B-4B parameters with more than acceptable speed and performance on really limited hardware and very little RAM, I decided to share the instructions and commands I used to compile llama.cpp on my machine and the arguments I currently use to launch llama.
In my specific case, the configuration I share below allowed me to get the maximum out of my hardware, receive fast responses with no crashes at all, and even do other things while using the model (for example, browsing the web) while keeping everything fluid. My stubborn determination to run offline inference on a home computer and my initial frustration thinking it was impossible kept me awake for many nights until, after numerous attempts, I finally found the way to do it!
At the moment I am using Linux Mint (Debian) 64-bit as operating systems. I usually use Ubuntu, which I love and find very convenient, and it works fine too, but I switched to these two because they use much less RAM and, in my specific case, inference runs much better. In any case, the commands and configurations that follow work on any Debian-based operating system, including Ubuntu.
To get the maximum performance I recommend using a relatively lightweight OS like Linux Mint (easier to install) or even better MX Linux or AntiX Linux, although they can be a bit more complex to install and configure. This is my personal opinion and advice, but you are free to do whatever you want :)
Let’s get started step by step with the compilation and installation of llama.cpp on your Ubuntu/Debian system with CPU-only old 2core's CPU and 8 GB of RAM (DDR2). These are the same steps I used, and I get responses with an average speed of 2 tokens/s on a 1-2B model and 1 tokens/s on 4B
Open a terminal.
Step 1: Install the necessary dependencies:
sudo apt update
sudo apt install -y git build-essential cmake libcurl4-openssl-devStep 2: Clone the repository (latest official version):
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cppStep 3: Compile optimized for CPU (no GPU):
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_FLAGS="-O3 -march=native -mtune=native -flto" \
-DCMAKE_CXX_FLAGS="-O3 -march=native -mtune=native -flto" \
-DBUILD_SHARED_LIBS=OFF \
-DLLAMA_BUILD_SERVER=ON \
-DLLAMA_CURL=ONThen build (use all available cores):
cmake --build build -j $(nproc)Estimated time: 5–15 minutes depending on your CPU.
This is the configuration I used to get the most out of my CPU.
Step 4: Make sure everything went well:
./build/bin/llama-cli --versionYou should see the version.
Step 5: (Optional but recommended) Add the binaries to your PATH so you can use llama-cli, llama-server, etc. from any folder:
echo 'export PATH="$HOME/llama.cpp/build/bin:$PATH"' >> ~/.bashrc
source ~/.bashrcOk. Now, before using a model, let’s apply a few tricks to improve memory management during inference (they worked really well for me):
Step 1: Set vm.swappiness=10
What it does: sets how “aggressive” the Linux kernel is in moving memory pages to swap (disk) when RAM fills up.
Default value: 60 (quite aggressive).
Why lower it to 10:
With only 8 GB and models that occupy 2.5–4 GB, the kernel tends to swap the model pages (which are mmap’ed) very easily. With low swappiness, the kernel prefers to drop filesystem cache (which is less critical) instead of swapping the model.
Result: fewer micro-freezes during token generation.
How to set it:
Temporary (until reboot):
sudo sysctl vm.swappiness=10Permanent (recommended):
echo "vm.swappiness=10" | sudo tee /etc/sysctl.d/99-swappiness.conf
sudo sysctl -p /etc/sysctl.d/99-swappiness.conf10 is great for low-RAM desktops. Some people set 5 or even 1, but 10 is a good compromise.
Step 2: Set memlock to unlimited.
Open the file:
sudo nano /etc/security/limits.confAdd these lines:
* soft memlock unlimited
* hard memlock unlimited
Save and exit (CTRL+X → Enter).
* means it will be applied to all users; replace * with a specific username if you want to apply it only to one user.
Reboot your session (logout + login) to apply the changes.
Check that it was applied:
ulimit -lIt should return unlimited (IMPORTANT).
Step 3: Set CPU governor = performance. '''bash sudo nano /etc/systemd/system/cpu-governor.service ''' Add this content: ''' [Unit] Description=Set CPU Governor to Performance After=multi-user.target [Service] Type=oneshot ExecStart=/bin/bash -c 'echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor > /dev/null' RemainAfterExit=yes [Install] WantedBy=multi-user.target '''
Save and exit (CTRL+X → Enter).
Enable with:
'''bash sudo systemctl daemon-reload sudo systemctl enable --now cpu-governor.service '''
Step 4:
'''bash sudo nano /etc/sysctl.d/99-llama.conf ''' Add this content:
''' vm.swappiness=10 vm.vfs_cache_pressure=50 vm.dirty_ratio=10 vm.dirty_background_ratio=5 '''
Save and exit (CTRL+X → Enter).
Then:
'''bash sudo sysctl -p /etc/sysctl.d/99-llama.conf '''
Step 1: Download a model in GGUF format from Hugging Face.
Create a dedicated folder for models:
mkdir -p ~/GGUF_modelsDownload (in this case we will download an abliterated version of Gemma3:1B by HuiHui with Q4_K_M quantization — a good compromise between size and quality — but you can download any model you want; I recommend not going beyond 4B, or at most 6B–7B which is still acceptable):
cd ~/GGUF_models
wget https://huggingface.co/bartowski/huihui-ai_gemma-3-1b-it-abliterated-GGUF/resolve/main/huihui-ai_gemma-3-1b-it-abliterated-Q4_K_M.gguf -O Gemma3-1B-abliterated-Q4_K_M.ggufStep 2: Launch the model with --mlock enabled (the most important point).
Create a bash script for convenience:
nano ~/chat.shInsert this content:
#!/bin/bash
nice -n -10 ionice -c 1 -n 0 \
llama-cli \
-m ~/GGUF_models/Gemma3-1B-abliterated-Q4_K_M.gguf \
--mlock \
--ctx-size 2048 \
-t 2 \
-ngl 0 "$@"Save and exit (CTRL+X → Enter).
Make it executable:
chmod +x ~/chat.shNotes
nice -n -10: sets the execution priority value of process llama-cli at -10, that is an higher priority value than standar processes.
--mlock→ locks the model in RAM (big performance boost in my case)--ctx-size 2048(or maximum 4096) → do not exaggerate with context-t 2→ number of threads (adjust-tto the number of cores of your CPU — 2 in my case)-ngl 0→ no GPU layers
Now you are ready to chat offline from your terminal :)
Start the chat:
~/chat.shI hope this has been useful to you :)