Skip to content

Instantly share code, notes, and snippets.

Container Definitions to Run OpenWebUI and llama.cpp (Ubuntu 24.04 / Dockerfile & Singularity)

  • Unified base OS: Both Docker and Singularity use Ubuntu 24.04 to keep the runtime environment consistent.
  • Minimal dependency set: Installs python3.12 and python3.12-venv, curl for retrieval, libgomp1 for OpenMP support, and ffmpeg for audio/video handling, using no-install-recommends.
  • Reproducible, non-interactive builds: Fixes timezone to UTC and avoids interactive prompts via DEBIAN_FRONTEND=noninteractive (ARG in Dockerfile; applied during install in Singularity). Also cleans apt caches (apt-get clean and removing /var/lib/apt/lists/*) to reduce image size.

Setting Reasoning Strength in OpenWebUI with chat_template_kwargs

When you run a model through llama.cpp and access it from OpenWebUI using an OpenAI-compatible API, you may want to control how “strongly” the model reasons. A reliable way to do this is to send a custom parameter called chat_template_kwargs from OpenWebUI. This parameter can include a reasoning_effort setting such as low, medium, or high.

Why use chat_template_kwargs?

In many llama.cpp-based deployments, the model’s reasoning behavior is influenced by values passed into the chat template. Rather than trying to force reasoning strength through prompts, passing reasoning_effort via chat_template_kwargs provides a more direct and predictable control mechanism. OpenWebUI supports sending such custom parameters in its model configuration, and this approach is also demonstrated in official integration guidance (in a different backend example). [OpenVINO Documentation][2]

How to set it in OpenWebUI

Workaround for WSL1 DNS issue

  • Prevent WSL from overwriting DNS settings by setting generateResolvConf = false in /etc/wsl.conf.
  • Pull active DNS servers from Windows (based on default IPv4/IPv6 routes and interface metrics) using PowerShell, ensuring the most relevant adapters are used.
  • Write a stable Linux resolver configuration by converting the Windows DNS list into nameserver entries and saving it to /etc/resolv.conf (with CRLF normalization via tr -d '\r').
echo -e '[network]\ngenerateResolvConf = false' >> /etc/wsl.conf

/mnt/c/Windows/System32/WindowsPowerShell/v1.0/powershell.exe -NoProfile -Command '$ifs=@(Get-NetRoute -DestinationPrefix "0.0.0.0/0","::/0" -ErrorAction SilentlyContinue | Sort-Object RouteMetric,InterfaceMetric | Select-Object -ExpandProperty InterfaceIndex -Unique); $dns=foreach($i in $ifs){ (Get-DnsClientServerAddress -InterfaceIndex $i).ServerAddresses }; $dns | Where-Object { $_ } | Select-Object -Unique | ForEach-Object { "nameserver $_" }' 

Notes on Tabby: Llama.cpp, Model Caching, and Access Tokens

Tabby is a developer-focused tool that can run and manage local AI models, and it includes a few practical configuration and account details that are useful to keep in mind.

Tabby uses llama.cpp internally

One notable point is that Tabby uses llama.cpp under the hood. In practice, this means Tabby can leverage the lightweight, local-inference approach that llama.cpp is known for, which is often used to run LLMs efficiently on local machines.

Model cache location: TABBY_MODEL_CACHE_ROOT

Getting Fill-In-the-Middle Autocomplete Working in VS Code Continue with llama.cpp

Overview

Continue is a popular AI coding extension for Visual Studio Code. One of its most useful capabilities is Tab autocomplete, which is typically implemented as Fill-In-the-Middle (FIM) completion: the model predicts code that fits between what you already have before and after the cursor.

Community reports indicate that llama.cpp (via llama-server) can be a practical backend for Continue’s FIM-style autocomplete, often with better results than other local backends in some setups. In Continue’s configuration, the key idea is to define a model that is explicitly assigned the autocomplete role, and point it to your running llama-server.

How Continue Chooses an Autocomplete Model

Building llama.cpp in an Environment Without curl Headers

When you try to build llama.cpp on a system where the curl development headers are not installed, the build may fail because the compiler cannot find curl’s header files (such as curl/curl.h). One straightforward workaround is to download the matching curl source package (so you have the headers locally) and then point CMake to the existing curl library on your system plus the downloaded include directory.

Below is a simple step-by-step example using curl 7.76.1.

1) Check the Installed curl Version

First, confirm which curl version is available in your environment:

Building llama.cpp with CUDA in an NVIDIA HPC SDK Environment (Simple Guide)

If you are working in an NVIDIA HPC SDK environment and want to build llama.cpp with CUDA support, one reliable approach is to use GCC/G++ for the C/C++ parts and NVCC for the CUDA parts.

This setup is practical because some compiler warning flags used by projects like ggml/llama.cpp are commonly supported by GCC/Clang, but may not be accepted by other C++ compilers. By explicitly selecting gcc and g++, you reduce the risk of compiler-flag incompatibilities, while still enabling CUDA with nvcc.

Recommended CMake Commands

Run the following commands from the project root directory:

Rootless Tailscale Setup and Serving

1) Assumption: rootless mode requires userspace networking

Without root, you generally cannot use a TUN device, so you run tailscaled in userspace networking mode.


2) Start tailscaled as a normal user (no sudo)

Programmatically Removing the “Always on Top” Window State from Chrome on Windows

  • Enumerates all visible top-level windows using the Windows API (via ctypes) and collects window metadata such as title, process ID, executable path, and extended window styles.
  • Identifies windows belonging to a specific target executable (chrome.exe) and checks whether they are marked with the WS_EX_TOPMOST (always-on-top) extended style.
  • Safely removes the topmost attribute from matching windows using SetWindowPos, logging successful modifications and failures for traceability.

Enumerating Visible Windows on Windows with ctypes

  • Enumerates all top-level visible windows using EnumWindows, then filters to windows that are visible and have a non-empty title (IsWindowVisible, GetWindowText*).
  • Enriches each window with process and metadata by collecting the window class (GetClassNameW), PID (GetWindowThreadProcessId), and executable path (OpenProcess + QueryFullProcessImageNameW).
  • Extracts and interprets window style flags via GetWindowLongPtrW/GetWindowLongW to report key attributes such as DISABLED, TOPMOST, TOOLWIN, and APPWIN alongside raw STYLE/EXSTYLE hex values.