Skip to content

Instantly share code, notes, and snippets.

Setting Reasoning Strength in OpenWebUI with chat_template_kwargs

When you run a model through llama.cpp and access it from OpenWebUI using an OpenAI-compatible API, you may want to control how “strongly” the model reasons. A reliable way to do this is to send a custom parameter called chat_template_kwargs from OpenWebUI. This parameter can include a reasoning_effort setting such as low, medium, or high.

Why use chat_template_kwargs?

In many llama.cpp-based deployments, the model’s reasoning behavior is influenced by values passed into the chat template. Rather than trying to force reasoning strength through prompts, passing reasoning_effort via chat_template_kwargs provides a more direct and predictable control mechanism. OpenWebUI supports sending such custom parameters in its model configuration, and this approach is also demonstrated in official integration guidance (in a different backend example). [OpenVINO Documentation][2]

How to set it in OpenWebUI

Notes on Tabby: Llama.cpp, Model Caching, and Access Tokens

Tabby is a developer-focused tool that can run and manage local AI models, and it includes a few practical configuration and account details that are useful to keep in mind.

Tabby uses llama.cpp internally

One notable point is that Tabby uses llama.cpp under the hood. In practice, this means Tabby can leverage the lightweight, local-inference approach that llama.cpp is known for, which is often used to run LLMs efficiently on local machines.

Model cache location: TABBY_MODEL_CACHE_ROOT

Getting Fill-In-the-Middle Autocomplete Working in VS Code Continue with llama.cpp

Overview

Continue is a popular AI coding extension for Visual Studio Code. One of its most useful capabilities is Tab autocomplete, which is typically implemented as Fill-In-the-Middle (FIM) completion: the model predicts code that fits between what you already have before and after the cursor.

Community reports indicate that llama.cpp (via llama-server) can be a practical backend for Continue’s FIM-style autocomplete, often with better results than other local backends in some setups. In Continue’s configuration, the key idea is to define a model that is explicitly assigned the autocomplete role, and point it to your running llama-server.

How Continue Chooses an Autocomplete Model

Building llama.cpp in an Environment Without curl Headers

When you try to build llama.cpp on a system where the curl development headers are not installed, the build may fail because the compiler cannot find curl’s header files (such as curl/curl.h). One straightforward workaround is to download the matching curl source package (so you have the headers locally) and then point CMake to the existing curl library on your system plus the downloaded include directory.

Below is a simple step-by-step example using curl 7.76.1.

1) Check the Installed curl Version

First, confirm which curl version is available in your environment:

Building llama.cpp with CUDA in an NVIDIA HPC SDK Environment (Simple Guide)

If you are working in an NVIDIA HPC SDK environment and want to build llama.cpp with CUDA support, one reliable approach is to use GCC/G++ for the C/C++ parts and NVCC for the CUDA parts.

This setup is practical because some compiler warning flags used by projects like ggml/llama.cpp are commonly supported by GCC/Clang, but may not be accepted by other C++ compilers. By explicitly selecting gcc and g++, you reduce the risk of compiler-flag incompatibilities, while still enabling CUDA with nvcc.

Recommended CMake Commands

Run the following commands from the project root directory:

Programmatically Removing the “Always on Top” Window State from Chrome on Windows

  • Enumerates all visible top-level windows using the Windows API (via ctypes) and collects window metadata such as title, process ID, executable path, and extended window styles.
  • Identifies windows belonging to a specific target executable (chrome.exe) and checks whether they are marked with the WS_EX_TOPMOST (always-on-top) extended style.
  • Safely removes the topmost attribute from matching windows using SetWindowPos, logging successful modifications and failures for traceability.

Enumerating Visible Windows on Windows with ctypes

  • Enumerates all top-level visible windows using EnumWindows, then filters to windows that are visible and have a non-empty title (IsWindowVisible, GetWindowText*).
  • Enriches each window with process and metadata by collecting the window class (GetClassNameW), PID (GetWindowThreadProcessId), and executable path (OpenProcess + QueryFullProcessImageNameW).
  • Extracts and interprets window style flags via GetWindowLongPtrW/GetWindowLongW to report key attributes such as DISABLED, TOPMOST, TOOLWIN, and APPWIN alongside raw STYLE/EXSTYLE hex values.

Linux Distribution Container Image Sizes

A quick comparison of container image sizes for three Linux distributions, showing the baseline (“init”) and the size after adding Python 3 (“python3”), in bytes.

distribution init (bytes) python3 (bytes)
alpine 3.23.0 9,256,960 51,548,160
ubuntu 24.04 20251013 84,162,560 122,214,400
rockylinux 9.3.20231119 126,074,880 185,815,040

Setting up RHEL to boot with / (root) on an mdadm RAID device

When you place the RHEL root filesystem on a software RAID array built with mdadm, the initramfs (dracut) must be able to find and assemble that RAID array very early in the boot process. A reliable way to do this is to pass the RAID array UUID to the kernel command line via GRUB, so dracut knows exactly which array to assemble.

Below is a minimal and practical configuration workflow.


1) Confirm the RAID array UUID

Additional necessary packages for setting up Waydroid on Ubuntu docker image

apt-get install dbus-x11 kmod weston