Continue is a popular AI coding extension for Visual Studio Code. One of its most useful capabilities is Tab autocomplete, which is typically implemented as Fill-In-the-Middle (FIM) completion: the model predicts code that fits between what you already have before and after the cursor.
Community reports indicate that llama.cpp (via llama-server) can be a practical backend for Continue’s FIM-style autocomplete, often with better results than other local backends in some setups. In Continue’s configuration, the key idea is to define a model that is explicitly assigned the autocomplete role, and point it to your running llama-server.
Continue’s YAML configuration supports multiple models. Each model can be assigned one or more roles (for example: chat, edit, apply, and importantly, autocomplete). When you include a model with the autocomplete role, Continue will use it for Tab completion and similar inline suggestions.
This is the modern equivalent of the older “tab autocomplete model” concept discussed in issues and community threads: you dedicate a model specifically for completion, rather than using your chat model for everything.
Below is a minimal, workable pattern: one model for FIM autocomplete via llama.cpp, and optionally another for chat/edit.
name: Local Continue (llama.cpp FIM)
version: 1.0.0
schema: v1
models:
# Autocomplete (FIM) model powered by llama.cpp / llama-server
- name: local-fim
provider: llama.cpp
model: your-model.gguf
apiBase: http://127.0.0.1:8081
roles:
- autocomplete
autocompleteOptions:
debounceDelay: 250
maxPromptTokens: 1024
modelTimeout: 2000
# Optional: separate model for chat/edit/apply (OpenAI-compatible example)
- name: chat-main
provider: openai
model: your-chat-model
apiBase: http://127.0.0.1:9000/v1
roles:
- chat
- edit
- applyKey points:
provider: llama.cpptells Continue to treat this model as a llama.cpp backend.apiBaseshould match where yourllama-serveris listening (example:http://127.0.0.1:8081).roles: [autocomplete]is what makes this model the one Continue uses for FIM-style Tab completion.
A recurring theme in user reports is that llama.cpp is often chosen specifically for autocomplete, while a different backend may be used for chat, depending on performance and quality preferences. Users also discuss model sizing and latency tradeoffs (for example, using smaller “coder” models to keep suggestions responsive).
Separately, Continue’s own issue discussions confirm that the extension includes user-facing options related to “FIM” and “Next Edit autocomplete over FIM,” reinforcing that FIM is a core part of its autocomplete pathway.
If autocomplete does not appear to work:
- Confirm
llama-serveris running and reachable at the configuredapiBase. - Ensure your FIM model entry includes
roles: [autocomplete]. - Use a model known to behave well for code completions (and small enough to keep latency low).
- If you are also running an OpenAI-compatible server for chat, keep it separate from llama.cpp autocomplete to avoid endpoint mismatches and confusion in routing.
To enable FIM-style code completion in Continue using llama.cpp, you do not need an elaborate setup: define a dedicated model entry with provider: llama.cpp and roles: [autocomplete], and point it to your running llama-server. From there, you can optionally layer on a separate chat/edit model depending on your workflow.