Large language model (LLM) coding assistants like Sourcegraph Cody, Aider, and Tabby help developers generate and apply code changes. This report examines how these open-source tools prompt LLMs to produce patches, integrate the changes into code, handle common issues, verify results, and what challenges remain.
Structured Prompts for Code Edits – These assistants carefully craft prompts so the LLM knows exactly how to output changes. For example, Aider uses specialized edit formats: it can ask the LLM for a full file rewrite or a diff. Aider often defaults to a diff format, where the LLM is told to return only the changed parts of files using a syntax similar to a unified diff or marked “search/replace” blocks. This reduces token usage and focuses the LLM on the edits. The prompt includes instructions like “produce changes in this format” with file paths and code fences, so the model returns patches instead of plain text. Aider’s documentation shows an example where the model is prompted to output a diff with <<<<<<< SEARCH
and >>>>>>> REPLACE
markers around the modified lines. By giving a clear template, the LLM is guided to generate an actual patch.
Including Relevant Context – All these tools feed the LLM as much context as needed to make correct changes. Cody uses Sourcegraph’s code search to pull in relevant code snippets, definitions, or usages from the codebase. For instance, if you ask Cody to change a function, it will include that function’s file (and maybe its references) in the prompt. Developers can also explicitly mention files (e.g. @filename
) to ensure Cody includes them in context. Aider analyzes the entire git repo to build a “repository map”, essentially a compact summary of the codebase, which it automatically includes in the prompt for context. This map (built via ctags or similar) gives the LLM an overview of project structure and important symbols without sending every file. By doing this, Aider can provide needed context about other parts of the code (classes, functions, etc.) that aren’t directly being edited, reducing the chance of hallucinated references. Tabby, being primarily a code completion engine, provides context from the current file and open files in the IDE. Its inline chat feature works with the open editor content, so the prompt usually contains the code surrounding the area of interest and possibly some project-wide info if available. In short, all tools try to give the model the right slice of the code – not too little (which can cause incoherent changes) and not too much (which can overwhelm the model).
Iterative and Refined Instructions – To improve patch quality, these assistants sometimes iterate or use multi-step prompting. Rather than one big prompt, they may break the task into stages. For example, Aider has an “architect mode” that uses a two-step approach: one model generates a high-level plan or diff, and a second model (or a second round) focuses just on applying the edits in the proper format. This separation of what to do from how to edit helps when a single LLM tends to produce incorrect patch syntax – the first response can be a plain explanation or pseudo-diff, and the second ensures it’s converted to a proper patch. Another strategy is seen in how some tools (and community tips) suggest giving insertion context to the LLM. For instance, developers found that instructing the model to include a few lines of code before and after the change can help it locate the correct insertion point. An example from a user of Cody (via a similar tool called Cursor) shows a prompt format: “First, I will show you some context... Then I will show you the insertion point in fileX
and give instructions... Generate the code to insert, and include 3 lines before/after in the code block.”. By explicitly framing where the edit goes, the LLM is less likely to misplace it. In summary, the prompting strategies involve structured output formats, rich but focused context, and sometimes iterative hints to guide the LLM toward correct and minimal edits.
Translating LLM Output into Code Changes – Once the LLM produces a response, these assistants parse and apply the changes to the user’s code. Aider expects the LLM’s answer to already be a well-formed patch (per the prompt format). If using the diff format, the answer contains the filename and the diff snippet; Aider then locates the “search” text in the file and replaces it with the provided “replace” text. In effect, the LLM is doing the heavy lifting of generating a unified diff, and Aider just applies it. In cases where Aider uses the “whole file” format, it receives an entire file from the LLM and writes it out. Aider’s tight integration with git means each accepted patch can be immediately staged or committed. In fact, by default it auto-commits every change after applying it, labeling the commit as an AI-generated change. This ensures each LLM edit is tracked – providing a safety net to undo or review later. Aider also prints the diff of changes for the user to see after applying, and offers commands like /diff
(to show all changes since last prompt) and /undo
(to rollback the last applied edit). This workflow treats the LLM’s output as a patch file that needs to cleanly apply to the current code.
Cody, which operates inside editors like VS Code or JetBrains, handles patch integration somewhat differently. It doesn’t usually ask the LLM to produce raw unified diff text; instead, Cody might get the LLM to output a code snippet or a replacement block, and the editor plugin determines how to apply it. For example, if you ask “Rename function X to Y”, Cody might reply with a snippet of the refactored code. The user is then shown an “Apply change” button in the chat UI. When clicked, Cody inserts or replaces the code in the appropriate file and location. Under the hood, Cody uses heuristics and the provided context to find where that snippet belongs (often using markers or the surrounding lines as anchors). However, this can be tricky – if the snippet lacks enough context, the tool might insert it in the wrong place or even in a new file. Indeed, users have reported that sometimes “Apply” would create a duplicate file or garble the insertion. Sourcegraph has been improving the “Smart Apply” feature to reduce such errors. Typically, the integration relies on the file being open or referenced (via an @file
mention in the prompt) so Cody knows which file to modify. If the file is identified, Cody will attempt to merge the LLM’s suggestion into that file at the correct spot. The integration is therefore a combination of the LLM providing the new code and the assistant tooling mapping that into an edit operation (like insert at line N, or replace lines X-Y). It’s less of an explicit diff and more akin to a guided search-and-replace.
Tabby, as a self-hosted alternative to GitHub Copilot, primarily provides inline code completions and some chat Q&A capabilities. Its patch generation is more implicit – Tabby might not output a formal patch, but rather suggest the next few lines or a refactored function body as you type. The developer then accepts the suggestion (usually with an editor keystroke) which effectively integrates the change. In chat mode, Tabby can answer questions or suggest code, but it’s up to the user to copy those suggestions into their code. In other words, Tabby’s integration is largely manual or editor-driven: it augments the coding process with suggestions, and the developer decides to apply them. This design means Tabby doesn’t need a complex diff parsing mechanism – integration is “apply as you go.” The advantage is simplicity, but the flip side is that coordinating multi-file or sweeping changes is not automatic; it’s done piece by piece with user oversight.
Ensuring Compatibility – After an LLM generates a patch, the assistants strive to make sure it applies cleanly and doesn’t break the codebase. They often verify that the target code (context) hasn’t changed since the LLM saw it. For example, Aider uses git to warn if a file was modified outside of its knowledge (a “dirty” file) and even auto-commits unsaved changes before applying new edits. This prevents patch offsets from being wrong. The patches themselves are applied using standard diff algorithms or simple text replacement, so if the LLM followed the prompt correctly, the change slots right in. When things don’t line up (e.g., the LLM’s diff doesn’t apply), Aider will report a failure. Cody, working live in an IDE, relies on the real-time state of the file – if you’ve made other edits, Cody’s suggestions might no longer match, and in such cases the “apply” might fail or produce a conflict. There isn’t a fully automated merge conflict resolution in these tools (yet); usually the onus is on keeping context in sync or asking the LLM to regenerate a patch for the latest code if a conflict occurs. In practice, integration works best when the scope of changes is small and localized, which the prompting strategies try to ensure.
Even with careful design, LLM-generated patches can run into issues. Some common problems and edge cases observed include:
-
Syntax Errors and Broken Code: The LLM may produce code that doesn’t compile or has minor syntax mistakes (missing commas, wrong indentation, etc.). For instance, an edit might introduce an off-by-one error in indentation in Python, or a missing semicolon in Java. These errors slip in because the model’s output isn’t guaranteed to be syntactically perfect. Tools like Aider catch this during the lint/test phase (discussed later) and will attempt to fix it, but it’s a frequent hiccup. In some cases, the model might also misunderstand the code context and use an undefined variable or incorrect type, leading to runtime errors if not caught.
-
Patch Format Mismatch: Sometimes the LLM doesn’t follow the expected format for output, which means the assistant can’t apply the change. Aider’s docs note that “the LLM will reply with some code changes that don’t get applied... usually because the LLM is disobeying the system prompts and trying to make edits in a format that Aider doesn’t expect.” In other words, the assistant asked for a diff, but the model might return a whole file, or intermix explanation text with code, etc. Aider works hard to handle “almost correct” outputs, but if the format is too off (say, missing the markers), it will error out. This is especially a problem with less capable models – as the Aider maintainer notes, weaker local LLMs are more prone to ignoring the required patch syntax, making them “just barely capable of working with Aider”. Ensuring the LLM stays on-format is a constant battle, and when it fails, the user might have to prompt again or switch to a simpler edit mode.
-
Misplaced or Overlapping Edits: When applying a change, there’s a risk the new code ends up in the wrong spot or overwrites something it shouldn’t. This edge case has been observed particularly with Cody’s apply feature. Users reported that Smart Apply would “miss where it should go or completely replace/delete code blocks” unrelated to the intended change. In one example, instead of modifying an existing function, the tool inserted a second copy of the function at the end of the file (creating duplicate definitions). Such misplacement can happen if the assistant mis-identifies the insertion point – perhaps the snippet it got from the LLM didn’t uniquely match anywhere, or matched too broadly. Another scenario is when multiple changes are needed: if the LLM outputs two separate modifications in one response, applying the first might shift the code and make the second one incorrect (line numbers change, etc.). Aider’s experiments with line-number-based edits ran into this problem: “as soon as you apply the first, the line numbers for the next change will change,” making it challenging to apply a batch of edits. One suggested solution was applying from bottom to top (so earlier edits don’t disturb the line positions of later ones). This is an active edge case to manage when multiple edits are involved.
-
Incorrect Logic or Contextual Errors: Just because a patch applies doesn’t mean it does the right thing. LLMs can produce logically incorrect code – e.g., off-by-one loops, wrong conditions, inefficient algorithms, or just an approach that doesn’t actually solve the user’s request. The assistant might not catch this immediately if the code runs and passes tests syntactically. For example, the LLM might “fix” a bug in a way that introduces another subtle bug. These tools rely on either tests or the developer’s review to catch logical issues. A specific manifestation of this is hallucinated code: the LLM might call a function that doesn’t exist or use an API incorrectly. If the function is completely fictional, the code won’t compile (which is easier to catch), but if it’s a real function used wrongly, it could slip by. An anecdote from Aider’s experience with certain models was that they sometimes elided large sections of code and replaced them with comments like “# … original code here …” – essentially lazily saying “and so on” instead of solving the problem. This kind of output is obviously incorrect logically and incomplete, but an over-trusting user might not notice missing functionality. It underscores that LLMs might not always do what you expect, even if the patch format is correct.
-
Session Consistency and Drift: In a multi-turn session, the conversation history grows, and maintaining consistency can be difficult. If you ask for several changes in a row, the context of earlier edits has to carry over. There are edge cases where the model might “forget” details from earlier in the chat or get confused by them. Also, large context windows can ironically cause confusion – as one developer noted, feeding too much (like an entire big codebase) can make models start “losing track and stop obeying instructions”. When the assistant loads, say, 50k tokens of code into context to be thorough, the LLM might become unable to locate the proper edit region or start giving irrelevant responses. This is why the strategy is usually to slim down context to just what’s needed. But it’s a tricky balance – not enough context, and the model might make an edit that conflicts with code in another file it wasn’t shown; too much context, and it might ignore the system prompt to stick to the diff format, for example. Keeping the model focused through many edits (especially if the user’s requests shift scope) is an open challenge. Some tools mitigate this by occasionally clearing history or re-establishing context (e.g., re-summarizing the code after several changes).
-
Multiple File Edits and Coordination: A particularly challenging edge case is when a change spans multiple files. For example, renaming a function requires updating its definition and all call sites across the project. LLMs working on one file at a time might not catch everything. Aider will automatically pull in related files if they appear in the repo map context, but there’s no guarantee the model will edit all of them unless explicitly instructed. Cody can handle multi-file if you select them or mention them, but ensuring consistency (no missed references) is hard. If the LLM misses an occurrence, the code may compile but fail at runtime. Conversely, if the user asks for a broad change (“make all API endpoints use authentication”), the model might try to output a very large diff touching many files, which could exhaust token limits or simply be too much to apply safely. In practice, these tools often have to break such tasks into smaller chunks (or guide the user to do so). Handling transactions of many coordinated edits is still an area with edge cases (like partial application or one file succeeding and another failing).
In summary, while LLM coding assistants greatly speed up writing and modifying code, they do encounter issues like format compliance, correct placement of changes, logical correctness, and maintaining coherence over multiple edits. Each tool has some mechanisms to reduce these, but they remain important considerations when using LLM-generated patches.
To increase confidence in the AI-generated patches, coding assistants employ a mix of automated checks and encouraging human oversight:
-
Automated Linting: Many tools integrate linters or compilers to catch errors immediately after applying a patch. Aider, for example, “can automatically lint ... code every time it makes changes”, using built-in linters for popular languages. If the LLM’s edit introduced a syntax error or style violation, the linter output is captured. Aider will present those lint errors and often feed them back into the chat, asking the LLM to fix the issues it introduced. This creates a feedback loop where the assistant says, in effect, “I applied your changes, but the linter found these problems, please address them.” By doing this before the user even runs the code, obvious mistakes can be corrected in seconds. For compiled languages, a “lint” step can simply be a compilation attempt – any compiler errors are treated just like lint findings. Linters also enforce coding conventions, so if a patch violated formatting rules, it can be caught and fixed as part of this process.
-
Running Tests (Auto-Testing): Beyond linting, running the project’s test suite is a powerful way to verify correctness. Aider allows users to specify a test command (e.g.
pytest
ornpm test
) and with the--auto-test
option, it will run the tests after each AI edit. If any tests fail post-patch, Aider captures the failing test output (stack traces, assertions, etc.) and shares it with the LLM. The assistant will say something like: “Tests X and Y failed with this error… please fix the code accordingly.” The LLM then has concrete feedback about what went wrong, which greatly helps in steering it to a correct solution. This approach essentially uses the test suite as the judge of success: a patch isn’t “done” until tests pass (or the user is satisfied). It’s a form of verification that goes beyond superficial correctness. Not all tools have this built-in, though. Cody doesn’t automatically run tests in the IDE, but a developer can of course run them manually after applying changes. Some workflows combine Cody with a continuous testing setup (like auto-running tests on file save) to mimic this. The key benefit is catching runtime or logical errors that linting won’t (e.g., the code runs but produces wrong results). -
User Review and Confirmation: These assistants generally keep the developer in the loop for final approval. In Cody and similar IDE-based tools, the changes are shown to the user (either as a diff or as a preview in the chat) and only applied when the user confirms. This is a form of verification by inspection – the developer can eyeball the patch and decide if it looks reasonable. Even after application, the changes are in the editor where the developer can further tweak or revert if needed. In fact, one of the simplest “verification” habits encouraged is to always review the AI’s diff. Cody makes this easy by highlighting the changes it will make; Cody’s apply UI effectively forces a mini code review by the user before the code goes into their file. In contrast, Aider auto-applies by design, but the user still sees the diff printed and can undo if it looks wrong. Tabby’s suggestions are inline, so the user inherently reviews them as they accept them (since you see the suggestion ghosted in your editor before committing it). This human-in-the-loop checkpoint is crucial, because it can catch issues that automated checks might not – e.g., if the code change doesn’t align with the intended design or could have side effects the model wouldn’t know. Many teams treat AI patches like any other code change: they might even put them in a pull request for peer review. Some open-source projects using these tools require that an AI-generated diff be reviewed and edited by a developer before it’s committed to the main branch.
-
Version Control Safety Nets: Integration with version control (typically Git) is another verification and safety strategy. Aider’s choice to commit each change means there’s a record of what the AI did. If something goes wrong, you can revert that commit. You can also diff the commit to see exactly what was altered, which is useful if the change was large or spread out. Aider even tags commit messages to indicate AI involvement, so later one can identify AI-authored changes. For verification purposes, one could run
git diff
or usegit bisect
to isolate an AI-introduced bug. While Cody and Tabby don’t automatically commit changes, they operate on the user’s working copy which the user can manually commit or revert. Many developers will commit a baseline before using the AI, so they have a snapshot to go back to if needed. This practice isn’t new (developers do it before large manual refactors too), but it’s especially handy with AI patches – if the tool goes on a wrong tangent, you can drop the changes easily. In essence, version control provides an undo/redo on a higher level, complementing the tool’s own undo commands. -
Dry Runs and Confirmation Modes: Some assistants have a “preview” mode or allow the LLM to suggest changes without immediately applying them. For example, instead of actually editing files, the assistant could output a diff in the chat for the user to confirm. The user could then say “Looks good, apply it” or copy it manually. This isn’t the default for most (because it’s less efficient), but it’s a strategy some users adopt for critical sections – treat the AI like it’s proposing a pull request, and you as the maintainer approve and apply it. Verification here is purely manual, but it can be useful for double-checking complex patches.
-
Secondary Model or Double-Checking: An emerging strategy (not widely implemented yet) is using a second LLM to review the first LLM’s output. Aider’s architect mode is one example: one model proposes, another applies/fixes. In theory, one could also have a model critique the patch (“Does this change meet the request and not break anything?”). This is still experimental, but conceptually it’s like a code review by an AI. The second model might catch obvious mistakes or policy violations. However, it’s not foolproof, and it doubles the cost, so it’s not common in current tools beyond the specific use-case of formatting patches. Still, it points toward future verification where the AI could self-validate to some extent before asking the human to sign off.
In practice, a combination of these strategies is used. For instance, Aider applies the patch, lints, runs tests, and still lets the user inspect the diff and undo if needed. Cody relies more on the user’s judgment and existing development processes (like tests in the repo, code reviews), given it’s an IDE assistant. The goal is always to ensure the final code is correct and maintainable, whether that assurance comes from an automated signal (tests passing) or the developer’s own confidence after reviewing.
Despite the progress in applying LLM-generated code patches, several challenges remain difficult:
-
Handling Complex or Large-Scale Refactorings: While LLMs are good at small, targeted fixes, they struggle with sweeping changes that require understanding the architecture or making coordinated edits across many parts of a codebase. For example, redesigning an entire module or changing the inheritance structure of several classes is hard to do with one-shot prompts. The model might miss some connections or produce a massive diff that doesn’t fit in context. Current tools often break such tasks into smaller steps or leave the higher-level decisions to the developer. Fully automating a large refactor (that a senior engineer would plan carefully) is still out of reach. The assistants also lack global reasoning – they don’t perform whole-program analysis like a human would when ensuring a big change doesn’t introduce subtle bugs. So complex refactoring remains a semi-manual process with AI providing help on the smaller pieces.
-
Eliminating Hallucinations and Ensuring Accuracy: LLMs sometimes output code that “looks” valid but is false. For instance, an AI might confidently use a function from a library that actually doesn’t exist in that version, or assume a different data type. These hallucinations are improved by giving more context, but not entirely solved. There’s no guarantee that just because the patch applies and tests pass (if tests aren’t thorough) that the logic is 100% correct. Reducing hallucinations may require model improvements or additional validation steps. Some research proposes runtime verification or executing the code in examples to see if it behaves as expected, but that’s not common in current tooling. Thus, developers still need to keep a critical eye on AI-suggested code for plausibility. As one commenter put it, “Tabby (or Copilot, etc.) can produce sneaky problems that appear to run but contain issues” – catching those is an unsolved part that likely needs stronger semantic analysis from the AI or integration with formal verification tools.
-
Maintaining Project Coding Conventions: Each project has its own style and conventions (naming, formatting, architectural patterns). LLMs have a generic understanding of common styles (and tools like Aider let you specify “use black formatting” or “follow PEP8”), but nuances often slip. For example, an organization might have specific prefix for private variables or a certain pattern for commit messages. Today’s models might not pick up on those without explicit instructions – and even then, they might inconsistently apply them. Aider’s ability to take a config for coding conventions is a helpful feature, but it’s not foolproof if the model isn’t 100% obedient or if the convention is complex. Enforcing things like project-specific lint rules or type-checker constraints is still something that might require a post-processing step (or again, human review). In summary, getting AI to write code in your team’s style consistently is still a challenge; it often requires iterative prompts (like “Fix the naming to match our styleguide”) or manual tweaks after the fact.
-
Context Window Limits and Performance: Even though models with huge context windows (50k, 100k tokens) are emerging, using them effectively is non-trivial. As noted, feeding too much code can confuse the model or cause it to ignore the system instructions. There’s an unresolved question of how to give an AI “knowledge” of an entire large codebase without drowning it in details or running into token limits. Approaches like embeddings + retrieval (which Cody uses) help, but if the relevant piece of code isn’t retrieved, the model might not know about it and make a mistake. Extremely large contexts also raise performance issues (slower responses, higher cost). So tools must do a lot of smart filtering – this is still an active development area and not entirely solved. “Infinite context” where the AI truly knows everything about the code at once is a bit of a holy grail; for now, there’s a trade-off between breadth of context and reliability of the model’s focus.
-
Reliability of Patch Formatting: Despite improved prompting, getting an LLM to consistently produce a perfect, applyable patch is not 100% solved. Minor format deviations or creative interpretations of the instructions still happen. This is especially true with open-source or smaller models. Aider’s creator mentioned that precision breaks down when models are overloaded or not robust enough, leading to format errors. They introduced the “editor/architect mode” to mitigate this, essentially fixing patches with another step. That indicates the problem isn’t fully solved by prompt engineering alone – it needed a secondary process. In general, having to parse the AI’s output and sometimes deal with unexpected output will likely persist. The tools will get better at recovering (e.g., if the model outputs code without the proper fence, perhaps auto-infer where it goes), but the ideal where the AI always follows instructions to the letter is not here yet. Until models are more deterministic or controllable, patch integration will occasionally require clean-up (either by the tool or the user). It remains an area for improvement (perhaps future LLM systems will allow constrained output modes natively).
-
Trust and Verification at Scale: Another unsolved aspect is: how do we truly verify that an AI change hasn’t broken anything in a large system? Running a full test suite is a good start, but what if coverage is incomplete? Subtle bugs might lurk. In critical code (say medical or financial software), one might want formal guarantees. Right now, these AI assistants don’t integrate with formal methods or deep static analysis beyond linters. In the future, there may be hybrid tools that can reason about the code’s correctness (or integrate with model checkers), but that’s an open challenge. For now, the “correctness” of an AI patch is only as good as the tests and reviews that follow it. This is more a software engineering challenge compounded by AI: if you wouldn’t be confident pushing a human patch without thorough testing, the same goes for AI patches. Ensuring complete correctness (semantics, security, performance) remains largely unsolved in an automated way.
In conclusion, open-source LLM coding assistants have made it feasible to generate and apply code changes automatically, using clever prompt designs and tool integrations. They excel at small tasks and save developers time, but they are not infallible. Prompting strategies and patch application techniques continue to improve, tackling many failure modes. Verification mechanisms like linting and testing provide feedback loops to catch mistakes. However, challenges like large-scale refactoring, hallucination elimination, and absolute reliability of changes still require further innovation. Developers using these tools today should leverage their strengths for productivity, while remaining aware of their limits and ready to intervene when the AI goes off track. The field is rapidly evolving, and each iteration (both in model capability and tool design) brings us closer to more trustworthy AI pair programmers, but for now, human oversight and collaboration with the AI remain key.
Sources:
- Aider Documentation and Issues (prompt formats, error handling, verification)
- Sourcegraph Cody Forum and Docs (context injection, apply behavior)
- Tabby Project Info and Community Commentary (capabilities and limitations)
- Developer Discussions on LLM Code Edits (Hacker News threads on Aider and Cody)