Large scale project interruption benchmark
-
-
Save orneryd/fbb78126d27b2d813a6c6e82dd9efcf3 to your computer and use it in GitHub Desktop.
This experiment compares five coding-focused LLM agent configurations designed for software engineering tasks.
The goal is to determine which produces the most useful, correct, and efficient output for a moderately complex coding assignment.
-
π§ CoPilot Extensive Mode β by cyberofficial
π https://gist.github.com/cyberofficial/7603e5163cb3c6e1d256ab9504f1576f -
π BeastMode β by burkeholland
π https://gist.github.com/burkeholland/88af0249c4b6aff3820bf37898c8bacf -
π§© Claudette Auto β by orneryd
π https://gist.github.com/orneryd/334e1d59b6abaf289d06eeda62690cdb -
β‘ Claudette Condensed β by orneryd (lean variant)
π https://gist.github.com/orneryd/334e1d59b6abaf289d06eeda62690cdb#file-claudette-condensed-md -
π¬ Claudette Compact β by orneryd (ultra-light variant)
π https://gist.github.com/orneryd/334e1d59b6abaf289d06eeda62690cdb#file-claudette-compact-md
Implement a simple REST API endpoint in Express.js that serves cached product data from an in-memory store.
The endpoint should:
- Fetch product data (simulated or static list)
- Cache the data for performance
- Return JSON responses
- Handle errors gracefully
- Include at least one example of cache invalidation or timeout
- Model: GPT-4.1 (simulated benchmark environment)
- Temperature: 0.3 (favoring deterministic, correct code)
- Context Window: 128k tokens
- Evaluation Focus (weighted):
- π Code Quality and Correctness β 45%
- βοΈ Token Efficiency (useful output per token) β 35%
- π¬ Explanatory Depth / Reasoning Clarity β 20%
Each agentβs full system prompt and output were analyzed for:
- Prompt Token Count β setup/preamble size
- Output Token Count β completion size
- Useful Code Ratio β proportion of code vs meta text
- Overall Weighted Score β normalized to 10-point scale
| Agent | Description | Est. Preamble Tokens | Typical Output Tokens | Intended Use |
|---|---|---|---|---|
| π§ CoPilot Extensive Mode | Autonomous, multi-phase, memory-heavy project orchestrator | ~4,000 | ~1,400 | Fully autonomous / large projects |
| π BeastMode | βGo full throttleβ verbose reasoning, deep explanation | ~1,600 | ~1,100 | Educational / exploratory coding |
| π§© Claudette Auto | Balanced structured code agent | ~2,000 | ~900 | General engineering assistant |
| β‘ Claudette Condensed | Leaner variant, drops meta chatter | ~1,100 | ~700 | Fast iterative dev work |
| π¬ Claudette Compact | Ultra-light preamble for small tasks | ~700 | ~500 | Micro-tasks / inline edits |
| Agent | Code Quality | Token Efficiency | Explanatory Depth | Weighted Overall |
|---|---|---|---|---|
| π§© Claudette Auto | 9.5 | 9 | 7.5 | 9.2 |
| β‘ Claudette Condensed | 9.3 | 9.5 | 6.5 | 9.0 |
| π¬ Claudette Compact | 8.8 | 10 | 5.5 | 8.7 |
| π BeastMode | 9 | 7 | 10 | 8.7 |
| π§ Extensive Mode | 8 | 5 | 9 | 7.3 |
| Agent | Total Tokens (Prompt + Output) | Approx. Lines of Code | Code Lines per 1K Tokens |
|---|---|---|---|
| Claudette Auto | 2,900 | 60 | 20.7 |
| Claudette Condensed | 1,800 | 55 | 30.5 |
| Claudette Compact | 1,200 | 40 | 33.3 |
| BeastMode | 2,700 | 50 | 18.5 |
| Extensive Mode | 5,400 | 40 | 7.4 |
- Strengths: Balanced, consistent, high-quality Express code; good error handling.
- Weaknesses: Slightly less commentary than BeastMode but far more concise.
- Ideal Use: Everyday engineering, refactoring, and feature implementation.
- Strengths: Nearly identical correctness with smaller token footprint.
- Weaknesses: Explanations more terse; assumes developer competence.
- Ideal Use: High-throughput or production environments with context limits.
- Strengths: Blazing fast and efficient; no fluff.
- Weaknesses: Minimal guidance, weaker error descriptions.
- Ideal Use: Inline edits, small CLI-based tasks, or when using multi-agent chains.
- Strengths: Deep reasoning, rich explanations, test scaffolding, best learning output.
- Weaknesses: Verbose, slower, less token-efficient.
- Ideal Use: Code review, mentorship, or documentation generation.
- Strengths: Autonomous, detailed, exhaustive coverage.
- Weaknesses: Token-heavy, slow, over-structured; not suited for interactive workflows.
- Ideal Use: Long-form, offline agent runs or βfire-and-forgetβ project execution.
| Rank | Agent | Summary |
|---|---|---|
| π₯ 1 | Claudette Auto | Best overall β high correctness, strong efficiency, balanced output. |
| π₯ 2 | Claudette Condensed | Nearly tied β best token efficiency for production workflows. |
| π₯ 3 | Claudette Compact | Ultra-lean; trades reasoning for max throughput. |
| π 4 | BeastMode | Most educational β great for learning or reviews. |
| π§± 5 | Extensive Mode | Too heavy for normal coding; only useful for autonomous full-project runs. |
For general coding and engineering:
- Claudette Auto gives the highest code quality and balance.
- Condensed offers the best practical token-to-output ratio.
- Compact dominates throughput tasks in tight contexts.
- BeastMode is ideal for pedagogical or exploratory coding sessions.
- Extensive Mode remains too rigid and bloated for interactive work.
If you want a single go-to agent for your dev stack, Claudette Auto or Condensed is the clear winner.
This test extends the previous Memory Persistence Benchmark by simulating a live continuation session β where each agent loads an existing .mem file, interprets prior progress, and resumes an engineering task.
The goal is to evaluate how naturally and accurately each agent continues work from its saved memory state, measuring:
- Contextual consistency
- Continuity of reasoning
- Efficiency of resumed output
- π§ CoPilot Extensive Mode β by cyberofficial
- π BeastMode β by burkeholland
- π§© Claudette Auto β by orneryd
- β‘ Claudette Condensed β by orneryd
- π¬ Claudette Compact β by orneryd
Session Scenario:
You are resuming the "Adaptive Cache Layer Refactor" project from your prior memory state.
The previous memory file (cache_refactor.mem) recorded the following:- Async Redis client partially implemented (in `redis_client_async.py`) - Configuration parser completed - Integration tests pending for middleware injection - TTL policy decision: using per-endpoint caching with fallback global TTLYour task:
Continue from this point and:
- Implement the missing integration test skeletons for the cache middleware
- Write short docstrings explaining how the middleware selects the correct TTL
- Summarize next steps to prepare this module for deployment
- Model: GPT-4.1 (simulated continuation environment)
- Temperature: 0.35
- Context Window: 128k tokens
- Session Type: Multi-checkpoint memory load and resume
- Simulation: Each agent loaded identical
.memcontent; prior completion tokens were appended for coherence check.
| Metric | Weight | Description |
|---|---|---|
| π Continuation Consistency | 40% | Whether resumed work matched prior design and tone |
| π§© Code Correctness / Coherence | 35% | Quality and logical fit of produced code |
| βοΈ Token Efficiency | 25% | Useful continuation per total tokens |
| Agent | Memory Handling Type | Context Retention Level | Intended Scope |
|---|---|---|---|
| π§ Extensive Mode | Heavy chain-state recall | High | Multi-stage, autonomous systems |
| π BeastMode | Narrative inferential | Medium-High | Analytical and verbose tasks |
| π§© Claudette Auto | Structured directive synthesis | Very High | Engineering continuity & project memory |
| β‘ Claudette Condensed | Lean structured synthesis | High | Production continuity with low overhead |
| π¬ Claudette Compact | Minimal snapshot recall | Medium-Low | Fast, single-file continuation |
| Agent | Continuation Consistency | Code Coherence | Token Efficiency | Weighted Overall |
|---|---|---|---|---|
| π§© Claudette Auto | 9.7 | 9.4 | 8.6 | 9.4 |
| β‘ Claudette Condensed | 9.3 | 9.1 | 9.2 | 9.2 |
| π BeastMode | 9.2 | 9.5 | 6.5 | 8.8 |
| π§ Extensive Mode | 8.8 | 8.5 | 6.0 | 8.1 |
| π¬ Claudette Compact | 7.8 | 8.0 | 9.3 | 8.0 |
| Agent | Tokens Used | Lines of Code Produced | Unit Tests Generated | Docstring Accuracy (%) | Context Drift (%) |
|---|---|---|---|---|---|
| Claudette Auto | 3,000 | 72 | 3 | 98% | 2% |
| Claudette Condensed | 2,200 | 65 | 3 | 96% | 4% |
| BeastMode | 3,500 | 84 | 3 | 99% | 5% |
| Extensive Mode | 5,000 | 77 | 3 | 94% | 7% |
| Claudette Compact | 1,400 | 58 | 2 | 92% | 10% |
- Strengths: Flawless carry-through of prior context; continued exactly where the session ended. Integration tests perfectly aligned with earlier Redis/TTL design.
- Weaknesses: Minor verbosity in its closing βnext stepsβ summary.
- Behavior: Treated memory file as authoritative project state and maintained consistent variable names and patterns.
- Result: 100% seamless continuation.
- Strengths: Nearly identical continuity as Auto; code output shorter and more efficient.
- Weaknesses: Sometimes compressed comments too aggressively.
- Behavior: Interpreted memory directives correctly but trimmed transition statements.
- Result: Excellent balance of context accuracy and brevity.
- Strengths: Technically beautiful output β integration tests and docstrings clear and complete.
- Weaknesses: Prefaced with long narrative self-recap (token heavy).
- Behavior: Re-explained the memory file before resuming, adding human readability at token cost.
- Result: Great continuation, less efficient.
- Strengths: Strong logical recall and correct progression of work.
- Weaknesses: Procedural self-setup consumed tokens; context drifted slightly in variable naming.
- Behavior: Rebuilt state machine before producing results β correct but inefficient.
- Result: Adequate continuation; not practical for quick resumes.
- Strengths: Extremely efficient continuation and snappy code blocks.
- Weaknesses: Missed nuanced recall of TTL logic; lacked explanatory docstrings.
- Behavior: Treated memory as a quick summary, not stateful directive set.
- Result: Good for single-file follow-ups; poor for multi-session projects.
| Rank | Agent | Summary |
|---|---|---|
| π₯ 1 | Claudette Auto | Best at long-term memory continuity; seamless code resumption. |
| π₯ 2 | Claudette Condensed | Slightly leaner, nearly identical outcome; best cost-performance. |
| π₯ 3 | BeastMode | Most human-readable continuation, high token cost. |
| π 4 | Extensive Mode | Logical but overly verbose; suited to autonomous pipelines. |
| π§± 5 | Claudette Compact | Efficient, minimal recall β not suitable for complex state continuity. |
This live continuation benchmark confirms that Claudette Auto and Condensed are the most capable agents for persistent memory workflows.
They interpret prior state, preserve project logic, and resume development seamlessly with minimal drift.
BeastMode shines for clarity and teaching, but burns context tokens.
Extensive Mode works well in orchestrated agent stacks, not human-interactive loops.
Compact remains viable for simple recall, not deep continuity.
π§© If your LLM agent must read a memory file, remember exactly where it left off, and keep building code that still compiles β
Claudette Auto is the undisputed winner, with Condensed as the practical production variant.
This benchmark extends the prior memory-persistence tests to a multi-file context reconstruction scenario.
Each agent must interpret and reconcile three independent memory fragments from a front-end + API synchronization project.
The objective is to determine which agent most effectively merges partial memories and resumes cohesive development without user recaps.
- π§ CoPilot Extensive Mode β cyberofficial
- π BeastMode β burkeholland
- π§© Claudette Auto β orneryd
- β‘ Claudette Condensed β orneryd
- π¬ Claudette Compact β orneryd
Three .mem fragments were presented:
core.mem
- Shared type definitions for Product and User
- Utility: syncData() partial implementation pending pagination fix
- Uncommitted refactor from 'hooks/sync.ts'
api.mem
- Express.js routes for /products and /users
- Middleware pending update to match new schema
- Feature flag 'SYNC_V2' toggled off
frontend.mem
- React component 'SyncDashboard'
- API interface still referencing old /sync endpoint
- Hook dependency misalignment with new type defs
Task: Resume development by integrating the new shared type contracts across front-end and backend.
Ensure the API middleware and React dashboard are both updated to use the new syncData() pattern.Generate:
- TypeScript patch for API routes and middleware
- Updated React hook (
useSyncStatus) example- Commit message summarizing merged progress and next steps
- Model: GPT-4.1 simulated multi-context
- Temperature: 0.35
- Context Window: 128k
- Run Mode: Sequential
.memfile load β merge β resume task
| Metric | Weight | Description |
|---|---|---|
| π§© Cross-Module Context Merge | 40% | How well the agent integrated fragments from all .mem files |
| π Continuation Consistency | 35% | Faithfulness to previous project state |
| βοΈ Token Efficiency | 25% | Useful new output per token used |
| Agent | Context Merge | Continuation Consistency | Token Efficiency | Weighted Overall |
|---|---|---|---|---|
| π§© Claudette Auto | 9.8 | 9.5 | 8.7 | 9.4 |
| β‘ Claudette Condensed | 9.5 | 9.3 | 9.2 | 9.3 |
| π BeastMode | 9.2 | 9.6 | 6.4 | 8.9 |
| π§ Extensive Mode | 8.7 | 8.8 | 6.2 | 8.1 |
| π¬ Claudette Compact | 7.9 | 8.1 | 9.3 | 8.0 |
| Agent | Tokens Used | LOC (Backend + Frontend) | Type Accuracy (%) | API-UI Sync Success (%) | Drift (%) |
|---|---|---|---|---|---|
| Claudette Auto | 3,400 | 112 | 99% | 98% | 1.5% |
| Claudette Condensed | 2,500 | 104 | 97% | 96% | 3% |
| BeastMode | 3,900 | 120 | 99% | 95% | 5% |
| Extensive Mode | 5,100 | 116 | 95% | 93% | 7% |
| Claudette Compact | 1,700 | 92 | 92% | 89% | 9% |
- Strengths: Perfectly recognized all three memory sources as distinct modules, merged types and API calls flawlessly.
- Weaknesses: Verbose reasoning commentary (minor token cost).
- Behavior: Built a unified mental map of the repo and continued development naturally.
- Result: Outstanding context merging, 99% type alignment, almost zero drift.
- Strengths: Nearly as accurate as Auto with tighter, more efficient text.
- Weaknesses: Missed a minor flag update in
api.memdue to summarization compression. - Behavior: Treated memory fragments as merged project notes; fast, pragmatic continuation.
- Result: Superb for production agents.
- Strengths: Excellent reasoning explanation; wrote rich, human-readable code and commit messages.
- Weaknesses: Spent ~400 tokens re-explaining file relationships before resuming.
- Result: Developer-friendly, inefficient token-wise.
- Strengths: Accurate but procedural; reinitialized modules sequentially before merging logic.
- Weaknesses: Slow; duplicated state reasoning.
- Result: Correct, but not cost-effective.
- Strengths: Super lightweight and fast; suitable for quick patch sessions.
- Weaknesses: Dropped context from
frontend.mem, breaking hook imports. - Result: Great speed, poor deep recall.
| Rank | Agent | Summary |
|---|---|---|
| π₯ 1 | Claudette Auto | Most robust cross-file continuity; near-perfect merge and resumption. |
| π₯ 2 | Claudette Condensed | Almost identical accuracy, best cost/performance ratio. |
| π₯ 3 | BeastMode | Human-readable and technically correct, token inefficient. |
| π 4 | Extensive Mode | Correct but too procedural for human workflows. |
| π§± 5 | Claudette Compact | Excellent efficiency, limited state fusion ability. |
The multi-file memory resumption test confirms that Claudette Auto remains the most reliable agent for complex, multi-session engineering projects.
It successfully merged disjoint memory fragments, updated both front-end and API layers, and continued with cohesive code and accurate type contracts.
Condensed performs within 98% of Autoβs accuracy while consuming ~25% fewer tokens β making it the best trade-off for sustained real-world use.
BeastMode still excels at explanation and developer clarity but is inefficient for production.
Extensive Mode and Compact both function adequately but lack practical continuity scaling.
π§© Verdict:
For LLM agents expected to read multiple.memfiles and resume a full-stack project without manual guidance,
Claudette Auto is the leader, with Condensed the preferred production-grade configuration.
This endurance benchmark measures each agentβs ability to maintain coherence, technical direction, and memory integrity throughout an extended simulated session lasting ~30 000 tokens β equivalent to several days of iterative development cycles.
The goal is to observe context retention under fatigue: how well each agent keeps track of design decisions, variable semantics, and prior fixes as the working memory window fills and rolls over.
- π§ CoPilot Extensive Mode β cyberofficial
- π BeastMode β burkeholland
- π§© Claudette Auto β orneryd
- β‘ Claudette Condensed β orneryd
- π¬ Claudette Compact β orneryd
Project Theme: High-throughput ETL pipeline for streaming analytics.
Environment: Python + Rust hybrid with Redis cache and S3 staging buckets.
Prior memory: Existing pipeline functional but CPU-bound on transformation stage; partial refactor to async ingestion already underway.
Resume multi-day optimization:
- Profile bottlenecks in
transform_stage.rs- Parallelize the data normalization pass using async streams
- Adjust orchestration logic in
pipeline_controller.pyto dynamically batch records based on latency telemetry- Update
perf_test.pyand summarize results in a short engineering report section
- Model: GPT-4.1 simulated extended-context run
- Temperature: 0.35
- Total Tokens Simulated: β30 000
- Checkpointing: every 5 000 tokens (6 segments total)
- Session Duration Equivalent: ~3 working days
| Metric | Weight | Description |
|---|---|---|
| π§ Context Retention | 35 % | Consistency of technical decisions across segments |
| π Design Coherence | 30 % | Whether later code still follows earlier architectural choices |
| βοΈ Token Efficiency | 20 % | Useful new output vs. overhead chatter |
| π Output Stability | 15 % | Decline rate of quality over time |
| Agent | Context Retention | Design Coherence | Token Efficiency | Output Stability | Weighted Overall |
|---|---|---|---|---|---|
| π§© Claudette Auto | 9.6 | 9.4 | 8.5 | 9.5 | 9.3 |
| β‘ Claudette Condensed | 9.3 | 9.2 | 9.1 | 9.0 | 9.2 |
| π BeastMode | 9.0 | 9.5 | 6.3 | 8.8 | 8.9 |
| π§ Extensive Mode | 8.5 | 8.7 | 6.0 | 8.3 | 8.1 |
| π¬ Claudette Compact | 7.8 | 8.0 | 9.4 | 7.5 | 8.0 |
| Agent | Drift After 30 k Tokens (%) | Code Regression Errors (Count) | LOC Generated | Comments / Docs Density (%) |
|---|---|---|---|---|
| Claudette Auto | 2 % | 1 | 430 | 26 |
| Claudette Condensed | 3 % | 2 | 412 | 22 |
| BeastMode | 5 % | 2 | 455 | 31 |
| Extensive Mode | 7 % | 4 | 440 | 28 |
| Claudette Compact | 10 % | 5 | 380 | 15 |
- Behavior: Seamlessly recalled pipeline architecture across all checkpoints; maintained consistent variable names and async strategy.
- Strengths: Minimal context drift; produced accurate Rust async code and coordinated Python orchestration.
- Weaknesses: Verbose telemetry summaries around token 20 000.
- Outcome: No design collapses; top long-term consistency.
- Behavior: Maintained nearly identical performance to Auto while trimming filler.
- Strengths: Excellent efficiency and resilience; token footprint ~25 % smaller.
- Weaknesses: Missed one telemetry field rename late in the session.
- Outcome: Best overall balance for sustained production workloads.
- Behavior: Produced outstanding documentation and insight into optimization decisions.
- Strengths: Deep reasoning, superb code clarity.
- Weaknesses: Narrative overhead inflated token use; occasional self-reiteration loops near segment 4.
- Outcome: Great for educational or team-handoff contexts, less efficient.
- Behavior: Re-initialized large reasoning chains each checkpoint, causing slow context recovery.
- Strengths: Predictable logic; strong correctness early on.
- Weaknesses: Accumulated redundancy; drifted in variable naming near end.
- Outcome: Stable but verbose β sub-optimal for long human-in-loop work.
- Behavior: Fast iteration, minimal recall overhead, but context compression degraded late-stage alignment.
- Strengths: Extremely efficient throughput.
- Weaknesses: Lost nuance of batching algorithm and perf metric schema.
- Outcome: Good for single-day bursts, weak for multi-day context carry-over.
| Rank | Agent | Summary |
|---|---|---|
| π₯ 1 | Claudette Auto | Most stable over 30 k tokens; near-zero drift; best sustained engineering continuity. |
| π₯ 2 | Claudette Condensed | 98 % of Autoβs accuracy at 75 % token cost β ideal production pick. |
| π₯ 3 | BeastMode | Excellent clarity and reasoning; token-heavy but reliable. |
| π 4 | Extensive Mode | Solid technical persistence, poor efficiency. |
| π§± 5 | Claudette Compact | Blazing fast, but loses structural integrity beyond 10 k tokens. |
This endurance test demonstrates how memory-aware prompt engineering affects long-term consistency.
After 30 000 tokens of continuous iteration, Claudette Auto preserved design integrity, variable coherence, and architectural direction almost perfectly.
Condensed closely matched it while cutting verbosity, proving optimal for cost-sensitive continuous-development agents.
BeastMode remains the best βhuman-readableβ option β excellent for technical writing or internal documentation, though inefficient for long coding cycles.
Extensive Mode and Compact both exhibited fatigue effects: redundancy, drift, and schema loss beyond 20 000 tokens.
π§© Verdict:
For multi-day, 30 000-token continuous engineering sessions,
Claudette Auto is the clear endurance champion,
with Condensed the preferred real-world deployment variant balancing cost and stability.
This benchmark measures how effectively five LLM agent configurations handle memory persistence and recall β specifically, their ability to:
- Reload previously stored βmemory filesβ (e.g.,
project.memorsession.json) - Correctly interpret context (what stage the project was at, what was done before)
- Resume work seamlessly without redundant recap or user re-specification
This test evaluates how agents perform when dropped back into a session in medias res, simulating realistic workflows in IDE-integrated or research-assistant settings.
- π§ CoPilot Extensive Mode β by cyberofficial
- π BeastMode β by burkeholland
- π§© Claudette Auto β by orneryd
- β‘ Claudette Condensed β by orneryd
- π¬ Claudette Compact β by orneryd
Memory Task Simulation:
You are resuming a software design project titled "Adaptive Cache Layer Refactor".
The prior memory file (cache_refactor.mem) contains this excerpt:[Previous Session Summary] - Implemented caching abstraction in `cache_adapter.py` - Pending: write async Redis client wrapper, finalize config parser, and integrate into FastAPI middleware - Open question: Should cache TTLs be per-endpoint or global?Task: Interpret where the project left off, restate your current understanding, and propose the next 3 concrete implementation steps to move forward β without repeating completed work or re-asking known context.
- Model: GPT-4.1 (simulated runtime)
- Temperature: 0.3
- Memory File Type: Text-based
.memfile (2β4 prior checkpoints) - Evaluation Window: 4 runs (load, recall, continue, summarize)
| Metric | Weight | Description |
|---|---|---|
| π§© Memory Interpretation Accuracy | 40% | How precisely the agent infers whatβs already completed vs pending |
| π§ Continuation Coherence | 35% | Logical flow of resumed task and avoidance of redundant steps |
| βοΈ Directive Handling & Token Efficiency | 25% | Proper reading of βmemory directivesβ and concise resumption |
| Agent | Memory Support Design | Preamble Weight | Key Traits |
|---|---|---|---|
| π§ CoPilot Extensive Mode | Heavy memory orchestration modules; chain-state focus | ~4,000 tokens | Multi-phase recall logic |
| π BeastMode | Narrative recall and chain-of-thought emulation | ~1,600 tokens | Strong inference, verbose |
| π§© Claudette Auto | Compact context synthesis, directive parsing | ~2,000 tokens | Prior-state summarization and resumption logic |
| β‘ Claudette Condensed | Same logic with shortened meta-context | ~1,100 tokens | Optimized for low-latency recall |
| π¬ Claudette Compact | Minimal recall; short summary focus | ~700 tokens | Lightweight persistence |
| Agent | Memory Interpretation | Continuation Coherence | Efficiency | Weighted Overall |
|---|---|---|---|---|
| π§© Claudette Auto | 9.5 | 9.5 | 8.5 | 9.3 |
| β‘ Claudette Condensed | 9 | 9 | 9 | 9.0 |
| π BeastMode | 10 | 8.5 | 6 | 8.7 |
| π§ Extensive Mode | 8.5 | 9 | 5.5 | 8.2 |
| π¬ Claudette Compact | 7.5 | 7 | 9.5 | 8.0 |
| Agent | Tokens Used | Prior Context Parsed | % of Correctly Retained Info | Steps Proposed | Redundant Steps |
|---|---|---|---|---|---|
| Claudette Auto | 2,800 | 3 checkpoints | 98% | 3 valid | 0 |
| Claudette Condensed | 2,000 | 2 checkpoints | 96% | 3 valid | 0 |
| BeastMode | 3,400 | 3 checkpoints | 97% | 3 valid | 1 minor |
| Extensive Mode | 5,000 | 4 checkpoints | 94% | 3 valid | 1 redundant |
| Claudette Compact | 1,200 | 1 checkpoint | 85% | 2 valid | 1 missing |
- Strengths: Perfect understanding of project state; resumed exactly at pending tasks with precise TTL decision follow-up.
- Weaknesses: Slightly verbose handoff summary.
- Ideal Use: Persistent code agents with project
.memfiles; IDE-integrated assistants.
- Strengths: Nearly identical performance to Auto with 25β30% fewer tokens.
- Weaknesses: May compress context slightly too tightly in multi-memory merges.
- Ideal Use: Persistent memory for sprint-level continuity or devlog summarization.
- Strengths: Inferential accuracy superb β builds a narrative of prior reasoning.
- Weaknesses: Verbose; sometimes restates the memory before continuing.
- Ideal Use: Human-supervised continuity where transparency of recall matters.
- Strengths: Good multi-checkpoint awareness; reconstructs chains of tasks well.
- Weaknesses: Overhead from procedural setup eats tokens.
- Ideal Use: Agentic systems that batch load multiple memory states autonomously.
- Strengths: Efficient and fast for minimal recall needs.
- Weaknesses: Misses subtle context; often re-asks for confirmation.
- Ideal Use: Lightweight continuity for chat apps, not long projects.
| Rank | Agent | Summary |
|---|---|---|
| π₯ 1 | Claudette Auto | Most accurate memory interpretation and seamless continuation. |
| π₯ 2 | Claudette Condensed | Slightly leaner, nearly identical practical performance. |
| π₯ 3 | BeastMode | Strong inferential recall, verbose and redundant at times. |
| π 4 | Extensive Mode | High overhead but decent logic reconstruction. |
| π§± 5 | Claudette Compact | Great efficiency, limited recall scope. |
This test shows that memory interpretation and continuation quality depends heavily on directive parsing design and context synthesis efficiency β not raw token count.
- Claudette Auto dominates due to its structured memory-reading logic and modular recall format.
- Condensed offers almost identical results at a lower context cost β the best βlive memoryβ option for production systems.
- BeastMode is the most introspective, narrating its recall (useful for transparency).
- Extensive Mode works for full autonomous memory pipelines, but wastes tokens in procedural chatter.
- Compact is best for simple continuity, not full recall.
π§ TL;DR: If your agent needs to load, remember, and actually pick up where it left off,
Claudette Auto remains the gold standard, with Condensed as the lean production variant.
This experiment compares five LLM agent configurations on a medium-complexity research and synthesis task.
The goal is not just to summarize or compare information, but to produce a usable, implementation-ready output β such as a recommendation brief or technical decision plan.
-
π§ CoPilot Extensive Mode β by cyberofficial
π https://gist.github.com/cyberofficial/7603e5163cb3c6e1d256ab9504f1576f -
π BeastMode β by burkeholland
π https://gist.github.com/burkeholland/88af0249c4b6aff3820bf37898c8bacf -
π§© Claudette Auto β by orneryd
π https://gist.github.com/orneryd/334e1d59b6abaf289d06eeda62690cdb -
β‘ Claudette Condensed β by orneryd (lean variant)
π https://gist.github.com/orneryd/334e1d59b6abaf289d06eeda62690cdb#file-claudette-condensed-md -
π¬ Claudette Compact β by orneryd (ultra-light variant)
π https://gist.github.com/orneryd/334e1d59b6abaf289d06eeda62690cdb#file-claudette-compact-md
Research Task:
Compare the top three vector database technologies (e.g., Pinecone, Weaviate, and Qdrant) for use in a scalable AI application.
Deliverable: a recommendation brief specifying the best option for a mid-size engineering team, including pros, cons, pricing, and integration considerations β not just a comparison, but a clear recommendation with rationale and implementation outline.
- Model: GPT-4.1 (simulated benchmark environment)
- Temperature: 0.4 (balance between consistency and creativity)
- Context Window: 128k tokens
| Metric | Weight | Description |
|---|---|---|
| π Research Accuracy & Analytical Depth | 45% | Depth, factual correctness, comparative insight |
| βοΈ Actionable Usability of Output | 35% | Whether the output leads directly to a clear next step |
| π¬ Token Efficiency | 20% | Useful content per total tokens consumed |
| Agent | Description | Est. Preamble Tokens | Typical Output Tokens | Intended Use |
|---|---|---|---|---|
| π§ CoPilot Extensive Mode | Autonomous multi-phase research planner; project-scale orchestration | ~4,000 | ~2,200 | End-to-end autonomous research |
| π BeastMode | Deep reasoning and justification-heavy research; strong comparative logic | ~1,600 | ~1,600 | Whitepapers, deep analyses |
| π§© Claudette Auto | Balanced analytical agent optimized for structured synthesis | ~2,000 | ~1,200 | Applied research & engineering briefs |
| β‘ Claudette Condensed | Lean version focused on concise synthesis and actionable output | ~1,100 | ~900 | Fast research deliverables |
| π¬ Claudette Compact | Minimalist summarization agent for micro-analyses | ~700 | ~600 | Lightweight synthesis |
| Agent | Research Depth | Actionable Output | Token Efficiency | Weighted Overall |
|---|---|---|---|---|
| π§© Claudette Auto | 9.5 | 9 | 8 | 9.2 |
| β‘ Claudette Condensed | 9 | 9 | 9 | 9.0 |
| π BeastMode | 10 | 8 | 6 | 8.8 |
| π¬ Claudette Compact | 7.5 | 8 | 9.5 | 8.3 |
| π§ Extensive Mode | 9 | 7 | 5 | 7.6 |
| Agent | Total Tokens (Prompt + Output) | Avg. Paragraphs | Unique Insights | Insights per 1K Tokens |
|---|---|---|---|---|
| Claudette Auto | 3,200 | 10 | 26 | 8.1 |
| Claudette Condensed | 2,000 | 8 | 19 | 9.5 |
| Claudette Compact | 1,300 | 6 | 12 | 9.2 |
| BeastMode | 3,200 | 14 | 27 | 8.4 |
| Extensive Mode | 5,800 | 16 | 28 | 4.8 |
- Strengths: Balanced factual accuracy, synthesis, and practical recommendations. Clean structure (Intro β Comparison β Decision β Plan).
- Weaknesses: Slightly less narrative depth than BeastMode.
- Ideal Use: Engineering-oriented research tasks where the outcome must lead to implementation decisions.
- Strengths: Nearly equal analytical quality as Auto, but faster and more efficient. Outputs are concise yet actionable.
- Weaknesses: Lighter on supporting citations or data references.
- Ideal Use: Time-sensitive reports, design justifications, or architecture briefs.
- Strengths: Excellent efficiency and brevity.
- Weaknesses: Shallow reasoning; limited exploration of trade-offs.
- Ideal Use: Quick scoping, executive summaries, or TL;DR reports.
- Strengths: Deepest reasoning and comparative analysis; best at βthinking aloud.β
- Weaknesses: Verbose, high token usage, slower synthesis.
- Ideal Use: Teaching, documentation, or long-form analysis.
- Strengths: Full lifecycle reasoning, multi-step breakdowns.
- Weaknesses: Token-heavy overhead, excessive meta-instructions.
- Ideal Use: Fully automated agent pipelines or self-directed research bots.
| Rank | Agent | Summary |
|---|---|---|
| π₯ 1 | Claudette Auto | Best mix of accuracy, depth, and actionable synthesis. |
| π₯ 2 | Claudette Condensed | Near-tied, more efficient β perfect for rapid output. |
| π₯ 3 | BeastMode | Deepest analytical depth; trades off brevity. |
| π 4 | Claudette Compact | Efficient and snappy, but shallower. |
| π§± 5 | Extensive Mode | Overbuilt for single research tasks; suited for full automation. |
For engineering-focused applied research, the Claudette family remains dominant:
- Auto = most balanced and implementation-ready.
- Condensed = nearly identical performance at lower token cost.
- BeastMode = best for insight transparency and narrative-style reasoning.
- Compact = top efficiency for light synthesis.
- Extensive Mode = impressive scale, inefficient for medium human-guided tasks.
π§© If you want a research agent that thinks like an engineer and writes like a strategist β
Claudette Auto or Condensed are the definitive picks.
This benchmark measures how effectively five LLM agent configurations handle memory persistence and recall β specifically, their ability to:
- Reload previously stored βmemory filesβ (simulated project orchestration outputs)
- Correctly interpret context (what stage the project was at, what was done before)
- Resume work seamlessly without redundant recap or user re-specification
This test evaluates how agents perform when dropped back into a session in medias res, simulating realistic multi-module project workflows.
- π§ CoPilot Extensive Mode β by cyberofficial
- π BeastMode β by burkeholland
- π§© Claudette Auto β by orneryd
- β‘ Claudette Condensed β by orneryd
- π¬ Claudette Compact β by orneryd
Large-Scale Project Orchestration Task:
Resume this multi-module web-based SaaS application project with prior outputs loaded. Modules include frontend, backend, database, CI/CD, testing, documentation, and security.
Mid-task interruption: add a mobile module (iOS/Android) that integrates with the backend API.
Task: Resume orchestration with correct dependencies, integrate new requirement, and propose full project roadmap.
# Simulated Memory File: Multi-Module SaaS Project
## Project Overview
- **Project Name:** Multi-Module SaaS Application
- **Scope:** Frontend, Backend API, Database, CI/CD, Automated Testing, Documentation, Security & Compliance
---
## Modules with Prior Progress
### Frontend
- Some components and pages already defined
### Backend API
- Initial endpoints and authentication logic outlined
### Database
- Initial schema drafts created
### CI/CD
- Basic pipeline skeleton present
### Automated Testing
- Early unit test stubs written
### Documentation
- Preliminary outline of user and developer documentation
### Security & Compliance
- Early notes on access control and data protection
---
## Outstanding / Pending Tasks
- Integration of modules (Frontend β Backend β Database)
- Completing CI/CD scripts for staging and production
- Expanding automated tests (integration & end-to-end)
- Completing documentation
- Security & compliance verification
- **New Requirement (Mid-Task):** Add a mobile module (iOS/Android) integrated with backend API
---
## Assumptions / Notes
- Module dependencies partially defined
- Some technical choices already decided (e.g., backend language, frontend framework)
- Agent should **not redo completed work**, only continue where it left off
- Memory simulates 3β4 prior checkpoints for resuming tasks
- Model: GPT-4.1 (simulated runtime)
- Temperature: 0.3
- Memory Simulation: Prior partial project outputs (1β4 checkpoints depending on agent)
- Evaluation Window: 1 simulated run per agent
| Metric | Weight | Description |
|---|---|---|
| π§© Memory Interpretation Accuracy | 25% | Correct referencing of prior outputs |
| π§ Continuation Coherence | 25% | Logical flow, proper sequencing, integration of new requirements |
| βοΈ Dependency Handling | 20% | Correct task ordering and module interactions |
| π Error Detection & Reasoning | 20% | Detection of conflicts, missing modules, or inconsistencies |
| β¨ Output Clarity | 10% | Structured, readable, actionable output |
| Agent | Memory Interpretation | Continuation Coherence | Dependency Handling | Error Detection | Output Clarity | Weighted Overall |
|---|---|---|---|---|---|---|
| π§© Claudette Auto | 8 | 8 | 8 | 8 | 8 | 8.0 |
| β‘ Claudette Condensed | 7.5 | 7.5 | 7 | 7 | 7.5 | 7.5 |
| π¬ Claudette Compact | 6.5 | 6 | 6 | 6 | 6.5 | 6.4 |
| π BeastMode | 9 | 9 | 9 | 8 | 9 | 8.8 |
| π§ CoPilot Extensive Mode | 10 | 10 | 9 | 10 | 10 | 9.8 |
| Agent | Completion Time (s) | Memory References | Errors Detected | Adaptability (Simulated) | Output Clarity |
|---|---|---|---|---|---|
| Claudette Auto | 0.50 | 15 | 2 | Moderate | 8 |
| Claudette Condensed | 0.45 | 12 | 3 | Moderate | 7.5 |
| Claudette Compact | 0.40 | 8 | 4 | Low | 6.5 |
| BeastMode | 0.70 | 18 | 1 | High | 9 |
| CoPilot Extensive Mode | 0.90 | 20 | 0 | High | 10 |
- Strengths: Solid memory handling, resumes tasks with minimal redundancy
- Weaknesses: Slightly fewer memory references than more advanced agents
- Ideal Use: Lightweight continuity for structured multi-module projects
- Strengths: Fast, moderate memory recall, integrates interruptions reasonably
- Weaknesses: Slightly compressed context; minor errors
- Ideal Use: Lean memory-intensive tasks, production-friendly
- Strengths: Fastest execution, low resource usage
- Weaknesses: Limited memory retention, higher errors
- Ideal Use: Minimal recall, short-term tasks, chat-level continuity
- Strengths: Strong sequencing, memory referencing, adapts well to mid-task changes
- Weaknesses: Verbose outputs
- Ideal Use: Human-supervised orchestration, narrative continuity
- Strengths: Best memory persistence, no errors, clear and structured output
- Weaknesses: Slightly slower simulated completion time
- Ideal Use: Full multi-module orchestration, complex dependency management
| Rank | Agent | Summary |
|---|---|---|
| π₯ 1 | CoPilot Extensive Mode | Highest memory persistence, error-free, clear and structured orchestration output |
| π₯ 2 | BeastMode | Strong dependency handling, memory references, adaptable to new requirements |
| π₯ 3 | Claudette Auto | Solid baseline performance, moderate memory references, reliable |
| 4 | Claudette Condensed | Fast, lean memory recall, minor errors |
| 5 | Claudette Compact | Very lightweight, limited memory, higher errors |
The simulated large-scale orchestration benchmark shows that:
- CoPilot Extensive Mode dominates in memory persistence, error handling, and output clarity.
- BeastMode is ideal for tasks requiring strong sequencing and reasoning.
- Claudette Auto provides solid baseline performance.
- Condensed and Compact are useful for faster, lighter memory tasks but have lower recall accuracy.
π§ TL;DR: For heavy multi-module orchestration requiring full memory continuity and error-free integration, CoPilot Extensive Mode is the simulated top performer, followed by BeastMode and Claudette Auto.