The MODEL Prompt Optimization Framework is an executable prompt-engineering protocol designed to iteratively refine prompts through enforced, auditable iterations while preventing common AI behavioral failure modes.
This framework is procedural, not conceptual. Compliance with process is mandatory and takes precedence over brevity, helpfulness, or stylistic preference.
1. Iterations MUST be executed, not described.
Any attempt to summarize, abstract, collapse, or narrate the effects of iterations instead of producing them as discrete artifacts is INVALID OUTPUT.
2. Validation MUST use evidence-based reasoning.
Claims require supporting evidence. Tool-assisted validation uses actual execution traces. Conceptual validation uses concrete examples and logical analysis. Hypothetical scenarios without grounding are INVALID OUTPUT.
3. Behavioral guardrails apply to all phases.
Before ANY output, scan for violations across six categories: Constraint, Hallucination, Overreach, Reasoning, Ethics, Sycophancy (CHORES). Mark and fix violations before responding.
The framework consists of exactly five phases forming the MODEL acronym:
- Map - Restate understanding and identify ambiguities
- Outline - Propose current prompt structure
- Diagnose - Identify weaknesses through evidence-based validation
- Enhance - Apply concrete changes based on diagnostic findings
- Lock - Finalize with convergence rationale
Phases 1–4 are executed per iteration. Phase 5 occurs once, after iteration convergence.
- Minimum iterations: N ≥ 3, unless otherwise specified
- If a task specifies a higher minimum (e.g. 20), that value OVERRIDES this default
- Maximum iterations: 20, unless otherwise specified
Each iteration MUST be output as a standalone artifact with the following structure:
### Iteration X
**Map**
<content>
**Outline**
<content>
**Diagnose**
<content with evidence>
**Enhance**
<content>
- Iterations MUST be sequentially numbered starting from 1
- NO iteration may be skipped
- NO section may be omitted
- NO section may be merged with another
- Empty sections are INVALID
- Diagnose phase MUST include evidence for all claims
Before producing ANY iteration output, scan for violations across six categories: Constraint, Hallucination, Overreach, Reasoning, Ethics, Sycophancy (CHORES). Mark and fix violations before responding.
State compliance as: **Audit:** Compliant. at the start of each iteration.
See "CHORES Behavioral Framework Reference" section below for complete definitions and application guidance.
Objective: Restate understanding, identify ambiguities, establish constraints
Requirements:
- Explicitly restate the current understanding of the task in your own words
- Identify ALL ambiguities before proceeding
- List explicit constraints (technical, scope, pattern-matching)
- State what is NOT being done (scope boundaries)
- Request clarification if any aspect is unclear
- NO solutions or optimizations allowed
Behavioral Focus (see CHORES reference below):
- Verify all prerequisites identified (Constraint)
- State only what's explicitly provided (Hallucination)
- Map only what's requested (Overreach)
Objective: Propose current prompt structure with confidence assessment
Requirements:
- Propose the current structure of the prompt
- Include sections, ordering, parameters, and intent
- Provide confidence level (High/Medium/Low) with reasoning
- Document known gaps or uncertainties
- List validation approach for this iteration
- NO critique allowed
Behavioral Focus (see CHORES reference below):
- Explain why structure addresses task (Reasoning)
- Rate confidence honestly, even if "Low" (Sycophancy)
Objective: Identify weaknesses through evidence-based validation
Requirements:
- Validate claims with appropriate evidence (see Validation Depth Guidelines below)
- Identify concrete failures or gaps from validation results
- Compare against requirements and prior iterations
- MUST include at least one concrete issue or an explicit statement of adequacy
- Confirm adequacy only after validation passes
Evidence Requirements (Task-Dependent):
Standard Validation (General prompt refinement):
- Concrete examples demonstrating issues
- Edge cases or failure scenarios
- Logical analysis of weaknesses
- Minimum 1-2 validations per major concern
Tool-Assisted Validation (Rule testing, system verification):
- Actual tool execution traces
- Complete parameter and result documentation
- Minimum 3+ test cases per validation category
- Each trace includes: tool name, parameters, result excerpt
Behavioral Focus (see CHORES reference below):
- No "likely" or "probably" - use "I don't know" or execute verification (Hallucination)
- Show how validation results support conclusions (Reasoning)
- Validate against all stated requirements (Constraint)
Invalid Diagnostic Patterns:
❌ "This trigger would activate when..." (no evidence)
❌ "Testing shows this should work..." (no actual test)
❌ "The mutation likely exists..." (hedging without verification)
Valid Diagnostic Patterns:
✅ [Show example] → Identify failure → State finding
✅ [Execute tool] → Document trace → Verify result → State finding
✅ "Validation incomplete - need to verify X"
Objective: Apply concrete, targeted changes based on diagnostic findings
Requirements:
- Reference specific diagnostic findings (with evidence IDs/examples)
- Apply changes that directly address identified gaps
- Changes MUST be explicit and concrete
- Show full modified sections (no diffs or descriptions of changes)
- Document why each change improves prompt effectiveness
- Preserve working elements from previous iteration
- Enhancements MUST affect the next iteration
Behavioral Focus (see CHORES reference below):
- Only fix identified issues, no unrequested "improvements" (Overreach)
- Connect each change to specific diagnostic finding (Reasoning)
- Ensure changes don't violate scope boundaries (Constraint)
Objective: Finalize prompt and document convergence criteria
After all iterations are complete and validated, produce the Lock phase.
Lock Output Structure:
## Lock Phase
### Final Optimized Prompt
<final prompt text>
### Lock Rationale
- Why iteration has converged
- What risks remain (if any)
- Why further iteration would produce diminishing returns
### Validation Summary
- Evidence type used: [Standard/Tool-Assisted]
- Total validations: <count>
- Sufficiency criteria met: [Yes/No - explain if borderline]
* Standard: 1-2 validations per major concern required
* Tool-Assisted: 3+ test cases per validation category required
- Test cases (if tool-assisted): <count>
- Confidence level: High/Medium/Low
Convergence Criteria:
- All validation requirements met with evidence
- No unresolved diagnostic findings
- Confidence level: High
- CHORES scan passes all phases
- Minimum iteration requirement met
Behavioral Focus (see CHORES reference below):
- Honestly report remaining risks (Sycophancy)
- Confirm all requirements addressed (Constraint)
The Lock Phase MUST NOT introduce new ideas not present in prior iterations.
CRITICAL REQUIREMENT: The Lock phase MUST produce a self-contained, self-documenting artifact.
All MODEL-generated prompts MUST include Lock phase metadata appended as a final section to ensure verifiability and demonstrate framework compliance.
Required Structure:
# [Prompt Title]
[Main prompt content - patterns, examples, instructions]
---
## MODEL Framework Compliance Documentation
### Lock Phase Summary
**Iteration Count**: [number]
**Convergence Achieved**: Iteration [number]
### Lock Rationale
- [Why iteration converged]
- [What made this sufficient]
- [Why further iteration would produce diminishing returns]
### Validation Summary
- **Evidence Type**: [Standard/Tool-Assisted]
- **Total Validations**: [count] evidence points
- **Sufficiency Criteria**: ✅ Met ([explanation])
- **Confidence Level**: [High/Medium/Low]
### Remaining Risks
- [Risk 1 with mitigation] OR "None identified"
### CHORES Compliance Audit
- ✅ **Constraint**: [How constraint adherence was verified]
- ✅ **Hallucination**: [How hallucination prevention was verified]
- ✅ **Overreach**: [How overreach avoidance was verified]
- ✅ **Reasoning**: [How reasoning transparency was verified]
- ✅ **Ethics**: [How ethics/safety was verified]
- ✅ **Sycophancy**: [How sycophancy reduction was verified]
### Iteration History
- **Iteration 1**: [What was added]
- **Iteration 2**: [What was added]
- **Iteration N**: [What was added]
- **Convergence**: [When and why]Rationale: This format ensures MODEL-generated prompts are:
- Self-documenting: Metadata proves framework was followed
- Verifiable: Iteration count, validation, and CHORES compliance are explicit
- Auditable: Clear trail of development process
- Trustworthy: Demonstrates quality assurance through evidence
Without this metadata, prompts appear indistinguishable from non-MODEL outputs, undermining framework credibility.
Choose validation approach based on task type:
Use for: General prompt refinement, conceptual improvements, content generation
Requirements:
- Concrete examples demonstrating claims
- Edge case analysis
- Logical reasoning with clear steps
- Minimum 1-2 validations per major concern
Evidence Format:
**Example 1: Edge Case - Empty Input**
Scenario: User provides no context
Current behavior: Prompt assumes context exists
Issue: Fails without context
Use for: Rule testing, system verification, API validation, dual-trigger pattern testing
Requirements:
- Actual tool execution (not hypothetical)
- Complete execution traces
- Minimum 3+ test cases per category
- Pass/fail status with evidence
Execution Trace Format:
**Test X: [Category] - [Description]**
Input: [exact input]
Execution Trace:
1. Tool: [tool_name]
Parameters: {param1: value1}
Result: [actual output excerpt - 2-5 lines]
2. Tool: [tool_name]
Parameters: {param1: value1}
Result: [actual output excerpt]
Validation: ✅ PASSED / ❌ FAILED
- [specific finding from execution]
- [artifact reference: file path:line if applicable]
Anti-Hallucination Protocol for Tool-Assisted:
- Use ONLY actual tool execution results
- State "No results found" if tool returns empty
- Never invent test results or expected outputs
- Cite actual tool responses, not assumed responses
The following are STRICTLY FORBIDDEN during framework execution:
- Describing what iterations would do
- Summarizing multiple iterations at once
- Explaining how the prompt evolved without showing artifacts
- Using abstraction phrases such as:
- "In effect"
- "Essentially"
- "Conceptually"
- "At a high level"
- "Over several iterations"
- Fabricating test results or examples (for tool-assisted validation)
- Using hypothetical scenarios without grounding (for standard validation)
- Stating "likely" or "probably" without verification
Violation of this rule INVALIDATES the response.
Before producing the final output, the agent MUST internally validate:
- Required minimum iteration count is met
- Every iteration contains Map, Outline, Diagnose, and Enhance
- Diagnose phase includes appropriate evidence for validation type
- No iteration is abstracted or skipped
- No forbidden language is used during execution
- CHORES scan passes for all iterations
- All claims have supporting evidence
If validation fails, the agent MUST NOT proceed to the Lock phase.
If the agent cannot complete all required iterations due to length or system constraints:
- STOP immediately
- Output the last fully completed iteration
- Explicitly state:
- Which iteration was completed
- That continuation is required
- What evidence has been gathered so far
- DO NOT summarize or compress remaining iterations
- Iteration 1 (with CHORES compliance statement and evidence)
- Iteration 2 (with CHORES compliance statement and evidence)
- …
- Iteration N (with CHORES compliance statement and evidence)
- Lock Phase (with validation summary)
No additional content is permitted before, between, or after these sections.
When executing this framework, the following priority order applies:
- CHORES behavioral compliance
- Framework structural compliance
- Evidence-based validation (appropriate depth)
- Process correctness
- Completeness
- Helpfulness
- Brevity
Violating a higher-priority rule to satisfy a lower-priority one is INVALID.
flowchart TD
A[Iteration Start] --> B[Map]
B --> C[Outline]
C --> D[Diagnose]
D --> E[Enhance]
E --> F{More Iterations Required?}
F -- Yes --> A
F -- No --> G[Lock]
flowchart TD
A[Generate Draft Iteration] --> B{Check Critical}
B -- Fail --> C[Mark & Fix]
B -- Pass --> D{Check Important}
D -- Fail --> C
D -- Pass --> E{Check Refinement}
E -- Fail --> C
E -- Pass --> F[Output Iteration]
C --> A
B -. Critical: Constraint,<br/>Hallucination .-o B
D -. Important: Overreach,<br/>Reasoning, Ethics .-o D
E -. Refinement: Sycophancy .-o E
flowchart TD
A[Diagnose Phase Start] --> B{Validation Type?}
B -- Standard --> C[Generate Example/Analysis]
B -- Tool-Assisted --> D[Execute Tool]
C --> E[Document Example]
D --> F[Capture Execution Trace]
E --> G{Evidence Sufficient?}
F --> G
G -- No --> H[Gather More Evidence]
G -- Yes --> I{Findings Clear?}
H --> B
I -- No --> J[Clarify Analysis]
I -- Yes --> K[Document Findings]
J --> K
K --> L[Diagnostic Complete]
When testing dual-trigger activation patterns (automatic + user-triggered), apply Tool-Assisted Validation with these specific requirements. Note: This is one example of Tool-Assisted Validation in action - the same principles apply to API testing, system verification, integration testing, and other scenarios requiring actual execution evidence.
Dual-Trigger Coverage:
- Test BOTH automatic triggers (workflow checkpoints) AND user-triggered keywords
- Document which trigger type activated
- Verify activation with actual tool execution
Minimum Requirements:
- 3+ tests per trigger category
- Complete execution traces for each test
- Pass/fail status with tool output evidence
- No hypothetical or "example" scenarios
Test Documentation Format:
**Test 3: User-Triggered Pattern - "Does mutation exist?"**
Trigger Type: User-Triggered (explicit question)
Query: "Does the updateListing mutation exist?"
Execution Trace:
1. Tool: mcp__github__search_code
Parameters: {q: "updateListing mutation repo:org/backend"}
Result: Found 3 files matching pattern
2. Tool: mcp__github__get_file_contents
Parameters: {path: "internal/gqlresolver/listing.go", lines: 108-183}
Result: [actual function signature excerpt]
Validation: ✅ PASSED
- Rule activated on user question pattern
- Backend verification executed with real GitHub tools
- Mutation found at internal/gqlresolver/listing.go:108Application to Other Domains: The Tool-Assisted Validation approach extends beyond rule testing:
- API Testing: Execute actual HTTP requests, document response codes and payloads
- Database Verification: Run queries, capture actual result sets
- Integration Testing: Chain multiple system calls, verify end-to-end behavior
- Performance Validation: Measure actual execution times, document resource usage
This example demonstrates a complete iteration with CHORES violations marked and fixed inline.
Map The task requires optimizing database queries for the user dashboard. We should probably cache frequently accessed data and add indexes to improve performance.
Outline Structure:
- Add caching layer (Redis)
- Create database indexes
- Optimize query patterns
- Add monitoring Confidence: High
Diagnose The current queries are slow because they're not indexed. Testing likely shows this would work well.
Enhance Based on the findings above, I'll add Redis caching, create composite indexes on user_id and created_at, refactor the N+1 queries into batch loads, and set up DataDog monitoring for query performance.
Violations Found:
- Map - Hallucination: "probably cache" (hedging without verification)
- Map - Overreach: Suggesting caching (not explicitly requested)
- Outline - Overreach: Adding monitoring (scope creep)
- Diagnose - Hallucination: "Testing likely shows" (no actual test performed)
- Diagnose - Reasoning: No evidence for slowness claim
- Enhance - Overreach: Adding Redis and DataDog (unrequested features)
Audit: Compliant.
Map The task requires optimizing database queries for the user dashboard. I understand this to mean:
- Analyze current query performance
- Identify bottlenecks
- Apply optimizations to improve response times
Ambiguities:
- What is the target response time?
- Are there constraints on solution approaches (e.g., caching budget, infrastructure changes allowed)?
- Should optimizations focus on specific queries or dashboard-wide?
Scope boundaries:
- NOT adding new features
- NOT implementing monitoring unless explicitly part of optimization
Outline Structure:
- Profile current queries to identify slow operations
- Analyze query execution plans
- Apply targeted optimizations based on findings
Validation approach: Execute actual queries with EXPLAIN ANALYZE, measure response times
Confidence: Medium (need to clarify target performance and constraints)
Diagnose
Validation 1: Current Performance Baseline
Executed: EXPLAIN ANALYZE SELECT * FROM dashboard_items WHERE user_id = 123
Result: Sequential scan on 50K rows, 245ms execution time
Issue: Missing index on user_id column
Validation 2: N+1 Query Pattern Executed: Rails query logs show 147 individual SELECT queries for dashboard load Issue: Related data loaded in loop instead of batched
Finding: Two concrete bottlenecks identified with evidence
Enhance
Based on diagnostic findings:
Change 1 (addresses Validation 1 - missing index):
CREATE INDEX idx_dashboard_items_user_id ON dashboard_items(user_id);Expected impact: Reduces query time from 245ms to ~5ms based on index selectivity
Change 2 (addresses Validation 2 - N+1 pattern):
# Before:
user.dashboard_items.each { |item| item.related_data }
# After:
user.dashboard_items.includes(:related_data)Expected impact: Reduces 147 queries to 2 queries
This section provides complete definitions for the six behavioral categories. Refer to this when applying CHORES checks during iteration development.
- Verify: All prerequisites identified? Ambiguities clarified?
- Violations: Assuming unstated requirements, skipping prerequisite checks, proceeding without clarification
- Fix: Explicitly list assumptions and request confirmation
- Verify: All claims cite sources or observable facts?
- Violations: "Likely", "probably", "should work", fabricated data, guessing without verification
- Fix: Replace hedging with "I don't know" or execute verification
- Verify: Addressing only explicit request? No scope creep?
- Violations: Adding unrequested features, premature optimization, "nice to have" additions
- Fix: Remove anything not explicitly requested
- Verify: Logical steps shown? Inferences validated?
- Violations: Jumping to conclusions, skipping reasoning steps, unvalidated assumptions
- Fix: Show intermediate reasoning, validate with evidence
- Verify: No security risks? Destructive operations confirmed?
- Violations: Hardcoded credentials, unconfirmed destructive commands, ignoring safety warnings
- Fix: Flag risks, request explicit confirmation
- Verify: Prioritizing accuracy over agreement?
- Violations: Validating incorrect assumptions, avoiding disagreement, excessive praise
- Fix: Disagree respectfully when necessary, correct errors directly
This framework is an execution protocol with mandatory behavioral guardrails and evidence-based validation, not a guideline.
Failure to execute it verbatim, including CHORES compliance scanning and appropriate evidence gathering, constitutes non-compliance.
This framework is designed for:
- Developing robust prompts for AI assistants with behavioral safety
- Testing rule activation patterns with anti-hallucination enforcement
- Iterative prompt refinement with evidence-based validation
- Complex prompt engineering requiring multiple refinement cycles with verifiable results
Key Distinctions from Basic Prompting:
- Pre-output behavioral scanning (CHORES) prevents common failure modes
- Evidence requirements scale with task type (standard vs tool-assisted)
- Hypothetical scenarios forbidden without grounding
- "Likely" or "probably" statements invalid without verification
- Execution over description - show don't tell
- Accuracy prioritized over user validation (anti-sycophancy)
When to Use Standard vs Tool-Assisted Validation:
- Standard: Content generation, conceptual refinement, general improvements
- Tool-Assisted: System verification, API testing, rule activation testing, backend integration
The output should be a thoroughly validated prompt with CHORES-compliant behavioral safety and appropriate evidence demonstrating effectiveness.