Skip to content

Instantly share code, notes, and snippets.

@chasballew
Created April 14, 2026 17:29
Show Gist options
  • Select an option

  • Save chasballew/ba7ec3c64a1e89226e1599eedd2b7341 to your computer and use it in GitHub Desktop.

Select an option

Save chasballew/ba7ec3c64a1e89226e1599eedd2b7341 to your computer and use it in GitHub Desktop.

TPRM Pipeline Prompt Engineering Improvements

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Apply Anthropic's official Claude 4.6 prompt engineering best practices to the assessment pipeline — enable adaptive thinking, restructure system/user prompt split, optimize long-context ordering, and clean up the prompt.

Architecture: The methodology prompt moves from the user message into the system prompt (enabling prompt caching). The user message is reordered to data-first/instructions-last. Adaptive thinking is enabled for the assessment stage via api.py. A new v4 prompt file replaces v3 with XML tags, trimmed content, and positive framing.

Tech Stack: Python 3.9+, Anthropic SDK 0.64.0, Claude Opus 4.6


Context

A prompt engineering review (saved as a gist) identified 12 improvements to the assessment prompt and pipeline. The critical issues: no adaptive thinking, inverted long-context ordering, thin system prompt, wasted tokens on pipeline descriptions, and no prompt caching. This plan implements all improvements in 5 tasks.

File Structure

File Action Responsibility
assessor/lib/api.py Modify Add thinking/output_config support, conditional temperature, system as str|list
assessor/lib/config.py Modify Add MAX_ASSESS_OUTPUT_TOKENS constant
assessor/prompts/vendor-risk-assessment-prompt-v4.md Create Rewritten methodology with XML tags, trimmed content, positive framing
assessor/lib/pipeline/assessment.py Modify Rebuild system prompt, reorder user message, enable adaptive thinking
assessor/prompts/vendor-risk-assessment-prompt-v3.md Unchanged Preserved as reference

Task 1: Add adaptive thinking support to api.py

Files:

  • Modify: assessor/lib/api.py:24-41 (send_message signature and kwargs)

  • Modify: assessor/lib/api.py:53-58 (token usage logging)

  • Step 1: Update send_message signature

Add thinking, output_config as optional params. Make system accept str | list[dict] for cache-controlled content blocks:

def send_message(
    client,
    system: str | list[dict],
    messages: list[dict],
    model: str,
    max_tokens: int = MAX_OUTPUT_TOKENS,
    tools: list[dict] | None = None,
    thinking: dict | None = None,
    output_config: dict | None = None,
):
  • Step 2: Make temperature conditional, add thinking/output_config to kwargs

When thinking is enabled, omit temperature (adaptive thinking manages its own reasoning). Otherwise keep temperature=0 for deterministic non-thinking stages:

kwargs = {
    "model": model,
    "max_tokens": max_tokens,
    "system": system,
    "messages": messages,
}
if thinking:
    kwargs["thinking"] = thinking
else:
    kwargs["temperature"] = 0
if output_config:
    kwargs["output_config"] = output_config
if tools:
    kwargs["tools"] = tools
  • Step 3: Update token usage logging

The response.usage object with thinking includes additional cache/thinking fields. Update the log line to show cache hits when present:

if hasattr(response, "usage") and response.usage:
    usage = response.usage
    parts = [f"in={usage.input_tokens:,}", f"out={usage.output_tokens:,}", f"model={model}"]
    if getattr(usage, "cache_read_input_tokens", 0):
        parts.append(f"cache_read={usage.cache_read_input_tokens:,}")
    if getattr(usage, "cache_creation_input_tokens", 0):
        parts.append(f"cache_create={usage.cache_creation_input_tokens:,}")
    print(f"  [tokens] {' '.join(parts)}")
  • Step 4: Verify other callers are unaffected

Confirm that send_with_documents and send_with_web_search (which call send_message without thinking) still work — they don't pass thinking so they'll get temperature=0 as before. No changes needed to these functions.

  • Step 5: Commit
git add assessor/lib/api.py
git commit -m "feat: add adaptive thinking and prompt caching support to api.py"

Task 2: Add assessment-specific output token limit to config.py

Files:

  • Modify: assessor/lib/config.py:39 (add constant)

  • Step 1: Add MAX_ASSESS_OUTPUT_TOKENS

With adaptive thinking, max_tokens covers thinking + text combined. Assessment JSON is ~20-30K tokens; thinking may use 30-50K. The current 64K is too tight. Opus 4.6 supports up to 128K output tokens.

Add after line 39:

MAX_ASSESS_OUTPUT_TOKENS = 128_000
  • Step 2: Commit
git add assessor/lib/config.py
git commit -m "feat: add 128K output token limit for assessment stage with thinking"

Task 3: Create v4 prompt file

Files:

  • Create: assessor/prompts/vendor-risk-assessment-prompt-v4.md

This is the biggest task. The v4 prompt applies all prompt engineering improvements to the methodology content. This file will be loaded as the system prompt by assessment.py.

  • Step 1: Write the v4 prompt

Key changes from v3:

  1. Remove pipeline description (v3 lines 12-83) — replace with 5-line artifact description
  2. Remove execution summary diagram (v3 lines 493-514) — pure waste
  3. Add XML tags for major sections: <role>, <source_artifacts>, <inputs>, <four_decision_process>, <citation_standard>, <report_structure>, <style_rules>, <quality_check>
  4. Rephrase negatives as positives:
    • "Do not perform independent document reads" → "Work exclusively from the evidence bank and web research"
    • "Do not stop the assessment if documents are insufficient" → "Continue with available evidence, noting gaps as findings"
    • "Do not include: Background context..." → "Include only information that directly informs a risk decision"
    • "Never pad" → "For control areas with no findings, state 'Satisfactory. No issues identified.' and move on"
  5. Dial back aggressive emphasis:
    • "STRICT WHITELIST" → "Only SOC-xx documents may be cited in SOC 2 sections"
    • Remove caps from "IMPORTANT ENUM VALUES", "Do NOT"
    • Use normal firm language throughout
  6. Condense 3-pass QA (v3 lines 453-488) into a single self-check block:
    Before finalizing, verify:
    1. Every citation references a document in the register and the page/section supports the claim
    2. The recommendation aligns with residual risk findings — multiple High residual risks cannot produce "approve without conditions"
    3. Every control area in the framework was evaluated — add "Insufficient Evidence" for any gaps
    4. No vendor credit was given on vague evidence — downgrade to "Satisfactory with Observations" where evidence is thin
    
  7. Add "why" adjacent to key constraints — e.g., "Maximum 10 sentences — the risk manager needs the verdict in under 60 seconds"
  8. Deduplicate style rules — remove rules 2 and 8 which restate the citation standard

The full v4 content should be written in one step. The file is the complete methodology from <role> through <quality_check>, approximately 350-400 lines (down from v3's 514).

  • Step 2: Verify v4 file is parseable and complete

Quick check: ensure all v3 report structure sections (1-5) are present in v4, all citation prefixes are listed, all risk levels are defined, Four-Decision Process has all 4 decisions.

  • Step 3: Commit
git add assessor/prompts/vendor-risk-assessment-prompt-v4.md
git commit -m "feat: create v4 assessment prompt with prompt engineering best practices"

Task 4: Restructure assessment.py — system prompt, user message, and thinking

Files:

  • Modify: assessor/lib/pipeline/assessment.py:193-300 (build_assessment_prompt)

  • Modify: assessor/lib/pipeline/assessment.py:602-635 (run function)

  • Modify: assessor/lib/config.py:41 (update PROMPT_TEMPLATE_PATH)

  • Step 1: Update config.py to point to v4

PROMPT_TEMPLATE_PATH = PROMPTS_DIR / "vendor-risk-assessment-prompt-v4.md"
  • Step 2: Rebuild run() system prompt

The system prompt becomes: methodology (from v4 file) + JSON_SCHEMA_INSTRUCTIONS, wrapped in a list with cache_control for prompt caching:

def run(client, evidence_path, research_path, methodology_path, ...):
    methodology = methodology_path.read_text(encoding="utf-8")
    system_prompt = [
        {
            "type": "text",
            "text": methodology + "\n\n" + JSON_SCHEMA_INSTRUCTIONS,
            "cache_control": {"type": "ephemeral"},
        }
    ]

Remove the old 4-line system prompt string. The role is now in the v4 methodology file.

  • Step 3: Rebuild build_assessment_prompt() — data first, instruction last

Remove methodology_path parameter. Reorder to: evidence bank → web research → document register → scope → task instruction.

def build_assessment_prompt(
    evidence_path: Path,
    research_path: Path,
    customer_name: str,
    vendor_name: str,
    vendor_product: str,
    register_path: Path | None = None,
    assessment_context: dict | None = None,
) -> str:
    # ... load evidence, research, register as before ...

    # Data first (longform content at top per Anthropic best practice)
    # Then scope constraints
    # Then task instruction at the very end
    return f"""\
<evidence_bank>
{evidence}
</evidence_bank>

<web_research>
{research}
</web_research>
{register_section}
{scope_section}
---

Produce a vendor risk assessment for:
- **Customer:** {customer_name}
- **Vendor:** {vendor_name}
- **Product:** {vendor_product}

Apply the Four-Decision Process. Every factual claim must cite a source from the evidence bank or \
web research using document register codes. Execute the quality check before finalizing. \
Include the document_register array in the output JSON.
"""
  • Step 4: Update run() to pass thinking config and increased max_tokens
from lib.config import MODEL_ASSESS, MAX_ASSESS_OUTPUT_TOKENS

# ... in run() ...
response = api.send_message(
    client, system_prompt,
    [{"role": "user", "content": prompt}],
    MODEL_ASSESS, MAX_ASSESS_OUTPUT_TOKENS,
    thinking={"type": "adaptive"},
    output_config={"effort": "high"},
)

Update the build_assessment_prompt call to remove methodology_path arg.

  • Step 5: Update the truncation error message

The error at line 640 references MAX_OUTPUT_TOKENS. Update to reference MAX_ASSESS_OUTPUT_TOKENS:

if response.stop_reason == "max_tokens":
    ...
    raise RuntimeError(
        f"Assessment output was truncated (hit {MAX_ASSESS_OUTPUT_TOKENS} token limit). "
        ...
    )
  • Step 6: Commit
git add assessor/lib/pipeline/assessment.py assessor/lib/config.py
git commit -m "feat: restructure assessment to use system prompt + adaptive thinking"

Task 5: End-to-end verification

  • Step 1: Dry-run syntax check
cd /Users/chas/code/tprm-pipeline
python3 -c "from assessor.lib.pipeline import assessment; print('import OK')"
python3 -c "from assessor.lib import api; print('import OK')"
  • Step 2: Run a real assessment

Pick an existing customer/vendor pair from database.json and run the pipeline. Check:

  • Assessment completes without errors
  • [tokens] log line shows cache_create on first run
  • Output JSON passes validation
  • No stop_reason: max_tokens truncation
  • Assessment quality is at least as good as v3 output
cd /Users/chas/code/tprm-pipeline
python3 assessor/assess.py --customer <id> --vendor <id> --stage assessment
  • Step 3: Verify prompt caching works

Run the same assessment a second time within 5 minutes. Check that the [tokens] log shows cache_read instead of cache_create. This confirms the system prompt is being cached.

  • Step 4: Compare output quality

Diff the v4 assessment output against the existing v3 output for the same vendor. Check that:

  • All report sections are present

  • Citations use register codes

  • SOC 2 sections only cite SOC-xx documents

  • Risk ratings are consistent

  • No obvious hallucinations or gaps

  • Step 5: Final commit

git add -A
git commit -m "docs: verify v4 prompt engineering improvements"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment