For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Apply Anthropic's official Claude 4.6 prompt engineering best practices to the assessment pipeline — enable adaptive thinking, restructure system/user prompt split, optimize long-context ordering, and clean up the prompt.
Architecture: The methodology prompt moves from the user message into the system prompt (enabling prompt caching). The user message is reordered to data-first/instructions-last. Adaptive thinking is enabled for the assessment stage via api.py. A new v4 prompt file replaces v3 with XML tags, trimmed content, and positive framing.
Tech Stack: Python 3.9+, Anthropic SDK 0.64.0, Claude Opus 4.6
A prompt engineering review (saved as a gist) identified 12 improvements to the assessment prompt and pipeline. The critical issues: no adaptive thinking, inverted long-context ordering, thin system prompt, wasted tokens on pipeline descriptions, and no prompt caching. This plan implements all improvements in 5 tasks.
| File | Action | Responsibility |
|---|---|---|
assessor/lib/api.py |
Modify | Add thinking/output_config support, conditional temperature, system as str|list |
assessor/lib/config.py |
Modify | Add MAX_ASSESS_OUTPUT_TOKENS constant |
assessor/prompts/vendor-risk-assessment-prompt-v4.md |
Create | Rewritten methodology with XML tags, trimmed content, positive framing |
assessor/lib/pipeline/assessment.py |
Modify | Rebuild system prompt, reorder user message, enable adaptive thinking |
assessor/prompts/vendor-risk-assessment-prompt-v3.md |
Unchanged | Preserved as reference |
Files:
-
Modify:
assessor/lib/api.py:24-41(send_message signature and kwargs) -
Modify:
assessor/lib/api.py:53-58(token usage logging) -
Step 1: Update
send_messagesignature
Add thinking, output_config as optional params. Make system accept str | list[dict] for cache-controlled content blocks:
def send_message(
client,
system: str | list[dict],
messages: list[dict],
model: str,
max_tokens: int = MAX_OUTPUT_TOKENS,
tools: list[dict] | None = None,
thinking: dict | None = None,
output_config: dict | None = None,
):- Step 2: Make temperature conditional, add thinking/output_config to kwargs
When thinking is enabled, omit temperature (adaptive thinking manages its own reasoning). Otherwise keep temperature=0 for deterministic non-thinking stages:
kwargs = {
"model": model,
"max_tokens": max_tokens,
"system": system,
"messages": messages,
}
if thinking:
kwargs["thinking"] = thinking
else:
kwargs["temperature"] = 0
if output_config:
kwargs["output_config"] = output_config
if tools:
kwargs["tools"] = tools- Step 3: Update token usage logging
The response.usage object with thinking includes additional cache/thinking fields. Update the log line to show cache hits when present:
if hasattr(response, "usage") and response.usage:
usage = response.usage
parts = [f"in={usage.input_tokens:,}", f"out={usage.output_tokens:,}", f"model={model}"]
if getattr(usage, "cache_read_input_tokens", 0):
parts.append(f"cache_read={usage.cache_read_input_tokens:,}")
if getattr(usage, "cache_creation_input_tokens", 0):
parts.append(f"cache_create={usage.cache_creation_input_tokens:,}")
print(f" [tokens] {' '.join(parts)}")- Step 4: Verify other callers are unaffected
Confirm that send_with_documents and send_with_web_search (which call send_message without thinking) still work — they don't pass thinking so they'll get temperature=0 as before. No changes needed to these functions.
- Step 5: Commit
git add assessor/lib/api.py
git commit -m "feat: add adaptive thinking and prompt caching support to api.py"Files:
-
Modify:
assessor/lib/config.py:39(add constant) -
Step 1: Add MAX_ASSESS_OUTPUT_TOKENS
With adaptive thinking, max_tokens covers thinking + text combined. Assessment JSON is ~20-30K tokens; thinking may use 30-50K. The current 64K is too tight. Opus 4.6 supports up to 128K output tokens.
Add after line 39:
MAX_ASSESS_OUTPUT_TOKENS = 128_000- Step 2: Commit
git add assessor/lib/config.py
git commit -m "feat: add 128K output token limit for assessment stage with thinking"Files:
- Create:
assessor/prompts/vendor-risk-assessment-prompt-v4.md
This is the biggest task. The v4 prompt applies all prompt engineering improvements to the methodology content. This file will be loaded as the system prompt by assessment.py.
- Step 1: Write the v4 prompt
Key changes from v3:
- Remove pipeline description (v3 lines 12-83) — replace with 5-line artifact description
- Remove execution summary diagram (v3 lines 493-514) — pure waste
- Add XML tags for major sections:
<role>,<source_artifacts>,<inputs>,<four_decision_process>,<citation_standard>,<report_structure>,<style_rules>,<quality_check> - Rephrase negatives as positives:
- "Do not perform independent document reads" → "Work exclusively from the evidence bank and web research"
- "Do not stop the assessment if documents are insufficient" → "Continue with available evidence, noting gaps as findings"
- "Do not include: Background context..." → "Include only information that directly informs a risk decision"
- "Never pad" → "For control areas with no findings, state 'Satisfactory. No issues identified.' and move on"
- Dial back aggressive emphasis:
- "STRICT WHITELIST" → "Only SOC-xx documents may be cited in SOC 2 sections"
- Remove caps from "IMPORTANT ENUM VALUES", "Do NOT"
- Use normal firm language throughout
- Condense 3-pass QA (v3 lines 453-488) into a single self-check block:
Before finalizing, verify: 1. Every citation references a document in the register and the page/section supports the claim 2. The recommendation aligns with residual risk findings — multiple High residual risks cannot produce "approve without conditions" 3. Every control area in the framework was evaluated — add "Insufficient Evidence" for any gaps 4. No vendor credit was given on vague evidence — downgrade to "Satisfactory with Observations" where evidence is thin - Add "why" adjacent to key constraints — e.g., "Maximum 10 sentences — the risk manager needs the verdict in under 60 seconds"
- Deduplicate style rules — remove rules 2 and 8 which restate the citation standard
The full v4 content should be written in one step. The file is the complete methodology from <role> through <quality_check>, approximately 350-400 lines (down from v3's 514).
- Step 2: Verify v4 file is parseable and complete
Quick check: ensure all v3 report structure sections (1-5) are present in v4, all citation prefixes are listed, all risk levels are defined, Four-Decision Process has all 4 decisions.
- Step 3: Commit
git add assessor/prompts/vendor-risk-assessment-prompt-v4.md
git commit -m "feat: create v4 assessment prompt with prompt engineering best practices"Files:
-
Modify:
assessor/lib/pipeline/assessment.py:193-300(build_assessment_prompt) -
Modify:
assessor/lib/pipeline/assessment.py:602-635(run function) -
Modify:
assessor/lib/config.py:41(update PROMPT_TEMPLATE_PATH) -
Step 1: Update config.py to point to v4
PROMPT_TEMPLATE_PATH = PROMPTS_DIR / "vendor-risk-assessment-prompt-v4.md"- Step 2: Rebuild
run()system prompt
The system prompt becomes: methodology (from v4 file) + JSON_SCHEMA_INSTRUCTIONS, wrapped in a list with cache_control for prompt caching:
def run(client, evidence_path, research_path, methodology_path, ...):
methodology = methodology_path.read_text(encoding="utf-8")
system_prompt = [
{
"type": "text",
"text": methodology + "\n\n" + JSON_SCHEMA_INSTRUCTIONS,
"cache_control": {"type": "ephemeral"},
}
]Remove the old 4-line system prompt string. The role is now in the v4 methodology file.
- Step 3: Rebuild
build_assessment_prompt()— data first, instruction last
Remove methodology_path parameter. Reorder to: evidence bank → web research → document register → scope → task instruction.
def build_assessment_prompt(
evidence_path: Path,
research_path: Path,
customer_name: str,
vendor_name: str,
vendor_product: str,
register_path: Path | None = None,
assessment_context: dict | None = None,
) -> str:
# ... load evidence, research, register as before ...
# Data first (longform content at top per Anthropic best practice)
# Then scope constraints
# Then task instruction at the very end
return f"""\
<evidence_bank>
{evidence}
</evidence_bank>
<web_research>
{research}
</web_research>
{register_section}
{scope_section}
---
Produce a vendor risk assessment for:
- **Customer:** {customer_name}
- **Vendor:** {vendor_name}
- **Product:** {vendor_product}
Apply the Four-Decision Process. Every factual claim must cite a source from the evidence bank or \
web research using document register codes. Execute the quality check before finalizing. \
Include the document_register array in the output JSON.
"""- Step 4: Update
run()to pass thinking config and increased max_tokens
from lib.config import MODEL_ASSESS, MAX_ASSESS_OUTPUT_TOKENS
# ... in run() ...
response = api.send_message(
client, system_prompt,
[{"role": "user", "content": prompt}],
MODEL_ASSESS, MAX_ASSESS_OUTPUT_TOKENS,
thinking={"type": "adaptive"},
output_config={"effort": "high"},
)Update the build_assessment_prompt call to remove methodology_path arg.
- Step 5: Update the truncation error message
The error at line 640 references MAX_OUTPUT_TOKENS. Update to reference MAX_ASSESS_OUTPUT_TOKENS:
if response.stop_reason == "max_tokens":
...
raise RuntimeError(
f"Assessment output was truncated (hit {MAX_ASSESS_OUTPUT_TOKENS} token limit). "
...
)- Step 6: Commit
git add assessor/lib/pipeline/assessment.py assessor/lib/config.py
git commit -m "feat: restructure assessment to use system prompt + adaptive thinking"- Step 1: Dry-run syntax check
cd /Users/chas/code/tprm-pipeline
python3 -c "from assessor.lib.pipeline import assessment; print('import OK')"
python3 -c "from assessor.lib import api; print('import OK')"- Step 2: Run a real assessment
Pick an existing customer/vendor pair from database.json and run the pipeline. Check:
- Assessment completes without errors
[tokens]log line showscache_createon first run- Output JSON passes validation
- No
stop_reason: max_tokenstruncation - Assessment quality is at least as good as v3 output
cd /Users/chas/code/tprm-pipeline
python3 assessor/assess.py --customer <id> --vendor <id> --stage assessment- Step 3: Verify prompt caching works
Run the same assessment a second time within 5 minutes. Check that the [tokens] log shows cache_read instead of cache_create. This confirms the system prompt is being cached.
- Step 4: Compare output quality
Diff the v4 assessment output against the existing v3 output for the same vendor. Check that:
-
All report sections are present
-
Citations use register codes
-
SOC 2 sections only cite SOC-xx documents
-
Risk ratings are consistent
-
No obvious hallucinations or gaps
-
Step 5: Final commit
git add -A
git commit -m "docs: verify v4 prompt engineering improvements"