I blind A/B tested 40 Claude prompt codes. Only 7 actually shift reasoning.

A research note on which Claude "secret prompts" (L99, /skeptic, GODMODE, ULTRATHINK, and 36 others) change what Claude thinks versus what it sounds like.

Author: Samarth Bhamare. March-April 2026. Single-rater, self-funded. 15,000-word companion report as PDF link at the bottom. Raw results by code are in the appendix.

TL;DR

I tested 40 of the most-circulated Claude prompt codes against no-prefix baselines. Fresh context per run, fixed task batteries across coding / analysis / writing, n=12-20 per code, blind ordering between run and rating.

Finding that surprised me the most: only 7 of the 40 measurably changed what Claude decided. The other 33 changed how it sounded — more confident, less hedgy, shorter, more formatted — while the underlying reasoning was the same.

That's not useless. Sometimes you want the terser, less-hedgy version. But it isn't the "unlock" the marketing suggests. Most viral prompt codes are structural, not cognitive.

The 7 that do shift reasoning all share one feature: they tell Claude what framings to refuse, not what to produce. Additive codes ("be more confident, more expert, more thorough") were among the weakest. Rejection-logic codes ("challenge the premise", "commit before hedging") were the strongest.

The 7 codes with real signal (with test numbers)

1. `/skeptic` — premise challenge

n = 14 "should I do X" decisions where the obvious answer was wrong. Caught the wrong premise in 11 of 14 (79%). Generic Claude caught it in 2 of 14 (14%). A 5.5× improvement over baseline — the largest single delta in the dataset.

What it does: tells Claude to challenge the premise of your question before answering it. Most prefixes change how Claude answers; this changes which question Claude attempts to answer.

Example. "Should I add a referral program to my SaaS where buyers earn 30% commission?"

Baseline: 400-word writeup of commission tiers, attribution windows, fraud prevention. Useful if you should build a referral program. Disastrous if you shouldn't.

With /skeptic: "Before I answer — how many customers do you have? If under 50, a referral program is almost certainly premature. The signal-to-noise on referrals is too low at small scale. The right move at <50 customers is to personally ask each existing customer for one specific intro, not to build a system. If you're over 50, here's how I'd structure it..."

Failure mode: over-challenge. On questions with no wrong premise, /skeptic occasionally invents a fake problem to challenge and wastes the response slot on meta-commentary. Only use on genuine "should I do X" decisions.

2. `L99` — commit before hedging

n = 12 binary decision questions. Claude committed to one answer 11 of 12 times with L99. Without: 2 of 12. Correctness when committed: 73% — so the decisiveness is real, but so is the possibility of confidently-wrong. Verify before acting.

What it does: kills hedging. Forces a single-answer recommendation instead of "it depends" essays. Response length goes up ~40% because the defense of the chosen answer takes space.

When not to use: questions where nuance is the actual value. Tradeoff analyses should stay tradeoff analyses. Don't L99 anything where you want to see both sides.

3. `/blindspots` — assumption surfacing

n = 11 strategy/architecture questions. Surfaced at least one material hidden assumption in 9 of 11 runs. Baseline surfaced assumptions in 3 of 11. Not the largest delta, but the lowest-friction — adding six characters in front of your question pays back almost every time on strategy questions.

4. `/crit` — adversarial self-review

n = 15 drafts (code, prose, proposals). Found a real flaw in ~60% of drafts. Baseline self-review surfaced nothing in 12 of 15. The 40% where /crit found nothing weren't "clean drafts" — they were drafts where Claude missed its own mistakes. It's a ceiling, not a guarantee.

5. `/deep` — decompose into sub-questions

n = 9 ambiguous system-design asks. Response quality judged meaningfully better in 7 of 9 runs using a rubric covering specificity, coverage, and actionability. Best on vague asks ("how should I think about X"). Worse on concrete asks — adds friction without adding value.

6. `/premortem` — failure-mode anticipation

n = 8 execution plans. Identified at least one realistic failure mode in 7 of 8. Baseline surfaced failure modes in 2 of 8. Small sample, but the effect is loud.

7. `ULTRATHINK` — depth on demand (expensive but real)

n = 14 reasoning-heavy prompts. Labeled-debugging correctness 87.5% vs 62.5% baseline (on 8 debugging prompts). Average response length 3.2× baseline. Reasoning-step count 2.4×. Factual recall improvement: 0% — this is not what it does.

Popular assumption was that ULTRATHINK is confidence-theater. That turned out to be wrong, but the cost is real. Response length 3.2× baseline, response time 60-90 seconds vs 15-25. Not a daily driver. What you pull out when a specific question deserves maximum depth and you're willing to pay for it.

The placebo shortlist (sounded magical, measured like noise)

These were the codes most-frequently shared as "unlocks" that tested as indistinguishable from baseline or baseline + formatting:

GODMODE / BEASTMODE / OVERRIDE — confidence theater. Response tone changes. Measured reasoning, correctness, and problem-solving: unchanged.
"You are an expert in X" / "Act as a senior [role]" — adds role-specific vocabulary to the output. Underlying answer structure and reasoning: unchanged. The words "as a senior engineer" do not produce senior-engineer judgment.
DAN / Jailbreak variants — Claude's 4.x alignment is robust enough that these don't produce the bypass they used to. What they mostly do is add length.
"Take a deep breath and think step by step" — once a meaningful unlock on earlier models. By Claude 4.x, baseline already does stepwise reasoning on most prompts. Adds 80-120 tokens of performative thinking without changing conclusions.
"This is very important, be accurate" — no measurable effect on accuracy on tested tasks. Adds cautious hedging. Often makes output worse by inserting unnecessary caveats.
Most XML-tag prefixes — some real utility on structured extraction tasks, but "wrap your answer in tags" as a reasoning booster did not survive the baseline comparison.

Important caveat: "placebo" here means "indistinguishable from baseline at this sample size." It does not mean "useless in every case." A few of these have real uses in specific contexts (XML tags for structured output, for example) — just not as the reasoning unlocks they're marketed as.

Methodology (short version)

Fresh context per run. Every test opened a new chat. No prior-turn contamination.
Fixed task batteries. 11 task types: code review, refactor, debug, architecture, migration plan, API design, prose draft, critique of a draft, decision analysis, research summary, failure-mode analysis. Each code tested against relevant batteries only — you don't test /crit on factual lookups.
Blind ordering. Test and baseline runs were stored with randomized filenames and rated 48 hours later.
Single rater. Me. Rubric was fixed before testing started, with pass/fail thresholds for each measured dimension (decisiveness, correctness, assumption surfacing, failure-mode ID, reasoning depth).
n = 12-20 per code. Minimum was enforced. Some codes ran higher where the first pass came out ambiguous.
Models tested. Opus 4.6, Sonnet 4.5, Haiku 4.5, between March 1 and April 15, 2026. No Opus 4.7 runs in the 40-code dataset (4.7 dropped mid-study; retested a subset separately, directionally the same).

Known limitations

Single rater. Biggest methodology hole. Controlled with blind ordering and a fixed rubric, but single-rater studies have a ceiling. If someone wants to replicate a subset with an independent rater, I'll send the task batteries.
Small n. 12-20 per code. The top 7's effect sizes survive this (79% vs 14% is not ambiguous). For borderline placebos I'd use "indistinguishable from baseline at n=20" rather than "proven fake."
Rubric-dependent. "Decisiveness" and "assumption surfacing" are judgment calls. Rubric published in the full PDF — push back on any definition that seems loaded.
Model drift. Claude models are updated continuously. Results from March-April 2026 may not hold 6 months from now. I plan to re-run subset quarterly.
Task battery coverage. 11 task types is narrow. Doesn't cover creative writing, translation, legal analysis, therapy, roleplay, other domains where some of these codes might behave differently.

What this isn't

This isn't "AI is fake." The 7 codes with real signal are genuinely useful and I use /skeptic and L99 daily. The narrower claim is:

Most "secret prompts" circulating on Reddit/Twitter are tone changes being sold as reasoning changes.

If you were relying on BEASTMODE or "you are an expert" prefixes to unlock better thinking, you were getting better sounding output with the same underlying reasoning. Structural codes are useful as structural codes. Branding them as cognitive unlocks is where the noise comes from.

Appendix A — Complete 40-code classification

Reasoning shifters (7): /skeptic, L99, /blindspots, /crit, /deep, /premortem, ULTRATHINK

Structural / high-value-but-not-cognitive (23): /concise, /expand, /format-json, /format-markdown, /format-table, /bullets-only, /no-preamble, /no-hedging, /cite-sources, /show-work, /numbered-steps, /checklist, /tldr, /eli5, /one-sentence, /draft-only, /no-code, /code-only, /diff-only, /explain-why, /compare, /contrast, /devils-advocate

Placebo / indistinguishable from baseline (7): GODMODE, BEASTMODE, OVERRIDE, "you are an expert", "act as senior X", DAN variants, "take a deep breath"

Niche / low-n / inconclusive (3): /persona-switch, /metaphor-first, /first-principles

Full before/after transcripts for every code are in the companion PDF.

Appendix B — Contact / follow-up

Questions, replication offers, methodology challenges, or "you got code X wrong" feedback: team@clskills.in

Companion 31-page PDF with every test run's before/after output: https://github.com/Samarth0211/claude-skills-hub/releases/download/report-v1/claude-skills-report-2026.pdf

HN username: samarthbhamare

Last updated April 23, 2026. Next update July 2026 — I'll re-run subsets on whatever Claude ships next and publish the delta.

Samarth0211/claude-skills-report-GIST.md

Select an option

No results found

Select an option

No results found

I blind A/B tested 40 Claude prompt codes. Only 7 actually shift reasoning.

TL;DR

The 7 codes with real signal (with test numbers)

1. `/skeptic` — premise challenge

2. `L99` — commit before hedging

3. `/blindspots` — assumption surfacing

4. `/crit` — adversarial self-review

5. `/deep` — decompose into sub-questions

6. `/premortem` — failure-mode anticipation

7. `ULTRATHINK` — depth on demand (expensive but real)

The placebo shortlist (sounded magical, measured like noise)

Methodology (short version)

Known limitations

What this isn't

Appendix A — Complete 40-code classification

Appendix B — Contact / follow-up

Samarth0211/claude-skills-report-GIST.md

I blind A/B tested 40 Claude prompt codes. Only 7 actually shift reasoning.

TL;DR

The 7 codes with real signal (with test numbers)

1. /skeptic — premise challenge

2. L99 — commit before hedging

3. /blindspots — assumption surfacing

4. /crit — adversarial self-review

5. /deep — decompose into sub-questions

6. /premortem — failure-mode anticipation

7. ULTRATHINK — depth on demand (expensive but real)

The placebo shortlist (sounded magical, measured like noise)

Methodology (short version)

Known limitations

What this isn't

Appendix A — Complete 40-code classification

Appendix B — Contact / follow-up

1. `/skeptic` — premise challenge

2. `L99` — commit before hedging

3. `/blindspots` — assumption surfacing

4. `/crit` — adversarial self-review

5. `/deep` — decompose into sub-questions

6. `/premortem` — failure-mode anticipation

7. `ULTRATHINK` — depth on demand (expensive but real)