Skip to content

Instantly share code, notes, and snippets.

@danmoseley
Last active March 11, 2026 17:54
Show Gist options
  • Select an option

  • Save danmoseley/2bd583fb655c9248828c0f8b339ac6c2 to your computer and use it in GitHub Desktop.

Select an option

Save danmoseley/2bd583fb655c9248828c0f8b339ac6c2 to your computer and use it in GitHub Desktop.
NRT Migration A/B Experiment: Copilot with vs without migrate-nullable-references skill

NRT Migration Skill Experiment

How much does a Copilot skill help with C# nullable reference type migration — and which parts of the skill actually matter?

Setup

Codebase: System.Diagnostics.EventLog from dotnet/runtime, extracted as a standalone buildable project targeting net10.0-windows.

Metric Value
Source files 65 (.cs)
Lines of code ~9,000
Target net10.0-windows (.NET 11 preview SDK)
Initial state <Nullable>disable</Nullable>, zero nullable annotations
Repo danmoseley/eventlog-nrt

Branches (all diffable against each other):

Cross-comparisons: no-skill vs full-skill · full-skill vs human · wisdom vs full-skill · docs-recap vs no-skill · wisdom vs human

Protocol: Each agent was a separate Copilot CLI session (Claude Opus 4.6) in its own git worktree, given the same base prompt: enable NRTs and resolve all warnings using dotnet build. The skill content (if any) was pasted into the first message. All agents had iterative compiler feedback.

Comparison baseline: Human PR dotnet/runtime#119891 by RenderMichael (same codebase, same task).

The Four Skill Variants

Variant Size Content
No skill 0 Base model only
Docs recap ~3KB / ~800 tokens Condensed NRT docs: warning table, attributes list, workflow steps
Wisdom only ~3.5KB / ~870 tokens Anti-patterns, decision flowchart, !-elimination patterns, commit strategy
Full skill ~35KB / ~9,000 tokens Complete migrate-nullable-references skill from dotnet/skills

The docs-recap and wisdom-only skills are available in the repo as skill-docs-recap.md and skill-wisdom-only.md.

Results

No-Skill Docs-Recap Wisdom Full-Skill Human
Skill size (tokens) 0 ~800 ~870 ~9,000
Files changed 28 30 28 29 34
? annotations 319 317 349 347 404
! operators 119 112 100 100 55
null! specifically 21 22 5 5 5
Nullable attributes 1 1 1 1 20
Debug.Assert 6 5 6 6 16
?. behavioral changes 4 2 2 2 4
<WarningsAsErrors> no yes yes yes no
parent nullable? no no yes yes no
Duration (min) ~29 ~31 ~31 ~29
Build warnings 0 0 0 0 0

All measured on the same standalone file set (no tests, no ref assemblies).

Key Findings

1. The "wisdom" content is the active ingredient

The wisdom-only skill (~870 tokens) produced results nearly identical to the full skill (~9,000 tokens) on every metric:

  • ! operators: 100 vs 100
  • null!: 5 vs 5
  • ? annotations: 349 vs 347
  • ?. changes: 2 vs 2
  • parent nullable: yes vs yes

At 1/10th the token cost, the wisdom-only skill matched the full skill.

2. Restating the docs adds almost nothing

The docs-recap skill (~800 tokens) performed nearly identically to no-skill:

  • ! operators: 112 vs 119
  • null!: 22 vs 21
  • ? annotations: 317 vs 319

The model already knows the NRT documentation from its training data. Reminding it of warning codes and attribute names had negligible effect.

3. What specifically helped

The wisdom skill's value came from three explicit rules:

  1. "Don't spray ! / never return null!"null! count: 5 (wisdom, full) vs 21-22 (no-skill, docs). A 4x improvement.
  2. "Don't use ?. as a quick fix"?. changes: 2 (wisdom, full, docs) vs 4 (no-skill). The docs-recap also helped here, likely because it mentioned this as a "typical fix" anti-pattern.
  3. Commit strategy + WarningsAsErrors → Both wisdom and full-skill agents added <WarningsAsErrors>nullable</WarningsAsErrors>. Neither no-skill nor docs-recap... wait, docs-recap did add it. So this rule was picked up from both the docs-recap and wisdom variants.

4. Neither agent used nullable attributes — regardless of skill

All four agents used exactly 1 nullable attribute ([AllowNull] on EventLogTraceListener.Name). The human used 20 across 11 files.

This held true even for the docs-recap skill, which explicitly listed every nullable attribute with usage guidance. The knowledge was provided; the model didn't apply it.

Why? Adding ! is a local point fix — the compiler warns on line 450, you add ! on line 450. Adding [MemberNotNull(nameof(field))] is a cross-method optimization — you need to notice that a field is dereferenced at 8 call sites, trace back to a common initialization method, add the attribute there, and verify it resolves all 8 warnings. The model's iterative build-fix loop is optimized for point fixes. The human's architectural understanding of the code lets them see the pattern and apply the attribute once instead of ! eight times.

This is the largest quality gap between AI and human NRT migration today, and it appears to be a model capability limitation (cross-method reasoning under iterative build loops), not a knowledge gap.

5. One decision better than the human

Both the wisdom and full-skill agents made the parent field EventLog? (nullable), recognizing that static helper methods create EventLogInternal instances without a parent. The human PR kept parent non-nullable and used parent: null! at 3 construction sites — a reasonable pragmatic choice, but the nullable annotation more precisely reflects the runtime reality.

Implications for Skill Design

What works

  • Anti-pattern rules ("don't do X") — directly measurable impact
  • Decision flowcharts — helped agents make better type decisions
  • Process discipline (commit strategy, WarningsAsErrors) — followed when stated

What doesn't help

  • Restating documentation the model already knows — pure context tax
  • Attribute reference tables — knowledge provided but not applied
  • Lengthy explanations of concepts — the model understands NRTs from training

Optimal skill design for batch transformations

For tasks like NRT migration where the agent gets a full context window and iterates with compiler feedback, the ideal skill is:

  1. Short (~1-3KB) — minimize context tax
  2. Opinionated — "don't do X, prefer Y" rules the model wouldn't discover alone
  3. Process-oriented — workflow steps, commit boundaries, final validation
  4. No knowledge content — trust the model's training data

The full 35KB skill's value lived almost entirely in ~3KB of practical wisdom.

Broader taxonomy of skill value

Category Example Durability
Knowledge gap "Here's what NRT attributes do" Temporary — decays as docs improve and models retrain
Authority/ideology "In this project, prefer ? over !" Durable — like a style guide
Workflow orchestration "Build → fix → commit → repeat → add WarningsAsErrors" Durable — process discipline
Private knowledge "Our internal framework requires X pattern" Permanent — never in training data

The NRT skill is mostly category 1. The experiment shows category 2+3 is where the value lives. Category 4 (internal/proprietary knowledge) was not tested but is likely the strongest and most durable use case for skills.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment