NRT Migration Skill Experiment

How much does a Copilot skill help with C# nullable reference type migration — and which parts of the skill actually matter?

Setup

Codebase: System.Diagnostics.EventLog from dotnet/runtime, extracted as a standalone buildable project targeting net10.0-windows.

Metric	Value
Source files	65 (.cs)
Lines of code	~9,000
Target	net10.0-windows (.NET 11 preview SDK)
Initial state	`<Nullable>disable</Nullable>`, zero nullable annotations
Repo	danmoseley/eventlog-nrt

Branches (all diffable against each other):

main — baseline (pre-NRT)
nrt_no_skill — agent, no skill (round 1)
nrt_with_skill — agent, full 35KB skill (round 1)
nrt_docs_recap — agent, docs-recap skill ~800 tokens (round 2)
nrt_wisdom_only — agent, wisdom-only skill ~870 tokens (round 2)
nrt_human — human PR mapped to same repo

Cross-comparisons: no-skill vs full-skill · full-skill vs human · wisdom vs full-skill · docs-recap vs no-skill · wisdom vs human

Protocol: Each agent was a separate Copilot CLI session (Claude Opus 4.6) in its own git worktree, given the same base prompt: enable NRTs and resolve all warnings using dotnet build. The skill content (if any) was pasted into the first message. All agents had iterative compiler feedback.

Comparison baseline: Human PR dotnet/runtime#119891 by RenderMichael (same codebase, same task).

The Four Skill Variants

Variant	Size	Content
No skill	0	Base model only
Docs recap	~3KB / ~800 tokens	Condensed NRT docs: warning table, attributes list, workflow steps
Wisdom only	~3.5KB / ~870 tokens	Anti-patterns, decision flowchart, `!`-elimination patterns, commit strategy
Full skill	~35KB / ~9,000 tokens	Complete `migrate-nullable-references` skill from dotnet/skills

The docs-recap and wisdom-only skills are available in the repo as skill-docs-recap.md and skill-wisdom-only.md.

Results

	No-Skill	Docs-Recap	Wisdom	Full-Skill	Human
Skill size (tokens)	0	~800	~870	~9,000	—
Files changed	28	30	28	29	34
`?` annotations	319	317	349	347	404
`!` operators	119	112	100	100	55
`null!` specifically	21	22	5	5	5
Nullable attributes	1	1	1	1	20
`Debug.Assert`	6	5	6	6	16
`?.` behavioral changes	4	2	2	2	4
`<WarningsAsErrors>`	no	yes	yes	yes	no
`parent` nullable?	no	no	yes	yes	no
Duration (min)	~29	~31	~31	~29	—
Build warnings	0	0	0	0	0

All measured on the same standalone file set (no tests, no ref assemblies).

Key Findings

1. The "wisdom" content is the active ingredient

The wisdom-only skill (~870 tokens) produced results nearly identical to the full skill (~9,000 tokens) on every metric:

! operators: 100 vs 100
null!: 5 vs 5
? annotations: 349 vs 347
?. changes: 2 vs 2
parent nullable: yes vs yes

At 1/10th the token cost, the wisdom-only skill matched the full skill.

2. Restating the docs adds almost nothing

The docs-recap skill (~800 tokens) performed nearly identically to no-skill:

! operators: 112 vs 119
null!: 22 vs 21
? annotations: 317 vs 319

The model already knows the NRT documentation from its training data. Reminding it of warning codes and attribute names had negligible effect.

3. What specifically helped

The wisdom skill's value came from three explicit rules:

"Don't spray ! / never return null!" → null! count: 5 (wisdom, full) vs 21-22 (no-skill, docs). A 4x improvement.
"Don't use ?. as a quick fix" → ?. changes: 2 (wisdom, full, docs) vs 4 (no-skill). The docs-recap also helped here, likely because it mentioned this as a "typical fix" anti-pattern.
Commit strategy + WarningsAsErrors → Both wisdom and full-skill agents added <WarningsAsErrors>nullable</WarningsAsErrors>. Neither no-skill nor docs-recap... wait, docs-recap did add it. So this rule was picked up from both the docs-recap and wisdom variants.

4. Neither agent used nullable attributes — regardless of skill

All four agents used exactly 1 nullable attribute ([AllowNull] on EventLogTraceListener.Name). The human used 20 across 11 files.

This held true even for the docs-recap skill, which explicitly listed every nullable attribute with usage guidance. The knowledge was provided; the model didn't apply it.

Why? Adding ! is a local point fix — the compiler warns on line 450, you add ! on line 450. Adding [MemberNotNull(nameof(field))] is a cross-method optimization — you need to notice that a field is dereferenced at 8 call sites, trace back to a common initialization method, add the attribute there, and verify it resolves all 8 warnings. The model's iterative build-fix loop is optimized for point fixes. The human's architectural understanding of the code lets them see the pattern and apply the attribute once instead of ! eight times.

This is the largest quality gap between AI and human NRT migration today, and it appears to be a model capability limitation (cross-method reasoning under iterative build loops), not a knowledge gap.

5. One decision better than the human

Both the wisdom and full-skill agents made the parent field EventLog? (nullable), recognizing that static helper methods create EventLogInternal instances without a parent. The human PR kept parent non-nullable and used parent: null! at 3 construction sites — a reasonable pragmatic choice, but the nullable annotation more precisely reflects the runtime reality.

Implications for Skill Design

What works

Anti-pattern rules ("don't do X") — directly measurable impact
Decision flowcharts — helped agents make better type decisions
Process discipline (commit strategy, WarningsAsErrors) — followed when stated

What doesn't help

Restating documentation the model already knows — pure context tax
Attribute reference tables — knowledge provided but not applied
Lengthy explanations of concepts — the model understands NRTs from training

Optimal skill design for batch transformations

For tasks like NRT migration where the agent gets a full context window and iterates with compiler feedback, the ideal skill is:

Short (~1-3KB) — minimize context tax
Opinionated — "don't do X, prefer Y" rules the model wouldn't discover alone
Process-oriented — workflow steps, commit boundaries, final validation
No knowledge content — trust the model's training data

The full 35KB skill's value lived almost entirely in ~3KB of practical wisdom.

Broader taxonomy of skill value

Category	Example	Durability
Knowledge gap	"Here's what NRT attributes do"	Temporary — decays as docs improve and models retrain
Authority/ideology	"In this project, prefer `?` over `!`"	Durable — like a style guide
Workflow orchestration	"Build → fix → commit → repeat → add WarningsAsErrors"	Durable — process discipline
Private knowledge	"Our internal framework requires X pattern"	Permanent — never in training data

The NRT skill is mostly category 1. The experiment shows category 2+3 is where the value lives. Category 4 (internal/proprietary knowledge) was not tested but is likely the strongest and most durable use case for skills.

danmoseley/nrt-experiment-results.md

Select an option

No results found