How much does a Copilot skill help with C# nullable reference type migration — and which parts of the skill actually matter?
Codebase: System.Diagnostics.EventLog from dotnet/runtime, extracted as a
standalone buildable project targeting net10.0-windows.
| Metric | Value |
|---|---|
| Source files | 65 (.cs) |
| Lines of code | ~9,000 |
| Target | net10.0-windows (.NET 11 preview SDK) |
| Initial state | <Nullable>disable</Nullable>, zero nullable annotations |
| Repo | danmoseley/eventlog-nrt |
Branches (all diffable against each other):
main— baseline (pre-NRT)nrt_no_skill— agent, no skill (round 1)nrt_with_skill— agent, full 35KB skill (round 1)nrt_docs_recap— agent, docs-recap skill ~800 tokens (round 2)nrt_wisdom_only— agent, wisdom-only skill ~870 tokens (round 2)nrt_human— human PR mapped to same repo
Cross-comparisons: no-skill vs full-skill · full-skill vs human · wisdom vs full-skill · docs-recap vs no-skill · wisdom vs human
Protocol: Each agent was a separate Copilot CLI session (Claude Opus 4.6) in
its own git worktree, given the same base prompt: enable NRTs and resolve all
warnings using dotnet build. The skill content (if any) was pasted into the
first message. All agents had iterative compiler feedback.
Comparison baseline: Human PR dotnet/runtime#119891 by RenderMichael (same codebase, same task).
| Variant | Size | Content |
|---|---|---|
| No skill | 0 | Base model only |
| Docs recap | ~3KB / ~800 tokens | Condensed NRT docs: warning table, attributes list, workflow steps |
| Wisdom only | ~3.5KB / ~870 tokens | Anti-patterns, decision flowchart, !-elimination patterns, commit strategy |
| Full skill | ~35KB / ~9,000 tokens | Complete migrate-nullable-references skill from dotnet/skills |
The docs-recap and wisdom-only skills are available in the repo as
skill-docs-recap.md and
skill-wisdom-only.md.
| No-Skill | Docs-Recap | Wisdom | Full-Skill | Human | |
|---|---|---|---|---|---|
| Skill size (tokens) | 0 | ~800 | ~870 | ~9,000 | — |
| Files changed | 28 | 30 | 28 | 29 | 34 |
? annotations |
319 | 317 | 349 | 347 | 404 |
! operators |
119 | 112 | 100 | 100 | 55 |
null! specifically |
21 | 22 | 5 | 5 | 5 |
| Nullable attributes | 1 | 1 | 1 | 1 | 20 |
Debug.Assert |
6 | 5 | 6 | 6 | 16 |
?. behavioral changes |
4 | 2 | 2 | 2 | 4 |
<WarningsAsErrors> |
no | yes | yes | yes | no |
parent nullable? |
no | no | yes | yes | no |
| Duration (min) | ~29 | ~31 | ~31 | ~29 | — |
| Build warnings | 0 | 0 | 0 | 0 | 0 |
All measured on the same standalone file set (no tests, no ref assemblies).
The wisdom-only skill (~870 tokens) produced results nearly identical to the full skill (~9,000 tokens) on every metric:
!operators: 100 vs 100null!: 5 vs 5?annotations: 349 vs 347?.changes: 2 vs 2parentnullable: yes vs yes
At 1/10th the token cost, the wisdom-only skill matched the full skill.
The docs-recap skill (~800 tokens) performed nearly identically to no-skill:
!operators: 112 vs 119null!: 22 vs 21?annotations: 317 vs 319
The model already knows the NRT documentation from its training data. Reminding it of warning codes and attribute names had negligible effect.
The wisdom skill's value came from three explicit rules:
- "Don't spray
!/ neverreturn null!" →null!count: 5 (wisdom, full) vs 21-22 (no-skill, docs). A 4x improvement. - "Don't use
?.as a quick fix" →?.changes: 2 (wisdom, full, docs) vs 4 (no-skill). The docs-recap also helped here, likely because it mentioned this as a "typical fix" anti-pattern. - Commit strategy +
WarningsAsErrors→ Both wisdom and full-skill agents added<WarningsAsErrors>nullable</WarningsAsErrors>. Neither no-skill nor docs-recap... wait, docs-recap did add it. So this rule was picked up from both the docs-recap and wisdom variants.
All four agents used exactly 1 nullable attribute ([AllowNull] on
EventLogTraceListener.Name). The human used 20 across 11 files.
This held true even for the docs-recap skill, which explicitly listed every nullable attribute with usage guidance. The knowledge was provided; the model didn't apply it.
Why? Adding ! is a local point fix — the compiler warns on line 450, you
add ! on line 450. Adding [MemberNotNull(nameof(field))] is a cross-method
optimization — you need to notice that a field is dereferenced at 8 call sites,
trace back to a common initialization method, add the attribute there, and verify
it resolves all 8 warnings. The model's iterative build-fix loop is optimized for
point fixes. The human's architectural understanding of the code lets them see the
pattern and apply the attribute once instead of ! eight times.
This is the largest quality gap between AI and human NRT migration today, and it appears to be a model capability limitation (cross-method reasoning under iterative build loops), not a knowledge gap.
Both the wisdom and full-skill agents made the parent field EventLog?
(nullable), recognizing that static helper methods create EventLogInternal
instances without a parent. The human PR kept parent non-nullable and used
parent: null! at 3 construction sites — a reasonable pragmatic choice, but
the nullable annotation more precisely reflects the runtime reality.
- Anti-pattern rules ("don't do X") — directly measurable impact
- Decision flowcharts — helped agents make better type decisions
- Process discipline (commit strategy,
WarningsAsErrors) — followed when stated
- Restating documentation the model already knows — pure context tax
- Attribute reference tables — knowledge provided but not applied
- Lengthy explanations of concepts — the model understands NRTs from training
For tasks like NRT migration where the agent gets a full context window and iterates with compiler feedback, the ideal skill is:
- Short (~1-3KB) — minimize context tax
- Opinionated — "don't do X, prefer Y" rules the model wouldn't discover alone
- Process-oriented — workflow steps, commit boundaries, final validation
- No knowledge content — trust the model's training data
The full 35KB skill's value lived almost entirely in ~3KB of practical wisdom.
| Category | Example | Durability |
|---|---|---|
| Knowledge gap | "Here's what NRT attributes do" | Temporary — decays as docs improve and models retrain |
| Authority/ideology | "In this project, prefer ? over !" |
Durable — like a style guide |
| Workflow orchestration | "Build → fix → commit → repeat → add WarningsAsErrors" | Durable — process discipline |
| Private knowledge | "Our internal framework requires X pattern" | Permanent — never in training data |
The NRT skill is mostly category 1. The experiment shows category 2+3 is where the value lives. Category 4 (internal/proprietary knowledge) was not tested but is likely the strongest and most durable use case for skills.