- Series: Rust (mir) compiler bites
- Channel: Regional Tantrums
- Demo repo:
zst-in-mir-demo(this repo —cargo rustc -- -Zunpretty=mir)
PhantomData has no runtime representation. It occupies zero bytes of memory. The compiler erases it completely before generating machine code.
That's the story, right? And it's true — at runtime.
But between the Rust you write and the machine code you get, there's MIR — Mid-level Intermediate Representation — and in MIR, PhantomData gets real operands, real types, and if you're building a backend that consumes MIR... real bugs.
Lets talk: exactly where Zero-Sized Types show up in MIR, why the compiler keeps them around, and the surprisingly subtle type-precision problem that bit me when I was building a MIR backend.
Let's start with the basics. A Zero-Sized Type — a ZST — is a Rust type that carries meaning at the type level but occupies exactly zero bytes at runtime.
| Pattern | Example | Purpose |
|---|---|---|
PhantomData<T> |
PhantomData<&'a T> |
Lifetime / variance tracking |
| Unit structs | struct Locked; |
Type-level tags |
| Empty structs | struct Marker {} |
Typestate markers |
| Unit type | () |
"No value" / void return |
| Never type | ! |
Impossible values |
The most important one for today is
PhantomData. It's everywhere in the standard library.
Here's a simplified version of what the standard library's slice iterator looks like:
pub struct Iter<'a, T: 'a> {
ptr: *const T,
end: *const T,
_marker: PhantomData<&'a T>, // ← zero bytes
}Three fields. Two pointers that the iterator actually uses, and one PhantomData that exists purely so the compiler knows this iterator borrows something with lifetime 'a.
At runtime, Iter is just 16 bytes — two pointers. PhantomData contributes nothing.
So naturally, when you lower Rust to any IR, you'd expect PhantomData to just... not be there. Right?
Let's look.
I have a small example here. We've got our own
Iterstruct, same shape as the standard library's, and a function that constructs one:
struct Iter<'a, T: 'a> {
ptr: *const T,
end: *const T,
_marker: PhantomData<&'a T>,
}
fn create_iter<'a>(slice: &'a [u32]) -> Iter<'a, u32> {
let ptr = slice.as_ptr();
let end = unsafe { ptr.add(slice.len()) };
Iter { ptr, end, _marker: PhantomData }
}Now let's dump the MIR. You can do this yourself with:
cargo rustc -- -Zunpretty=mirHere's what the compiler produces:
fn create_iter(_1: &[u32]) -> Iter<'_, u32> {
let mut _0: Iter<'_, u32>;
let _2: *const u32;
let mut _4: usize;
bb0: {
_2 = core::slice::<impl [u32]>::as_ptr(copy _1) -> [return: bb1, unwind continue];
}
bb1: {
_4 = PtrMetadata(copy _1);
_3 = std::ptr::const_ptr::<impl *const u32>::add(copy _2, move _4) -> [return: bb2, unwind continue];
}
bb2: {
_0 = Iter::<'_, u32> { ptr: copy _2, end: copy _3, _marker: const PhantomData::<&u32> };
return;
}
}
Look at bb2. The aggregate construction.
_0 = Iter::<'_, u32> { ptr: copy _2, end: copy _3, _marker: const PhantomData::<&u32> };Three fields, three operands. PhantomData is right there as a real
Constantoperand in the MIR. It's not erased. It's not optimized away. It's a concrete value that the compiler hands to any backend consuming this MIR.And this isn't a one-off. Let me show you a struct with two PhantomData fields:
struct Multi<A, B> {
data: u64,
_a: PhantomData<A>,
_b: PhantomData<B>,
}fn create_multi() -> Multi<Locked, Unlocked> {
bb0: {
_0 = Multi::<Locked, Unlocked> {
data: const 42_u64,
_a: const PhantomData::<Locked>,
_b: const PhantomData::<Unlocked>
};
return;
}
}
Three fields, three operands. Every PhantomData field gets its own constant.
Now look at what happens when PhantomData is a function parameter and a return value:
fn accept_phantom(_1: PhantomData<u32>) -> PhantomData<u32> {
debug _marker => const PhantomData::<u32>;
let mut _0: std::marker::PhantomData<u32>;
bb0: {
return;
}
}
The function signature says it takes a
PhantomData<u32>and returns aPhantomData<u32>. But the body? Justreturn. No loads, no stores, no operations. The value is const-evaluated to nothing — but the type is still there in the signature.Same thing for a unit struct:
fn pass_unit_struct(_1: Locked) -> Locked {
debug _tag => const Locked;
let mut _0: Locked;
bb0: {
return;
}
}
Lockedis a ZST. The parameter_1exists in the MIR, it has a type, but notice the debug line:debug _tag => const Locked. The compiler knows the value is just... the constantLocked. There's nothing to load.
If these were just toy examples, you might not care. But PhantomData shows up in every for loop over a slice. Here's a simple sum:
fn sum_slice(data: &[u32]) -> u32 {
let mut total = 0u32;
for &val in data.iter() {
total += val;
}
total
}Three lines of logic. Let's look at the MIR:
fn sum_slice(_1: &[u32]) -> u32 {
let mut _0: u32;
let mut _2: u32;
let mut _3: std::slice::Iter<'_, u32>; // ← contains PhantomData internally
let mut _4: std::slice::Iter<'_, u32>;
let mut _6: std::option::Option<&u32>;
...
bb0: {
_2 = const 0_u32;
_4 = core::slice::<impl [u32]>::iter(copy _1) -> [return: bb1, unwind continue];
}
bb1: {
_3 = <std::slice::Iter<'_, u32> as IntoIterator>::into_iter(move _4) -> [return: bb2, ...];
}
bb3: {
_7 = &mut _5;
_6 = <std::slice::Iter<'_, u32> as Iterator>::next(copy _7) -> [return: bb4, ...];
}
bb4: {
_8 = discriminant(_6);
switchInt(move _8) -> [0: bb7, 1: bb6, otherwise: bb5];
}
...
}
Local
_3and_5have typestd::slice::Iter<'_, u32>. That type — from the standard library — containsPhantomData<&'a u32>as a real field.When the iterator machinery calls
.iter(), constructs anIter, calls.next(), pattern-matches onOption<&u32>— all of that code is operating on a struct that carries a PhantomData field. Any backend consuming this MIR has to handle that field during construction, destruction, and field access.This isn't obscure. This is what happens behind every
for val in sliceloop. PhantomData doesn't disappear at the MIR level — it's threaded through the entire iterator pipeline.
Here's where it gets subtle, and where I actually hit a bug.
When you're building a backend, you need to translate MIR types into your target IR. For ZSTs, the natural representation is "empty" — no fields, no bytes. But there are two kinds of empty:
ZST Types
/ \
Empty Tuple Empty Struct
mir.tuple <> mir.struct <PhantomData, [], []>
(aka ()) (aka PhantomData<T>)
An empty tuple and an empty struct both have zero fields and zero bytes. But they are different types. And this matters.
Here's why. When MIR constructs an
Iter, it expects three operands:
_0 = Iter::<'_, u32> {
ptr: copy _2, // field 0: pointer type
end: copy _3, // field 1: pointer type
_marker: const PhantomData // field 2: struct type (PhantomData)
};
Field 2 has type
PhantomData<&u32>. That's a struct type. If your backend translates PhantomData as an empty tuple — because hey, it's zero-sized, who cares — you get a type mismatch.Field 2 expects a struct. You gave it a tuple. If your IR has any kind of type verification — and it should — this will fail.
In my case, the error looked like this:
[paraphrase the verification error]
Verification error: field 2 of struct Iter expects type mir.struct <PhantomData, [], []> but got mir.tuple <>The fix is straightforward: when you encounter PhantomData (a struct ZST), construct an empty struct with the correct type name. When you encounter
()(a tuple ZST), construct an empty tuple. Don't conflate them.
if is_struct_zst {
// PhantomData<T> → construct empty struct with correct type
emit_construct_struct(phantom_data_type, fields: [])
} else {
// () → construct empty tuple
emit_construct_tuple(fields: [])
}This is a general lesson about compiler IRs: two types can be semantically identical (zero size, zero fields) but structurally different. And if your IR cares about structural types — which most do — you have to preserve that distinction.
Now here's the second problem. Even if you handle ZSTs perfectly in your own IR, you eventually have to lower to LLVM. And LLVM doesn't like empty structs in certain positions.
Let me show you what rustc's own LLVM output looks like for our examples:
| Function | MIR params → return | LLVM params → return | What changed |
|---|---|---|---|
create_iter |
(&[u32]) → Iter { ptr, end, PhantomData } |
(ptr, i64) → { ptr, ptr } |
PhantomData field gone |
create_multi |
() → Multi { u64, PhantomData, PhantomData } |
() → i64 |
only the u64 remains |
accept_phantom |
(PhantomData<u32>) → PhantomData<u32> |
() → void |
entire function erased |
pass_unit_struct |
(Locked) → Locked |
() → void |
same erasure |
Resource::lock |
(Resource<Unlocked>) → Resource<Locked> |
(i64) → i64 |
just the u64 data |
create_row_matrix |
(ptr, usize, usize) → Matrix { ptr, rows, cols, PhantomData } |
(sret([24 x i8]), ptr, i64, i64) → void |
24 bytes, not 32 |
read_marker |
(&Iter) → PhantomData<&T> |
(ptr) → void |
return erased |
Look at
create_iter. MIR says the return type isIterwith three fields: ptr, end, PhantomData. LLVM IR says the return type is{ ptr, ptr }— two fields. PhantomData is stripped.
create_multiis even more dramatic. MIR saysMultihas three fields. LLVM says...i64. Just the data. Both PhantomData fields gone.And
accept_phantom— a function that takes and returns PhantomData — becomesvoid ()in LLVM. The entire function compiles to: do nothing, return nothing.But here's the thing: this stripping happens inside rustc's own codegen. When you're building a custom MIR backend, you're bypassing rustc's codegen. You get the MIR — with all the PhantomData intact — and you have to do the stripping yourself.
And some LLVM backends are particularly strict about this. If you emit an empty struct
{}as a function parameter, certain backends will reject it outright:LLVM ERROR: Empty parameter types are not supportedSo you must strip ZSTs before emitting LLVM IR. You can't just pass them through.
So here's the design that falls out of all this. You want two layers:
┌───────────────────────────────────────────────────────┐
│ Your MIR Dialect │
│ • PhantomData is a real type with a real name │
│ • Struct fields include ZST fields │
│ • Aggregate construction has ZST operands │
│ • Type info available for analysis passes │
├──────────────────────────────────────────────────────-┤
│ Analysis Passes │
│ • Can query: "is field 2 a PhantomData?" │
│ • Can query: "what's the marker type of this struct?"│
│ • Full type information for optimization decisions │
├──────────────────────────────────────────────────────-┤
│ LLVM Dialect │
│ • ZSTs stripped during type conversion │
│ • Struct { ptr, PhantomData } → { ptr } │
│ • fn(PhantomData) → fn() │
│ • Only runtime-relevant types survive │
└──────────────────────────────────────────────────────-┘
In the high-level MIR layer, keep everything. PhantomData exists. Struct fields are complete. Type information is preserved. This is where you'd run any analysis that cares about what type something is — not just how many bytes it occupies.
In the low-level LLVM layer, strip ZSTs. Filter empty struct fields out of struct types. Drop ZST parameters from function signatures. Turn ZST-only return types into void.
The key insight is that stripping should happen at the type conversion boundary — when you're translating your MIR types to LLVM types. That's one location, it's clean, and it means your MIR-level passes always see the full picture.
This is actually what rustc does internally. Rustc's own codegen strips ZSTs when lowering MIR to LLVM IR. If you're building a custom backend, you're just recreating that same boundary.
I want to close with why this isn't just a PhantomData curiosity.
The typestate pattern — where you use ZSTs to encode state at the type level — is more common in Rust. Here's a real pattern:
struct Locked;
struct Unlocked;
struct Resource<State> {
value: u64,
_state: PhantomData<State>,
}
impl Resource<Unlocked> {
fn lock(self) -> Resource<Locked> { ... }
}
impl Resource<Locked> {
fn read(&self) -> u64 { ... }
fn unlock(self) -> Resource<Unlocked> { ... }
}The type system prevents you from calling
read()on an unlocked resource. That's compile-time safety with zero runtime cost.But in MIR, every
lock()andunlock()constructs a new struct with a PhantomData operand. Every transition between states is visible as a PhantomData constant in the IR.And if you look at the MIR for
lock:
fn lock(_1: Resource<Unlocked>) -> Resource<Locked> {
bb0: {
_2 = copy (_1.0: u64);
_0 = Resource::<Locked> { value: move _2, _state: const PhantomData::<Locked> };
return;
}
}
It copies the
u64out, constructs a newResource<Locked>with a PhantomData operand. In LLVM IR, this becomesi64 (i64)— just copy the integer. But in MIR, the state transition is explicitly represented.For anyone building analysis passes over MIR — say, checking that resources are always locked before use, or that state transitions follow a valid sequence — that PhantomData information is exactly what you'd want to inspect.
So: preserve ZST type info in your high-level IR. Strip it when lowering to LLVM. Two layers, clean boundary.
To summarize:
One — ZSTs are real operands in MIR. PhantomData appears in aggregate construction, field access, function signatures, and constant operands. It is not erased.
Two — Type precision matters. An empty struct and an empty tuple are both zero-sized, but they are different types. Conflating them breaks verification.
Three — LLVM backends can reject empty struct types in function signatures. You must strip ZSTs during LLVM lowering.
Four — The right architecture is two layers: MIR dialect preserves full type info for analysis; LLVM dialect strips ZSTs at the type conversion boundary.