In .NET the JIT performs more-and-more optimization on the code using dynamic PGO, constant folding and inlining. While writing this post, I am using .NET 9 Runtime on x64 architecture, although many of the discussed optimizations are available in one or another form since .NET 7. The presented assembly code snippets generated by the runtime are all optimized, Tier1-OSR code using Dynamic PGO.
Constant-Folding is applied when a compiler 'sees' a constant expression and replaces it with the result of the expression. The C# compiler does it at compile time (ie. var i = 3 + 4;
can be substituted with var i = 7
). Some expressions involving readonly fields, environment specific parameters, architecture, CPU, etc. can only be folded at runtime. In this latter case the JIT compiler may perform the optimization.
Inlining is a technique where a method's invocation is replaced with the actual body of the method. This reduces the overhead of the method invocation, but it increases the overall code size (and memory consumption) at runtime. This optimization also allows further optimizations across the overall method at the callsite.
These two optimizations can work together to achieve an overall optimization that is only possible at runtime. In this blog post I will present one such case.
To explore the JITted code, I set the dotnet JitDisasm $env:DOTNET_JitDisasm="Foo:Method"
environment variable. This instructs dotnet to capture the given the method's assembly code and to print it on the console. Then run the C# console application in Release mode: dotnet run -c Release
. To make sure the observed method is optimized, the application invokes it in a loop. The runtime prints the assembly code of the Method
on the console every time the tiered compiler produces a new version of code for the observed method.
public static void Run()
{
var sum = 0;
for (int i = 0; i < 100; i++)
sum += Foo.Method();
Console.WriteLine(sum);
}
I investigate 4 different versions of the same method in this blog post. They are named MethodA, MethodB, MethodC and MethodD. They all invoke the same shared helper method:
private static int Calculation(int index)
{
ReadOnlySpan<int> source = [20, 36, 52, 68];
return source[index] - 4;
}
Calculation
returns the value minus 4 of an array at a given index. Please note, it intentionally avoids any input argument validation. In case of indexing beyond the length of the array, the samples rely on the safety provided by the runtime bound checking. Notice, that when using the ReadOnlySpan<int>
as the type in Calculation
method, the runtime will not allocate a new integer array
every time the method is invoked. Instead, the span will point to the region of the assembly that contains the values.
MethodA
to observe invokes Calculation
in a loop, passing the value 2 as the index argument.
[MethodImpl(MethodImplOptions.NoInlining)]
public static int MethodA()
{
var localsum = 0;
for (int i = 3000; i > 0; i--)
localsum += Calculation(2);
return localsum;
}
The JIT generates the following code:
G_M000_IG01: ;; offset=0x0000
mov ecx, dword ptr [rsp+0x3C]
mov eax, dword ptr [rsp+0x38]
G_M000_IG02: ;; offset=0x0008
test eax, eax
jle SHORT G_M000_IG04
align [0 bytes for IG03]
G_M000_IG03: ;; offset=0x000C
add ecx, 48
dec eax
test eax, eax
jg SHORT G_M000_IG03
G_M000_IG04: ;; offset=0x0015
mov eax, ecx
G_M000_IG05: ;; offset=0x0017
add rsp, 120
pop rbp
ret
; Total bytes of code 29
The line add ecx, 48
adds the value 48 to the ecx register. The Calculation
method at index 2 contains the value 52-4 results the value of 48. The optimized code of MethodA
has constant folded and inlined the result of Calculation
method. Because the index
argument is a constant value at the callsite, the JIT can safely optimize this at this callsite.
MethodB
's body is the same as MethodA
's with one difference, it invokes Calculation
method with the argument of 1:
[MethodImpl(MethodImplOptions.NoInlining)]
public static int MethodB()
{
var localsum = 0;
for (int i = 3000; i > 0; i--)
localsum += Calculation(1);
return localsum;
}
This results the following JITted code:
G_M000_IG01: ;; offset=0x0000
mov ecx, dword ptr [rsp+0x3C]
mov eax, dword ptr [rsp+0x38]
G_M000_IG02: ;; offset=0x0008
test eax, eax
jle SHORT G_M000_IG04
align [0 bytes for IG03]
G_M000_IG03: ;; offset=0x000C
add ecx, 32
dec eax
test eax, eax
jg SHORT G_M000_IG03
G_M000_IG04: ;; offset=0x0015
mov eax, ecx
G_M000_IG05: ;; offset=0x0017
add rsp, 120
pop rbp
ret
; Total bytes of code 29
The only real difference compared to MethodA
is the line add ecx, 32
which accumulates 32 in each iteration. The interesting part is that the JIT optimized the same Calculation
method down to a single number (which is different for MethodA
and MethodB
) and inlined them both at the same time in the same application. Thisgives the impression that the same Calculation
method (replaced with a single number) is actually inlined multiple times in a completely different shape at the callsites (beside the other forms of tier-0, tier-1, etc. version).
MethodC
is different compared to the previous methods in a way that it does not hardcode the index argument, instead it receives it as input parameter. MethodC
is invoked as MethodC(Random.Shared.Next(0, 3))
, so a random value for the index parameter is generated on each call.
[MethodImpl(MethodImplOptions.NoInlining)]
public static int MethodC(int index)
{
var localsum = 0;
for (int i = 3000; i > 0; i--)
localsum += Calculation(index);
return localsum;
}
For brevity, I only show the interesting part of the machine code:
G_M000_IG03: ;; offset=0x0017
mov r8d, ecx
shl r8, 2
mov r10, 0x20614A42F30
align [0 bytes for IG04]
G_M000_IG04: ;; offset=0x0028
cmp ecx, 4
jae SHORT G_M000_IG07
mov r9d, dword ptr [r8+r10]
lea edx, [r9+rdx-0x04]
dec eax
test eax, eax
jg SHORT G_M000_IG04
This looks different to the previous assembly snippets. It has a bound check cmp ecx, 4
on the index value to avoid indexing beyond the length of the span. It also looks up the value (mov r9d, dword ptr [r8+r10]
) and performs the accumulation for localsum
variable and the subtraction of 4 (lea edx, [r9+rdx-0x04]
). That means Calculation
is inlined, but not folded, and the bound checks also not optimized away. That is Calculation
method manifests yet another very different shape of machine code at runtime. The same method is inlined in different ways at different sections of application code.
MethodD
slightly changes the implementation: index is no longer hardcoded, nor a parameter, instead it is a value from a static readonly
field. Notice, the parameter is readonly, meaning it only evaluates to a value at runtime, in this case when Random
creates a new integer.
public static readonly int R = Random.Shared.Next(0, 3);
[MethodImpl(MethodImplOptions.NoInlining)]
public static int MethodD()
{
var localsum = 0;
for (int i = 3000; i > 0; i--)
localsum += Calculation(R);
return localsum;
}
Yet, the JITed code of this method looks exactly as the MethodA
and MethodB
: Calculation
is fully inlined and folded to a single value. The snippet below highlights the relevant part of the generated machine code when a random value of 3 was set for R
.
G_M000_IG03: ;; offset=0x000C
add ecx, 64
dec eax
test eax, eax
jg SHORT G_M000_IG03
This last bit is interesting, as it gives a form of (a relatively restrictive) optimization opportunity that is not readily available for native compiled applications. Notice, that the value of a static readonly
field may come from a user input, datetime, http response etc., and still, it allows a way of optimizing the code for this value. This optimization only works today on static readonly
fields, as both instance fields and non-readonly values will result a less optimized code, because it is difficult for the runtime to reason (and generate different code for) instance or non-readonly fields.
When I execute benchmarks using BenchmarkDotNet, I executed the methods compiled against NativeAot and the JITing runtime:
Method | Job | Runtime | Mean | Error | StdDev |
---|---|---|---|---|---|
MethodD | .NET 9.0 | .NET 9.0 | 83.37 us | 0.793 us | 0.662 us |
MethodA | .NET 9.0 | .NET 9.0 | 81.40 us | 0.703 us | 0.657 us |
MethodC | .NET 9.0 | .NET 9.0 | 78.22 us | 1.310 us | 1.286 us |
MethodD | NativeAOT 9.0 | NativeAOT 9.0 | 111.43 us | 0.430 us | 0.381 us |
MethodA | NativeAOT 9.0 | NativeAOT 9.0 | 81.54 us | 0.366 us | 0.324 us |
MethodC | NativeAOT 9.0 | NativeAOT 9.0 | 95.52 us | 0.922 us | 0.817 us |
The .NET 9.0
jobs all take roughly about the same amount of mean execution time. However, the NativeAOT 9.0
job for MethodD
and MethodC
takes significantly longer, the current NativeAOT compiler cannot apply the same optimizations at runtime as the JIT can.