Constant Folding and Inlining

In .NET the JIT performs more-and-more optimization on the code using dynamic PGO, constant folding and inlining. While writing this post, I am using .NET 9 Runtime on x64 architecture, although many of the discussed optimizations are available in one or another form since .NET 7. The presented assembly code snippets generated by the runtime are all optimized, Tier1-OSR code using Dynamic PGO.

Constant-Folding is applied when a compiler 'sees' a constant expression and replaces it with the result of the expression. The C# compiler does it at compile time (ie. var i = 3 + 4; can be substituted with var i = 7). Some expressions involving readonly fields, environment specific parameters, architecture, CPU, etc. can only be folded at runtime. In this latter case the JIT compiler may perform the optimization.

Inlining is a technique where a method's invocation is replaced with the actual body of the method. This reduces the overhead of the method invocation, but it increases the overall code size (and memory consumption) at runtime. This optimization also allows further optimizations across the overall method at the callsite.

These two optimizations can work together to achieve an overall optimization that is only possible at runtime. In this blog post I will present one such case.

Setup

To explore the JITted code, I set the dotnet JitDisasm $env:DOTNET_JitDisasm="Foo:Method" environment variable. This instructs dotnet to capture the given the method's assembly code and to print it on the console. Then run the C# console application in Release mode: dotnet run -c Release. To make sure the observed method is optimized, the application invokes it in a loop. The runtime prints the assembly code of the Method on the console every time the tiered compiler produces a new version of code for the observed method.

 public static void Run()
 {
     var sum = 0;
     for (int i = 0; i < 100; i++)
         sum += Foo.Method();
     Console.WriteLine(sum);
 }

I investigate 4 different versions of the same method in this blog post. They are named MethodA, MethodB, MethodC and MethodD. They all invoke the same shared helper method:

private static int Calculation(int index)
{
    ReadOnlySpan<int> source = [20, 36, 52, 68];
    return source[index] - 4;
}

Calculation returns the value minus 4 of an array at a given index. Please note, it intentionally avoids any input argument validation. In case of indexing beyond the length of the array, the samples rely on the safety provided by the runtime bound checking. Notice, that when using the ReadOnlySpan<int> as the type in Calculation method, the runtime will not allocate a new integer array every time the method is invoked. Instead, the span will point to the region of the assembly that contains the values.

MethodA

MethodA to observe invokes Calculation in a loop, passing the value 2 as the index argument.

[MethodImpl(MethodImplOptions.NoInlining)]
public static int MethodA()
{
    var localsum = 0;
    for (int i = 3000; i > 0; i--)
        localsum += Calculation(2);
    return localsum;
}

The JIT generates the following code:

G_M000_IG01:                ;; offset=0x0000
       mov      ecx, dword ptr [rsp+0x3C]
       mov      eax, dword ptr [rsp+0x38]

G_M000_IG02:                ;; offset=0x0008
       test     eax, eax
       jle      SHORT G_M000_IG04
       align    [0 bytes for IG03]

G_M000_IG03:                ;; offset=0x000C
       add      ecx, 48
       dec      eax
       test     eax, eax
       jg       SHORT G_M000_IG03

G_M000_IG04:                ;; offset=0x0015
       mov      eax, ecx

G_M000_IG05:                ;; offset=0x0017
       add      rsp, 120
       pop      rbp
       ret

; Total bytes of code 29

The line add ecx, 48 adds the value 48 to the ecx register. The Calculation method at index 2 contains the value 52-4 results the value of 48. The optimized code of MethodA has constant folded and inlined the result of Calculation method. Because the index argument is a constant value at the callsite, the JIT can safely optimize this at this callsite.

MethodB

MethodB's body is the same as MethodA's with one difference, it invokes Calculation method with the argument of 1:

[MethodImpl(MethodImplOptions.NoInlining)]
public static int MethodB()
{
    var localsum = 0;
    for (int i = 3000; i > 0; i--)
        localsum += Calculation(1);
    return localsum;
}

This results the following JITted code:

G_M000_IG01:                ;; offset=0x0000
       mov      ecx, dword ptr [rsp+0x3C]
       mov      eax, dword ptr [rsp+0x38]

G_M000_IG02:                ;; offset=0x0008
       test     eax, eax
       jle      SHORT G_M000_IG04
       align    [0 bytes for IG03]

G_M000_IG03:                ;; offset=0x000C
       add      ecx, 32
       dec      eax
       test     eax, eax
       jg       SHORT G_M000_IG03

G_M000_IG04:                ;; offset=0x0015
       mov      eax, ecx

G_M000_IG05:                ;; offset=0x0017
       add      rsp, 120
       pop      rbp
       ret

; Total bytes of code 29

The only real difference compared to MethodA is the line add ecx, 32 which accumulates 32 in each iteration. The interesting part is that the JIT optimized the same Calculation method down to a single number (which is different for MethodA and MethodB) and inlined them both at the same time in the same application. Thisgives the impression that the same Calculation method (replaced with a single number) is actually inlined multiple times in a completely different shape at the callsites (beside the other forms of tier-0, tier-1, etc. version).

MethodC

MethodC is different compared to the previous methods in a way that it does not hardcode the index argument, instead it receives it as input parameter. MethodC is invoked as MethodC(Random.Shared.Next(0, 3)), so a random value for the index parameter is generated on each call.

[MethodImpl(MethodImplOptions.NoInlining)]
public static int MethodC(int index)
{
    var localsum = 0;
    for (int i = 3000; i > 0; i--)
        localsum += Calculation(index);
    return localsum;
}

For brevity, I only show the interesting part of the machine code:

G_M000_IG03:                ;; offset=0x0017
       mov      r8d, ecx
       shl      r8, 2
       mov      r10, 0x20614A42F30
       align    [0 bytes for IG04]

G_M000_IG04:                ;; offset=0x0028
       cmp      ecx, 4
       jae      SHORT G_M000_IG07
       mov      r9d, dword ptr [r8+r10]
       lea      edx, [r9+rdx-0x04]
       dec      eax
       test     eax, eax
       jg       SHORT G_M000_IG04

This looks different to the previous assembly snippets. It has a bound check cmp ecx, 4 on the index value to avoid indexing beyond the length of the span. It also looks up the value (mov r9d, dword ptr [r8+r10]) and performs the accumulation for localsum variable and the subtraction of 4 (lea edx, [r9+rdx-0x04]). That means Calculation is inlined, but not folded, and the bound checks also not optimized away. That is Calculation method manifests yet another very different shape of machine code at runtime. The same method is inlined in different ways at different sections of application code.

MethodD

MethodD slightly changes the implementation: index is no longer hardcoded, nor a parameter, instead it is a value from a static readonly field. Notice, the parameter is readonly, meaning it only evaluates to a value at runtime, in this case when Random creates a new integer.

public static readonly int R = Random.Shared.Next(0, 3);

[MethodImpl(MethodImplOptions.NoInlining)]
public static int MethodD()
{
    var localsum = 0;
    for (int i = 3000; i > 0; i--)
        localsum += Calculation(R);
    return localsum;
}

Yet, the JITed code of this method looks exactly as the MethodA and MethodB: Calculation is fully inlined and folded to a single value. The snippet below highlights the relevant part of the generated machine code when a random value of 3 was set for R.

G_M000_IG03:                ;; offset=0x000C
       add      ecx, 64
       dec      eax
       test     eax, eax
       jg       SHORT G_M000_IG03

This last bit is interesting, as it gives a form of (a relatively restrictive) optimization opportunity that is not readily available for native compiled applications. Notice, that the value of a static readonly field may come from a user input, datetime, http response etc., and still, it allows a way of optimizing the code for this value. This optimization only works today on static readonly fields, as both instance fields and non-readonly values will result a less optimized code, because it is difficult for the runtime to reason (and generate different code for) instance or non-readonly fields.

Benchmarks

When I execute benchmarks using BenchmarkDotNet, I executed the methods compiled against NativeAot and the JITing runtime:

Method	Job	Runtime	Mean	Error	StdDev
MethodD	.NET 9.0	.NET 9.0	83.37 us	0.793 us	0.662 us
MethodA	.NET 9.0	.NET 9.0	81.40 us	0.703 us	0.657 us
MethodC	.NET 9.0	.NET 9.0	78.22 us	1.310 us	1.286 us
MethodD	NativeAOT 9.0	NativeAOT 9.0	111.43 us	0.430 us	0.381 us
MethodA	NativeAOT 9.0	NativeAOT 9.0	81.54 us	0.366 us	0.324 us
MethodC	NativeAOT 9.0	NativeAOT 9.0	95.52 us	0.922 us	0.817 us

The .NET 9.0 jobs all take roughly about the same amount of mean execution time. However, the NativeAOT 9.0 job for MethodD and MethodC takes significantly longer, the current NativeAOT compiler cannot apply the same optimizations at runtime as the JIT can.

ladeak/Readme.md