Created
October 8, 2025 06:24
-
-
Save awaescher/8c1feeec6331d3acca6bf9b5fb3922b5 to your computer and use it in GitHub Desktop.
36k prompt
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Fasse diesen Artikel in 10 Überschriften zusammen | |
| ```plaintext | |
| Performance Improvements in .NET 10 | |
| Stephen Toub - MSFT | |
| Partner Software Engineer | |
| My kids love “Frozen”. They can sing every word, re-enact every scene, and provide detailed notes on the proper sparkle of Elsa’s ice dress. I’ve seen the movie more times than I can recount, to the point where, if you’ve seen me do any live coding, you’ve probably seen my subconscious incorporate an Arendelle reference or two. After so many viewings, I began paying closer attention to the details, like how at the very beginning of the film the ice harvesters are singing a song that subtly foreshadows the story’s central conflicts, the characters’ journeys, and even the key to resolving the climax. I’m slightly ashamed to admit I didn’t comprehend this connection until viewing number ten or so, at which point I also realized I had no idea if this ice harvesting was actually “a thing” or if it was just a clever vehicle for Disney to spin a yarn. Turns out, as I subsequently researched, it’s quite real. | |
| In the 19th century, before refrigeration, ice was an incredibly valuable commodity. Winters in the northern United States turned ponds and lakes into seasonal gold mines. The most successful operations ran with precision: workers cleared snow from the surface so the ice would grow thicker and stronger, and they scored the surface into perfect rectangles using horse-drawn plows, turning the lake into a frozen checkerboard. Once the grid was cut, teams with long saws worked to free uniform blocks weighing several hundred pounds each. These blocks were floated along channels of open water toward the shore, at which point men with poles levered the blocks up ramps and hauled them into storage. Basically, what the movie shows. | |
| The storage itself was an art. Massive wooden ice houses, sometimes holding tens of thousands of tons, were lined with insulation, typically straw. Done well, this insulation could keep the ice solid for months, even through summer heat. Done poorly, you would open the doors to slush. And for those moving ice over long distances, typically by ship, every degree, every crack in the insulation, every extra day in transit meant more melting and more loss. | |
| Enter Frederic Tudor, the “Ice King” of Boston. He was obsessed with systemic efficiency. Where competitors saw unavoidable loss, Tudor saw a solvable problem. After experimenting with different insulators, he leaned on cheap sawdust, a lumber mill byproduct that outperformed straw, packing it densely around the ice to cut melt losses significantly. For harvesting efficiency, his operations adopted Nathaniel Jarvis Wyeth’s grid-scoring system, which produced uniform blocks that could be packed tightly, minimizing air gaps that would otherwise increase exposure in a ship’s hold. And to shorten the critical time between shore and ship, Tudor built out port infrastructure and depots near docks, allowing ships to load and unload much faster. Each change, from tools to ice house design to logistics, amplified the last, turning a risky local harvest into a reliable global trade. With Tudor’s enhancements, he had solid ice arriving in places like Havana, Rio de Janeiro, and even Calcutta (a voyage of four months in the 1830s). His performance gains allowed the product to survive journeys that were previously unthinkable. | |
| What made Tudor’s ice last halfway around the world wasn’t one big idea. It was a plethora of small improvements, each multiplying the effect of the last. In software development, the same principle holds: big leaps forward in performance rarely come from a single sweeping change, rather from hundreds or thousands of targeted optimizations that compound into something transformative. .NET 10’s performance story isn’t about one Disney-esque magical idea; it’s about carefully shaving off nanoseconds here and tens of bytes there, streamlining operations that are executed trillions of times. | |
| In the rest of this post, just as we did in Performance Improvements in .NET 9, .NET 8, .NET 7, .NET 6, .NET 5, .NET Core 3.0, .NET Core 2.1, and .NET Core 2.0, we’ll dig into hundreds of the small but meaningful and compounding performance improvements since .NET 9 that make up .NET 10’s story (if you instead stay on LTS releases and thus are upgrading from .NET 8 instead of from .NET 9, you’ll see even more improvements based on the aggregation from all the improvements in .NET 9 as well). So, without further ado, go grab a cup of your favorite hot beverage (or, given my intro, maybe something a bit more frosty), sit back, relax, and “Let It Go”! | |
| Or, hmm, maybe, let’s push performance “Into the Unknown”? | |
| Let .NET 10 performance “Show Yourself”? | |
| “Do You Want To Build a Snowman Fast Service?” | |
| I’ll see myself out. | |
| Benchmarking Setup | |
| As in previous posts, this tour is chock full of micro-benchmarks intended to showcase various performance improvements. Most of these benchmarks are implemented using BenchmarkDotNet 0.15.2, with a simple setup for each. | |
| To follow along, make sure you have .NET 9 and .NET 10 installed, as most of the benchmarks compare the same test running on each. Then, create a new C# project in a new benchmarks directory: | |
| Copy | |
| dotnet new console -o benchmarks | |
| cd benchmarks | |
| That will produce two files in the benchmarks directory: benchmarks.csproj, which is the project file with information about how the application should be compiled, and Program.cs, which contains the code for the application. Finally, replace everything in benchmarks.csproj with this: | |
| Copy | |
| <Project Sdk="Microsoft.NET.Sdk"> | |
| <PropertyGroup> | |
| <OutputType>Exe</OutputType> | |
| <TargetFrameworks>net10.0;net9.0</TargetFrameworks> | |
| <LangVersion>Preview</LangVersion> | |
| <ImplicitUsings>enable</ImplicitUsings> | |
| <Nullable>enable</Nullable> | |
| <ServerGarbageCollection>true</ServerGarbageCollection> | |
| </PropertyGroup> | |
| <ItemGroup> | |
| <PackageReference Include="BenchmarkDotNet" Version="0.15.2" /> | |
| </ItemGroup> | |
| </Project> | |
| With that, we’re good to go. Unless otherwise noted, I’ve tried to make each benchmark standalone; just copy/paste its whole contents into the Program.cs file, overwriting everything that’s there, and then run the benchmarks. Each test includes at its top a comment for the dotnet command to use to run the benchmark. It’s typically something like this: | |
| Copy | |
| dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| which will run the benchmark in release on both .NET 9 and .NET 10 and show the compared results. The other common variation, used when the benchmark should only be run on .NET 10 (typically because it’s comparing two approaches rather than comparing one thing on two versions), is the following: | |
| Copy | |
| dotnet run -c Release -f net10.0 --filter "*" | |
| Throughout the post, I’ve shown many benchmarks and the results I received from running them. Unless otherwise stated (e.g. because I’m demonstrating an OS-specific improvement), the results shown are from running them on Linux (Ubuntu 24.04.1) on an x64 processor. | |
| Copy | |
| BenchmarkDotNet v0.15.2, Linux Ubuntu 24.04.1 LTS (Noble Numbat) | |
| 11th Gen Intel Core i9-11950H 2.60GHz, 1 CPU, 16 logical and 8 physical cores | |
| .NET SDK 10.0.100-rc.1.25451.107 | |
| [Host] : .NET 9.0.9 (9.0.925.41916), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI | |
| As always, a quick disclaimer: these are micro-benchmarks, timing operations so short you’d miss them by blinking (but when such operations run millions of times, the savings really add up). The exact numbers you get will depend on your hardware, your operating system, what else your machine is juggling at the moment, how much coffee you’ve had since breakfast, and perhaps whether Mercury is in retrograde. In other words, don’t expect your results to match mine exactly, but I’ve picked tests that should still be reasonably reproducible in the real world. | |
| Now, let’s start at the bottom of the stack. Code generation. | |
| JIT | |
| Among all areas of .NET, the Just-In-Time (JIT) compiler stands out as one of the most impactful. Every .NET application, whether a small console tool or a large-scale enterprise service, ultimately relies on the JIT to turn intermediate language (IL) code into optimized machine code. Any enhancement to the JIT’s generated code quality has a ripple effect, improving performance across the entire ecosystem without requiring developers to change any of their own code or even recompile their C#. And with .NET 10, there’s no shortage of these improvements. | |
| Deabstraction | |
| As with many languages, .NET historically has had an “abstraction penalty,” those extra allocations and indirections that can occur when using high-level language features like interfaces, iterators, and delegates. Each year, the JIT gets better and better at optimizing away layers of abstraction, so that developers get to write simple code and still get great performance. .NET 10 continues this tradition. The result is that idiomatic C# (using interfaces, foreach loops, lambdas, etc.) runs even closer to the raw speed of meticulously crafted and hand-tuned code. | |
| Object Stack Allocation | |
| One of the most exciting areas of deabstraction progress in .NET 10 is the expanded use of escape analysis to enable stack allocation of objects. Escape analysis is a compiler technique to determine whether an object allocated in a method escapes that method, meaning determining whether that object is reachable after the method returns (for example, by being stored in a field or returned to the caller) or used in some way that the runtime can’t track within the method (like passed to an unknown callee). If the compiler can prove an object doesn’t escape, then that object’s lifetime is bounded by the method, and it can be allocated on the stack instead of on the heap. Stack allocation is much cheaper (just pointer bumping for allocation and automatic freeing when the method exits) and reduces GC pressure because, well, the object doesn’t need to be tracked by the GC. .NET 9 had already introduced some limited escape analysis and stack allocation support; .NET 10 takes this significantly further. | |
| dotnet/runtime#115172 teaches the JIT how to perform escape analysis related to delegates, and in particular that a delegate’s Invoke method (which is implemented by the runtime) does not stash away the this reference. Then if escape analysis can prove that the delegate’s object reference is something that otherwise hasn’t escaped, the delegate can effectively evaporate. Consider this benchmark: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [MemoryDiagnoser(displayGenColumns: false)] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "y")] | |
| public partial class Tests | |
| { | |
| [Benchmark] | |
| [Arguments(42)] | |
| public int Sum(int y) | |
| { | |
| Func<int, int> addY = x => x + y; | |
| return DoubleResult(addY, y); | |
| } | |
| private int DoubleResult(Func<int, int> func, int arg) | |
| { | |
| int result = func(arg); | |
| return result + result; | |
| } | |
| } | |
| If we just run this benchmark and compare .NET 9 and .NET 10, we can immediately tell something interesting is happening. | |
| Method Runtime Mean Ratio Code Size Allocated Alloc Ratio | |
| Sum .NET 9.0 19.530 ns 1.00 118 B 88 B 1.00 | |
| Sum .NET 10.0 6.685 ns 0.34 32 B 24 B 0.27 | |
| The C# code for Sum belies complicated code generation by the C# compiler. It needs to create a Func<int, int>, which is “closing over” the y “local”. That means the compiler needs to “lift” y to no longer be an actual local, and instead live as a field on an object; the delegate can then point to a method on that object, giving it access to y. This is approximately what the IL generated by the C# compiler looks like when decompiled to C#: | |
| Copy | |
| public int Sum(int y) | |
| { | |
| <>c__DisplayClass0_0 c = new(); | |
| c.y = y; | |
| Func<int, int> func = new(c.<Sum>b__0); | |
| return DoubleResult(func, c.y); | |
| } | |
| private sealed class <>c__DisplayClass0_0 | |
| { | |
| public int y; | |
| internal int <Sum>b__0(int x) => x + y; | |
| } | |
| From that, we can see the closure is resulting in two allocations: an allocation for the “display class” (what the C# compiler calls these closure types) and an allocation for the delegate that points to the <Sum>b__0 method on that display class instance. That’s what’s accounting for the 88 bytes of allocation in the .NET 9 results: the display class is 24 bytes, and the delegate is 64 bytes. In the .NET 10 version, though, we only see a 24 byte allocation; that’s because the JIT has successfully elided the delegate allocation. Here is the resulting assembly code: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Sum(Int32) | |
| push rbp | |
| push r15 | |
| push rbx | |
| lea rbp,[rsp+10] | |
| mov ebx,esi | |
| mov rdi,offset MT_Tests+<>c__DisplayClass0_0 | |
| call CORINFO_HELP_NEWSFAST | |
| mov r15,rax | |
| mov [r15+8],ebx | |
| mov rdi,offset MT_System.Func<System.Int32, System.Int32> | |
| call CORINFO_HELP_NEWSFAST | |
| mov rbx,rax | |
| lea rdi,[rbx+8] | |
| mov rsi,r15 | |
| call CORINFO_HELP_ASSIGN_REF | |
| mov rax,offset Tests+<>c__DisplayClass0_0.<Sum>b__0(Int32) | |
| mov [rbx+18],rax | |
| mov esi,[r15+8] | |
| cmp [rbx+18],rax | |
| jne short M00_L01 | |
| mov rax,[rbx+8] | |
| add esi,[rax+8] | |
| mov eax,esi | |
| M00_L00: | |
| add eax,eax | |
| pop rbx | |
| pop r15 | |
| pop rbp | |
| ret | |
| M00_L01: | |
| mov rdi,[rbx+8] | |
| call qword ptr [rbx+18] | |
| jmp short M00_L00 | |
| ; Total bytes of code 112 | |
| ; .NET 10 | |
| ; Tests.Sum(Int32) | |
| push rbx | |
| mov ebx,esi | |
| mov rdi,offset MT_Tests+<>c__DisplayClass0_0 | |
| call CORINFO_HELP_NEWSFAST | |
| mov [rax+8],ebx | |
| mov eax,[rax+8] | |
| mov ecx,eax | |
| add eax,ecx | |
| add eax,eax | |
| pop rbx | |
| ret | |
| ; Total bytes of code 32 | |
| In both .NET 9 and .NET 10, the JIT successfully inlined DoubleResult, such that the delegate doesn’t escape, but then in .NET 10, it’s able to stack allocate it. Woo hoo! There’s obviously still future opportunity here, as the JIT doesn’t elide the allocation of the closure object, but that should be addressable with some more effort, hopefully in the near future. | |
| dotnet/runtime#104906 from @hez2010 and dotnet/runtime#112250 extend this kind of analysis and stack allocation to arrays. How many times have you written code like this? | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| using System.Runtime.CompilerServices; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [MemoryDiagnoser(displayGenColumns: false)] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| [Benchmark] | |
| public void Test() | |
| { | |
| Process(new string[] { "a", "b", "c" }); | |
| static void Process(string[] inputs) | |
| { | |
| foreach (string input in inputs) | |
| { | |
| Use(input); | |
| } | |
| [MethodImpl(MethodImplOptions.NoInlining)] | |
| static void Use(string input) { } | |
| } | |
| } | |
| } | |
| Some method I want to call accepts an array of inputs and does something for each input. I need to allocate an array to pass my inputs in, either explicitly, or maybe implicitly due to using params or a collection expression. Ideally moving forward there would be an overload of such a Process method that accepted a ReadOnlySpan<string> instead of a string[], and I could then avoid the allocation by construction. But for all of these cases where I’m forced to create an array, .NET 10 comes to the rescue. | |
| Method Runtime Mean Ratio Allocated Alloc Ratio | |
| Test .NET 9.0 11.580 ns 1.00 48 B 1.00 | |
| Test .NET 10.0 3.960 ns 0.34 – 0.00 | |
| The JIT was able to inline Process, see that the array never leaves the frame, and stack allocate it. | |
| Of course, now that we’re able to stack allocate arrays, we also want to be able to deal with a common way those arrays are used: via spans. dotnet/runtime#113977 and dotnet/runtime#116124 teach escape analysis to be able to reason about the fields in structs, which includes Span<T>, as it’s “just” a struct that stores a ref T field and an int length field. | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| using System.Runtime.CompilerServices; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [MemoryDiagnoser(displayGenColumns: false)] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| private byte[] _buffer = new byte[3]; | |
| [Benchmark] | |
| public void Test() => Copy3Bytes(0x12345678, _buffer); | |
| [MethodImpl(MethodImplOptions.NoInlining)] | |
| private static void Copy3Bytes(int value, Span<byte> dest) => | |
| BitConverter.GetBytes(value).AsSpan(0, 3).CopyTo(dest); | |
| } | |
| Here, we’re using BitConverter.GetBytes, which allocates a byte[] containing the bytes from the input (in this case, it’ll be a four-byte array for the int), then we slice off three of the four bytes, and we copy them to the destination span. | |
| Method Runtime Mean Ratio Allocated Alloc Ratio | |
| Test .NET 9.0 9.7717 ns 1.04 32 B 1.00 | |
| Test .NET 10.0 0.8718 ns 0.09 – 0.00 | |
| In .NET 9, we get the 32-byte allocation we’d expect for the byte[] in GetBytes (every object on 64-bit is at least 24 bytes, which will include the four bytes for the array’s length, and then the four bytes for the data will be in slots 24-27, and the size will be padded up to the next word boundary, for 32). In .NET 10, with GetBytes and AsSpan inlined, the JIT can see that the array doesn’t escape, and a stack allocated version of it can be used to seed the span, just as if it were created from any other stack allocation (like stackalloc). (This case also needed a little help from dotnet/runtime#113093, which taught the JIT that certain span operations, like the Memmove used internally by CopyTo, are non-escaping.) | |
| Devirtualization | |
| Interfaces and virtual methods are a critical aspect of .NET and the abstractions it enables. Being able to unwind these abstractions and “devirtualize” is then an important job for the JIT, which has taken notable leaps in capabilities here in .NET 10. | |
| While arrays are one of the most central features provided by C# and .NET, and while the JIT exerts a lot of energy and does a great job optimizing many aspects of arrays, one area in particular has caused it pain: an array’s interface implementations. The runtime manufactures a bunch of interface implementations for T[], and because they’re implemented differently from literally every other interface implementation in .NET, the JIT hasn’t been able to apply the same devirtualization capabilities it’s applied elsewhere. And, for anyone who’s dived deep into micro-benchmarks, this can lead to some odd observations. Here’s a performance comparison between iterating over a ReadOnlyCollection<T> using a foreach loop (going through its enumerator) and using a for loop (indexing on each element). | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| using System.Collections.ObjectModel; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| private ReadOnlyCollection<int> _list = new(Enumerable.Range(1, 1000).ToArray()); | |
| [Benchmark] | |
| public int SumEnumerable() | |
| { | |
| int sum = 0; | |
| foreach (var item in _list) | |
| { | |
| sum += item; | |
| } | |
| return sum; | |
| } | |
| [Benchmark] | |
| public int SumForLoop() | |
| { | |
| ReadOnlyCollection<int> list = _list; | |
| int sum = 0; | |
| int count = list.Count; | |
| for (int i = 0; i < count; i++) | |
| { | |
| sum += _list[i]; | |
| } | |
| return sum; | |
| } | |
| } | |
| When asked “which of these will be faster”, the obvious answer is “SumForLoop“. After all, SumEnumerable is going to allocate an enumerator and has to make twice the number of interface calls (MoveNext+Current per iteration vs this[int] per iteration). As it turns out, the obvious answer is also wrong. Here are the timings on my machine for .NET 9: | |
| Method Mean | |
| SumEnumerable 949.5 ns | |
| SumForLoop 1,932.7 ns | |
| What the what?? If I change the ToArray to instead be ToList, however, the numbers are much more in line with our expectations. | |
| Method Mean | |
| SumEnumerable 1,542.0 ns | |
| SumForLoop 894.1 ns | |
| So what’s going on here? It’s super subtle. First, it’s necessary to know that ReadOnlyCollection<T> just wraps an arbitrary IList<T>, the ReadOnlyCollection<T>‘s GetEnumerator() returns _list.GetEnumerator() (I’m ignoring for this discussion the special-case where the list is empty), and ReadOnlyCollection<T>‘s indexer just indexes into the IList<T>‘s indexer. So far presumably this all sounds like what you’d expect. But where things gets interesting is around what the JIT is able to devirtualize. In .NET 9, it struggles to devirtualize calls to the interface implementations specifically on T[], so it won’t devirtualize either the _list.GetEnumerator() call nor the _list[index] call. However, the enumerator that’s returned is just a normal type that implements IEnumerator<T>, and the JIT has no problem devirtualizing its MoveNext and Current members. Which means that we’re actually paying a lot more going through the indexer, because for N elements, we’re having to make N interface calls, whereas with the enumerator, we only need the one with GetEnumerator interface call and then no more after that. | |
| Thankfully, this is now addressed in .NET 10. dotnet/runtime#108153, dotnet/runtime#109209, dotnet/runtime#109237, and dotnet/runtime#116771 all make it possible for the JIT to devirtualize array’s interface method implementations. Now when we run the same benchmark (reverted back to using ToArray), we get results much more in line with our expectations, with both benchmarks improving from .NET 9 to .NET 10, and with SumForLoop on .NET 10 being the fastest. | |
| Method Runtime Mean Ratio | |
| SumEnumerable .NET 9.0 968.5 ns 1.00 | |
| SumEnumerable .NET 10.0 775.5 ns 0.80 | |
| SumForLoop .NET 9.0 1,960.5 ns 1.00 | |
| SumForLoop .NET 10.0 624.6 ns 0.32 | |
| One of the really interesting things about this is how many libraries are implemented on the premise that it’s faster to use an IList<T>‘s indexer for iteration than it is to use its IEnumerable<T> for iteration, and that includes System.Linq. All these years, where LINQ has had specialized code paths for working with IList<T> when possible, while in many cases it’s been a welcome optimization, in some cases (such as when the concrete type is a ReadOnlyCollection<T>), it’s actually been a deoptimization. | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| using System.Collections.ObjectModel; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| private ReadOnlyCollection<int> _list = new(Enumerable.Range(1, 1000).ToArray()); | |
| [Benchmark] | |
| public int SkipTakeSum() => _list.Skip(100).Take(800).Sum(); | |
| } | |
| Method Runtime Mean Ratio | |
| SkipTakeSum .NET 9.0 3.525 us 1.00 | |
| SkipTakeSum .NET 10.0 1.773 us 0.50 | |
| Fixing devirtualization for array’s interface implementation then also has this transitive effect on LINQ. | |
| Guarded Devirtualization (GDV) is also improved in .NET 10, such as from dotnet/runtime#116453 and dotnet/runtime#109256. With dynamic PGO, the JIT is able to instrument a method’s compilation and then use the resulting profiling data as part of emitting an optimized version of the method. One of the things it can profile are which types are used in a virtual dispatch. If one type dominates, it can special-case that type in the code gen and emit a customized implementation specific to that type. That then enables devirtualization in that dedicated path, which is “guarded” by the relevant type check, hence “GDV”. In some cases, however, such as if a virtual call was being made in a shared generic context, GDV would not kick in. Now it will. | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| using System.Runtime.CompilerServices; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| [Benchmark] | |
| public bool Test() => GenericEquals("abc", "abc"); | |
| [MethodImpl(MethodImplOptions.NoInlining)] | |
| private static bool GenericEquals<T>(T a, T b) => EqualityComparer<T>.Default.Equals(a, b); | |
| } | |
| Method Runtime Mean Ratio | |
| Test .NET 9.0 2.816 ns 1.00 | |
| Test .NET 10.0 1.511 ns 0.54 | |
| dotnet/runtime#110827 from @hez2010 also helps more methods to be inlined by doing another pass looking for opportunities after later phases of devirtualization. The JIT’s optimizations are split up into multiple phases; each phase can make improvements, and those improvements can expose additional opportunities. If those opportunities would only be capitalized on by a phase that already ran, they can be missed. But for phases that are relatively cheap to perform, such as doing a pass looking for additional inlining opportunities, those phases can be repeated once enough other optimization has happened that it’s likely productive to do so again. | |
| Bounds Checking | |
| C# is a memory-safe language, an important aspect of modern programming languages. A key component of this is the inability to walk off the beginning or end of an array, string, or span. The runtime ensures that any such invalid attempt produces an exception, rather than being allowed to perform the invalid memory access. We can see what this looks like with a small benchmark: | |
| Copy | |
| // dotnet run -c Release -f net10.0 --filter "*" | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| private int[] _array = new int[3]; | |
| [Benchmark] | |
| public int Read() => _array[2]; | |
| } | |
| This is a valid access: the _array contains three elements, and the Read method is reading its last element. However, the JIT can’t be 100% certain that this access is in bounds (something could have changed what’s in the _array field to be a shorter array), and thus it needs to emit a check to ensure we’re not walking off the end of the array. Here’s what the generated assembly code for Read looks like: | |
| Copy | |
| ; .NET 10 | |
| ; Tests.Read() | |
| push rax | |
| mov rax,[rdi+8] | |
| cmp dword ptr [rax+8],2 | |
| jbe short M00_L00 | |
| mov eax,[rax+18] | |
| add rsp,8 | |
| ret | |
| M00_L00: | |
| call CORINFO_HELP_RNGCHKFAIL | |
| int 3 | |
| ; Total bytes of code 25 | |
| The this reference is passed into the Read instance method in the rdi register, and the _array field is at offset 8, so the mov rax,[rdi+8] instruction is loading the address of the array into the rax register. Then the cmp is loading the value at offset 8 from that address; it so happens that’s where the length of the array is stored in the array object. So, this cmp instruction is the bounds check; it’s comparing 2 against that length to ensure it’s in bounds. If the array were too short for this access, the next jbe instruction would branch to the M00_L00 label, which calls the CORINFO_HELP_RNGCHKFAIL helper function that throws an IndexOutOfRangeException. Any time you see this pair of call CORINFO_HELP_RNGCHKFAIL/int 3 at the end of a method, there was at least one bounds check emitted by the JIT in that method. | |
| Of course, we not only want safety, we also want great performance, and it’d be terrible for performance if every single read from an array (or string or span) incurred such an additional check. As such, the JIT strives to avoid emitting these checks when they’d be redundant, when it can prove by construction that the accesses are safe. For example, let me tweak my benchmark slightly, moving the array from an instance field into a static readonly field: | |
| Copy | |
| // dotnet run -c Release -f net10.0 --filter "*" | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| private static readonly int[] s_array = new int[3]; | |
| [Benchmark] | |
| public int Read() => s_array[2]; | |
| } | |
| We now get this assembly: | |
| Copy | |
| ; .NET 10 | |
| ; Tests.Read() | |
| mov rax,705D5419FA20 | |
| mov eax,[rax+18] | |
| ret | |
| ; Total bytes of code 14 | |
| The static readonly field is immutable, arrays can’t be resized, and the JIT can guarantee that the field is initialized prior to generating the code for Read. Therefore, when generating the code for Read, it can know with certainty that the array is of length three, and we’re accessing the element at index two. Therefore, the specified array index is guaranteed to be within bounds, and there’s no need for a bounds check. We simply get two movs, the first mov to load the address of the array (which, thanks to improvements in previous releases, is allocated on a heap that doesn’t need to be compacted such that the array lives at a fixed address), and the second mov to read the int value at the location of index two (these are ints, so index two lives 2 * sizeof(int) = 8 bytes from the start of the array’s data, which itself on 64-bit is offset 16 bytes from the start of the array reference, for a total offset of 24 bytes, or in hex 0x18, hence the rax+18 in the disassembly). | |
| Every release of .NET, more and more opportunities are found and implemented to eschew bounds checks that were previously being generated. .NET 10 continues this trend. | |
| Our first example comes from dotnet/runtime#109900, which was inspired by the implementation of BitOperations.Log2. The operation has intrinsic hardware support on many architectures, and generally BitOperations.Log2 will use one of the hardware intrinsics available to it for a very efficient implementation (e.g. Lscnt.LeadingZeroCount, ArmBase.LeadingZeroCount, or X86Base.BitScanReverse), however as a fallback implementation it uses a lookup table. The lookup table has 32 elements, and the operation involves computing a uint value and then shifting it down by 27 in order to get the top 5 bits. Any possible result is guaranteed to be a non-negative number less than 32, but indexing into the span with that result still produced a bounds check, and, as this is a critical path, “unsafe” code (meaning code that eschews the guardrails the runtime supplies by default) was then used to avoid the bounds check. | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "value")] | |
| public partial class Tests | |
| { | |
| [Benchmark] | |
| [Arguments(42)] | |
| public int Log2SoftwareFallback2(uint value) | |
| { | |
| ReadOnlySpan<byte> Log2DeBruijn = | |
| [ | |
| 00, 09, 01, 10, 13, 21, 02, 29, | |
| 11, 14, 16, 18, 22, 25, 03, 30, | |
| 08, 12, 20, 28, 15, 17, 24, 07, | |
| 19, 27, 23, 06, 26, 05, 04, 31 | |
| ]; | |
| value |= value >> 01; | |
| value |= value >> 02; | |
| value |= value >> 04; | |
| value |= value >> 08; | |
| value |= value >> 16; | |
| return Log2DeBruijn[(int)((value * 0x07C4ACDDu) >> 27)]; | |
| } | |
| } | |
| Now in .NET 10, the bounds check is gone (note the presence of the call CORINFO_HELP_RNGCHKFAIL in the .NET 9 assembly and the lack of it in the .NET 10 assembly). | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Log2SoftwareFallback2(UInt32) | |
| push rax | |
| mov eax,esi | |
| shr eax,1 | |
| or esi,eax | |
| mov eax,esi | |
| shr eax,2 | |
| or esi,eax | |
| mov eax,esi | |
| shr eax,4 | |
| or esi,eax | |
| mov eax,esi | |
| shr eax,8 | |
| or esi,eax | |
| mov eax,esi | |
| shr eax,10 | |
| or eax,esi | |
| imul eax,7C4ACDD | |
| shr eax,1B | |
| cmp eax,20 | |
| jae short M00_L00 | |
| mov rcx,7913CA812E10 | |
| movzx eax,byte ptr [rax+rcx] | |
| add rsp,8 | |
| ret | |
| M00_L00: | |
| call CORINFO_HELP_RNGCHKFAIL | |
| int 3 | |
| ; Total bytes of code 74 | |
| ; .NET 10 | |
| ; Tests.Log2SoftwareFallback2(UInt32) | |
| mov eax,esi | |
| shr eax,1 | |
| or esi,eax | |
| mov eax,esi | |
| shr eax,2 | |
| or esi,eax | |
| mov eax,esi | |
| shr eax,4 | |
| or esi,eax | |
| mov eax,esi | |
| shr eax,8 | |
| or esi,eax | |
| mov eax,esi | |
| shr eax,10 | |
| or eax,esi | |
| imul eax,7C4ACDD | |
| shr eax,1B | |
| mov rcx,7CA298325E10 | |
| movzx eax,byte ptr [rcx+rax] | |
| ret | |
| ; Total bytes of code 58 | |
| This improvement then enabled dotnet/runtime#118560 to simplify the code in the real Log2SoftwareFallback, avoiding manual use of unsafe constructs. | |
| dotnet/runtime#113790 implements a similar case, where the result of a mathematical operation is guaranteed to be in bounds. In this case, it’s the result of Log2. The change teaches the JIT to understand the maximum possible value that Log2 could produce, and if that maximum is in bounds, then any result is guaranteed to be in bounds as well. | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "value")] | |
| public partial class Tests | |
| { | |
| [Benchmark] | |
| [Arguments(12345)] | |
| public nint CountDigits(ulong value) | |
| { | |
| ReadOnlySpan<byte> log2ToPow10 = | |
| [ | |
| 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, | |
| 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9, 9, 9, 10, 10, 10, | |
| 10, 11, 11, 11, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 15, 15, | |
| 15, 16, 16, 16, 16, 17, 17, 17, 18, 18, 18, 19, 19, 19, 19, 20 | |
| ]; | |
| return log2ToPow10[(int)ulong.Log2(value)]; | |
| } | |
| } | |
| We can see the bounds check present in the .NET 9 output and absent in the .NET 10 output: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.CountDigits(UInt64) | |
| push rax | |
| or rsi,1 | |
| xor eax,eax | |
| lzcnt rax,rsi | |
| xor eax,3F | |
| cmp eax,40 | |
| jae short M00_L00 | |
| mov rcx,7C2D0A213DF8 | |
| movzx eax,byte ptr [rax+rcx] | |
| add rsp,8 | |
| ret | |
| M00_L00: | |
| call CORINFO_HELP_RNGCHKFAIL | |
| int 3 | |
| ; Total bytes of code 45 | |
| ; .NET 10 | |
| ; Tests.CountDigits(UInt64) | |
| or rsi,1 | |
| xor eax,eax | |
| lzcnt rax,rsi | |
| xor eax,3F | |
| mov rcx,71EFA9400DF8 | |
| movzx eax,byte ptr [rcx+rax] | |
| ret | |
| ; Total bytes of code 29 | |
| My choice of benchmark in this case was not coincidental. This pattern shows up in the FormattingHelpers.CountDigits internal method that’s used by the core primitive types in their ToString and TryFormat implementations, in order to determine how much space will be needed to store rendered digits for a number. As with the previous example, this routine is considered core enough that it was using unsafe code to avoid the bounds check. With this fix, the code was able to be changed back to using a simple span access, and even with the simpler code, it’s now also faster. | |
| Now, consider this code: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "ids")] | |
| public partial class Tests | |
| { | |
| public IEnumerable<int[]> Ids { get; } = [[1, 2, 3, 4, 5, 1]]; | |
| [Benchmark] | |
| [ArgumentsSource(nameof(Ids))] | |
| public bool StartAndEndAreSame(int[] ids) => ids[0] == ids[^1]; | |
| } | |
| I have a method that’s accepting an int[] and checking to see whether it starts and ends with the same value. The JIT has no way of knowing whether the int[] is empty or not, so it does need a bounds check; otherwise, accessing ids[0] could walk off the end of the array. However, this is what we see on .NET 9: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.StartAndEndAreSame(Int32[]) | |
| push rax | |
| mov eax,[rsi+8] | |
| test eax,eax | |
| je short M00_L00 | |
| mov ecx,[rsi+10] | |
| lea edx,[rax-1] | |
| cmp edx,eax | |
| jae short M00_L00 | |
| mov eax,edx | |
| cmp ecx,[rsi+rax*4+10] | |
| sete al | |
| movzx eax,al | |
| add rsp,8 | |
| ret | |
| M00_L00: | |
| call CORINFO_HELP_RNGCHKFAIL | |
| int 3 | |
| ; Total bytes of code 41 | |
| Note there are two jumps to the M00_L00 label that handles failed bounds checks… that’s because there are two bounds checks here, one for the start access and one for the end access. But that shouldn’t be necessary. ids[^1] is the same as ids[ids.Length - 1]. If the code has successfully accessed ids[0], that means the array is at least one element in length, and if it’s at least one element in length, ids[ids.Length - 1] will always be in bounds. Thus, the second bounds check shouldn’t be needed. Indeed, thanks to dotnet/runtime#116105, this is what we now get on .NET 10 (one branch to M00_L00 instead of two): | |
| Copy | |
| ; .NET 10 | |
| ; Tests.StartAndEndAreSame(Int32[]) | |
| push rax | |
| mov eax,[rsi+8] | |
| test eax,eax | |
| je short M00_L00 | |
| mov ecx,[rsi+10] | |
| dec eax | |
| cmp ecx,[rsi+rax*4+10] | |
| sete al | |
| movzx eax,al | |
| add rsp,8 | |
| ret | |
| M00_L00: | |
| call CORINFO_HELP_RNGCHKFAIL | |
| int 3 | |
| ; Total bytes of code 34 | |
| What’s really interesting to me here is the knock-on effect of having removed the bounds check. It didn’t just eliminate the cmp/jae pair of instructions that’s typical of a bounds check. The .NET 9 version of the code had this: | |
| Copy | |
| lea edx,[rax-1] | |
| cmp edx,eax | |
| jae short M00_L00 | |
| mov eax,edx | |
| At this point in the assembly, the rax register is storing the length of the array. It’s calculating ids.Length - 1 and storing the result into edx, and then checking to see whether ids.Length-1 is in bounds of ids.Length (the only way it wouldn’t be is if the array were empty such that ids.Length-1 wrapped around to uint.MaxValue); if it’s not, it jumps to the fail handler, and if it is, it stores the already computed ids.Length - 1 into eax. By removing the bounds check, we get rid of those two intervening instructions, leaving these: | |
| Copy | |
| lea edx,[rax-1] | |
| mov eax,edx | |
| which is a little silly, as this sequence is just computing a decrement, and as long as it’s ok that flags get modified, it could instead just be: | |
| Copy | |
| dec eax | |
| which, as you can see in the .NET 10 output, is exactly what .NET 10 now does. | |
| dotnet/runtime#115980 addresses another case. Let’s say I have this method: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "start", "text")] | |
| public partial class Tests | |
| { | |
| [Benchmark] | |
| [Arguments("abc", "abc.")] | |
| public bool IsFollowedByPeriod(string start, string text) => | |
| start.Length < text.Length && text[start.Length] == '.'; | |
| } | |
| We’re validating that one input’s length is less than the other, and then checking to see what comes immediately after it in the other. We know that string.Length is immutable, so a bounds check here is redundant, but until .NET 10, the JIT couldn’t see that. | |
| Copy | |
| ; .NET 9 | |
| ; Tests.IsFollowedByPeriod(System.String, System.String) | |
| push rbp | |
| mov rbp,rsp | |
| mov eax,[rsi+8] | |
| mov ecx,[rdx+8] | |
| cmp eax,ecx | |
| jge short M00_L00 | |
| cmp eax,ecx | |
| jae short M00_L01 | |
| cmp word ptr [rdx+rax*2+0C],2E | |
| sete al | |
| movzx eax,al | |
| pop rbp | |
| ret | |
| M00_L00: | |
| xor eax,eax | |
| pop rbp | |
| ret | |
| M00_L01: | |
| call CORINFO_HELP_RNGCHKFAIL | |
| int 3 | |
| ; Total bytes of code 42 | |
| ; .NET 10 | |
| ; Tests.IsFollowedByPeriod(System.String, System.String) | |
| mov eax,[rsi+8] | |
| mov ecx,[rdx+8] | |
| cmp eax,ecx | |
| jge short M00_L00 | |
| cmp word ptr [rdx+rax*2+0C],2E | |
| sete al | |
| movzx eax,al | |
| ret | |
| M00_L00: | |
| xor eax,eax | |
| ret | |
| ; Total bytes of code 26 | |
| The removal of the bounds check almost halves the size of the function. If we don’t need to do a bounds check, we get to elide the cmp/jae. Without that branch, nothing is targeting M00_L01, and we can remove the call/int pair that were only necessary to support a bounds check. Then without the call in M00_L01, which was the only call in the whole method, the prologue and epilogue can be elided, meaning we also don’t need the opening and closing push and pop instructions. | |
| dotnet/runtime#113233 improved handling “assertions” (facts the JIT claims and based on which the JIT makes optimizations) to be less order dependent. In .NET 9, this code: | |
| Copy | |
| static bool Test(ReadOnlySpan<char> span, int pos) => | |
| pos > 0 && | |
| pos <= span.Length - 42 && | |
| span[pos - 1] != '\n'; | |
| was successfully removing the bounds check on the span access, but the following variant, which just switches the order of the first two conditions, was still incurring the bounds check. | |
| Copy | |
| static bool Test(ReadOnlySpan<char> span, int pos) => | |
| pos <= span.Length - 42 && | |
| pos > 0 && | |
| span[pos - 1] != '\n'; | |
| Note that both conditions contribute an assertion (fact) that need to be merged in order to know the bounds check can be avoided. Now in .NET 10, the bounds check is elided, regardless of the order. | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| private string _s = new string('s', 100); | |
| private int _pos = 10; | |
| [Benchmark] | |
| public bool Test() | |
| { | |
| string s = _s; | |
| int pos = _pos; | |
| return | |
| pos <= s.Length - 42 && | |
| pos > 0 && | |
| s[pos - 1] != '\n'; | |
| } | |
| } | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Test() | |
| push rbp | |
| mov rbp,rsp | |
| mov rax,[rdi+8] | |
| mov ecx,[rdi+10] | |
| mov edx,[rax+8] | |
| lea edi,[rdx-2A] | |
| cmp edi,ecx | |
| jl short M00_L00 | |
| test ecx,ecx | |
| jle short M00_L00 | |
| dec ecx | |
| cmp ecx,edx | |
| jae short M00_L01 | |
| cmp word ptr [rax+rcx*2+0C],0A | |
| setne al | |
| movzx eax,al | |
| pop rbp | |
| ret | |
| M00_L00: | |
| xor eax,eax | |
| pop rbp | |
| ret | |
| M00_L01: | |
| call CORINFO_HELP_RNGCHKFAIL | |
| int 3 | |
| ; Total bytes of code 55 | |
| ; .NET 10 | |
| ; Tests.Test() | |
| push rbp | |
| mov rbp,rsp | |
| mov rax,[rdi+8] | |
| mov ecx,[rdi+10] | |
| mov edx,[rax+8] | |
| add edx,0FFFFFFD6 | |
| cmp edx,ecx | |
| jl short M00_L00 | |
| test ecx,ecx | |
| jle short M00_L00 | |
| dec ecx | |
| cmp word ptr [rax+rcx*2+0C],0A | |
| setne al | |
| movzx eax,al | |
| pop rbp | |
| ret | |
| M00_L00: | |
| xor eax,eax | |
| pop rbp | |
| ret | |
| ; Total bytes of code 45 | |
| dotnet/runtime#113862 addresses a similar case where assertions weren’t being handled as precisely as they could have been. Consider this code: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [MemoryDiagnoser(displayGenColumns: false)] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| private int[] _arr = Enumerable.Range(0, 10).ToArray(); | |
| [Benchmark] | |
| public int Sum() | |
| { | |
| int[] arr = _arr; | |
| int sum = 0; | |
| int i; | |
| for (i = 0; i < arr.Length - 3; i += 4) | |
| { | |
| sum += arr[i + 0]; | |
| sum += arr[i + 1]; | |
| sum += arr[i + 2]; | |
| sum += arr[i + 3]; | |
| } | |
| for (; i < arr.Length; i++) | |
| { | |
| sum += arr[i]; | |
| } | |
| return sum; | |
| } | |
| } | |
| The Sum method is trying to do manual loop unrolling. Rather than incurring a branch on each element, it’s handling four elements per iteration. Then, for the case where the length of the input isn’t evenly divisible by four, it’s handling the remaining elements in a separate loop. In .NET 9, the JIT successfully elides the bounds checks in the main unrolled loop: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Sum() | |
| push rbp | |
| mov rbp,rsp | |
| mov rax,[rdi+8] | |
| xor ecx,ecx | |
| xor edx,edx | |
| mov edi,[rax+8] | |
| lea esi,[rdi-3] | |
| test esi,esi | |
| jle short M00_L02 | |
| M00_L00: | |
| mov r8d,edx | |
| add ecx,[rax+r8*4+10] | |
| lea r8d,[rdx+1] | |
| add ecx,[rax+r8*4+10] | |
| lea r8d,[rdx+2] | |
| add ecx,[rax+r8*4+10] | |
| lea r8d,[rdx+3] | |
| add ecx,[rax+r8*4+10] | |
| add edx,4 | |
| cmp esi,edx | |
| jg short M00_L00 | |
| jmp short M00_L02 | |
| M00_L01: | |
| cmp edx,edi | |
| jae short M00_L03 | |
| mov esi,edx | |
| add ecx,[rax+rsi*4+10] | |
| inc edx | |
| M00_L02: | |
| cmp edi,edx | |
| jg short M00_L01 | |
| mov eax,ecx | |
| pop rbp | |
| ret | |
| M00_L03: | |
| call CORINFO_HELP_RNGCHKFAIL | |
| int 3 | |
| ; Total bytes of code 92 | |
| You can see this in the M00_L00 section, which has the five add instructions (four for the summed elements, and one for the index). However, we still see the CORINFO_HELP_RNGCHKFAIL at the end, indicating this method has a bounds check. That’s coming from the final loop, due to the JIT losing track of the fact that i is guaranteed to be non-negative. Now in .NET 10, that bounds check is removed as well (again, just look for the lack of the CORINFO_HELP_RNGCHKFAIL call). | |
| Copy | |
| ; .NET 10 | |
| ; Tests.Sum() | |
| push rbp | |
| mov rbp,rsp | |
| mov rax,[rdi+8] | |
| xor ecx,ecx | |
| xor edx,edx | |
| mov edi,[rax+8] | |
| lea esi,[rdi-3] | |
| test esi,esi | |
| jle short M00_L01 | |
| M00_L00: | |
| mov r8d,edx | |
| add ecx,[rax+r8*4+10] | |
| lea r8d,[rdx+1] | |
| add ecx,[rax+r8*4+10] | |
| lea r8d,[rdx+2] | |
| add ecx,[rax+r8*4+10] | |
| lea r8d,[rdx+3] | |
| add ecx,[rax+r8*4+10] | |
| add edx,4 | |
| cmp esi,edx | |
| jg short M00_L00 | |
| M00_L01: | |
| cmp edi,edx | |
| jle short M00_L03 | |
| test edx,edx | |
| jl short M00_L04 | |
| M00_L02: | |
| mov esi,edx | |
| add ecx,[rax+rsi*4+10] | |
| inc edx | |
| cmp edi,edx | |
| jg short M00_L02 | |
| M00_L03: | |
| mov eax,ecx | |
| pop rbp | |
| ret | |
| M00_L04: | |
| mov esi,edx | |
| add ecx,[rax+rsi*4+10] | |
| inc edx | |
| cmp edi,edx | |
| jg short M00_L04 | |
| jmp short M00_L03 | |
| ; Total bytes of code 102 | |
| Another nice improvement comes from dotnet/runtime#112824, which teaches the JIT to turn facts it already learned from earlier checks into concrete numeric ranges, and then use those ranges to fold away later relational tests and bounds checks. Consider this example: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| using System.Runtime.CompilerServices; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| private int[] _array = new int[10]; | |
| [Benchmark] | |
| public void Test() => SetAndSlice(_array); | |
| [MethodImpl(MethodImplOptions.NoInlining)] | |
| private static Span<int> SetAndSlice(Span<int> src) | |
| { | |
| src[5] = 42; | |
| return src.Slice(4); | |
| } | |
| } | |
| We have to incur a bounds check for the src[5], as the JIT has no evidence that src is at least six elements long. However, by the time we get to the Slice call, we know the span has a length of at least six, or else writing into src[5] would have failed. We can use that knowledge to remove the length check from within the Slice call (note the removal of the call qword ptr [7F8DDB3A7810]/int 3 sequence, which is the manual length check and call to a throw helper method in Slice). | |
| Copy | |
| ; .NET 9 | |
| ; Tests.SetAndSlice(System.Span`1<Int32>) | |
| push rbp | |
| mov rbp,rsp | |
| cmp esi,5 | |
| jbe short M01_L01 | |
| mov dword ptr [rdi+14],2A | |
| cmp esi,4 | |
| jb short M01_L00 | |
| add rdi,10 | |
| mov rax,rdi | |
| add esi,0FFFFFFFC | |
| mov edx,esi | |
| pop rbp | |
| ret | |
| M01_L00: | |
| call qword ptr [7F8DDB3A7810] | |
| int 3 | |
| M01_L01: | |
| call CORINFO_HELP_RNGCHKFAIL | |
| int 3 | |
| ; Total bytes of code 48 | |
| ; .NET 10 | |
| ; Tests.SetAndSlice(System.Span`1<Int32>) | |
| push rax | |
| cmp esi,5 | |
| jbe short M01_L00 | |
| mov dword ptr [rdi+14],2A | |
| lea rax,[rdi+10] | |
| lea edx,[rsi-4] | |
| add rsp,8 | |
| ret | |
| M01_L00: | |
| call CORINFO_HELP_RNGCHKFAIL | |
| int 3 | |
| ; Total bytes of code 31 | |
| Let’s look at one more, which has a very nice impact on bounds checking, even though technically the optimization is broader than just that. dotnet/runtime#113998 creates assertions from switch targets. This means that the body of a switch case statement inherits facts about what was switched over based on what the case was, e.g. in a case 3 for switch (x), the body of that case will now “know” that x is three. This is great for very popular patterns with arrays, strings, and spans, where developers switch over the length and then index into available indices in the appropriate branches. Consider this: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| using System.Runtime.CompilerServices; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| private int[] _array = [1, 2]; | |
| [Benchmark] | |
| public int SumArray() => Sum(_array); | |
| [MethodImpl(MethodImplOptions.NoInlining)] | |
| public int Sum(ReadOnlySpan<int> span) | |
| { | |
| switch (span.Length) | |
| { | |
| case 0: return 0; | |
| case 1: return span[0]; | |
| case 2: return span[0] + span[1]; | |
| case 3: return span[0] + span[1] + span[2]; | |
| default: return -1; | |
| } | |
| } | |
| } | |
| On .NET 9, each of those six span dereferences ends up with a bounds check: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Sum(System.ReadOnlySpan`1<Int32>) | |
| push rbp | |
| mov rbp,rsp | |
| M01_L00: | |
| cmp edx,2 | |
| jne short M01_L02 | |
| test edx,edx | |
| je short M01_L04 | |
| mov eax,[rsi] | |
| cmp edx,1 | |
| jbe short M01_L04 | |
| add eax,[rsi+4] | |
| M01_L01: | |
| pop rbp | |
| ret | |
| M01_L02: | |
| cmp edx,3 | |
| ja short M01_L03 | |
| mov eax,edx | |
| lea rcx,[783DA42091B8] | |
| mov ecx,[rcx+rax*4] | |
| lea rdi,[M01_L00] | |
| add rcx,rdi | |
| jmp rcx | |
| M01_L03: | |
| mov eax,0FFFFFFFF | |
| pop rbp | |
| ret | |
| test edx,edx | |
| je short M01_L04 | |
| mov eax,[rsi] | |
| cmp edx,1 | |
| jbe short M01_L04 | |
| add eax,[rsi+4] | |
| cmp edx,2 | |
| jbe short M01_L04 | |
| add eax,[rsi+8] | |
| jmp short M01_L01 | |
| test edx,edx | |
| je short M01_L04 | |
| mov eax,[rsi] | |
| jmp short M01_L01 | |
| xor eax,eax | |
| pop rbp | |
| ret | |
| M01_L04: | |
| call CORINFO_HELP_RNGCHKFAIL | |
| int 3 | |
| ; Total bytes of code 103 | |
| You can see the tell-tale bounds check sign (CORINFO_HELP_RNGCHKFAIL) under M01_L04, and no fewer than six jumps targeting that label, one for each span[...] access. But on .NET 10, we get this: | |
| Copy | |
| ; .NET 10 | |
| ; Tests.Sum(System.ReadOnlySpan`1<Int32>) | |
| push rbp | |
| mov rbp,rsp | |
| M01_L00: | |
| cmp edx,2 | |
| jne short M01_L02 | |
| mov eax,[rsi] | |
| add eax,[rsi+4] | |
| M01_L01: | |
| pop rbp | |
| ret | |
| M01_L02: | |
| cmp edx,3 | |
| ja short M01_L03 | |
| mov eax,edx | |
| lea rcx,[72C15C0F8FD8] | |
| mov ecx,[rcx+rax*4] | |
| lea rdx,[M01_L00] | |
| add rcx,rdx | |
| jmp rcx | |
| M01_L03: | |
| mov eax,0FFFFFFFF | |
| pop rbp | |
| ret | |
| xor eax,eax | |
| pop rbp | |
| ret | |
| mov eax,[rsi] | |
| jmp short M01_L01 | |
| mov eax,[rsi] | |
| add eax,[rsi+4] | |
| add eax,[rsi+8] | |
| jmp short M01_L01 | |
| ; Total bytes of code 70 | |
| The CORINFO_HELP_RNGCHKFAIL and all the jumps to it have evaporated. | |
| Cloning | |
| There are other ways the JIT can remove bounds checking even when it can’t prove statically that every individual access is safe. Consider this method: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| private int[] _arr = new int[16]; | |
| [Benchmark] | |
| public void Test() | |
| { | |
| int[] arr = _arr; | |
| arr[0] = 2; | |
| arr[1] = 3; | |
| arr[2] = 5; | |
| arr[3] = 8; | |
| arr[4] = 13; | |
| arr[5] = 21; | |
| arr[6] = 34; | |
| arr[7] = 55; | |
| } | |
| } | |
| Here’s the assembly code generated on .NET 9: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Test() | |
| push rax | |
| mov rax,[rdi+8] | |
| mov ecx,[rax+8] | |
| test ecx,ecx | |
| je short M00_L00 | |
| mov dword ptr [rax+10],2 | |
| cmp ecx,1 | |
| jbe short M00_L00 | |
| mov dword ptr [rax+14],3 | |
| cmp ecx,2 | |
| jbe short M00_L00 | |
| mov dword ptr [rax+18],5 | |
| cmp ecx,3 | |
| jbe short M00_L00 | |
| mov dword ptr [rax+1C],8 | |
| cmp ecx,4 | |
| jbe short M00_L00 | |
| mov dword ptr [rax+20],0D | |
| cmp ecx,5 | |
| jbe short M00_L00 | |
| mov dword ptr [rax+24],15 | |
| cmp ecx,6 | |
| jbe short M00_L00 | |
| mov dword ptr [rax+28],22 | |
| cmp ecx,7 | |
| jbe short M00_L00 | |
| mov dword ptr [rax+2C],37 | |
| add rsp,8 | |
| ret | |
| M00_L00: | |
| call CORINFO_HELP_RNGCHKFAIL | |
| int 3 | |
| ; Total bytes of code 114 | |
| Even if you’re not proficient at reading assembly, the pattern should still be obvious. In the C# code, we have eight writes into the array, and in the assembly code, we have eight repetitions of the same pattern: cmp ecx,LENGTH to compare the length of the array against the required LENGTH, jbe short M00_L00 to jump to the CORINFO_HELP_RNGCHKFAIL helper if the bounds check fails, and mov dword ptr [rax+OFFSET],VALUE to store VALUE into the array at byte offset OFFSET. Inside the Test method, the JIT can’t know how long _arr is, so it must include bounds checks. Moreover, it must include all of the bounds checks, rather than coalescing them, because it is forbidden from introducing behavioral changes as part of optimizations. Imagine instead if it chose to coalesce all of the bounds checks into a single check, and emitted this method as if it were the equivalent of the following: | |
| Copy | |
| if (arr.Length >= 8) | |
| { | |
| arr[0] = 2; | |
| arr[1] = 3; | |
| arr[2] = 5; | |
| arr[3] = 8; | |
| arr[4] = 13; | |
| arr[5] = 21; | |
| arr[6] = 34; | |
| arr[7] = 55; | |
| } | |
| else | |
| { | |
| throw new IndexOutOfRangeException(); | |
| } | |
| Now, let’s say the array was actually of length four. The original program would have filled the array with values [2, 3, 5, 8] before throwing an exception, but this transformed code wouldn’t (there wouldn’t be any writes to the array). That’s an observable behavioral change. An enterprising developer could of course choose to rewrite their code to avoid some of these checks, e.g. | |
| Copy | |
| arr[7] = 55; | |
| arr[0] = 2; | |
| arr[1] = 3; | |
| arr[2] = 5; | |
| arr[3] = 8; | |
| arr[4] = 13; | |
| arr[5] = 21; | |
| arr[6] = 34; | |
| By moving the last store to the beginning, the developer has given the JIT extra knowledge. The JIT can now see that if the first store succeeds, the rest are guaranteed to succeed as well, and the JIT will emit a single bounds check. But, again, that’s the developer choosing to change their program in a way the JIT must not. However, there are other things the JIT can do. Imagine the JIT chose to rewrite the method like this instead: | |
| Copy | |
| if (arr.Length >= 8) | |
| { | |
| arr[0] = 2; | |
| arr[1] = 3; | |
| arr[2] = 5; | |
| arr[3] = 8; | |
| arr[4] = 13; | |
| arr[5] = 21; | |
| arr[6] = 34; | |
| arr[7] = 55; | |
| } | |
| else | |
| { | |
| arr[0] = 2; | |
| arr[1] = 3; | |
| arr[2] = 5; | |
| arr[3] = 8; | |
| arr[4] = 13; | |
| arr[5] = 21; | |
| arr[6] = 34; | |
| arr[7] = 55; | |
| } | |
| To our C# sensibilities, that looks unnecessarily complicated; the if and the else block contain exactly the same C# code. But, knowing what we now know about how the JIT can use known length information to elide bounds checks, it starts to make a bit more sense. Here’s what the JIT emits for this variant on .NET 9: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Test() | |
| push rbp | |
| mov rbp,rsp | |
| mov rax,[rdi+8] | |
| mov ecx,[rax+8] | |
| cmp ecx,8 | |
| jl short M00_L00 | |
| mov rcx,300000002 | |
| mov [rax+10],rcx | |
| mov rcx,800000005 | |
| mov [rax+18],rcx | |
| mov rcx,150000000D | |
| mov [rax+20],rcx | |
| mov rcx,3700000022 | |
| mov [rax+28],rcx | |
| pop rbp | |
| ret | |
| M00_L00: | |
| test ecx,ecx | |
| je short M00_L01 | |
| mov dword ptr [rax+10],2 | |
| cmp ecx,1 | |
| jbe short M00_L01 | |
| mov dword ptr [rax+14],3 | |
| cmp ecx,2 | |
| jbe short M00_L01 | |
| mov dword ptr [rax+18],5 | |
| cmp ecx,3 | |
| jbe short M00_L01 | |
| mov dword ptr [rax+1C],8 | |
| cmp ecx,4 | |
| jbe short M00_L01 | |
| mov dword ptr [rax+20],0D | |
| cmp ecx,5 | |
| jbe short M00_L01 | |
| mov dword ptr [rax+24],15 | |
| cmp ecx,6 | |
| jbe short M00_L01 | |
| mov dword ptr [rax+28],22 | |
| cmp ecx,7 | |
| jbe short M00_L01 | |
| mov dword ptr [rax+2C],37 | |
| pop rbp | |
| ret | |
| M00_L01: | |
| call CORINFO_HELP_RNGCHKFAIL | |
| int 3 | |
| ; Total bytes of code 177 | |
| The else block is compiled to the M00_L00 label, which contains those same eight repeated blocks we saw earlier. But the if block (above the M00_L00 label) is interesting. The only branch there is the initial array.Length >= 8 check I wrote in the C# code, emitted as the cmp ecx,8/jl short M00_L00 pair of instructions. The rest of the block is just mov instructions (and you can see there are only four writes into the array rather than eight… the JIT has optimized the eight four-byte writes into four eight-byte writes). In our rewrite, we’ve manually cloned the code, so that in what we expect to be the vast, vast, vast majority case (presumably we wouldn’t have written the array writes in the first place if we thought they’d fail), we only incur the single length check, and then we have our “hopefully this is never needed” fallback case for the rare situation where it is. Of course, you shouldn’t (and shouldn’t need to) do such manual cloning. But, the JIT can do such cloning for you, and does. | |
| “Cloning” is an optimization long employed by the JIT, where it will do this kind of code duplication, typically of loops, when it believes that in doing so, it can heavily optimize a common case. Now in .NET 10, thanks to dotnet/runtime#112595, it can employ this same technique for these kinds of sequences of writes. Going back to our original benchmark, here’s what we now get on .NET 10: | |
| Copy | |
| ; .NET 10 | |
| ; Tests.Test() | |
| push rbp | |
| mov rbp,rsp | |
| mov rax,[rdi+8] | |
| mov ecx,[rax+8] | |
| mov edx,ecx | |
| cmp edx,7 | |
| jle short M00_L01 | |
| mov rdx,300000002 | |
| mov [rax+10],rdx | |
| mov rcx,800000005 | |
| mov [rax+18],rcx | |
| mov rcx,150000000D | |
| mov [rax+20],rcx | |
| mov rcx,3700000022 | |
| mov [rax+28],rcx | |
| M00_L00: | |
| pop rbp | |
| ret | |
| M00_L01: | |
| test edx,edx | |
| je short M00_L02 | |
| mov dword ptr [rax+10],2 | |
| cmp ecx,1 | |
| jbe short M00_L02 | |
| mov dword ptr [rax+14],3 | |
| cmp ecx,2 | |
| jbe short M00_L02 | |
| mov dword ptr [rax+18],5 | |
| cmp ecx,3 | |
| jbe short M00_L02 | |
| mov dword ptr [rax+1C],8 | |
| cmp ecx,4 | |
| jbe short M00_L02 | |
| mov dword ptr [rax+20],0D | |
| cmp ecx,5 | |
| jbe short M00_L02 | |
| mov dword ptr [rax+24],15 | |
| cmp ecx,6 | |
| jbe short M00_L02 | |
| mov dword ptr [rax+28],22 | |
| cmp ecx,7 | |
| jbe short M00_L02 | |
| mov dword ptr [rax+2C],37 | |
| jmp short M00_L00 | |
| M00_L02: | |
| call CORINFO_HELP_RNGCHKFAIL | |
| int 3 | |
| ; Total bytes of code 179 | |
| This structure looks almost identical to what we got when we manually cloned: the JIT has emitted the same code twice, except in one case, there are no bounds checks, and in the other case, there are all the bounds checks, and a single length check determines which path to follow. Pretty neat. | |
| As noted, the JIT has been doing cloning for years, in particular for loops over arrays. However, more and more code is being written against spans instead of arrays, and unfortunately this valuable optimization didn’t apply to spans. Now with dotnet/runtime#113575, it does! We can see this with a basic looping example: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| private int[] _arr = new int[16]; | |
| private int _count = 8; | |
| [Benchmark] | |
| public void WithSpan() | |
| { | |
| Span<int> span = _arr; | |
| int count = _count; | |
| for (int i = 0; i < count; i++) | |
| { | |
| span[i] = i; | |
| } | |
| } | |
| [Benchmark] | |
| public void WithArray() | |
| { | |
| int[] arr = _arr; | |
| int count = _count; | |
| for (int i = 0; i < count; i++) | |
| { | |
| arr[i] = i; | |
| } | |
| } | |
| } | |
| In both WithArray and WithSpan, we have the same loop, iterating from 0 to a _count with an unknown relationship to the length of _arr, so there has to be some kind of bounds checking emitted. Here’s what we get on .NET 9 for WithSpan: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.WithSpan() | |
| push rbp | |
| mov rbp,rsp | |
| mov rax,[rdi+8] | |
| test rax,rax | |
| je short M00_L03 | |
| lea rcx,[rax+10] | |
| mov eax,[rax+8] | |
| M00_L00: | |
| mov edi,[rdi+10] | |
| xor edx,edx | |
| test edi,edi | |
| jle short M00_L02 | |
| nop dword ptr [rax] | |
| M00_L01: | |
| cmp edx,eax | |
| jae short M00_L04 | |
| mov [rcx+rdx*4],edx | |
| inc edx | |
| cmp edx,edi | |
| jl short M00_L01 | |
| M00_L02: | |
| pop rbp | |
| ret | |
| M00_L03: | |
| xor ecx,ecx | |
| xor eax,eax | |
| jmp short M00_L00 | |
| M00_L04: | |
| call CORINFO_HELP_RNGCHKFAIL | |
| int 3 | |
| ; Total bytes of code 59 | |
| There’s some upfront assembly here associated with loading _array into a span, loading _count, and checking to see whether the count is 0 (in which case the whole loop can be skipped). Then the core of the loop is at M00_L01, which is repeatedly checking edx (which contains i) against the length of the span (in eax), jumping to CORINFO_HELP_RNGCHKFAIL if it’s an out-of-bounds access, writing edx (i) into the span at the next position, bumping up i, and then jumping back to M00_L01 to keep iterating if i is still less than count (stored in edi). In other words, we have two checks per iteration: is i still within the bounds of the span, and is i still less than count. Now here’s what we get on .NET 9 for WithArray: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.WithArray() | |
| push rbp | |
| mov rbp,rsp | |
| mov rax,[rdi+8] | |
| mov ecx,[rdi+10] | |
| xor edx,edx | |
| test ecx,ecx | |
| jle short M00_L01 | |
| test rax,rax | |
| je short M00_L02 | |
| cmp [rax+8],ecx | |
| jl short M00_L02 | |
| nop dword ptr [rax+rax] | |
| M00_L00: | |
| mov edi,edx | |
| mov [rax+rdi*4+10],edx | |
| inc edx | |
| cmp edx,ecx | |
| jl short M00_L00 | |
| M00_L01: | |
| pop rbp | |
| ret | |
| M00_L02: | |
| cmp edx,[rax+8] | |
| jae short M00_L03 | |
| mov edi,edx | |
| mov [rax+rdi*4+10],edx | |
| inc edx | |
| cmp edx,ecx | |
| jl short M00_L02 | |
| jmp short M00_L01 | |
| M00_L03: | |
| call CORINFO_HELP_RNGCHKFAIL | |
| int 3 | |
| ; Total bytes of code 71 | |
| Here, label M00_L02 looks very similar to the loop we just saw in WithSpan, incurring both the check against count and the bounds check on every iteration. But note section M00_L00: it’s a clone of the same loop, still with the cmp edx,ecx that checks i against count on each iteration, but no additional bounds checking in sight. The JIT has cloned the loop, specializing one to not have bounds checks, and then in the upfront section, it determines which path to follow based on a single check against the array’s length (cmp [rax+8],ecx/jl short M00_L02). Now in .NET 10, here’s what we get for WithSpan: | |
| Copy | |
| ; .NET 10 | |
| ; Tests.WithSpan() | |
| push rbp | |
| mov rbp,rsp | |
| mov rax,[rdi+8] | |
| test rax,rax | |
| je short M00_L04 | |
| lea rcx,[rax+10] | |
| mov eax,[rax+8] | |
| M00_L00: | |
| mov edx,[rdi+10] | |
| xor edi,edi | |
| test edx,edx | |
| jle short M00_L02 | |
| cmp edx,eax | |
| jg short M00_L03 | |
| M00_L01: | |
| mov eax,edi | |
| mov [rcx+rax*4],edi | |
| inc edi | |
| cmp edi,edx | |
| jl short M00_L01 | |
| M00_L02: | |
| pop rbp | |
| ret | |
| M00_L03: | |
| cmp edi,eax | |
| jae short M00_L05 | |
| mov esi,edi | |
| mov [rcx+rsi*4],edi | |
| inc edi | |
| cmp edi,edx | |
| jl short M00_L03 | |
| jmp short M00_L02 | |
| M00_L04: | |
| xor ecx,ecx | |
| xor eax,eax | |
| jmp short M00_L00 | |
| M00_L05: | |
| call CORINFO_HELP_RNGCHKFAIL | |
| int 3 | |
| ; Total bytes of code 75 | |
| As with WithArray in .NET 9, WithSpan for .NET 10 has the loop cloned, with the M00_L03 block containing the bounds check on each iteration, and the M00_L01 block eliding the bounds check on each iteration. | |
| The JIT gains more cloning abilities in .NET 10, as well. dotnet/runtime#110020, dotnet/runtime#108604, and dotnet/runtime#110483 make it possible for the JIT to clone try/finally blocks, whereas previously it would immediately bail out of cloning any regions containing such constructs. This might seem niche, but it’s actually quite valuable when you consider that foreach‘ing over an enumerable typically involves a hidden try/finally for the finally to call the enumerator’s Dispose. | |
| Many of these different optimizations interact with each other. Dynamic PGO triggers a form of cloning, as part of the guarded devirtualization (GDV) mentioned earlier: if the instrumentation data reveals that a particular virtual call is generally performed on an instance of a specific type, the JIT can clone the resulting code into one path specific to that type and another path that handles any type. That then enables the specific-type code path to devirtualize the call and possibly inline it. And if it inlines it, that then provides more opportunities for the JIT to see that an object doesn’t escape, and potentially stack allocate it. dotnet/runtime#111473, dotnet/runtime#116978, dotnet/runtime#116992, dotnet/runtime#117222, and dotnet/runtime#117295 enable that, enhancing escape analysis to determine if an object only escapes when such a generated type test fails (when the target object isn’t of the expected common type). | |
| I want to pause for a moment, because my words thus far aren’t nearly enthusiastic enough to highlight the magnitude of what this enables. The dotnet/runtime repo uses an automated performance analysis system which flags when benchmarks significantly improve or regress and ties those changes back to the responsible PR. This is what it looked like for this PR: Conditional Escape Analysis Triggering Many Benchmark Improvements We can see why this is so good from a simple example: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| using System.Runtime.CompilerServices; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [MemoryDiagnoser(displayGenColumns: false)] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| private int[] _values = Enumerable.Range(1, 100).ToArray(); | |
| [Benchmark] | |
| public int Sum() => Sum(_values); | |
| [MethodImpl(MethodImplOptions.NoInlining)] | |
| private static int Sum(IEnumerable<int> values) | |
| { | |
| int sum = 0; | |
| foreach (int value in values) | |
| { | |
| sum += value; | |
| } | |
| return sum; | |
| } | |
| } | |
| With dynamic PGO, the instrumented code for Sum will see that values is generally an int[], and it’ll be able to emit a specialized code path in the optimized Sum implementation for when it is. And then with this ability to do conditional escape analysis, for the common path the JIT can see that the resulting GetEnumerator produces an IEnumerator<int> that never escapes, such that along with all of the relevant methods being devirtualized and inlined, the enumerator can be stack allocated. | |
| Method Runtime Mean Ratio Allocated Alloc Ratio | |
| Sum .NET 9.0 109.86 ns 1.00 32 B 1.00 | |
| Sum .NET 10.0 35.45 ns 0.32 – 0.00 | |
| Just think about how many places in your apps and services you enumerate collections like this, and you can see why it’s such an exciting improvement. Note that these cases don’t always even require PGO. Consider a case like this: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [MemoryDiagnoser(displayGenColumns: false)] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| private static readonly IEnumerable<int> s_values = new int[] { 1, 2, 3, 4, 5 }; | |
| [Benchmark] | |
| public int Sum() | |
| { | |
| int sum = 0; | |
| foreach (int value in s_values) | |
| { | |
| sum += value; | |
| } | |
| return sum; | |
| } | |
| } | |
| Here, the JIT can see that even though the s_values is typed as IEnumerable<int>, it’s always actually an int[]. In that case, dotnet/runtime#111948 enables the return type to be retyped in the JIT as int[] and the enumerator can be stack allocated. | |
| Method Runtime Mean Ratio Allocated Alloc Ratio | |
| Sum .NET 9.0 16.341 ns 1.00 32 B 1.00 | |
| Sum .NET 10.0 2.059 ns 0.13 – 0.00 | |
| Of course, too much cloning can be a bad thing, in particular as it increases code size. dotnet/runtime#108771 employs a heuristic to determine whether loops that can be cloned should be cloned; the larger the loop, the less likely it’ll be to be cloned. | |
| Inlining | |
| “Inlining”, which replaces a call to a function with a copy of that function’s implementation, has always been a critically important optimization. It’s easy to think about the benefits of inlining as just being about avoiding the overhead of a call, and while that can be meaningful (especially when considering security mechanisms like Intel’s Control-Flow Enforcement Technology, which slightly increases the cost of calls), generally the most benefit from inlining comes from knock-on benefits. Just as a simple example, if you have code like: | |
| Copy | |
| int i = Divide(10, 5); | |
| static int Divide(int n, int d) => n / d; | |
| if Divide doesn’t get inlined, then when Divide is called, it’ll need to perform the actual idiv, which is a relatively expensive operation. In contrast, if Divide is inlined, then the call site becomes: | |
| Copy | |
| int i = 10 / 5; | |
| which can be evaluated at compile time and becomes just: | |
| Copy | |
| int i = 2; | |
| More compelling examples were already seen throughout the discussion of escape analysis and stack allocation, which depend heavily on the ability to inline methods. Given the increased importance of inlining, it’s gotten even more focus in .NET 10. | |
| Some of the .NET work related to inlining is about enabling more kinds of things to be inlined. Historically, a variety of constructs present in a method would prevent that method from even being considered for inlining. Arguably the most well known of these is exception handling: methods with exception handling clauses, e.g. try/catch or try/finally, would not be inlined. Even a simple method like M in this example: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [MemoryDiagnoser(displayGenColumns: false)] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| private readonly object _o = new(); | |
| [Benchmark] | |
| public int Test() | |
| { | |
| M(_o); | |
| return 42; | |
| } | |
| private static void M(object o) | |
| { | |
| Monitor.Enter(o); | |
| try | |
| { | |
| } | |
| finally | |
| { | |
| Monitor.Exit(o); | |
| } | |
| } | |
| } | |
| does not get inlined on .NET 9: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Test() | |
| push rax | |
| mov rdi,[rdi+8] | |
| call qword ptr [78F199864EE8]; Tests.M(System.Object) | |
| mov eax,2A | |
| add rsp,8 | |
| ret | |
| ; Total bytes of code 21 | |
| But with a plethora of PRs, in particular dotnet/runtime#112968, dotnet/runtime#113023, dotnet/runtime#113497, and dotnet/runtime#112998, methods containing try/finally are no longer blocked from inlining (try/catch regions are still a challenge). For the same benchmark on .NET 10, we now get this assembly: | |
| Copy | |
| ; .NET 10 | |
| ; Tests.Test() | |
| push rbp | |
| push rbx | |
| push rax | |
| lea rbp,[rsp+10] | |
| mov rbx,[rdi+8] | |
| test rbx,rbx | |
| je short M00_L03 | |
| mov rdi,rbx | |
| call 00007920A0EE65E0 | |
| test eax,eax | |
| je short M00_L02 | |
| M00_L00: | |
| mov rdi,rbx | |
| call 00007920A0EE6D50 | |
| test eax,eax | |
| jne short M00_L04 | |
| M00_L01: | |
| mov eax,2A | |
| add rsp,8 | |
| pop rbx | |
| pop rbp | |
| ret | |
| M00_L02: | |
| mov rdi,rbx | |
| call qword ptr [79202393C1F8] | |
| jmp short M00_L00 | |
| M00_L03: | |
| xor edi,edi | |
| call qword ptr [79202393C1C8] | |
| int 3 | |
| M00_L04: | |
| mov edi,eax | |
| mov rsi,rbx | |
| call qword ptr [79202393C1E0] | |
| jmp short M00_L01 | |
| ; Total bytes of code 86 | |
| The details of the assembly don’t matter, other than it’s a whole lot more than was there before, because we’re now looking in large part at the implementation of M. In addition to methods with try/finally now being inlineable, other improvements have also been made around exception handling. For example, dotnet/runtime#110273 and dotnet/runtime#110464 enable the removal of try/catch and try/fault blocks if it can prove the try block can’t possibly throw. Consider this: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "i")] | |
| public partial class Tests | |
| { | |
| [Benchmark] | |
| [Arguments(42)] | |
| public int Test(int i) | |
| { | |
| try | |
| { | |
| i++; | |
| } | |
| catch | |
| { | |
| Console.WriteLine("Exception caught"); | |
| } | |
| return i; | |
| } | |
| } | |
| There’s nothing the try block here can do that will result in an exception being thrown (assuming the developer hasn’t enabled checked arithmetic, in which case it could possibly throw an OverflowException), yet on .NET 9 we get this assembly: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Test(Int32) | |
| push rbp | |
| sub rsp,10 | |
| lea rbp,[rsp+10] | |
| mov [rbp-10],rsp | |
| mov [rbp-4],esi | |
| mov eax,[rbp-4] | |
| inc eax | |
| mov [rbp-4],eax | |
| M00_L00: | |
| mov eax,[rbp-4] | |
| add rsp,10 | |
| pop rbp | |
| ret | |
| push rbp | |
| sub rsp,10 | |
| mov rbp,[rdi] | |
| mov [rsp],rbp | |
| lea rbp,[rbp+10] | |
| mov rdi,784B08950018 | |
| call qword ptr [784B0DE44EE8] | |
| lea rax,[M00_L00] | |
| add rsp,10 | |
| pop rbp | |
| ret | |
| ; Total bytes of code 79 | |
| Now on .NET 10, the JIT is able to elide the catch and remove all ceremony related to the try because it can see that ceremony is pointless overhead. | |
| Copy | |
| ; .NET 10 | |
| ; Tests.Test(Int32) | |
| lea eax,[rsi+1] | |
| ret | |
| ; Total bytes of code 4 | |
| That’s true even when the contents of the try calls into other methods that are then inlined, exposing their contents to the JIT’s analysis. | |
| (As an aside, the JIT was already able to remove try/finally when the finally was empty, but dotnet/runtime#108003 catches even more cases of checking for empty finallys again after most other optimizations have been run, in case they revealed additional empty blocks.) | |
| Another example is “GVM”. Previously, any method that called a GVM, or generic virtual method (a virtual method with a generic type parameter), would be blocked from being inlined. | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [MemoryDiagnoser(displayGenColumns: false)] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| private Base _base = new(); | |
| [Benchmark] | |
| public int Test() | |
| { | |
| M(); | |
| return 42; | |
| } | |
| private void M() => _base.M<object>(); | |
| } | |
| class Base | |
| { | |
| public virtual void M<T>() { } | |
| } | |
| On .NET 9, the above results in this assembly: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Test() | |
| push rax | |
| call qword ptr [728ED5664FD8]; Tests.M() | |
| mov eax,2A | |
| add rsp,8 | |
| ret | |
| ; Total bytes of code 17 | |
| Now on .NET 10, with dotnet/runtime#116773, M can now be inlined. | |
| Copy | |
| ; .NET 10 | |
| ; Tests.Test() | |
| push rbp | |
| push rbx | |
| push rax | |
| lea rbp,[rsp+10] | |
| mov rbx,[rdi+8] | |
| mov rdi,rbx | |
| mov rsi,offset MT_Base | |
| mov rdx,78034C95D2A0 | |
| call System.Runtime.CompilerServices.VirtualDispatchHelpers.VirtualFunctionPointer(System.Object, IntPtr, IntPtr) | |
| mov rdi,rbx | |
| call rax | |
| mov eax,2A | |
| add rsp,8 | |
| pop rbx | |
| pop rbp | |
| ret | |
| ; Total bytes of code 57 | |
| Another area of investment with inlining is to do with the heuristics around when methods should be inlined. Just inlining everything would be bad; inlining copies code, which results in more code, which can have significant negative repercussions. For example, inlining’s increased code size puts more pressure on caches. Processors have an instruction cache, a small amount of super fast memory in a CPU that stores recently used instructions, making them really fast to access again the next time they’re needed (such as the next iteration through a loop, or the next time that same function is called). Consider a method M, and 100 call sites to M that are all being accessed. If all of those share the same instructions for M, because the 100 call sites are all actually calling M, the instruction cache will only need to load M‘s instructions once. If all of those 100 call sites each have their own copy of M‘s instructions, then all 100 copies will separately be loaded through the cache, fighting with each other and other instructions for residence. The less likely it is that instructions are in the cache, the more likely it is that the CPU will stall waiting for the instructions to be loaded from memory. | |
| For this reason, the JIT needs to be careful what it inlines. It tries hard to avoid inlining anything that won’t benefit (e.g. a larger method whose instructions won’t be materially influenced by the caller’s context) while also trying hard to inline anything that will materially benefit (e.g. small functions where the code required to call the function is similar in size to the contents of the function, functions with instructions that could be materially impacted by information from the call site, etc.) As part of these heuristics, the JIT has the notion of “boosts,” where observations it makes about things methods do boost the chances of that method being inlined. dotnet/runtime#114806 gives a boost to methods that appear to be returning new arrays of a small, fixed length; if those arrays can instead be allocated in the caller’s frame, the JIT might then be able to discover they don’t escape and enable them to be stack allocated. dotnet/runtime#110596 similarly looks for boxing, as the caller could possibly instead avoid the box entirely. | |
| For the same purpose (and also just to minimize time spent performing compilation), the JIT also maintains a budget for how much it allows to be inlined into a method compilation… once it hits that budget, it might stop inlining anything. The budgeting scheme overall works ok, however in certain circumstances it can run out of budget at very inopportune times, for example doing a lot of inlining at top-level call sites but then running out of budget by the time it gets to small methods that are critically-important to inline for good performance. To help mitigate these scenarios, dotnet/runtime#114191 and dotnet/runtime#118641 more than double the JIT’s default inlining budget. | |
| The JIT also pays a lot of attention to the number of local variables (e.g. parameters/locals explicitly in the IL, JIT-created temporary locals, promoted struct fields, etc.) it tracks. To avoid creating too many, the JIT would stop inlining once it was already tracking 512. But as other changes have made inlining more aggressive, this (strangely hardcoded) limit gets hit more often, leaving very valuable inlinees out in the cold. dotnet/runtime#118515 removed this fixed limit and instead ties it to a large percentage of the number of locals the JIT is allowed to track (by default, this ends up almost doubling the limit used by the inliner). | |
| Constant Folding | |
| Constant folding is a compiler’s ability to perform operations, typically math, at compile-time rather than at run-time: given multiple constants and an expressed relationship between them, the compiler can “fold” those constants together into a new constant. So, if you have the C# code int M(int i) => i + 2 * 3;, the C# compiler does constant folding and emits that into your compilation as if you’d written int M(int i) => i + 6;. The JIT can and does also do constant folding, which is valuable especially when it’s based on information not available to the C# compiler. For example, the JIT can treat static readonly fields or IntPtr.Size or Vector128<T>.Count as constants. And the JIT can do folding across inlines. For example, if you have: | |
| Copy | |
| int M1(int i) => i + M2(2 * 3); | |
| int M2(int j) => j * Environment.ProcessorCount; | |
| the C# compiler will only be able to fold the 2 * 3, and will emit the equivalent of: | |
| Copy | |
| int M1(int i) => i + M2(6); | |
| int M2(int j) => j * Environment.ProcessorCount; | |
| but when compiling M1, the JIT can inline M2 and treat ProcessorCount as a constant (on my machine it’s 16), and produce the following assembly code for M1: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "i")] | |
| public partial class Tests | |
| { | |
| [Benchmark] | |
| [Arguments(42)] | |
| public int M1(int i) => i + M2(6); | |
| private int M2(int j) => j * Environment.ProcessorCount; | |
| } | |
| Copy | |
| ; .NET 9 | |
| ; Tests.M1(Int32) | |
| lea eax,[rsi+60] | |
| ret | |
| ; Total bytes of code 4 | |
| That’s as if the code for M1 had been public int M1(int i) => i + 96; (the displayed assembly renders hexadecimal, so the 60 is hexadecimal 0x60 and thus decimal 96). | |
| Or consider: | |
| Copy | |
| string M() => GetString() ?? throw new Exception(); | |
| static string GetString() => "test"; | |
| The JIT will be able to inline GetString, at which point it can see that the result is non-null and can fold away the check for the null constant, at which point it can also dead-code eliminate the throw. Constant folding is useful on its own in avoiding unnecessary work, but it also often unlocks other optimizations, like dead-code elimination and bounds-check elimination. The JIT is already quite good at finding constant folding opportunities, and gets better in .NET 10. Consider this benchmark: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "s")] | |
| public partial class Tests | |
| { | |
| [Benchmark] | |
| [Arguments("test")] | |
| public ReadOnlySpan<char> Test(string s) | |
| { | |
| s ??= ""; | |
| return s.AsSpan(); | |
| } | |
| } | |
| Here’s the assembly that gets produced for .NET 9: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Test(System.String) | |
| push rbp | |
| mov rbp,rsp | |
| mov rax,75B5D6200008 | |
| test rsi,rsi | |
| cmove rsi,rax | |
| test rsi,rsi | |
| jne short M00_L01 | |
| xor eax,eax | |
| xor edx,edx | |
| M00_L00: | |
| pop rbp | |
| ret | |
| M00_L01: | |
| lea rax,[rsi+0C] | |
| mov edx,[rsi+8] | |
| jmp short M00_L00 | |
| ; Total bytes of code 41 | |
| Of particular note are those two test rsi,rsi instructions, which are null checks. The assembly starts by loading a value into rax; that value is the address of the "" string literal. It then uses test rsi,rsi to check whether the s parameter, which was passed into this instance method in the rsi register, is null. If it is null, the cmove rsi,rax instruction sets it to the address of the "" literal. And then… it does test rsi,rsi again? That second test is the null check at the beginning of AsSpan, which looks like this: | |
| Copy | |
| public static ReadOnlySpan<char> AsSpan(this string? text) | |
| { | |
| if (text is null) return default; | |
| return new ReadOnlySpan<char>(ref text.GetRawStringData(), text.Length); | |
| } | |
| Now with dotnet/runtime#111985, that second null check, along with others, can be folded, resulting in this: | |
| Copy | |
| ; .NET 10 | |
| ; Tests.Test(System.String) | |
| mov rax,7C01C4600008 | |
| test rsi,rsi | |
| cmove rsi,rax | |
| lea rax,[rsi+0C] | |
| mov edx,[rsi+8] | |
| ret | |
| ; Total bytes of code 25 | |
| Similar impact comes from dotnet/runtime#108420, which is also able to fold a different class of null checks. | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "condition")] | |
| public partial class Tests | |
| { | |
| [Benchmark] | |
| [Arguments(true)] | |
| public bool Test(bool condition) | |
| { | |
| string tmp = condition ? GetString1() : GetString2(); | |
| return tmp is not null; | |
| } | |
| private static string GetString1() => "Hello"; | |
| private static string GetString2() => "World"; | |
| } | |
| In this benchmark, we can see that neither GetString1 nor GetString2 return null, and thus the is not null check shouldn’t be necessary. The JIT in .NET 9 couldn’t see that, but its improved .NET 10 self can. | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Test(Boolean) | |
| mov rax,7407F000A018 | |
| mov rcx,7407F000A050 | |
| test sil,sil | |
| cmove rax,rcx | |
| test rax,rax | |
| setne al | |
| movzx eax,al | |
| ret | |
| ; Total bytes of code 37 | |
| ; .NET 10 | |
| ; Tests.Test(Boolean) | |
| mov eax,1 | |
| ret | |
| ; Total bytes of code 6 | |
| Constant folding also applies to SIMD (Single Instruction Multiple Data), instructions that enable processing multiple pieces of data at once rather than only one element at a time. dotnet/runtime#117099 and dotnet/runtime#117572 both enable more SIMD comparison operations to participate in folding. | |
| Code Layout | |
| When the JIT compiler generates assembly from the IL emitted by the C# compiler, it organizes that code into “basic blocks,” a sequence of instructions with one entry point and one exit point, no jumps inside, no branches out except at the end. These blocks can then be moved around as a unit, and the order in which these blocks are placed in memory is referred to as “code layout” or “basic block layout.” This ordering can have a significant performance impact because modern CPUs rely heavily on an instruction cache and on branch prediction to keep things moving fast. If frequently executed (“hot”) blocks are close together and follow a common execution path, the CPU can execute them with fewer cache misses and fewer mispredicted jumps. If the layout is poor, where the hot code is split into pieces far apart from each other, or where rarely executed (“cold”) code sits in between, the CPU can spend more time jumping around and refilling caches than doing actual work. Consider a tight loop executed millions of times. A good layout keeps the loop entry, body, and backward edge (the jump back to the beginning of the body to do the next iteration) right next to each other, letting the CPU fetch them straight from the cache. In a bad layout, that loop might be interwoven with unrelated cold blocks (say, a catch block for a try in the loop), forcing the CPU to load instructions from different places and disrupting the flow. Similarly, for an if block, the likely path should generally be the next block so no jump is required, with the unlikely branch behind a short jump away, as that better aligns with the sensibilities of branch predictors. Code layout heuristics control how that happens, and as a result, how efficient the resulting code is able to execute. | |
| When determining the starting layout of the blocks (before additional optimizations are done for the layout), dotnet/runtime#108903 employs a “loop-aware reverse post-order” traversal. A reverse post-order traversal is an algorithm for visiting the nodes in a control flow graph such that each block appears after its predecessors. The “loop aware” part means the traversal recognizes loops as units, effectively creating a block around the whole loop, and tries to keep the whole loop together as the layout algorithm moves things around. The intent here is to start the larger layout optimizations from a more sensible place, reducing the amount of later reshuffling and situations where loop bodies get broken up. | |
| In the extreme, layout is essentially the traveling salesman problem. The JIT must decide the order of basic blocks so that control transfers follow short, predictable paths and make efficient use of instruction cache and branch prediction. Just like the salesman visiting cities with minimal total travel distance, the compiler is trying to arrange blocks so that the “distance” between blocks, which might be measured in bytes or instruction fetch cost or something similar, is minimized. For any meaningfully-sized set of blocks, this is prohibitively expensive to compute optimally, as the number of possible orderings grows factorially with the number of blocks. Thus, the JIT has to rely on approximations rather than attempting an exact solution. One such approximation it employs now as of dotnet/runtime#103450 (and then tweaked further in dotnet/runtime#109741 and dotnet/runtime#109835) is a “3-opt,” which really just means that rather than considering all blocks together, it looks at only three and tries to produce an optimal ordering amongst those (there are only eight possible orderings to be checked). The JIT can choose to iterate through sets of three blocks until either it doesn’t see any more improvements or hits a self-imposed limit. Specifically when handling backward jumps, with dotnet/runtime#110277, it expands this “3-opt” to “4-opt” (four blocks). | |
| .NET 10 also does a better job of factoring PGO data into layout. With dynamic PGO, the JIT is able to gather instrumentation data from an initial compilation and then use the results of that profiling to impact an optimized re-compilation. That data can lead to conclusions about what blocks are hot or cold, and which direction branches take, all information that’s valuable for layout optimization. However, data can sometimes be missing from these profiles, so the JIT has a “profile synthesis” algorithm that makes realistic guesses for these gaps in order to fill them in (if you’ve read or seen “Jurassic Park,” this is the JIT-equivalent to filling in gaps in the dinosaur DNA sequences with that from present-day frogs.) With dotnet/runtime#111915, that repairing of the profile data is now performed just before layout, so that layout has a more complete picture. | |
| Let’s take a concrete example of all this. Here I’ve extracted the core function from MemoryExtensions.BinarySearch: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| using System.Runtime.CompilerServices; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| private int[] _values = Enumerable.Range(0, 512).ToArray(); | |
| [Benchmark] | |
| public int BinarySearch() | |
| { | |
| int[] values = _values; | |
| return BinarySearch(ref values[0], values.Length, 256); | |
| } | |
| [MethodImpl(MethodImplOptions.NoInlining)] | |
| private static int BinarySearch<T, TComparable>( | |
| ref T spanStart, int length, TComparable comparable) | |
| where TComparable : IComparable<T>, allows ref struct | |
| { | |
| int lo = 0; | |
| int hi = length - 1; | |
| while (lo <= hi) | |
| { | |
| int i = (int)(((uint)hi + (uint)lo) >> 1); | |
| int c = comparable.CompareTo(Unsafe.Add(ref spanStart, i)); | |
| if (c == 0) | |
| { | |
| return i; | |
| } | |
| else if (c > 0) | |
| { | |
| lo = i + 1; | |
| } | |
| else | |
| { | |
| hi = i - 1; | |
| } | |
| } | |
| return ~lo; | |
| } | |
| } | |
| And here’s the assembly we get for .NET 9 and .NET 10, diff’d from the former to the latter: | |
| Copy | |
| ; Tests.BinarySearch[[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]](Int32 ByRef, Int32, Int32) | |
| push rbp | |
| mov rbp,rsp | |
| xor ecx,ecx | |
| dec esi | |
| js short M01_L07 | |
| + jmp short M01_L03 | |
| M01_L00: | |
| - lea eax,[rsi+rcx] | |
| - shr eax,1 | |
| - movsxd r8,eax | |
| - mov r8d,[rdi+r8*4] | |
| - cmp edx,r8d | |
| - jge short M01_L03 | |
| mov r9d,0FFFFFFFF | |
| M01_L01: | |
| test r9d,r9d | |
| je short M01_L06 | |
| test r9d,r9d | |
| jg short M01_L05 | |
| lea esi,[rax-1] | |
| M01_L02: | |
| cmp ecx,esi | |
| - jle short M01_L00 | |
| - jmp short M01_L07 | |
| + jg short M01_L07 | |
| M01_L03: | |
| + lea eax,[rsi+rcx] | |
| + shr eax,1 | |
| + movsxd r8,eax | |
| + mov r8d,[rdi+r8*4] | |
| cmp edx,r8d | |
| - jg short M01_L04 | |
| - xor r9d,r9d | |
| + jl short M01_L00 | |
| + cmp edx,r8d | |
| + jle short M01_L04 | |
| + mov r9d,1 | |
| jmp short M01_L01 | |
| M01_L04: | |
| - mov r9d,1 | |
| + xor r9d,r9d | |
| jmp short M01_L01 | |
| M01_L05: | |
| lea ecx,[rax+1] | |
| jmp short M01_L02 | |
| M01_L06: | |
| pop rbp | |
| ret | |
| M01_L07: | |
| mov eax,ecx | |
| not eax | |
| pop rbp | |
| ret | |
| ; Total bytes of code 83 | |
| We can see that the main change here is a block that’s moved (the bulk of M01_L00 moving down to M01_L03). In .NET 9, the lo <= hi “stay in the loop check” (cmp ecx,esi) is a backward conditional branch (jle short M01_L00), where every iteration of the loop except for the last jumps back to the top (M01_L00). In .NET 10, it instead does a forward branch to exit the loop only in the rarer case, otherwise falling through to the body of the loop in the common case, and then unconditionally branching back. | |
| GC Write Barriers | |
| The .NET garbage collector (GC) works on a generational model, organizing the managed heap according to how long objects have been alive. The newest allocations land in “generation 0” (gen0), objects that have survived at least one collection are promoted to “generation 1” (gen1), and those that have been around for longer end up in “generation 2” (gen2). This is based on the premise that most objects are temporary, and that once an object has been around for a while, it’s likely to stick around for a while longer. Splitting up the heap into generations enables for quickly collecting gen0 objects by only scanning the gen0 heap for remaining references to that object. The expectation is that all, or at least the vast majority, of references to a gen0 object are also in gen0. Of course, if a reference to a gen0 object snuck into gen1 or gen2, not scanning gen1 or gen2 during a gen0 collection could be, well, bad. To avoid that case, the JIT collaborates with the GC to track references from older to younger generations. Whenever there’s a reference write that could cross a generation, the JIT emits a call to a helper that tracks the information in a “card table,” and when the GC runs, it consults this table to see if it needs to scan a portion of the higher generations. That helper is referred to as a “GC write barrier.” Since a write barrier is potentially employed on every reference write, it must be super fast, and in fact the runtime has several different variations of write barriers so that the JIT can pick one optimized for the given situation. Of course, the fastest write barrier is one that doesn’t need to exist at all, so as with bounds checks, the JIT also exerts energy to try to prove when write barriers aren’t needed, eliding them when it can. And it can even more in .NET 10. | |
| ref structs, referred to in runtime vernacular as “byref-like types,” can never live on the heap, which means any reference fields in them will similarly never live on the heap. As such, if the JIT can prove that a reference write is targeting a field of a ref struct, it can elide the write barrier. Consider this example: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| private object _object = new(); | |
| [Benchmark] | |
| public MyRefStruct Test() => new MyRefStruct() { Obj1 = _object, Obj2 = _object, Obj3 = _object }; | |
| public ref struct MyRefStruct | |
| { | |
| public object Obj1; | |
| public object Obj2; | |
| public object Obj3; | |
| } | |
| } | |
| In the .NET 9 assembly, we can see three write barriers (CORINFO_HELP_CHECKED_ASSIGN_REF) corresponding to the three fields in MyRefStruct in the benchmark: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Test() | |
| push r15 | |
| push r14 | |
| push rbx | |
| mov rbx,rsi | |
| mov r15,[rdi+8] | |
| mov rsi,r15 | |
| mov r14,r15 | |
| mov rdi,rbx | |
| call CORINFO_HELP_CHECKED_ASSIGN_REF | |
| lea rdi,[rbx+8] | |
| mov rsi,r14 | |
| call CORINFO_HELP_CHECKED_ASSIGN_REF | |
| lea rdi,[rbx+10] | |
| mov rsi,r15 | |
| call CORINFO_HELP_CHECKED_ASSIGN_REF | |
| mov rax,rbx | |
| pop rbx | |
| pop r14 | |
| pop r15 | |
| ret | |
| ; Total bytes of code 59 | |
| With dotnet/runtime#111576 and dotnet/runtime#111733 in .NET 10, all of those write barriers are elided: | |
| Copy | |
| ; .NET 10 | |
| ; Tests.Test() | |
| mov rax,[rdi+8] | |
| mov rcx,rax | |
| mov rdx,rax | |
| mov [rsi],rcx | |
| mov [rsi+8],rdx | |
| mov [rsi+10],rax | |
| mov rax,rsi | |
| ret | |
| ; Total bytes of code 25 | |
| Much more impactful, however, are dotnet/runtime#112060 and dotnet/runtime#112227, which have to do with “return buffers.” When a .NET method is typed to return a value, the runtime has to decide how that value gets from the callee back to the caller. For small types, like integers, floating-point numbers, pointers, or object references, the answer is simple: the value can be passed back via one or more CPU registers reserved for return values, making the operation essentially free. But not all values fit neatly into registers. Larger value types, such as structs with multiple fields, require a different strategy. In these cases, the caller allocates a “return buffer,” a block of memory, typically in the caller’s stack frame, and the caller passes a pointer to that buffer as a hidden argument to the method. The method then writes the return value directly into that buffer in order to provide the caller with the data. When it comes to write barriers, the challenge here is that there historically hasn’t been a requirement that the return buffer be on the stack; it’s technically feasible it could have been allocated on the heap, even if it rarely or never is. And since the callee doesn’t know where the buffer lives, any object reference writes needed to be tracked with GC write barriers. We can see that with a simple benchmark: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| private string _firstName = "Jane", _lastName = "Smith", _address = "123 Main St", _city = "Anytown"; | |
| [Benchmark] | |
| public Person GetPerson() => new(_firstName, _lastName, _address, _city); | |
| public record struct Person(string FirstName, string LastName, string Address, string City); | |
| } | |
| On .NET 9, each field of the returned value type is incurring a CORINFO_HELP_CHECKED_ASSIGN_REF write barrier: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.GetPerson() | |
| push r15 | |
| push r14 | |
| push r13 | |
| push rbx | |
| mov rbx,rsi | |
| mov rsi,[rdi+8] | |
| mov r15,[rdi+10] | |
| mov r14,[rdi+18] | |
| mov r13,[rdi+20] | |
| mov rdi,rbx | |
| call CORINFO_HELP_CHECKED_ASSIGN_REF | |
| lea rdi,[rbx+8] | |
| mov rsi,r15 | |
| call CORINFO_HELP_CHECKED_ASSIGN_REF | |
| lea rdi,[rbx+10] | |
| mov rsi,r14 | |
| call CORINFO_HELP_CHECKED_ASSIGN_REF | |
| lea rdi,[rbx+18] | |
| mov rsi,r13 | |
| call CORINFO_HELP_CHECKED_ASSIGN_REF | |
| mov rax,rbx | |
| pop rbx | |
| pop r13 | |
| pop r14 | |
| pop r15 | |
| ret | |
| ; Total bytes of code 81 | |
| Now in .NET 10, the calling convention has been updated to require that the return buffer live on the stack (if the caller wants the data somewhere else, it’s responsible for subsequently doing that copy). And because the return buffer is now guaranteed to be on the stack, the JIT can elide all GC write barriers as part of returning values. | |
| Copy | |
| ; .NET 10 | |
| ; Tests.GetPerson() | |
| mov rax,[rdi+8] | |
| mov rcx,[rdi+10] | |
| mov rdx,[rdi+18] | |
| mov rdi,[rdi+20] | |
| mov [rsi],rax | |
| mov [rsi+8],rcx | |
| mov [rsi+10],rdx | |
| mov [rsi+18],rdi | |
| mov rax,rsi | |
| ret | |
| ; Total bytes of code 35 | |
| dotnet/runtime#111636 from @a74nh is also interesting from a performance perspective because, as is common in optimization, it trades off one thing for another. Prior to this change, Arm64 had one universal write barrier helper for all GC modes. This change brings Arm64 in line with x64 by routing through a WriteBarrierManager that selects among multiple JIT_WriteBarrier variants based on runtime configuration. In doing so, it makes each Arm64 write barrier a bit more expensive, by adding region checks and moving to a region-aware card marking scheme, but in exchange it lets the GC do less work: fewer cards in the card table get marked, and the GC can scan more precisely. dotnet/runtime#106191 also helps reduce the cost of write barriers on Arm64 by tightening the hot-path comparisons and eliminating some avoidable saves and restores. | |
| Instruction Sets | |
| .NET continues to see meaningful optimizations and improvements across all supported architectures, along with various architecture-specific improvements. Here are a handful of examples. | |
| Arm SVE | |
| APIs for Arm SVE were introduced in .NET 9. As noted in the Arm SVE section of last year’s post, enabling SVE is a multi-year effort, and in .NET 10, support is still considered experimental. However, the support has continued to be improved and extended, with PRs like dotnet/runtime#115775 from @snickolls-arm adding BitwiseSelect methods, dotnet/runtime#117711 from @jacob-crawley adding MaxPairwise and MinPairwise methods, and dotnet/runtime#117051 from @jonathandavies-arm adding VectorTableLookup methods. | |
| Arm64 | |
| dotnet/runtime#111893 from @jonathandavies-arm, dotnet/runtime#111904 from @jonathandavies-arm, dotnet/runtime#111452 from @jonathandavies-arm, dotnet/runtime#112235 from @jonathandavies-arm, and dotnet/runtime#111797 from @snickolls-arm all improved .NET’s support for utilizing Arm64’s multi-operation compound instructions. For example, when implementing a compare and branch, rather than emitting a cmp against 0 followed by beq instruction, the JIT may now emit a cbz (“Compare and Branch on Zero”) instruction. | |
| APX | |
| Intel’s Advanced Performance Extensions (APX) was announced in 2023 as an extension of the x86/x64 instruction set. It expands the number of general-purpose registers from 16 to 32 and adds new instructions such as conditional operations designed to reduce memory traffic, improve performance, and lower power consumption. dotnet/runtime#106557 from @Ruihan-Yin, dotnet/runtime#108796 from @Ruihan-Yin, and dotnet/runtime#113237 from @Ruihan-Yin essentially teach the JIT how to speak the new dialect of assembly code (the REX and expanded EVEX encodings), and dotnet/runtime#108799 from @Ruihan-Yin updates the JIT to be able to use the expanded set of registers. The most impactful new instructions in APX are around conditional compares (ccmp), a concept the JIT already supports from targeting other instruction sets, and dotnet/runtime#111072 from @anthonycanino, dotnet/runtime#112153 from @anthonycanino, and dotnet/runtime#116445 from @khushal1996 all teach the JIT how to make good use of these new instructions with APX. | |
| AVX512 | |
| .NET 8 added broad support for AVX512, and .NET 9 significantly improved its handling and adoption throughout the core libraries. .NET 10 includes a plethora of additional related optimizations: | |
| dotnet/runtime#109258 from @saucecontrol and dotnet/runtime#109267 from @saucecontrol expand the number of places the JIT is able to use EVEX embedded broadcasts, a feature that lets vector instructions read a single scalar element from memory and implicitly replicate it to all the lanes of the vector, without needing a separate broadcast or shuffle operation. | |
| dotnet/runtime#108824 removes a redundant sign extension from broadcasts. | |
| dotnet/runtime#116117 from @alexcovington improves the code generated for Vector.Max and Vector.Min when AVX512 is supported. | |
| dotnet/runtime#109474 from @saucecontrol improves “containment” (where an instruction can be eliminated by having its behaviors fully encapsulated by another instruction) for AVX512 widening intrinsics (similar containment-related improvements were made in dotnet/runtime#110736 from @saucecontrol and dotnet/runtime#111778 from @saucecontrol). | |
| dotnet/runtime#111853 from @saucecontrol improves Vector128/256/512.Dot to be better accelerated with AVX512. | |
| dotnet/runtime#110195, dotnet/runtime#110307, and dotnet/runtime#117118 all improve how vector masks are handled. In AVX512, masks are special registers that can be included as part of various instructions to control which subset of vector elements should be utilized (each bit in a mask corresponds to one element in the vector). This enables operating on only part of a vector without needing extra branching or shuffling. | |
| dotnet/runtime#115981 improves zeroing (where the JIT emits instructions to zero out memory, often as part of initializing a stack frame) on AVX512. After zeroing as much as it can with 64-byte instructions, it was falling back to using 16-byte instructions, when it could have used 32-byte instructions. | |
| dotnet/runtime#110662 improves the code generated for ExtractMostSignificantBits (which is used by many of the searching algorithms in the core libraries) when working with short and ushort (and char, as most of those core library implementations reinterpret cast char as one of the others) by using EVEX mask support. | |
| dotnet/runtime#113864 from @saucecontrol improves the code generated for ConditionalSelect when not used with mask registers. | |
| AVX10.2 | |
| .NET 9 added support and intrinsics for the AVX10.1 instruction set. With dotnet/runtime#111209 from @khushal1996, .NET 10 adds support and intrinsics for the AVX10.2 instruction set. dotnet/runtime#112535 from @khushal1996 optimizes floating-point min/max operations with AVX10.2 instructions, while dotnet/runtime#111775 from @khushal1996 enables floating-point conversions to utilize AVX10.2. | |
| GFNI | |
| dotnet/runtime#109537 from @saucecontrol adds intrinsics for the GFNI (Galois Field New Instructions) instruction set, which can be used for accelerating operations over Galois fields GF(2^8). These are common in cryptography, error correction, and data encoding. | |
| VPCLMULQDQ | |
| VPCLMULQDQ is an x86 instruction set extension that adds vector support to the older PCLMULQDQ instruction, which performs carry-less multiplication over 64-bit integers. dotnet/runtime#109137 from @saucecontrol adds new intrinsic APIs for VPCLMULQDQ. | |
| Miscellaneous | |
| Many more PRs than the ones I’ve already called out have gone into the JIT this release. Here are a few more: | |
| Eliminating some covariance checks. Writing into arrays of reference types can require “covariance checks.” Imagine you have a class Base and two derived types Derived1 : Base and Derived2 : Base. Since arrays in .NET are covariant, I can have a Derived1[] and cast it successfully to a Base[], but under the covers that’s still a Derived1[]. That means, for example, that any attempt to store a Derived2 into that array should fail at runtime, even if it compiles. To achieve that, the JIT needs to insert such covariance checks when writing into arrays, but just like with bounds checking and write barriers, the JIT can elide those checks when it can prove statically that they’re not necessary. Such an example is with sealed types. If the JIT sees an array of type T[] and T is known to be sealed, T[] must exactly be a T[] and not some DerivedT[], because there can’t be a DerivedT. So with a benchmark like this: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| private List<string> _list = new() { "hello" }; | |
| [Benchmark] | |
| public void Set() => _list[0] = "world"; | |
| } | |
| as long as the JIT can see that the array underlying the List<string> is a string[] (string is sealed), it shouldn’t need a covariance check. In .NET 9, we get this: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Set() | |
| push rbx | |
| mov rbx,[rdi+8] | |
| cmp dword ptr [rbx+10],0 | |
| je short M00_L00 | |
| mov rdi,[rbx+8] | |
| xor esi,esi | |
| mov rdx,78914920A038 | |
| call System.Runtime.CompilerServices.CastHelpers.StelemRef(System.Object[], IntPtr, System.Object) | |
| inc dword ptr [rbx+14] | |
| pop rbx | |
| ret | |
| M00_L00: | |
| call qword ptr [78D1F80558A8] | |
| int 3 | |
| ; Total bytes of code 44 | |
| Note that CastHelpers.StelemRef call… that’s the helper that performs the write with the covariance check. But now in .NET 10, thanks to dotnet/runtime#107116 (which teaches the JIT how to resolve the exact type for the field of the closed generic), we get this: | |
| Copy | |
| ; .NET 10 | |
| ; Tests.Set() | |
| push rbp | |
| mov rbp,rsp | |
| mov rax,[rdi+8] | |
| cmp dword ptr [rax+10],0 | |
| je short M00_L00 | |
| mov rcx,[rax+8] | |
| mov edx,[rcx+8] | |
| test rdx,rdx | |
| je short M00_L01 | |
| mov rdx,75E2B9009A40 | |
| mov [rcx+10],rdx | |
| inc dword ptr [rax+14] | |
| pop rbp | |
| ret | |
| M00_L00: | |
| call qword ptr [762368116760] | |
| int 3 | |
| M00_L01: | |
| call CORINFO_HELP_RNGCHKFAIL | |
| int 3 | |
| ; Total bytes of code 58 | |
| No covariance check, thank you very much. | |
| More strength reduction. “Strength reduction” is a classic compiler optimization that replaces more expensive operations, like multiplications, with cheaper ones, like additions. In .NET 9, this was used to transform indexed loops that used multiplied offsets (e.g. index * elementSize) into loops that simply incremented a pointer-like offset (e.g. offset += elementSize), cutting down on arithmetic overhead and improving performance. In .NET 10, strength reduction has been extended, in particular with dotnet/runtime#110222. This enables the JIT to detect multiple loop induction variables with different step sizes and strength-reduce them by leveraging their greatest common divisor (GCD). Essentially, it creates a single primary induction variable based on the GCD of the varying step sizes, and then recovers each original induction variable by appropriately scaling. Consider this example: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "numbers")] | |
| public partial class Tests | |
| { | |
| [Benchmark] | |
| [Arguments("128514801826028643102849196099776734920914944609068831724328541639470403818631040")] | |
| public int[] Parse(string numbers) | |
| { | |
| int[] results = new int[numbers.Length]; | |
| for (int i = 0; i < numbers.Length; i++) | |
| { | |
| results[i] = numbers[i] - '0'; | |
| } | |
| return results; | |
| } | |
| } | |
| In this benchmark, we’re iterating through an input string, which is a collection of 2-byte char elements, and we’re storing the results into an array of 4-byte int elements. The core loop in the .NET 9 assembly looks like this: | |
| Copy | |
| ; .NET 9 | |
| M00_L00: | |
| mov edx,ecx | |
| movzx edi,word ptr [rbx+rdx*2+0C] | |
| add edi,0FFFFFFD0 | |
| mov [rax+rdx*4+10],edi | |
| inc ecx | |
| cmp r15d,ecx | |
| jg short M00_L00 | |
| The movzx edi,word ptr [rbx+rdx*2+0C] is the read of numbers[i], and the mov [rax+rdx*4+10],edi is the assignment to results[i]. rdx here is i, so each assignment is effectively having to do i*2 to compute the byte offset of the char at index i, and similarly do i*4 to compute the byte offset of the int at offset i. Now here’s the .NET 10 assembly: | |
| Copy | |
| ; .NET 10 | |
| M00_L00: | |
| movzx edx,word ptr [rbx+rcx+0C] | |
| add edx,0FFFFFFD0 | |
| mov [rax+rcx*2+10],edx | |
| add rcx,2 | |
| dec r15d | |
| jne short M00_L00 | |
| The multiplication in the numbers[i] read is gone. Instead, it can just increment rcx by 2 on each iteration, treating that as the offset to the ith char, and then instead of multiplying by 4 to compute the int offset, it just multiples by 2. | |
| CSE integration with SSA. As with most compilers, the JIT employs common subexpression elimination (CSE) to find identical computations and avoid doing them repeatedly. dotnet/runtime#106637 teaches the JIT how to do so in a more consistent manner by more fully integrating CSE with its Static Single Assignment (SSA) representation. This in turn allows for more optimizations to kick in, e.g. some of the strength reduction done around loop induction variables in .NET 9 wasn’t applying as much as it should have, and now it will. | |
| return someCondition ? true : false. There are often multiple ways to represent the same thing, but it often happens in compilers that certain patterns will be recognized during optimization while other equivalent ones won’t, and it can therefore behoove the compiler to first normalize the representations to all use the better recognized one. There’s a really common and interesting case of this with return someCondition, where, for reasons relating to the JIT’s internal representation, the JIT is better able to optimize with the equivalent return someCondition ? true : false. dotnet/runtime#107499 normalizes to the latter. As an example of this, consider this benchmark: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "i")] | |
| public partial class Tests | |
| { | |
| [Benchmark] | |
| [Arguments(42)] | |
| public bool Test1(int i) | |
| { | |
| if (i > 10 && i < 20) return true; | |
| return false; | |
| } | |
| [Benchmark] | |
| [Arguments(42)] | |
| public bool Test2(int i) => i > 10 && i < 20; | |
| } | |
| On .NET 9, that results in this assembly code for Test1: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Test1(Int32) | |
| sub esi,0B | |
| cmp esi,8 | |
| setbe al | |
| movzx eax,al | |
| ret | |
| ; Total bytes of code 13 | |
| The JIT has successfully recognized that it can change the two comparisons to instead be a subtraction and a single comparison, as if the i > 10 && i < 20 were instead written as (uint)(i - 11) <= 8. But for Test2, .NET 9 produces this: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Test2(Int32) | |
| xor eax,eax | |
| cmp esi,14 | |
| setl cl | |
| movzx ecx,cl | |
| cmp esi,0A | |
| cmovg eax,ecx | |
| ret | |
| ; Total bytes of code 18 | |
| Because of how the return condition is being represented internally by the JIT, it’s missing this particular optimization, and the assembly code more directly reflects what was written in the C#. But now in .NET 10, because of this normalization, we now get this for Test2, exactly what we got for Test1: | |
| Copy | |
| ; .NET 10 | |
| ; Tests.Test2(Int32) | |
| sub esi,0B | |
| cmp esi,8 | |
| setbe al | |
| movzx eax,al | |
| ret | |
| ; Total bytes of code 13 | |
| Bit tests. The C# compiler has a lot of flexibility in how it emits switch and is expressions. Consider a case like this: c is ' ' or '\t' or '\r' or '\n'. It could emit that as the equivalent of a series of cascading if/else branches, as an IL switch instruction, as a bit test, or as combinations of those. The C# compiler, though, doesn’t have all of the information the JIT has, such as whether the process is 32-bit or 64-bit, or knowledge of what instructions cost on given hardware. With dotnet/runtime#107831, the JIT will now recognize more such expressions that can be implemented as a bit test and generate the code accordingly. | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| using System.Runtime.CompilerServices; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "c")] | |
| public partial class Tests | |
| { | |
| [Benchmark] | |
| [Arguments('s')] | |
| public void Test(char c) | |
| { | |
| if (c is ' ' or '\t' or '\r' or '\n' or '.') | |
| { | |
| Handle(c); | |
| } | |
| [MethodImpl(MethodImplOptions.NoInlining)] | |
| static void Handle(char c) { } | |
| } | |
| } | |
| Method Runtime Mean Ratio Code Size | |
| Test .NET 9.0 0.4537 ns 1.02 58 B | |
| Test .NET 10.0 0.1304 ns 0.29 44 B | |
| It’s also common to see bit tests implemented in C# against shifted values; a constant mask is created with bits set at various indices, and then an incoming value to check is tested by shifting a bit to the corresponding index and seeing whether it aligns with one in the mask. For example, here is how Regex tests to see whether a provided UnicodeCategory is one of those that composes the “word” class (`\w`): | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| using System.Globalization; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "uc")] | |
| public partial class Tests | |
| { | |
| [Benchmark] | |
| [Arguments(UnicodeCategory.DashPunctuation)] | |
| public bool Test(UnicodeCategory uc) => (WordCategoriesMask & (1 << (int)uc)) != 0; | |
| private const int WordCategoriesMask = | |
| 1 << (int)UnicodeCategory.UppercaseLetter | | |
| 1 << (int)UnicodeCategory.LowercaseLetter | | |
| 1 << (int)UnicodeCategory.TitlecaseLetter | | |
| 1 << (int)UnicodeCategory.ModifierLetter | | |
| 1 << (int)UnicodeCategory.OtherLetter | | |
| 1 << (int)UnicodeCategory.NonSpacingMark | | |
| 1 << (int)UnicodeCategory.DecimalDigitNumber | | |
| 1 << (int)UnicodeCategory.ConnectorPunctuation; | |
| } | |
| Previously, the JIT would end up emitting that similar to how it’s written: a shift followed by a test. Now with dotnet/runtime#111979 from @varelen, it can emit it as a bit test. | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Test(System.Globalization.UnicodeCategory) | |
| mov eax,1 | |
| shlx eax,eax,esi | |
| test eax,4013F | |
| setne al | |
| movzx eax,al | |
| ret | |
| ; Total bytes of code 22 | |
| ; .NET 10 | |
| ; Tests.Test(System.Globalization.UnicodeCategory) | |
| mov eax,4013F | |
| bt eax,esi | |
| setb al | |
| movzx eax,al | |
| ret | |
| ; Total bytes of code 15 | |
| Redundant sign extensions. With dotnet/runtime#111305, the JIT can now remove more redundant sign extensions (when you take a smaller size type, e.g. int, and convert it to a larger size type, e.g. long, while preserving the value’s sign). For example, with a test like this public ulong Test(int x) => (uint)x < 10 ? (ulong)x << 60 : 0, the JIT can now emit a mov (just copy the bits) instead of movsxd (move with sign extension), since it knows from the first comparison that the shift will only ever be performed with a non-negative x. | |
| Better division with BMI2. If the BMI2 instruction set is available, with dotnet/runtime#116198 from @Daniel-Svensson the JIT can now use the mulx instruction (“Unsigned Multiply Without Affecting Flags”) to implement integer division, e.g. | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "value")] | |
| public partial class Tests | |
| { | |
| [Benchmark] | |
| [Arguments(12345)] | |
| public ulong Div10(ulong value) => value / 10; | |
| } | |
| results in: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Div10(UInt64) | |
| mov rdx,0CCCCCCCCCCCCCCCD | |
| mov rax,rsi | |
| mul rdx | |
| mov rax,rdx | |
| shr rax,3 | |
| ret | |
| ; Total bytes of code 24 | |
| ; .NET 10 | |
| ; Tests.Div10(UInt64) | |
| mov rdx,0CCCCCCCCCCCCCCCD | |
| mulx rax,rax,rsi | |
| shr rax,3 | |
| ret | |
| ; Total bytes of code 20 | |
| Better range comparison. When comparing a ulong expression against uint.MaxValue, rather than being emitted as a comparison, with dotnet/runtime#113037 from @shunkino it can be handled more efficiently as a shift. | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "x")] | |
| public partial class Tests | |
| { | |
| [Benchmark] | |
| [Arguments(12345)] | |
| public bool Test(ulong x) => x <= uint.MaxValue; | |
| } | |
| resulting in: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Test(UInt64) | |
| mov eax,0FFFFFFFF | |
| cmp rsi,rax | |
| setbe al | |
| movzx eax,al | |
| ret | |
| ; Total bytes of code 15 | |
| ; .NET 10 | |
| ; Tests.Test(UInt64) | |
| shr rsi,20 | |
| sete al | |
| movzx eax,al | |
| ret | |
| ; Total bytes of code 11 | |
| Better dead branch elimination. The JIT’s branch optimizer is already able to use implications from comparisons to statically determine the outcome of other branches. For example, if I have this: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "x")] | |
| public partial class Tests | |
| { | |
| [Benchmark] | |
| [Arguments(42)] | |
| public void Test(int x) | |
| { | |
| if (x > 100) | |
| { | |
| if (x > 10) | |
| { | |
| Console.WriteLine(); | |
| } | |
| } | |
| } | |
| } | |
| the JIT generates this on .NET 9: | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Test(Int32) | |
| cmp esi,64 | |
| jg short M00_L00 | |
| ret | |
| M00_L00: | |
| jmp qword ptr [7766D3E64FA8] | |
| ; Total bytes of code 12 | |
| Note there’s only a single comparison against 100 (0x64), with the comparison against 10 elided (as it’s implied by the previous comparison). However, there are many variations to this, and not all of them were handled equally well. Consider this: | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "x")] | |
| public partial class Tests | |
| { | |
| [Benchmark] | |
| [Arguments(42)] | |
| public void Test(int x) | |
| { | |
| if (x < 16) | |
| return; | |
| if (x < 8) | |
| Console.WriteLine(); | |
| } | |
| } | |
| Here, the Console.WriteLine ideally wouldn’t appear in the emitted assembly at all, as it’s never reachable. Alas, on .NET 9, we get this (the jmp instruction here is a tail call to WriteLine): | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Test(Int32) | |
| push rbp | |
| mov rbp,rsp | |
| cmp esi,10 | |
| jl short M00_L00 | |
| cmp esi,8 | |
| jge short M00_L00 | |
| pop rbp | |
| jmp qword ptr [731ED8054FA8] | |
| M00_L00: | |
| pop rbp | |
| ret | |
| ; Total bytes of code 23 | |
| With dotnet/runtime#111766 on .NET 10, it successfully recognizes that by the time it gets to the x < 8, that condition will always be false, and it can be eliminated. And once it’s eliminated, the initial branch is also unnecessary. So the whole method reduces to this: | |
| Copy | |
| ; .NET 10 | |
| ; Tests.Test(Int32) | |
| ret | |
| ; Total bytes of code 1 | |
| Better floating-point conversion. dotnet/runtime#114410 from @saucecontrol, dotnet/runtime#114597 from @saucecontrol, and dotnet/runtime#111595 from @saucecontrol all speed up conversions between floating-point and integers, such as by using vcvtusi2s when AVX512 is available, or when it isn’t, avoiding the intermediate double conversion. | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD", "i")] | |
| public partial class Tests | |
| { | |
| [Benchmark] | |
| [Arguments(42)] | |
| public float Compute(uint i) => i; | |
| } | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Compute(UInt32) | |
| mov eax,esi | |
| vxorps xmm0,xmm0,xmm0 | |
| vcvtsi2sd xmm0,xmm0,rax | |
| vcvtsd2ss xmm0,xmm0,xmm0 | |
| ret | |
| ; Total bytes of code 16 | |
| ; .NET 10 | |
| ; Tests.Compute(UInt32) | |
| vxorps xmm0,xmm0,xmm0 | |
| vcvtusi2ss xmm0,xmm0,esi | |
| ret | |
| ; Total bytes of code 11 | |
| Unrolling. When using CopyTo (or other “memmove”-based operations) with a constant source, dotnet/runtime#108576 reduces costs by avoiding a redundant memory load. dotnet/runtime#109036 unblocks more unrolling on Arm64 for Equals/StartsWith/EndsWith. And dotnet/runtime#110893 enables unrolling non-zero fills (unrolling already happened for zero fills). | |
| Copy | |
| // dotnet run -c Release -f net9.0 --filter "*" --runtimes net9.0 net10.0 | |
| using BenchmarkDotNet.Attributes; | |
| using BenchmarkDotNet.Running; | |
| BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args); | |
| [DisassemblyDiagnoser] | |
| [HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")] | |
| public partial class Tests | |
| { | |
| private char[] _chars = new char[100]; | |
| [Benchmark] | |
| public void Fill() => _chars.AsSpan(0, 16).Fill('x'); | |
| } | |
| Copy | |
| ; .NET 9 | |
| ; Tests.Fill() | |
| push rbp | |
| mov rbp,rsp | |
| mov rdi,[rdi+8] | |
| test rdi,rdi | |
| je short M00_L00 | |
| cmp dword ptr [rdi+8],10 | |
| jb short M00_L00 | |
| add rdi,10 | |
| mov esi,10 | |
| mov edx,78 | |
| call qword ptr [7F3093FBF1F8]; System.SpanHelpers.Fill[[System.Char, System.Private.CoreLib]](Char ByRef, UIntPtr, Char) | |
| nop | |
| pop rbp | |
| ret | |
| M00_L00: | |
| call qword ptr [7F3093787810] | |
| int 3 | |
| ; Total bytes of code 49 | |
| ; .NET 10 | |
| ; Tests.Fill() | |
| push rbp | |
| mov rbp,rsp | |
| mov rax,[rdi+8] | |
| test rax,rax | |
| je short M00_L00 | |
| cmp dword ptr [rax+8],10 | |
| jl short M00_L00 | |
| add rax,10 | |
| vbroadcastss ymm0,dword ptr [78EFC70C9340] | |
| vmovups [rax],ymm0 | |
| vzeroupper | |
| pop rbp | |
| ret | |
| M00_L00: | |
| call qword ptr [78EFC7447B88] | |
| int 3 | |
| ; Total bytes of code 48 | |
| Note the call to SpanHelpers.Fill in the .NET 9 assembly and the absence of it in the .NET 10 assembly. | |
| Native AOT | |
| Native AOT is the ability for a .NET application to be compiled directly to assembly code at build-time. The JIT is still used for code generation, but only at build time; the JIT isn’t part of the shipping app at all, and no code generation is performed at run-time. As such, most of the optimizations to the JIT already discussed, as well as optimizations throughput the rest of this post, apply to Native AOT equally. Native AOT presents some unique opportunities and challenges, however. | |
| One super power of the Native AOT tool chain is the ability to interpret (some) code at build-time and use the results of that execution rather than performing the operation at run-time. This is particularly relevant for static constructors, where the constructor’s code can be interpreted to initialize various static readonly fields, and then the contents of those fields can be persisted into the generated assembly; at run-time, the contents needs only be rehydrated from the assembly rather than being recomputed. This also potentially helps to make more code redundant and removable, if for example the static constructor and anything it (and only it) referenced were no longer needed. Of course, it would be dangerous and problematic if any arbitrary code could be run during build, so instead there’s a very filtered allow list and specialized support for the most common and appropriate constructs. dotnet/runtime#107575 augments this “preinitialization” capability to support spans sourced from arrays, such that using methods like .AsSpan() doesn’t cause preinitialization to bail out. dotnet/runtime#114374 also improved preinitialization, removing restrictions around accessing static fields of other types, calling methods on other types that have their own static constructors, and dereferencing pointers. | |
| Conversely, Native AOT has its own challenges, specifically that size really matters and is harder to control. With a JIT available at run-time, code generation for only exactly what’s needed can be deferred until run-time. With Native AOT, all assembly code generation needs to be done at build-time, which means the Native AOT tool chain needs to work hard to determine the least amount of code it needs to emit to support everything the app might need to do at run-time. Most of the effort on Native AOT in any given release ends up being about helping it to further decrease the size of generated code. For example: | |
| dotnet/runtime#117411 enables folding bodies of generic instantations of the same method, essentially avoiding duplication by using the same code for copies of the same method where possible. | |
| dotnet/runtime#117080 similarly helps improve the existing method body deduplication logic. | |
| dotnet/runtime#117345 from @huoyaoyuan tweaks a bit of code in reflection that would previously artificially force the code to be preserved for all enumerators of all generic instantations of every collection type. | |
| dotnet/runtime#112782 adds the same distinction that already existed for MethodTables for non-generic methods (“is this method table visible to user code or not”) to generic methods, allowing more metadata for the non-user visible ones to be optimized away. | |
| dotnet/runtime#118718 and dotnet/runtime#118832 enable size reductions related to boxed enums. The former tweaks a few methods in Thread, GC, and CultureInfo to avoid boxing some enums, which means the code for those needn’t be generated. The latter tweaks the implementation of RuntimeHelpers.CreateSpan, which is used by the C# compiler as part of creating spans with constructs like collection expressions. CreateSpan is a generic method, and the Native AOT toolchain’s whole-program analysis would end up treating the generic type parameter as being “reflected on,” meaning the compiler had to assume any type parameter would be used with reflection and thus had to preserve relevant metadata. When used with enums, it would need to ensure support for boxed enums was kept around, and System.Console has such a use with an enum. That in turn meant that a simple “hello, world” console app couldn’t trim away that boxed enum reflection support; now it can. | |
| ``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment