Dynamic PGO (Profile-guided optimization) is a JIT-compiler optimization technique that allows JIT to collect additional information about surroundings (aka profile) in tier0 codegen in order to rely on it later during promotion from tier0 to tier1 for hot methods to make them even more efficient.
-
Profile-driving inlining - inliner relies on PGO data and can be very aggressive for hot paths and care less about cold ones, see dotnet/runtime#52708 and dotnet/runtime#55478. A good example where it has visible effects is this StringBuilder benchmark:
-
Guarded devirtualization - most monomorphic virtual/interface calls can be devirtualized using PGO data, e.g.:
void DisposeMe(IDisposable d)
{
d.Dispose();
}
It looks like nothing can be optimized here, right? Just an ordinary virtual (interface) call on top of an unknown object that will go through several indirects to call the actual Dispose() implementation and its body will never be inlined here. Now let's see what PGO can do here.
With Dynamic PGO on, this method will be compiled to something like this in tier0 (in machine code):
void DisposeMe(IDisposable d)
{
+ call CORINFO_HELP_CLASSPROFILE32(d, offset);
d.Dispose();
}
We now poll that d
for its underlying type every call of that method. Yes, it makes it slightly slower, but eventually it will be re-compiled to tier1 to something like this:
void DisposeMe(IDisposable d)
{
+ if (d is MyType) // E.g. Profile states that Dispose here is 'mostly' called on MyType.
+ ((MyType)d).Dispose(); // Direct call - can be inlined now!
+ else
d.Dispose(); // a cold fallback, just in case
}
^ codegen diff for a case where MyType::Dispose is empty
- Hot-cold block reordering - JIT re-orders blocks to keep hot ones closer to each other and pushes cold ones to the end of the method. The following code:
void DoWork(int a)
{
if (a > 0)
DoWork1();
else
DoWork2();
}
Is compiled like this in tier0:
void DoWork(int a)
{
if (a > 0)
+ __block_0_counter++;
DoWork1();
else
+ __block_1_counter++;
DoWork2();
}
And again: once it's recompiled to tier1 it is optimized into:
void DoWork(int a)
{
// E.g. __block_0_counter is smaller or even zero => DoWork1 is rarely (never) taken
// and JIT re-orders DoWork2 with DoWork1:
- if (a > 0)
+ if (a <= 0)
- DoWork2();
+ DoWork1();
else
- DoWork1();
+ DoWork2();
}
- Register allocation - realistic block weights allow JIT to pick a better strategy on what to keep in registers and what to spill
- Misc - some optimizations such as Loop Clonning, Inlined Casts, etc. aren't applied in cold blocks.
using System.Text;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
// Run the benchmarks
BenchmarkRunner.Run<PgoBenchmarks>();
[Config(typeof(MyEnvVars))]
public class PgoBenchmarks
{
// Custom config to define "Default vs PGO"
class MyEnvVars : ManualConfig
{
public MyEnvVars()
{
// Use .NET 6.0 default mode:
AddJob(Job.Default.WithId("Default mode"));
// Use Dynamic PGO mode:
AddJob(Job.Default.WithId("Dynamic PGO")
.WithEnvironmentVariables(
new EnvironmentVariable("DOTNET_TieredPGO", "1"),
new EnvironmentVariable("DOTNET_TC_QuickJitForLoops", "1"),
new EnvironmentVariable("DOTNET_ReadyToRun", "0")));
}
}
//
// Benchmark 1: Devirtualize unknown virtual calls:
//
public IEnumerable<object> TestData()
{
// Test data for 'GuardedDevirtualization(ICollection<int>)'
yield return new List<int>();
}
[Benchmark]
[ArgumentsSource(nameof(TestData))]
public void GuardedDevirtualization(ICollection<int> collection)
{
// a chain of unknown virtual calls...
collection.Clear();
collection.Add(1);
collection.Add(2);
collection.Add(3);
}
//
// Benchmark 2: Allow inliner to be way more aggressive than usual
// for profiled call-sites:
//
[Benchmark]
public StringBuilder ProfileDrivingInlining()
{
StringBuilder sb = new();
for (int i = 0; i < 1000; i++)
sb.Append("hi"); // see https://twitter.com/EgorBo/status/1451149444183990273
return sb;
}
//
// Benchmark 3: Reorder hot-cold blocks for better performance
//
[Benchmark]
[Arguments(42)]
public string HotColdBlockReordering(int a)
{
if (a == 1)
return "a is 1";
if (a == 2)
return "a is 2";
if (a == 3)
return "a is 3";
if (a == 4)
return "a is 4";
if (a == 5)
return "a is 5";
return "a is too big"; // this branch is always taken in this benchmark (a is 42)
}
}
Method | Job | Mean | Error | StdDev |
---|---|---|---|---|
GuardedDevirtualization | Default mode | 5.7448 ns | 0.0020 ns | 0.0017 ns |
GuardedDevirtualization | Dynamic PGO | 3.2651 ns | 0.0233 ns | 0.0182 ns |
ProfileDrivingInlining | Default mode | 3,538.2980 ns | 26.7256 ns | 23.6915 ns |
ProfileDrivingInlining | Dynamic PGO | 2,167.8397 ns | 5.0619 ns | 4.2269 ns |
HotColdBlockReordering | Default mode | 1.5244 ns | 0.0029 ns | 0.0025 ns |
HotColdBlockReordering | Dynamic PGO | 0.0181 ns | 0.0051 ns | 0.0040 ns |
You only need to make sure the following environment variables are defined in the execution process of your program:
# Enable Dynamic PGO
export DOTNET_TieredPGO=1
# AOT images aren't instrumented so we need to disable them and collect
# relevant PGO data for literally everything. It affects startup time badly,
# but leads to higher performance after warm up.
export DOTNET_ReadyToRun=0
# For .NET 7.0 we hopefully will enable full-fledged OSR, but for now methods with loops
# always bypass tier0, however, we do need them in tier0 to be instrumented for PGO.
export DOTNET_TC_QuickJitForLoops=1
^ Linux/macOS, for Windows-Powershell:
$env:DOTNET_TieredPGO=1
$env:DOTNET_ReadyToRun=0
$env:DOTNET_TC_QuickJitForLoops=1
Please, tag me EgorBo on twitter and I'll forward it to the team
- @WhiteBlackGoose:
asc-community/AngouriMath (New open-source cross-platform symbolic algebra library for C# · F# · Jupyter · C++ (WIP)" - Dynamic PGO benefits: https://gist.github.com/WhiteBlackGoose/dd2fcd088b3d45e117d1a47ded02f686
- aspnet/TechEmpower benchmarks on our hardware: https://aka.ms/aspnet/benchmarks (Navigate to the 17th page which is "PGO"):
Hi. Great post and very interesting. Dynamic PGO is a feature I have been looking forward to.
Are there cases where we can expect performance regressions? I have a small loop that does seem to do worse with Dynamic PGO.