Oberon investigation on Mono 6.12.0 vs CoreCLR 9.0.100-preview.7.24375.5

Raw

dotnet.md

Starting DeltaBlue benchmark ... DeltaBlue: iterations=640000 average: 3us total: 2403826us

Starting Richards benchmark ... Richards: iterations=2000 average: 1388us total: 2777930us

Starting Json benchmark ... Json: iterations=8000 average: 435us total: 3480222us

Starting Havlak benchmark ... Havlak: iterations=50 average: 101792us total: 5089620us

Starting CD2 benchmark ... CD2: iterations=2500 average: 244us total: 611784us

Starting Bounce benchmark ... Bounce: iterations=30000 average: 18us total: 541817us

Starting List benchmark ... List: iterations=30000 average: 19us total: 576746us

Starting Mandelbrot benchmark ... Mandelbrot: iterations=50000 average: 0us total: 8219us

Starting NBody benchmark ... NBody: iterations=4000000 average: 0us total: 951317us

Starting Permute benchmark ... Permute: iterations=40000 average: 25us total: 1011096us

Starting Queens benchmark ... Queens: iterations=40000 average: 17us total: 694672us

Starting Sieve benchmark ... Sieve: iterations=60000 average: 10us total: 622302us

Starting Storage benchmark ... Storage: iterations=20000 average: 59us total: 1182548us

Starting Towers benchmark ... Towers: iterations=10000 average: 77us total: 778772us

End Tester

Raw

mono.md

Starting DeltaBlue benchmark ... DeltaBlue: iterations=640000 average: 6us total: 4081605us

Starting Richards benchmark ... Richards: iterations=2000 average: 1440us total: 2880531us

Starting Json benchmark ... Json: iterations=8000 average: 1677us total: 13421115us

Starting Havlak benchmark ... Havlak: iterations=50 average: 203535us total: 10176782us

Starting CD2 benchmark ... ERROR Starting Bounce benchmark ... Bounce: iterations=30000 average: 25us total: 758186us

Starting List benchmark ... List: iterations=30000 average: 47us total: 1432274us

Starting Mandelbrot benchmark ... Mandelbrot: iterations=50000 average: 0us total: 19819us

Starting NBody benchmark ... NBody: iterations=4000000 average: 0us total: 3076974us

Starting Permute benchmark ... Permute: iterations=40000 average: 41us total: 1650252us

Starting Queens benchmark ... Queens: iterations=40000 average: 33us total: 1342039us

Starting Sieve benchmark ... Sieve: iterations=60000 average: 21us total: 1281283us

Starting Storage benchmark ... Storage: iterations=20000 average: 93us total: 1866154us

Starting Towers benchmark ... Towers: iterations=10000 average: 88us total: 881495us

End Tester

Raw

oberon.md

Configuration:

Mono JIT compiler version 6.12.0 ((no/0cbf0e290c3 Wed Apr 17 03:40:45 UTC 2024)

9.0.100-preview.7.24375.5

Benchmark

 H.run("DeltaBlue", 640000, 1 )
 H.run("Richards", 2000, 1)
 H.run("Json", 8000, 1)
 H.run("Havlak", 50, 1 )
 H.run("CD2", 2500, 2)
 H.run("Bounce", 30000, 1)
 H.run("List", 30000, 1)
 H.run("Mandelbrot", 50000, 1)
 H.run("NBody", 4000000, 1)
 H.run("Permute", 40000, 1)
 H.run("Queens", 40000, 1)
 H.run("Sieve", 60000, 1)
 H.run("Storage", 20000, 1)
 H.run("Towers", 10000, 1)

Remarks

https://github.com/rochus-keller/Are-we-fast-yet/tree/main/Oberon as compiled by https://github.com/rochus-keller/Oberon results in a compiler-unfriendly IL that disregards .NET capabilities. The Oberon runtime also provides its own minimum of common functions like strcpy. The issues of these are but not limited to:

A common object is introduced for all Oberon objects, creating unnecessary inheritance chain which is particularly unfriendly by joining unrelated objects to ILC's whole-program-view analysis (does not impact benchmarks here, we use JIT)
Runtime-provided functions are of poor quality and do not use .NET's CoreLib, which is available under both Mono and .NET. Such disregard of standard type system and API seems like a typical C mindset of respecting C stdlib but not respecting the one of alternate runtime. I am not surprised.
Where C variant emits structs, performs manual memory management and does not have bound checks, the CIL back-end emits classes, objects and does plain array access, which has bounds checks. The correct action is to match the emitted C code, which is one of the strengths of .NET as a platform. Why are you even using it if you are not taking advantage of being able to do truly portable C?
Emitted IL is compiler-unfriendly and poor quality. It defeats bound checks elision in few places, occasionally misses attributes necessary for guarded devirt kick in, etc.
Originally, the benchmark runs so little time that even sufficiently fast pure interpreter runtime would have an advantage. The moment we increase the execution time, not only it stabilizes the result numbers, it lets CoreCLR's JIT compiler reach Tier 0 PGO-Instrumented and then Tier 1 PGO-Optimized method compilations, which are the main performance workhorses to exchange blows with the likes of LLVM.

In general, this is more or less what I expected to see - lack of respect or desire to extract optimal performance from the target platform. I suppose, for debugging purposes this works. But if you put this in any sort of performance measurement suite, it's like intentionally starting with square wheels and complaining about "technical issues". In any case, here are the numbers:

Results

Environment:

OS: macOS 15.0 24A5298h arm64, CPU: Apple M1 Pro

Geometric Mean (Mono): 1612026.37 microseconds
Geometric Mean (.NET): 847482.65 microseconds
Relative Performance (Mono/.NET): 190.21%

neon-sunset/dotnet.md

Benchmark

Remarks

Results