Let's say you have implemented a small data pipeline library to replace LINQ.
module TrivialStream =
type Receiver<'T> = 'T -> unit
type Stream<'T> = Receiver<'T> -> unit
module Details =
module Loop =
let rec range s e r i = if i <= e then r i; range s e r (i + s)
open Details
let inline range b s e : Stream<int> =
fun r -> Loop.range s e r b
let inline filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> =
fun r -> s (fun v -> if f v then (r v))
let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
fun r -> s (fun v -> (r (m v)))
let inline sum (s : Stream<'T>) : 'T =
let mutable ss = LanguagePrimitives.GenericZero
s (fun v -> ss <- ss + v)
ss
Then you test compare the overhead of the pipeline against LINQ:
let trivialTest n =
TrivialStream.range 0 1 n
|> TrivialStream.map (fun v -> int64 v)
|> TrivialStream.filter (fun v -> v &&& 1L = 0L)
|> TrivialStream.map (fun v -> v + 1L)
|> TrivialStream.sum
let linqTest n =
Enumerable
.Range(0, n + 1)
.Select(fun v -> int64 v)
.Where(fun v -> v &&& 1L = 0L)
.Select(fun v -> v + 1L)
.Sum()
You are very satisfied to see that your trivial push data pipeline has significantly lower overhead than LINQ.
Running LINQ with total=100000000, outer=10, inner=10000000 ...
... 1803 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
Running trivialpush with total=100000000, outer=10, inner=10000000 ...
... 375 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
Almost 6x lower overhead!
Unfortunately you run into Paul Westcott who asks what happens if you run the benchhmark in x86
mode rather than x64
mode.
Running LINQ with total=100000000, outer=10, inner=10000000 ...
... 1773 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
Running trivialpush with total=100000000, outer=10, inner=10000000 ...
... 3116 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
Now the tables are turned, your beautiful little library is actually slower than LINQ! Paul then says: "Just apply some magic oil!" and suggests you change your code into:
module TrivialStream =
type Receiver<'T> = 'T -> unit
type Stream<'T> = Receiver<'T> -> unit
module Details =
let inline paulWestcottsMagicOil u = match u with () -> ()
module Loop =
let rec range s e r i = if i <= e then r i; range s e r (i + s)
open Details
let inline range b s e : Stream<int> =
fun r -> Loop.range s e r b
let inline filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> =
fun r -> s (fun v -> if f v then paulWestcottsMagicOil (r v))
let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
fun r -> s (fun v -> paulWestcottsMagicOil (r (m v)))
let inline sum (s : Stream<'T>) : 'T =
let mutable ss = LanguagePrimitives.GenericZero
s (fun v -> ss <- ss + v)
ss
The function let inline paulWestcottsMagicOil u = match u with () -> ()
does nothing but to your disbelief it actually helps!
Running LINQ with total=100000000, outer=10, inner=10000000 ...
... 1770 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
Running trivialpush with total=100000000, outer=10, inner=10000000 ...
... 561 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
Performance in x64
mode isn't as good as before but respectable:
Running LINQ with total=100000000, outer=10, inner=10000000 ...
... 1803 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
Running trivialpush with total=100000000, outer=10, inner=10000000 ...
... 483 ms, cc0=0, cc1=0, cc2=0, result=25000010000001L
It's great that we have a potential solution but some question marks still remains:
- What does this have to do with tail calls in .NET (as implied by the title of the blog)?
- Why the significant drop in performance between
x64
andx86
modes? paulWestcottsMagicOil
does nothing yet it helps, why?- What is tail call optimization (TCO) I sometimes hear about?
A tail call is a function call that is the final operation in a function. When the F# compiler detects that a function is the final operation it decorates the call site with a .tail
attribute.
In fact, in our data pipeline the calls to the next step in pipeline are tail calls.
let inline filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> =
fun r -> s (fun v -> if f v then (r v)) // <-- r v is a tail call
let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
fun r -> s (fun v -> (r (m v))) // <-- r v is a tail call
If we would look at result of a function call that call isn't a tail call because it is no longer the final operation.
let rec x n = if n > 0 then n + y (n - 1) else 0 // y (n - 1) is not a tail call
and y n = if n > 0 then -n + x (n - 1) else 0 // x (n - 1) is not a tail call
let rec v s n = if n > 0 then w (s + n) (n - 1) else s // w (s + n) (n - 1) is a tail call
and w s n = if n > 0 then v (s - n) (n - 1) else s // v (s - n) (n - 1) is a tail call
These functions don't do anything really useful except demonstrating two different ways to implement the same functionality. One using tail calls (v & w) and one without tail calls (x & y). The F# compiler then should inject a .tail
attribute for v & w but not for x & y.
To see this we need to peek at the generated IL assembler.
Here is an excerpt of the IL assembler for x
:
; Calls y
IL_0008: call int32 Program/Samples::y(int32)
; Increments the result
IL_000d: add
; Returns from the function
IL_000e: ret
x
calls y
. When y
returns we accumulate the result and returns. Because the call isn't the final operation it is not a tail call as expected.
It looks different for v
:
; The following call is a tail call
IL_000a: tail.
; Calls w
IL_000c: call int32 Program/Samples::w(int32, int32)
; Returns from the function
IL_0011: ret
v
calls w
but as this is the final operation the call site is decorated with .tail
.
I am sure you have run into your fair share of StackOverflowException
. For developers that aren't aware of the runtime ABI the name StackOverflowException
must seem cryptic but we quickly learn it typically means we have a recursive function that keeps calling itself over and over until we overflow.
The F# and C# compiler translates our source into a stream of op codes that are executed by the .NET virtual machine.
When the virtual machine hits the op code ret
it mean it should continue at the code site that called this function. The virtual machine needs information where to continue executing and that information is in .NET stored on the stack.
What happens is that when the virtual machine executes a call
it stores the position of the op code following call
on the stack. Later when ret
is executed it reads that value from the stack
and the virtual machine continue executing the instruction following call
.
StackOverflowException
happens when we run out of space to store the return positions.
In reality there is more information that is stored on the stack than just the return position but let's ignore that for now.
Tail calls allow us to not run out of stack space when implementing a recursive algorithm.
If we have a call that looks like this:
IL_000c: call int32 Program/Samples::w(int32, int32)
IL_0011: ret
Since the call is the final operation it is a tail call. It seems unnecessary to return to IL_0011
to just return the the calling function. What if we did this?
IL_000c: jmp int32 Program/Samples::w(int32, int32)
Then we jump to w
but we don't store the return address to us. When w
return it returns to the code that called us directly. We save space and we save time.
This is roughly what the jitter tries to do when it sees a tall call.
Let's test our function above with this theory:
// Remember x is not tail recursive
printfn "%A" <| x 10000 // Prints 5000
printfn "%A" <| x 100000 // Crashes on my machine with StackOverflowException
// Remember v is tail recursive
printfn "%A" <| v 0 10000 // Prints 5000
printfn "%A" <| v 0 100000 // Prints 50000
printfn "%A" <| v 0 1000000000 // Prints 500000000 (takes awhile)
Tail calls are brilliant!
That is because tail calls sucks!
If we profile the x64
code the top 3 functions are:
[email protected] 32 %
[email protected] 20 %
[email protected] 18 %
In x86
mode the top 3 functions are:
[email protected] 27 %
[clr.dll] 24 %
[email protected] 17 %
What are the Invoke methods? What does the CLR do? Even if we eliminate the CLR part it does not explain the complete drop in performance.
One way to interpret the name [email protected]
is that this is lambda function generated for line 71. In my code line 71, 72 and 73 looks like this:
|> TrivialStream.map (fun v -> int64 v)
|> TrivialStream.filter (fun v -> v &&& 1L = 0L)
|> TrivialStream.map (fun v -> v + 1L)
Because F# inlines small lambdas the lambda functions that show up in the performance run are the data pipeline continuation lambdas for map
and filter
You are going to love this! To answer this we have to dig into the jitted x64
and x86
code.
Let's start with x64
code:
; This is - TrivialStream.filter (fun v -> v &&& 1L = 0L)
00007ff8`5af01160 488bc2 mov rax,rdx
; Note that the test (fun v -> v &&& 1L = 0L) is inlined
; Test if the if the number is even
00007ff8`5af01163 a801 test al,1
00007ff8`5af01165 7512 jne 00007ff8`5af01179
; It is even, then we should call the next step in the data pipeline.
; That is a virtual call.
; The jitter has devirtualization tricks but they have failed to apply here so
; we have to do the full virtual call...
; First load the pointer the classes that implements the next step in the data pipeline
00007ff8`5af01167 488b4908 mov rcx,qword ptr [rcx+8]
; Load the object Method Table Pointer
00007ff8`5af0116b 488b01 mov rax,qword ptr [rcx]
; Load the class Pointer to Methods
00007ff8`5af0116e 488b4040 mov rax,qword ptr [rax+40h]
; Load the address to virtual method Invoke
00007ff8`5af01172 488b4020 mov rax,qword ptr [rax+20h]
; Jump to next Invoke (note this is a tail call so need to return here)
00007ff8`5af01176 48ffe0 jmp rax
; The number was odd meaning we return to the caller
00007ff8`5af01179 33c0 xor eax,eax
00007ff8`5af0117b c3 ret
Apart from the unfortunate fact that the devirtualization has failed to apply which prevents further inlining of the next step the code above is not that bad.
Sure there is a lot of overhead compared to the simple test on odd/even but it could be worse.
Speaking of much worse, let's look at the x86
code for the same function:
; This is - TrivialStream.filter (fun v -> v &&& 1L = 0L)
; Method prelude storing various registers... they weren't here in x64 mode :(
014e1090 55 push ebp
014e1091 8bec mov ebp,esp
014e1093 57 push edi
014e1094 56 push esi
014e1095 53 push ebx
014e1096 8bf1 mov esi,ecx
; Loading the value to test from the stack rather than picking it up from a register :(
014e1098 8b4508 mov eax,dword ptr [ebp+8]
; Well at least the test is simple enough and inlined
014e109b a901000000 test eax,1
014e10a0 7526 jne 014e10c8
; The number is even
; But why this test? It wasn't here in x64 mode
014e10a2 833d4010005d00 cmp dword ptr [clr!g_TrapReturningThreads (5d001040)],0
014e10a9 7526 jne 014e10d1
; Ok, here we go loading the pointer to next step in the pipeline
014e10ab 8b4e04 mov ecx,dword ptr [esi+4]
; Pushing input arguments on the stack as required by the x86 ABI
; x64 ABI supports passing values in registers
014e10ae ff750c push dword ptr [ebp+0Ch]
014e10b1 ff7508 push dword ptr [ebp+8]
; Load the object Method Table Pointer
014e10b4 8b01 mov eax,dword ptr [ecx]
; Load the class Pointer to Methods
014e10b6 8b4028 mov eax,dword ptr [eax+28h]
; Load the address to virtual method Invoke
014e10b9 8b4010 mov eax,dword ptr [eax+10h]
; More arguments pushed to stack?
014e10bc 6a02 push 2
014e10be 6a02 push 2
014e10c0 6a01 push 1
014e10c2 50 push eax
; And now we call clr!JIT_TailCall? Not Invoke?
014e10c3 e877cf4a5b call clr!JIT_TailCall (5c98e03f)
; The number is odd so let's return
014e10c8 33c0 xor eax,eax
; Method epilogue :(
014e10ca 5b pop ebx
014e10cb 5e pop esi
014e10cc 5f pop edi
014e10cd 5d pop ebp
014e10ce c20800 ret 8
Some differences can be attributed to that the ABI for x64
and x86
modes, but what is the clr!JIT_TailCall
doing?
Let's look at it:
; Why are we looking the current thread? :(
5c98e03f e844060000 call clr!GetThread (5c98e688)
5c98e044 50 push eax
5c98e045 51 push ecx
5c98e046 52 push edx
; Something about the current thread is interesting?
5c98e047 f7400480000000 test dword ptr [eax+4],80h
5c98e04e 7406 je clr!JIT_TailCall+0x17 (5c98e056)
5c98e050 50 push eax
5c98e051 e870213400 call clr!JIT_TailCallHelper (5ccd01c6)
; Seems we usually end up here!
; Ok, let's load the arguments we pushed when calling clr!JIT_TailCall
5c98e056 8b542414 mov edx,dword ptr [esp+14h]
5c98e05a 8b44241c mov eax,dword ptr [esp+1Ch]
5c98e05e 8b4c2418 mov ecx,dword ptr [esp+18h]
; A lot of stuff happening, but it seems to be to eliminate the stackframe
; of the current function. Which makes sense because we don't want the stack to grow
; on a tail call
5c98e062 f7c201000000 test edx,1
5c98e068 7409 je clr!JIT_TailCall+0x34 (5c98e073)
5c98e06a 8b7dfc mov edi,dword ptr [ebp-4]
5c98e06d 8b75f8 mov esi,dword ptr [ebp-8]
5c98e070 8b5df4 mov ebx,dword ptr [ebp-0Ch]
5c98e073 ff7504 push dword ptr [ebp+4]
5c98e076 57 push edi
5c98e077 56 push esi
5c98e078 8d7c8508 lea edi,[ebp+eax*4+8]
5c98e07c 8d748c2c lea esi,[esp+ecx*4+2Ch]
5c98e080 8b6d00 mov ebp,dword ptr [ebp]
5c98e083 f7c202000000 test edx,2
5c98e089 752b jne clr!JIT_TailCallLeave+0x1 (5c98e0b6)
5c98e08b 85c9 test ecx,ecx
5c98e08d 740e je clr!JIT_TailCall+0x5e (5c98e09d)
5c98e08f 8b46fc mov eax,dword ptr [esi-4]
5c98e092 83ef04 sub edi,4
5c98e095 83ee04 sub esi,4
5c98e098 8907 mov dword ptr [edi],eax
; Hmm, a loop isn't what we want to see in a time-critical part
5c98e09a 49 dec ecx
5c98e09b 75f2 jne clr!JIT_TailCall+0x50 (5c98e08f)
5c98e09d 8b442408 mov eax,dword ptr [esp+8]
5c98e0a1 8b4c241c mov ecx,dword ptr [esp+1Ch]
5c98e0a5 8947fc mov dword ptr [edi-4],eax
5c98e0a8 894ff8 mov dword ptr [edi-8],ecx
5c98e0ab 8d47f8 lea eax,[edi-8]
; Ok, getting near the end, restoring the registers
5c98e0ae 5e pop esi
5c98e0af 5f pop edi
5c98e0b0 59 pop ecx
5c98e0b1 5a pop edx
5c98e0b2 59 pop ecx
5c98e0b3 8be0 mov esp,eax
clr!JIT_TailCallLeave:
; This actually "returns" to next step in the pipeline rather then the current step
5c98e0b5 c3 ret
If you want to look at the actual source code for clr!JIT_TailCall
you can have a peek at dotnet.core: https://github.com/dotnet/coreclr/blob/master/src/vm/i386/jithelp.asm#L927
So it seems that clr!Jit_TailCall
eliminates the current stackframe in order to make sure the stack don't grow. In x64
mode the jitter was able to just do a jmp
but that doesn't work for some reason in x86
mode. Unfortunately it seems to be quite lot of work eliminating the stackframe.
clr!Jit_TailCall
is most likely what's tagged as [clr.dll]
when we measured
the most costly functions.
Let's see what the "magic oil does to functions x
, y
, v
and w
we looked at before:
let inline paulWestcottsMagicOil u = match u with i -> i
let rec x n = if n > 0 then n + paulWestcottsMagicOil (y (n - 1)) else 0 // y (n - 1) is not a tail call
and y n = if n > 0 then -n + paulWestcottsMagicOil (x (n - 1)) else 0 // x (n - 1) is not a tail call
let rec v s n = if n > 0 then paulWestcottsMagicOil (w (s + n) (n - 1)) else s // w (s + n) (n - 1) is a tail call
and w s n = if n > 0 then paulWestcottsMagicOil (v (s - n) (n - 1)) else s // v (s - n) (n - 1) is a tail call
paulWestcottsMagicOil
runs a match on input argument and whatever it is that is returned. It does nothing!
Let's see how the magic function changes the IL Assembler.
Let's start with x
:
IL_0008: call int32 Program/Sample2::y(int32)
IL_000d: add
IL_000e: ret
Same as before. The magic oil function does nothing!
Let's look at v
:
IL_000a: call int32 Program/Sample2::w(int32, int32)
IL_000f: ret
The .tail
attribute is gone?! It seems that because match semantically looks at the value returned by w
the F# compiler don't realize this is in essence a tail call and omits the .tail
attribute.
This is the purpose of paulWestcottsMagicOil
- to eliminate the .tail
attribute. This means that we will get StackOverflowException
when call v 0 20000
.
However, it seems to help the performance of x86
code.
Let's see how this affects the x64
assembler code.
; Reserve some stack space
00007ff8`5aee1160 4883ec28 sub rsp,28h
00007ff8`5aee1164 488bc2 mov rax,rdx
; The test
00007ff8`5aee1167 a801 test al,1
00007ff8`5aee1169 7515 jne 00007ff8`5aee1180
; Number is even, let's call the next step in the datapipeline
; Load the pointer to the next step in the pipeline
00007ff8`5aee116b 488b4908 mov rcx,qword ptr [rcx+8]
; Load the object Method Table Pointer
00007ff8`5aee116f 488b01 mov rax,qword ptr [rcx]
; Load the class Pointer to Methods
00007ff8`5aee1172 488b4040 mov rax,qword ptr [rax+40h]
; Load and call the address to virtual method Invoke
00007ff8`5aee1176 ff5020 call qword ptr [rax+20h]
00007ff8`5aee1179 33c0 xor eax,eax
; Unreserve stack space
00007ff8`5aee117b 4883c428 add rsp,28h
; We are done!
00007ff8`5aee117f c3 ret
; Number is odd, just return
00007ff8`5aee1180 33c0 xor eax,eax
; Unreserve stack space
00007ff8`5aee1182 4883c428 add rsp,28h
; We are done!
00007ff8`5aee1186 c3 ret
In x64
mode the code when .tail
is present is better as we could jmp
to the next step in the pipeline, now the jitter choses call
instead. We lose some performance (25%) and we risk StackOverflowException
.
Let's look at x86
code:
; Saving the framepointer. This is a good thing!
02b61079 8bec mov ebp,esp
02b61078 55 push ebp
; Trying to walk a stack manually without framepointers are not fun.
; However, ARM ABI on Linux seems to place the framepointer in a "random" position
; so it doesn't help there but that is a story for another blog.
; Load number from stack
02b6107b 8b4508 mov eax,dword ptr [ebp+8]
; Test number if even
02b6107e a901000000 test eax,1
02b61083 7517 jne 02b6109c
; Ok the number is even. Let's call the next step in the pipeline
; Load the pointer to the next step in the pipeline
02b61085 8b4904 mov ecx,dword ptr [ecx+4]
; Push the input arguments on the stack
02b61088 ff750c push dword ptr [ebp+0Ch]
02b6108b ff7508 push dword ptr [ebp+8]
; Load the object Method Table Pointer
02b6108e 8b01 mov eax,dword ptr [ecx]
; Load the class Pointer to Methods
02b61090 8b4028 mov eax,dword ptr [eax+28h]
; Load and call the address to virtual method Invoke
02b61093 ff5010 call dword ptr [eax+10h]
02b61096 33c0 xor eax,eax
; Restore framepointer
02b61098 5d pop ebp
; We are done!
02b61099 c20800 ret 8
; Ok, the number was odd, just return
02b6109c 33c0 xor eax,eax
; Restore framepointer
02b6109e 5d pop ebp
; We are done!
02b6109f c20800 ret 8
This looks much cleaner and it performs much better. The gain seems to primarily to come from calling the next step in the pipeline directly without having to go through clr!JIT_TailCall
. We risk StackOverflowException
but perhaps it's worth it.
F# doesn't have break
and continue
as common in other languages. In F# if you want to implement a loop that stops at a certain condition the idiom is to use tail recursion:
let rec find f = function
| [] -> None
| v::vs -> if f v then Some v else find vs
But what about efficiency? Doesn't the use of tail calls especially on x86
mode kill the performance.
Here is where F# compiler applies TCO and unrolls the code above to a loop. If we reverse engineer the IL code into C# it looks something like this
public static FSharpOption<a> find<a>(FSharpFunc<a, bool> f, FSharpList<a> _arg1)
{
while (_arg1.get_TailOrNull() != null)
{
FSharpList<a> fsharpList = _arg1;
FSharpList<a> tailOrNull = fsharpList.get_TailOrNull();
a headOrDefault = fsharpList.get_HeadOrDefault();
if (f.Invoke(headOrDefault))
return FSharpOption<a>.Some(headOrDefault);
FSharpFunc<a, bool> fsharpFunc = f;
_arg1 = tailOrNull;
f = fsharpFunc;
}
return (FSharpOption<a>) null;
}
So our tail recursive function becomes an efficient loop instead. TCO can not be applied in every situation, for example v
and w
above are mutually recursive. TCO doesn't apply there.
Tail recursion to implements loops are more useful than you might think because in F# identical loops are vary a lot in terms of performance.
// These loops are fast
for i in 0..10 do
compute i
for i in 0..1..10 do
compute i
for i in 0..-1..10 do
compute i
// These loops are slow
for i in 0L..10L do
compute i
for i in 0..2..10 do
compute i
// This cause a significant GC overhead in addition to being slow
for i in 0..10000000 do
for j in 0L..10L do
compute i
If you need a fast loop in F# that increment longs by 2 tail recursion is the way to do it in F#.
Because we seen that tail recursion cause the jitter to avoid generating a stackframe (x64
mode) or to eliminate the stackframe (x86
mode) tail recursion allows us to implement deep recursion without StackOverflowException
.
In x86
mode was saw that eliminating the stackframe using clr!JIT_TailCall
caused a significant overhead. In x64
mode we instead observed a performance improvement as the jitter was able to avoid a stackframe altogether.
However, it is possible that for more complex x64
code the jitter end up needing to call clr!JIT_TailCall
.
As far as I know, C# compiler don't generate the .tail
attribute meaning that the JIT performance team has not optimized tail call as efficiently as possible as F# is a minor .NET language (I wish it was otherwise).
There are things the jitter could do to improve the stackframe elimination, like inlining clr!JIT_TailCall
and unroll the loops.
Implementing efficient non-trivial loops in F# is best done with tail recursion. In addition; tail recursion also allows us to avoid mutable state. F# has damaged me and now I tend to write tail recursion even in language that has support for break, continue and TCO like kotlin.
C# doesn't currently support TCO so tail recursion is a bad idea in C#.
I hope Paul forgives me for naming a function after him. I first saw tail call suppression trick in an ambitious PR from Paul that aimed to improve the Seq
module performance.
I see PRs from Paul that tries to improve the performance of various core library like Seq
, generic compare/equality
and Lazy
. All of these libraries need to be improved but I realize the difficulty in getting the PRs approved as they typically involve massive and risky changes. I have a beer instead and tries to forget it.
Paul does try and that's why I think he should win the "Tenacious F# developer of the year" award.