In this post we'll tell a story of how we got a 12% performance improvement in our production evaluator.
I looked into GHC Core of the production evaluator and spotted this there:
case safeIndexOne env1_XW (W64# bx18_scPRb) of {
Nothing -> jump exit_X16 <...>
Just val_acLQA -> jump returnCek_scPkh ctx1_X0 val_acLQA <...>
}
This performs variable lookup using safeIndexOne
. If the lookup is not successful, then evaluator terminates via jump exit_X16
, otherwise recursion continues on the looked up value via jump returnCek_scPkh
. The jump
part of each of these indicates that recursion compiles to an actual jump to a label (i.e. is very efficient), see the "Compiling without Continuations" paper for details: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/compiling-without-continuations.pdf
We can see some clear optimization opportunities:
safeIndexOne
appearing in the Core means that it's not getting inlined- the in-scope variable index
bx18_scPRb
is getting wrapped withW64#
suggesting thatsafeIndexOne
is lazier than it needs to be, which is what causes the unnecessary wrapping (and unwrapping inside ofsafeIndexOne
) - returning a
Maybe
fromsafeIndexOne
just to immediately dispatch on it isn't optimal: we should be able to avoid this intermediate step and not construct anyMaybe
s
So we go and look at how safeIndexOne
is defined:
safeIndexOne :: RAList a -> Word64 -> Maybe a
safeIndexOne Nil _ = Nothing
safeIndexOne (BHead w t ts) !i =
if i <= w
then indexTree w i t
else safeIndexOne ts (i-w)
where
indexTree :: Word64 -> Word64 -> Tree a -> Maybe a
indexTree = <...>
The first clause completely ignores the second argument, meaning GHC strictness analysis is going to conclude that it may be worth not evaluating it and so safeIndexOne
appears lazy in its second argument to GHC. which is how we end up with that W64#
wrapping above. We don't want that, so we tell GHC to evaluate the second argument no matter what (and not to force it in the second clause, because it's forced in the i <= w
comparison anyway):
safeIndexOne Nil !_ = Nothing
safeIndexOne (BHead w t ts) i =
We recompile the evaluator module and observe that the strictness issue is gone:
case $wsafeIndexOne env1_XW bx18_swNG of {
Nothing -> jump exit_X16 <...>
Just val_apny -> jump returnCek_sw4D ctx1_X0 val_apny <...>
}
$wsafeIndexOne
is the "worker" (see the "The Worker/Wrapper Transformation" paper: https://people.cs.nott.ac.uk/pszgmh/wrapper-extended.pdf) operating on a machine word value, so there's no pointless wrapping-unwrapping.
But the body of safeIndexOne
still isn't inlined. Looking at the definition of safeIndexOne
we observe that it's a recursive function and GHC doesn't inline recursive functions. We can try to do the worker-wrapper transformation manually to get a top-level function that isn't recursive, but calls a recursive worker:
safeIndexOne :: forall a. RAList a -> Word64 -> Maybe a
safeIndexOne = go
where
go :: RAList a -> Word64 -> Maybe a
go Nil !_ = Nothing
go (BHead w t ts) !i =
if i <= w
then indexTree w i t
else go ts (i-w)
indexTree :: Word64 -> Word64 -> Tree a -> Maybe a
indexTree = <...>
Unfortunately this doesn't change anything, because GHC floats go
to the outside of safeIndexOne
into its own top-level definition and we're back to square one.
We need to somehow force GHC to retain go
and indexTree
as local definitions, so that when safeIndexOne
is inlined, both of them appear in the generated Core at the call site where GHC can optimize them. We could probably achieve that by using something from GHC.Magic
, but the easiest way of doing it is to make safeIndexOne
accept a term-level argument and use it within go
and indexTree
. For that we can simply Church-encode Maybe
from the result type of safeIndexOne
:
safeIndexOne :: forall a b. b -> (a -> b) -> RAList a -> Word64 -> b
safeIndexOne z f = go
where
go :: RAList a -> Word64 -> b
go Nil !_ = z
go (BHead w t ts) !i =
if i <= w
then indexTree w i t
else go ts (i-w)
indexTree :: Word64 -> Word64 -> Tree a -> Maybe a
indexTree = <...>
safeIndexOne
now binds z
, which is used instead of Nothing
, and f
, which is used instead of Just
.
After adapting the call site from
case Env.safeIndexOne varEnv varIx of
Nothing -> throw <...>
Just val -> pure val
to
Env.safeIndexOne
(throw <...>)
pure
varEnv
varIx
safeIndexOne
finally inlines together with its internals, so in the generated Core we find this beatiful recursive join point (the thing enabling jumps to labels) full of operations on machine words and jumps to labels:
joinrec {
$windexTree_swMk
:: Word64#
-> Word64#
-> Tree (CekValue uni_swN8 fun_swN9 ann_swNa)
-> (# State# RealWorld,
Either
(CekEvaluationException
NamedDeBruijn uni_swN8 fun_swN9)
(NTerm uni_swN8 fun_swN9 ()) #)
$windexTree_swMk (ww2_swMc :: Word64#)
(ww3_swMg :: Word64#)
(ds28_swMi
:: Tree
(CekValue
uni_swN8 fun_swN9 ann_swNa))
= case ww3_swMg of wild14_X15 {
__DEFAULT ->
case ww2_swMc of wild15_X1m {
__DEFAULT ->
case ds28_swMi of {
Leaf ds29_avEa -> jump exit14_X16;
Node ipv13_avEc ipv14_avEd ipv15_avEe ->
case wild14_X15 of wild17_X1n {
__DEFAULT ->
let {
offset'_swtg :: Word64#
offset'_swtg
= subWord64#
wild17_X1n 1#Word64 } in
let {
halfSize_sw7I :: Word64#
halfSize_sw7I
= uncheckedShiftRL64#
wild15_X1m 1# } in
case leWord64#
offset'_swtg halfSize_sw7I
of {
__DEFAULT ->
jump $windexTree_swMk
halfSize_sw7I
(subWord64#
offset'_swtg halfSize_sw7I)
ipv15_avEe;
1# ->
jump $windexTree_swMk
halfSize_sw7I
offset'_swtg
ipv14_avEd
};
1#Word64 -> jump exit15_X17 ipv13_avEc
}
};
1#Word64 ->
case wild14_X15 of wild16_X1n {
__DEFAULT ->
case ds28_swMi of {
Leaf ds29_avEa -> jump exit14_X16;
Node ipv13_avEc ipv14_avEd ipv15_avEe ->
let! { __DEFAULT ~ wild18_X1o
<- wild16_X1n } in
let {
offset'_swtg :: Word64#
offset'_swtg
= subWord64#
wild18_X1o 1#Word64 } in
case leWord64# offset'_swtg 0#Word64
of {
__DEFAULT ->
jump $windexTree_swMk
0#Word64 offset'_swtg ipv15_avEe;
1# ->
jump $windexTree_swMk
0#Word64 offset'_swtg ipv14_avEd
}
};
1#Word64 ->
case ds28_swMi of {
Leaf x_avEC ->
jump returnCek_sw7q
ctx1_X0 x_avEC ipv12_swMu;
Node ipv13_avEE ipv14_avEF ipv15_avEG ->
jump exit15_X17 ipv13_avEE
}
}
};
0#Word64 -> jump exit16_X18
}; } in
joinrec {
$wgo_swMs
:: RAList (CekValue uni_swN8 fun_swN9 ann_swNa)
-> Word64#
-> (# State# RealWorld,
Either
(CekEvaluationException
NamedDeBruijn uni_swN8 fun_swN9)
(NTerm uni_swN8 fun_swN9 ()) #)
$wgo_swMs (ds28_swMm
:: RAList
(CekValue uni_swN8 fun_swN9 ann_swNa))
(ww2_swMp :: Word64#)
= case ds28_swMm of {
BHead bx19_avEO t_avEP ts_avEQ ->
case leWord64# ww2_swMp bx19_avEO of {
__DEFAULT ->
jump $wgo_swMs
ts_avEQ (subWord64# ww2_swMp bx19_avEO);
1# ->
jump $windexTree_swMk bx19_avEO ww2_swMp t_avEP
};
Nil -> jump exit13_X1l
}; } in
jump $wgo_swMs env1_XW bx18_swRA }
And that's all that was required to speed up evaluation by 12%.