Skip to content

Instantly share code, notes, and snippets.

@effectfully
Created February 4, 2025 23:54
Show Gist options
  • Save effectfully/d2f9673deb5dfe6ff63a09931ed737ca to your computer and use it in GitHub Desktop.
Save effectfully/d2f9673deb5dfe6ff63a09931ed737ca to your computer and use it in GitHub Desktop.

In this post we'll tell a story of how we got a 12% performance improvement in our production evaluator.

I looked into GHC Core of the production evaluator and spotted this there:

case safeIndexOne env1_XW (W64# bx18_scPRb) of {
  Nothing -> jump exit_X16 <...>
  Just val_acLQA -> jump returnCek_scPkh ctx1_X0 val_acLQA <...>
}

This performs variable lookup using safeIndexOne. If the lookup is not successful, then evaluator terminates via jump exit_X16, otherwise recursion continues on the looked up value via jump returnCek_scPkh. The jump part of each of these indicates that recursion compiles to an actual jump to a label (i.e. is very efficient), see the "Compiling without Continuations" paper for details: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/compiling-without-continuations.pdf

We can see some clear optimization opportunities:

  1. safeIndexOne appearing in the Core means that it's not getting inlined
  2. the in-scope variable index bx18_scPRb is getting wrapped with W64# suggesting that safeIndexOne is lazier than it needs to be, which is what causes the unnecessary wrapping (and unwrapping inside of safeIndexOne)
  3. returning a Maybe from safeIndexOne just to immediately dispatch on it isn't optimal: we should be able to avoid this intermediate step and not construct any Maybes

So we go and look at how safeIndexOne is defined:

safeIndexOne :: RAList a -> Word64 -> Maybe a
safeIndexOne Nil _ = Nothing
safeIndexOne (BHead w t ts) !i =
    if i <= w
    then indexTree w i t
    else safeIndexOne ts (i-w)
  where
    indexTree :: Word64 -> Word64 -> Tree a -> Maybe a
    indexTree = <...>

The first clause completely ignores the second argument, meaning GHC strictness analysis is going to conclude that it may be worth not evaluating it and so safeIndexOne appears lazy in its second argument to GHC. which is how we end up with that W64# wrapping above. We don't want that, so we tell GHC to evaluate the second argument no matter what (and not to force it in the second clause, because it's forced in the i <= w comparison anyway):

safeIndexOne Nil !_ = Nothing
safeIndexOne (BHead w t ts) i =

We recompile the evaluator module and observe that the strictness issue is gone:

case $wsafeIndexOne env1_XW bx18_swNG of {
  Nothing -> jump exit_X16 <...>
  Just val_apny -> jump returnCek_sw4D ctx1_X0 val_apny <...>
}

$wsafeIndexOne is the "worker" (see the "The Worker/Wrapper Transformation" paper: https://people.cs.nott.ac.uk/pszgmh/wrapper-extended.pdf) operating on a machine word value, so there's no pointless wrapping-unwrapping.

But the body of safeIndexOne still isn't inlined. Looking at the definition of safeIndexOne we observe that it's a recursive function and GHC doesn't inline recursive functions. We can try to do the worker-wrapper transformation manually to get a top-level function that isn't recursive, but calls a recursive worker:

safeIndexOne :: forall a. RAList a -> Word64 -> Maybe a
safeIndexOne = go
  where
    go :: RAList a -> Word64 -> Maybe a
    go Nil !_ = Nothing
    go (BHead w t ts) !i =
        if i <= w
        then indexTree w i t
        else go ts (i-w)

    indexTree :: Word64 -> Word64 -> Tree a -> Maybe a
    indexTree = <...>

Unfortunately this doesn't change anything, because GHC floats go to the outside of safeIndexOne into its own top-level definition and we're back to square one.

We need to somehow force GHC to retain go and indexTree as local definitions, so that when safeIndexOne is inlined, both of them appear in the generated Core at the call site where GHC can optimize them. We could probably achieve that by using something from GHC.Magic, but the easiest way of doing it is to make safeIndexOne accept a term-level argument and use it within go and indexTree. For that we can simply Church-encode Maybe from the result type of safeIndexOne:

safeIndexOne :: forall a b. b -> (a -> b) -> RAList a -> Word64 -> b
safeIndexOne z f = go
  where
    go :: RAList a -> Word64 -> b
    go Nil !_ = z
    go (BHead w t ts) !i =
        if i <= w
        then indexTree w i t
        else go ts (i-w)

    indexTree :: Word64 -> Word64 -> Tree a -> Maybe a
    indexTree = <...>

safeIndexOne now binds z, which is used instead of Nothing, and f, which is used instead of Just.

After adapting the call site from

case Env.safeIndexOne varEnv varIx of
    Nothing  -> throw <...>
    Just val -> pure val

to

Env.safeIndexOne
    (throw <...>)
    pure
    varEnv
    varIx

safeIndexOne finally inlines together with its internals, so in the generated Core we find this beatiful recursive join point (the thing enabling jumps to labels) full of operations on machine words and jumps to labels:

joinrec {
  $windexTree_swMk
    :: Word64#
       -> Word64#
       -> Tree (CekValue uni_swN8 fun_swN9 ann_swNa)
       -> (# State# RealWorld,
             Either
               (CekEvaluationException
                  NamedDeBruijn uni_swN8 fun_swN9)
               (NTerm uni_swN8 fun_swN9 ()) #)
  $windexTree_swMk (ww2_swMc :: Word64#)
                   (ww3_swMg :: Word64#)
                   (ds28_swMi
                      :: Tree
                           (CekValue
                              uni_swN8 fun_swN9 ann_swNa))
    = case ww3_swMg of wild14_X15 {
        __DEFAULT ->
          case ww2_swMc of wild15_X1m {
            __DEFAULT ->
              case ds28_swMi of {
                Leaf ds29_avEa -> jump exit14_X16;
                Node ipv13_avEc ipv14_avEd ipv15_avEe ->
                  case wild14_X15 of wild17_X1n {
                    __DEFAULT ->
                      let {
                        offset'_swtg :: Word64#
                        offset'_swtg
                          = subWord64#
                              wild17_X1n 1#Word64 } in
                      let {
                        halfSize_sw7I :: Word64#
                        halfSize_sw7I
                          = uncheckedShiftRL64#
                              wild15_X1m 1# } in
                      case leWord64#
                             offset'_swtg halfSize_sw7I
                      of {
                        __DEFAULT ->
                          jump $windexTree_swMk
                            halfSize_sw7I
                            (subWord64#
                               offset'_swtg halfSize_sw7I)
                            ipv15_avEe;
                        1# ->
                          jump $windexTree_swMk
                            halfSize_sw7I
                            offset'_swtg
                            ipv14_avEd
                      };
                    1#Word64 -> jump exit15_X17 ipv13_avEc
                  }
              };
            1#Word64 ->
              case wild14_X15 of wild16_X1n {
                __DEFAULT ->
                  case ds28_swMi of {
                    Leaf ds29_avEa -> jump exit14_X16;
                    Node ipv13_avEc ipv14_avEd ipv15_avEe ->
                      let! { __DEFAULT ~ wild18_X1o
                      <- wild16_X1n } in
                      let {
                        offset'_swtg :: Word64#
                        offset'_swtg
                          = subWord64#
                              wild18_X1o 1#Word64 } in
                      case leWord64# offset'_swtg 0#Word64
                      of {
                        __DEFAULT ->
                          jump $windexTree_swMk
                            0#Word64 offset'_swtg ipv15_avEe;
                        1# ->
                          jump $windexTree_swMk
                            0#Word64 offset'_swtg ipv14_avEd
                      }
                  };
                1#Word64 ->
                  case ds28_swMi of {
                    Leaf x_avEC ->
                      jump returnCek_sw7q
                        ctx1_X0 x_avEC ipv12_swMu;
                    Node ipv13_avEE ipv14_avEF ipv15_avEG ->
                      jump exit15_X17 ipv13_avEE
                  }
              }
          };
        0#Word64 -> jump exit16_X18
      }; } in
joinrec {
  $wgo_swMs
    :: RAList (CekValue uni_swN8 fun_swN9 ann_swNa)
       -> Word64#
       -> (# State# RealWorld,
             Either
               (CekEvaluationException
                  NamedDeBruijn uni_swN8 fun_swN9)
               (NTerm uni_swN8 fun_swN9 ()) #)
  $wgo_swMs (ds28_swMm
               :: RAList
                    (CekValue uni_swN8 fun_swN9 ann_swNa))
            (ww2_swMp :: Word64#)
    = case ds28_swMm of {
        BHead bx19_avEO t_avEP ts_avEQ ->
          case leWord64# ww2_swMp bx19_avEO of {
            __DEFAULT ->
              jump $wgo_swMs
                ts_avEQ (subWord64# ww2_swMp bx19_avEO);
            1# ->
              jump $windexTree_swMk bx19_avEO ww2_swMp t_avEP
          };
        Nil -> jump exit13_X1l
      }; } in
jump $wgo_swMs env1_XW bx18_swRA }

And that's all that was required to speed up evaluation by 12%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment