Skip to content

Instantly share code, notes, and snippets.

@emilypi
Last active April 11, 2020 03:42
Show Gist options
  • Select an option

  • Save emilypi/ad3ce8564bc7a7040ac6cf2c3e3e2a32 to your computer and use it in GitHub Desktop.

Select an option

Save emilypi/ad3ce8564bc7a7040ac6cf2c3e3e2a32 to your computer and use it in GitHub Desktop.

Rule 110 problem Post-Mortem + Tuning

Here's a small, unordered bullet list of things I'd immediately change:

In Main.hs:

  • Unused MultiWayIf pragma
  • I forgot to enable -Wall, which would have shown spurious and unused imports. My bad.
  • Probably could have done better error handling for the parser
  • The lookup table needn’t be in the reader env because of the size of the project, but I would recommend this approach for larger projects for module health reasons.
  • I did not document 2 invariants:
    • In the case where the user passes 0, the exit condition is trivially satisfied
    • In the case where the user passes 1, the exit condition is immediately satisfied.
    • For cases n >= 2, wraparound is well-defined using the algorithm I gave, and gauranteed by the Traversable laws
  • List index lookup is O(n), which is terrible for large n. This means the traversal is O(n^2) in the size of the row vector. I would rather use a data structure like Vector, which has constant time index lookups, which makes the traversal O(n)

In Types.hs:

  • The Digit prisms were unused

Overall, I was happy with the domain modeling I did. My overall strategy was not to overcomplicate it. There is a clear comonadic flavor to cellular automata that would make it a candidate for a structural corecursion (this could be done with an anamorphism over the fixpoint Store comonad), but I figured you would appreciate a simple and tasteful stateful approach. This is how i’d write in production, modulo the things i stated above. If one were to rewrite this to use a comonadic approach without using an explicit least fixed point and a generalized anamorphism, one could write the following:

go = flip V.unfoldrM seedVector $ \prev -> do 
   liftIO $ putStrLn . T.unpack $ ppDigits prev
   
   return $ if all (== Zero) prev || all (== One) prev
     then Nothing
     else 
        -- else set new row to value at lookup
        Just ((), computeLookup prev)

computeLookup :: Vector Digit -> Vector Digit
computeLookup = fmap go' ...

Performance Analysis

I've applied all of the above changes (save the unfoldrM), now let's start tuning. Performance takes only a small amount of constant heap space, as expected. Here is a sample for an infinitely looping run called for ~10s with cabal run simspace -- +RTS -h -RTS 4 and cabal run simspace -O2 -fprof-cafs -- +RTS -hy -l-au -RTS 4:

heap profile 1 heap profile 2

The main take-away here is that the conversion to text, and the construction/traversal/storage of vectors on heap is the main source of memory usage (as evidenced by ARR_WORDS i.e. CArrays underlying both Text and Vector). Memory usage is minimal, and small enough that we even see the unboxed state token increments show up in the profile. For larger row sizes, I imagine ARR_WORDS would dominate everything else, but not much will change.

Profiling at row size = 10k

Interestingly, for larger row sizes (in this case, 10,000) most of the time spent in the program is via allocation, as one would expect. But the allocation in this case points to several improvements one can make in order to harden the program:

big boi rows

In this profile image, we see that the dominators are shades of blue - an unknown type * (probably a C-allocation off-heap for arrays), and, interestingly, the triple (,,), and Digits. This implies that we're probably thunking in the constructor for our triples, allocating at least 4 thunks (one for the triple constructor, 1 for each digit) which eventually gets evaluated with each pass of go. These are unnecessary thunks, since they are stored and forced with no meaningful computation in between! This could be locked down by forcing the thunks in the set stage using a strict triple like so:

data T3 a b c = T3 !a !b !c

-- Or, better, specialize the tuple and unpack!

data Triple = Triple !Digit !Digit !Digit

Which, indeed, immediately cuts down on our Digit allocations:

big boi rows 2

Another potential optimization that we can do, since we are now strict in mostly everything that needs to be strict: the use of Map is perhaps unnecessarily general for our use case. Really, we're not using an Ord constraint in the Map as much as we are using a static lookup against Eq - i.e. we should use a Hashmap. This will cause a spike in ARR_WORDS, but should be a smaller heap footprint overall at larger sizes. Changing the Map to HashMap yields the following:

hashed and mapped

Which seems to have roughly the same heap footprint in general! This means Map v. HashMap is not a significant source of our troubles. The other option is to simply forego the use of a lookup table (and that allocation altogether), and simply hardcode the lookup as a pattern match in some dedicated lookup function, which accomplishes the same thing without the overhead of a Map:

ayy lmao

Now we're really cooking with butter. I see only what's needed to run the program, modulo GC events, and I am happy. The change from a map to a function, combined with the traversal, has allowed us to force the vector of digits for the next row, minimizing the heap footprint for the vector.

Further Improvements

There are two potential routes I might go down for further improvements:

  1. One could be to stick with the existing optimal solution and address the lookups of triples in parallel, as the lookups can be factored out in an embarassingly parallel way. The cost of allocating n-many fibers for a certain threshold of row sizes may be the way to scale this.

  2. Perhaps strictifying everything in this way may not be in our best interest! This is clearly an unfold, so it is subject to destroy/unfoldr stream fusion, which can eliminate some of our intermediate structures when building. Strictness is uniformly worst case asymptotic complexity for program runtime modulo constant factors - you can never do worse than a strict program since it can't amortize anything. It may be more efficient in terms of space usage for individual operations, but modulo allocation times, and the time it takes to force a thunk, a lazy program has the potential to be a better way forward for this since we can do whole-program optimization. This may not be a very team-savvy approach since thunk-counting techniques are a bit advanced (for more information, read Okasaki's thesis aka Purely Functional Data Structures). If I'm writing a library, I would certainly attempt to optimize making use of wholesale laziness as a first principle!

For more on #2, see this post by Ed Kmett

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment