Do not get bogged down in microoptimizations before you've assessed any macro optimizations that are available. IO and the choice of algorithm dominate any low level changes you may make. In the end you have to think hard about your code!
Before starting to optimize:
- Is the -O2 flag on ?
- Profile: which part of the code is the slow one.
- Use the best algorithm in that part.
- Optimize: implement it in the most efficient way.
Manual costs centers is usually better and avoids profiling library dependencies. Don't add cost centers to functions that should be inlined because SCC pragma forces no-inline.
Manual here
# This will add SSC everywhere
# You will probably want to change it to manual and use {-# SCC "name" #-} <expression>
ghc -O2 -prof -fprof-auto -rtsopts Example.hs
./Example +RTS -p -RTS
cat Example.prof
Don't forget -O2
Manual cost centers:
# Add {-# SCC <name> #-} manually to the functions you want to profile
cabal build --enable-profiling --ghc-options="-fno-prof-auto"
time cabal exec example -- +RTS -p -s -RTS # Produce project.prof and output rts statistics
Automatic cost centers (use with care):
cabal build --enable-profiling --ghc-options="-fprof-auto"
time cabal exec example -- +RTS -p -s -RTS
Recall that for multi-threading you will need:
cabal build --enable-profiling --ghc-options="-threaded -fprof-auto"
time cabal exec example -- +RTS -N -p -s -RTS
Manual cost centers:
mkdir -p .stack-bin
stack clean
stack install --local-bin-path .stack-bin --profile --ghc-options="-fno-prof-auto"
time .stack-bin/example +RTS -p
Automatic cost centers:
mkdir -p .stack-bin
stack clean
stack install --local-bin-path .stack-bin --profile --ghc-options="-fprof-auto"
time .stack-bin/example +RTS -p
See example here
- Always dump to a file:
-ddump-to-file
- Dump Core after optimizations:
-ddump-simpl
- You can also dump STG:
-ddump-stg
In *.cabal:
flag dump
manual: True
default: True
library
build-depends:
ghc-options: -O2
if flag(dump)
ghc-options: -ddump-simpl -ddump-stg -ddump-to-file
Read:
For example, if i see that a particular pure function is taking a long time relative to the rest of the code, and that it's Text, and I'm seeing ARR_WORDS rise linearly in the heap, I probably have a thunk-based memory leak. This is knowledge you build up over time.
When you need to profile cpu usage:
For thread profiling:
When you need to profile memory usage:
When you need to benchmark your application:
To get an environment with all profiling tools:
$ nix-shell --packages 'haskellPackages.ghcWithHoogle (pkgs: with pkgs; [ criterion deepseq parallel ])' haskellPackages.profiteur haskellPackages.threadscope haskellPackages.eventlog2html haskellPackages.ghc-prof-flamegraph
All examples are based on this program:
hellofib.hs
import Control.Parallel.Strategies
import System.Environment
fib 0 = 1
fib 1 = 1
fib n = runEval $ do
x <- rpar (fib (n-1))
y <- rseq (fib (n-2))
return (x + y + 1)
main = do
args <- getArgs
n <- case args of
[] -> return 20
[x] -> return (read x)
_ -> fail ("Usage: hellofib [n]")
print (fib n)
$ ghc -O2 -prof -fprof-auto -rtsopts -threaded hellofib
$ ./hellofib +RTS -N -pa
$ profiteur hellofib.prof
$ firefox hellofib.prof.html
$ ghc -O2 -prof -fprof-auto -rtsopts -threaded hellofib
$ ./hellofib +RTS -N -pa
$ ghc-prof-flamegraph hellofib.prof > output.svg
$ firefox output.svg
$ ghc -O2 -rtsopts -threaded -prof -fprof-auto -eventlog hellofib
# Use -hc to know where the thunk is being created.
# Use -hd or -hy to know which data constructor/type is creating the thunk.
# Use -hr to know why your data is not being garbage collected (retained).
$ ./hellofib +RTS -N -hy -l # -l-agu to not include thread events
$ eventlog2html hellofib.eventlog
$ firefox hellofib.eventlog.html
cabal build --enable-profiling --ghc-options="-fprof-auto"
cabal exec example -- +RTS -hc -l -RTS
For some reason, if you manually add the cost centers and use
-f-no-prof-auto
the graph is empty.
There is a new flag -hi
for profiling which gives you detailed information where the thunks (unevaluated closures) are accumulating:
$ ghc -eventlog -rtsopts -O2 -finfo-table-map -fdistinct-constructor-tables LargeThunk
$ ./LargeThunk 100000 100000 30000000 +RTS -l -hi -i0.5 -RTS
$ eventlog2html LargeThunk.eventlog
More on the blog post: https://well-typed.com/blog/2021/01/first-look-at-hi-profiling-mode/
Thread profiling and GC insight.
$ ghc -O2 -rtsopts -threaded -prof -fprof-auto -eventlog hellofib
$ ./hellofib +RTS -N -l -s
$ threadscope hellofib.eventlog
Threadscope
shows CPU cores activity while ghc-events-analyze
shows Haskell threads activity. ghc-events-analyze
works for single concurrent programs. ghc-events-analyze
allows to instrument regions of your code by named events.
- A First Look at Info Table Profiling
- Detecting Space Leaks
- Flame graphs for GHC time profiles with ghc-prof-flamegraph
- FPComplete: Profiling and Performance
- Haskell wiki: performance
- Locating Performance Bottlenecks
- Memory Fragmentation
- Micro-optimizations
- Performance profiling with ghc-events-analyze
- Profiteur: a visualiser for Haskell GHC .prof files
- Spaceleak Stack-limiting Technique: lots of interesting links about spaceleaks inside.
- Top tips and tools for optimising Haskell
- Stackoverflow: GHC's RTS options for garbage collection - Simon Marlow
Tip
If you ever experience an exception which requires a stack trace to be debugged, use -xc