Skip to content

Instantly share code, notes, and snippets.

@larskuhtz
Last active December 4, 2018 23:05
Show Gist options
  • Save larskuhtz/5ee9510ad5346f988614f202420ddcc4 to your computer and use it in GitHub Desktop.
Save larskuhtz/5ee9510ad5346f988614f202420ddcc4 to your computer and use it in GitHub Desktop.
Haskell Benchmarks for Concurrent Transactions with Many Shared Variables

Setup

This gist contains benchmarks for read-only concurrent transaction that involve many shared variables. The test data is a mutable single-linked list where individual list "cells" are linked via mutable shared variables. The goal is to measure the overhead of different transaction types and variables in read-only transactions when there are no concurrent writes.

The input to a benchmark run is a list of a of successive Int values. The benchmark code runs a number of concurrent threads that each fold over the list and each compute the sum of all entries. The list isn't mutated during the benchmarks.

Benchmark parameters are

  • the length of the list,
  • the number of concurrent threads, and
  • the type of the variables and transactions.

Variable and transaction types are

  • MVar,
  • Compose IORef Maybe (which is fine, since the data structure isn't mutated during the benchmark),
  • TMVar (in STM) evaluated in a single STM transaction,
  • TMVar (hoisted) with each STM action hoisted into a separate singleton STM transaction,
  • TMVar (in IO) with each STM action called in a separate singleton STM transaction,
  • Compose TVar Maybe (in STM) evaluated in a single STM transaction,
  • Compose TVar Maybe (hoisted) with each STM action hoisted into a separate singleton STM transaction, and
  • Compose TVar Maybe (in IO) access in IO without using STM.

"TMVar hoisted" and "TMVar in IO" result in equivalent code. However, the overhead of the hoist implementation for the Stream type that is used in the fold and different inlining by GHC may result in different performance.

Build:

ghc -O2 -threaded -rtsopts -main-is VarList -o main VarList.hs

Run:

./main +RTS -N

Results

Results

As expected Compose IORef Maybe performs best in all settings. Compose TVar Maybe with readTVarIO performs on par with plain IORef. MVar has only little overhead.

TMVar in IO and TMVar hoisted both have acceptable performance with some overhead due to the STM transactions. The hoisted version is somewhat slower, most likely due to overhead in hoist and different inlining by GHC.

In the context of STM transactions TMVar and Compose TVar Maybe perform similar, which can be explained by the fact that TMVar a is implemented as TVar (Maybe a) in base.

Benchmarks didn't complete in acceptable time for benchmarks that run the complete fold in a single STM transaction when there was more than a single thread and a moderately large lists lengths (in the order of thousands).

Results for large STM transactions with many variables

Rendered Notebook

Focusing on the scenarios where the complete fold is done in a single STM transaction, it seems that the run time scales linear in the number of threads and, up to some point, quadratic in the length of the list.

However, if there is more than a single thread and the length of the list growth beyond a certain point (~15,000 - 20,000 in my tests) the running time distribution exhibits a growing tail of very long runs and eventually most runs become long running. The behavior seems to indicate a race between concurrent transactions, which is surprising, since there are no updates to the variables during the test run.

Conclusion

It seems that read-only STM transactions without concurrent writes scale quadratic in the number involved variables and linear in the number of concurrent transactions.

Small (singleton) read transactions come with little overhead, but are still slower than MVar with tryReadMVar.

TVars with readTVarIO have the same performance as plain IORefs. Thus TVars have the benefit that they can be used either like plain IORef without any transaction overhead or in transactions.

Using STM transactions that involve a large number of variables should be avoided, even if when there are no concurrent writes. In addition to the quadratic complexity it seems that there can be races between read-only transactions even in the absence of concurrent updates.

Thus,

  • IORefs are best when there is no concurrent updates beyond what can be done with atomicModifyIORef.
  • MVars are best when concurrent writes can't be ruled out.
  • TVars are best when variables are used in contexts that don't have concurrent writes (where readTVarIO can be used) as well as in contexts with concurrent writes.

It would be nice if base could include a function tryReadTMVarIO that avoids the overhead of an STM transaction.

{-# LANGUAGE AllowAmbiguousTypes #-}
{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE MultiParamTypeClasses #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE RankNTypes #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TypeApplications #-}
{-# LANGUAGE TypeSynonymInstances #-}
-- |
-- Module: VarList
-- Copyright: Copyright © 2018 Lars Kuhtz <[email protected]>
-- License: MIT
-- Maintainer: Lars Kuhtz <[email protected]>
-- Stability: experimental
--
module VarList
(
-- * Mutable Variables
Var(..)
, MaybeIORef
, MaybeTVar
-- * Cell
, Cell(..)
, foldCell
, cellToStream
, cellFromStream
-- * Test and Benchmark Data
, testCell
-- * Main functions
, main
, main1
, main2
, main3
) where
import Control.Concurrent
import Control.Concurrent.Async
import Control.Concurrent.STM
import Control.DeepSeq
import Control.Monad
import Control.Monad.Morph
import Control.StopWatch
import Criterion
import Criterion.Main
import Data.Foldable
import Data.Functor.Compose
import Data.Functor.Of
import Data.IORef
import qualified Data.List.NonEmpty as L
import qualified Data.Sequence as Seq
import qualified Streaming.Prelude as S
import System.Clock
import System.Environment
import System.IO
import System.Mem
-- -------------------------------------------------------------------------- --
-- Variables
--
-- Note that 'Compose' is used here for type constructors that aren't functors.
class Monad m => Var m v where
readVar :: v x -> m (Maybe x)
newEmptyVar :: m (v x)
newVar :: x -> m (v x)
-- | Depending on the type of the variable this function may block or spin.
awaitVar :: v x -> m x
type MaybeIORef = Compose IORef Maybe
type MaybeTVar = Compose TVar Maybe
instance Var STM TMVar where
readVar = tryReadTMVar
newEmptyVar = newEmptyTMVar
newVar = newTMVar
awaitVar = takeTMVar
{-# INLINE readVar #-}
{-# INLINE newEmptyVar #-}
{-# INLINE newVar #-}
{-# INLINE awaitVar #-}
instance Var IO TMVar where
readVar = atomically . tryReadTMVar
newEmptyVar = newEmptyTMVarIO
newVar = newTMVarIO
awaitVar = atomically . awaitVar
{-# INLINE readVar #-}
{-# INLINE newEmptyVar #-}
{-# INLINE newVar #-}
{-# INLINE awaitVar #-}
instance Var IO MVar where
readVar = tryReadMVar
newEmptyVar = newEmptyMVar
newVar = newMVar
awaitVar = takeMVar
{-# INLINE readVar #-}
{-# INLINE newEmptyVar #-}
{-# INLINE newVar #-}
{-# INLINE awaitVar #-}
instance Var STM (Compose TVar Maybe) where
readVar = readTVar . getCompose
newEmptyVar = Compose <$> newTVar Nothing
newVar = fmap Compose . newTVar . Just
awaitVar = readVar >=> maybe retry return
{-# INLINE readVar #-}
{-# INLINE newEmptyVar #-}
{-# INLINE newVar #-}
{-# INLINE awaitVar #-}
instance Var IO (Compose TVar Maybe) where
readVar = readTVarIO . getCompose
newEmptyVar = Compose <$> newTVarIO Nothing
newVar = fmap Compose . newTVarIO . Just
awaitVar = atomically . awaitVar
{-# INLINE readVar #-}
{-# INLINE newEmptyVar #-}
{-# INLINE newVar #-}
{-# INLINE awaitVar #-}
instance Var IO (Compose IORef Maybe) where
readVar = readIORef . getCompose
newEmptyVar = Compose <$> newIORef Nothing
newVar = fmap Compose . newIORef . Just
-- | This spins
awaitVar v = let go = readVar v >>= maybe (yield >> go) return in go
{-# INLINE readVar #-}
{-# INLINE newEmptyVar #-}
{-# INLINE newVar #-}
{-# INLINE awaitVar #-}
-- -------------------------------------------------------------------------- --
-- Cell
data Cell v a = Cell a !(v (Cell v a))
instance NFData a => NFData (Cell v a) where
rnf (Cell a _) = rnf a
-- | Fold over the values in a cell
--
foldCell
:: Monad m
=> Var m v
=> (b -> a -> b)
-> b
-> Cell v a
-> m b
foldCell f b = go b
where
go !i (Cell !h !t) = readVar t >>= \case
Nothing -> return i
Just c -> go (f i h) c
cellToStream :: Var m v => Cell v a -> S.Stream (Of a) m ()
cellToStream = go
where
go (Cell h t) = do
S.yield h
lift (readVar t) >>= \case
Nothing -> return ()
Just c -> go c
-- | Transforms a stream into a cell and returns the first cell if the stream
-- isn't empty.
--
-- This could be made lazy using 'unsafeInterleavedIO', which would be fine,
-- since the 'Var's won't change once the are evaluated. However, for
-- benchmarking we want it strict.
--
cellFromStream :: Var m v => S.Stream (Of a) m () -> m (v (Cell v a))
cellFromStream s = S.foldrM f (s >> lift newEmptyVar)
where
f !a !x = newVar =<< (Cell a <$> x)
cellFromList :: forall v a m . Var m v => L.NonEmpty a -> m (Cell v a)
cellFromList (a L.:| l) = Cell a <$> go l
where
go [] = newEmptyVar
go (!h:t) = newVar =<< (Cell h <$> go t)
maybeIORefCellFromList :: forall a . L.NonEmpty a -> IO (Cell MaybeIORef a)
maybeIORefCellFromList (a L.:| l) = do
v <- newEmptyVar
go v l
return $ Cell a v
where
go :: MaybeIORef (Cell MaybeIORef a) -> [a] -> IO ()
go _ [] = return ()
go !v (!h:t) = do
!v' <- newEmptyVar
writeIORef (getCompose v) (Just (Cell h v'))
go v' t
-- -------------------------------------------------------------------------- --
-- Main
testCell :: forall v m . Var m v => Int -> m (Cell v Int)
testCell i = awaitVar =<< cellFromStream (S.each [1..i])
main :: IO ()
main = main3
-- -------------------------------------------------------------------------- --
-- Test
main1 :: IO ()
main1 = do
[minThreadCount, maxThreadCount, threadCountStep, minListLength, maxListLength, listLengthStep] <- fmap read . take 2 <$> getArgs
hSetBuffering stdout LineBuffering
putStrLn $ "threadCount, listLength, time"
forM_ [minThreadCount, (minThreadCount + threadCountStep) .. maxThreadCount] $ \threadCount -> do
forM_ [minListLength, (minListLength + listLengthStep) .. maxListLength] $ \listLength -> do
-- Compute test cell
cell <- testCell @MaybeTVar listLength
-- prime test cell evaluation
void $ atomically $ S.sum $ cellToStream $ cell
-- measure concurrent test cell evaluation
performMajorGC
(_, t) <- stopWatch $ replicateConcurrently_ threadCount
$ atomically $ S.sum_ $ cellToStream $ cell
-- print result
putStrLn
$ show threadCount
<> ", " <> show listLength
<> ", " <> show (toNanoSecs t)
-- -------------------------------------------------------------------------- --
-- Comparison with Pure Sequence
main3 :: IO ()
main3 = defaultMain
[ bs 1
, bs 100
, create
]
where
create = bgroup "create" $
[ bench "seq"
$ nf (\s -> Seq.fromList [0..s]) (1000000 :: Int)
, bench "cell"
$ nfAppIO (testCell @MaybeIORef) 1000000
, bench "cell from list"
$ nfAppIO (\s -> cellFromList @MaybeIORef @Int (0 L.:| [1..s])) 1000000
, bench "cell from list 2"
$ nfAppIO (\s -> maybeIORefCellFromList @Int (0 L.:| [1..s])) 1000000
]
bs i = bgroup (show i <> " threads")
[ bgroup "stream" $
[ env (return $ Seq.fromList [0..1000000]) $ \(e :: Seq.Seq Int) -> bench "seq"
$ nfAppIO (S.sum_ . S.each) e
, env (testCell @MaybeIORef 1000000) $ \e -> bench "cell"
$ nfAppIO (S.sum_ . cellToStream) e
]
, bgroup "fold" $
[ env (return $ Seq.fromList [0..1000000]) $ \(e :: Seq.Seq Int) -> bench "seq"
$ nf (foldl' (+) 0) e
, env (testCell @MaybeIORef 1000000) $ \e -> bench "cell"
$ nfAppIO (foldCell (+) 0) e
]
]
-- -------------------------------------------------------------------------- --
-- Criterion Benchmarks
main2 :: IO ()
main2 = defaultMain
[ benchmarks 1
, benchmarks 10
, benchmarks 1000
]
benchmarks :: Int -> Benchmark
benchmarks i = bgroup (show i <> " threads")
[ env (testCell @MVar 10000) $ \e -> bench "MVar"
$ nfAppIO (threaded $ S.sum_ . cellToStream) e
, env (testCell @MaybeIORef 10000) $ \e -> bench "MaybeIORef"
$ nfAppIO (threaded $ S.sum_ . cellToStream) e
, bgroup "TMVar" $
[ env (testCell @TMVar 10000) $ \e -> bench "TMVar hoisted"
$ nfAppIO (threaded $ S.sum_ . hoist atomically . cellToStream) e
, env (testCell @TMVar 10000) $ \e -> bench "TMVar in IO"
$ nfAppIO (threaded $ S.sum_ . cellToStream) e
]
<> if i > 1 then [] else
[ env (testCell @TMVar 10000) $ \e -> bench "TMVar in STM"
$ nfAppIO (threaded $ atomically . S.sum_ . cellToStream) e
-- , env (testCell @TMVar 10000) $ \e -> bench "foldCell Cell TMVar"
-- $ nfAppIO (threaded $ atomically . foldCell ((+) @Int) 0) e
]
, bgroup "MaybeTVar" $
[ env (testCell @MaybeTVar 10000) $ \e -> bench "MaybeTVar hoisted"
$ nfAppIO (threaded $ S.sum_ . hoist atomically . cellToStream) e
, env (testCell @MaybeTVar 10000) $ \e -> bench "MaybeTVar in IO"
$ nfAppIO (threaded $ S.sum_ . cellToStream) e
]
<> if i > 1 then [] else
[ env (testCell @MaybeTVar 10000) $ \e -> bench "MaybeTVar in STM"
$ nfAppIO (threaded $ atomically . S.sum_ . cellToStream) e
]
]
where
threaded f a = replicateConcurrently_ i (f a)
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@larskuhtz
Copy link
Author

results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment