The Future
implementation provided by the Scala Standard Library introduces considerable performance overhead when compared to other Future
libraries. This cost is probably acceptable for low-scale systems but can be problematic at high scale.
Let’s consider this benchmark:
private static final Function1 mapF = i -> "string";
private static final ExecutionContext ec = scala.concurrent.ExecutionContext.global();
@Benchmark public Future mapPromise() { return Promise.apply().future().map(mapF, ec); } |
I’ve isolated the execution of this method in a single iteration after warmup and profiled it using yourkit’s allocation profiler. Creating a Promise
and mapping it involves five object allocations and 120 bytes of memory footprint:
The jmh output confirms the profiling results:
Benchmark | Mode | Threads | Samples | Score | Score Error (99.9%) | Unit |
mapPromise | thrpt | 1 | 3 | 21879893.29 | 1539495.68 | ops/s |
mapPromise:·gc.alloc.rate | thrpt | 1 | 3 | 1668.95 | 83.34 | MB/sec |
mapPromise:·gc.alloc.rate.norm | thrpt | 1 | 3 | 120.00 | 0.00 | B/op |
mapPromise:·gc.churn.PS_Eden_Space | thrpt | 1 | 3 | 1582.11 | 76.99 | MB/sec |
mapPromise:·gc.churn.PS_Eden_Space.norm | thrpt | 1 | 3 | 113.76 | 1.88 | B/op |
mapPromise:·gc.churn.PS_Survivor_Space | thrpt | 1 | 3 | 0.05 | 0.58 | MB/sec |
mapPromise:·gc.churn.PS_Survivor_Space.norm | thrpt | 1 | 3 | 0.00 | 0.04 | B/op |
mapPromise:·gc.count | thrpt | 1 | 3 | 21.00 | NaN | counts |
mapPromise:·gc.time | thrpt | 1 | 3 | 12.00 | NaN | ms |
I’ve open sourced a new project that is a Future implementation in java inspired by Twitter’s Future and focused on performance. It supports most of the features provided by Twitter’s Future, including signals (cancellation) and locals.
The optimizations applied to this new project make the implementation diverge considerably from Twitter’s Future, but we were able to port some of the changes. The results were significant regarding of both CPU usage and allocation rate decrease.
For comparison, the same operation above described, when executed against the new project, requires only one object allocation for each operation (promise creation, mapping) and 40 bytes of memory:
The jmh output confirms the profiling results:
Benchmark | Mode | Threads | Samples | Score | Score Error (99.9%) | Unit |
mapPromise | thrpt | 1 | 3 | 59993938.44 | 9740470.96 | ops/s |
mapPromise:·gc.alloc.rate | thrpt | 1 | 3 | 1524.91 | 261.10 | MB/sec |
mapPromise:·gc.alloc.rate.norm | thrpt | 1 | 3 | 40.00 | 0.00 | B/op |
mapPromise:·gc.churn.PS_Eden_Space | thrpt | 1 | 3 | 1358.31 | 20.65 | MB/sec |
mapPromise:·gc.churn.PS_Eden_Space.norm | thrpt | 1 | 3 | 35.63 | 5.83 | B/op |
mapPromise:·gc.churn.PS_Survivor_Space | thrpt | 1 | 3 | 0.04 | 0.38 | MB/sec |
mapPromise:·gc.churn.PS_Survivor_Space.norm | thrpt | 1 | 3 | 0.00 | 0.01 | B/op |
mapPromise:·gc.count | thrpt | 1 | 3 | 18.00 | NaN | counts |
mapPromise:·gc.time | thrpt | 1 | 3 | 9.00 | NaN | ms |
The throughput of the new implementation (59,993,938.44) is 2.74 times higher than Scala’s Future (21,879,893.29) in this benchmark. It scores better in other scenarios as well.
The project has a set of benchmarks comparing Twitter’s, Scala’s, and Java’s Futures. I’ve compiled the jmh results into a spreadsheet with the throughput and gc scores.
From a general perspective, these are the techniques I’ve used to optimize the trane.io implementation:
Reduce closure allocations
Closures are so convenient and easy to create in Java 8 and Scala that it’s common to create one without thinking about the allocation of the closure object and references to the outer scope. For instance, this function (_ += _
) is instantiated for each list element. It’d be possible to cache a single instance of it and reuse for all elements.
Merge interfaces
A typical pattern in the future implementation is creating a promise and other control structures to perform a transformation. It’s often possible to create a Promise
new class that satisfies multiple interfaces and instantiate a single object. As an example, these three objects (p, completeFirst, foreach closure) can be merged into a single one.
Specialize implementations
Sometimes an abstraction can simplify the code but hurt performance. For example, Scala’s Future trait is a nice abstraction; it requires its subclasses to implement only a few methods. This approach comes with a performance penalty: the other methods require an additional closure allocation that adapts the user function to the transform
function. Specializing methods often enables optimizations [1] [2] [3].
Keep methods small
JIT inlining is crucial to reduce CPU utilization. Some hot methods used by Scala’s Future can’t be inlined because they are too large or become large after inlining its nested methods. I’ve inspected the JIT log of the new project multiple times to make sure that the hot methods were being inlined. This work can be tricky with a scala codebase since the bytecode is often larger than the code we write.
The next sections introduce a few optimization ideas based on the revision 9ab72a2 from Jul 11 of the Scala repository. Please note that they are only speculative at this point and require proper validation through microbenchmarks if we decide to implement them.
Future.failed
-
Cache the transformation function as a companion object field since it doesn’t capture context
-
It might be worth throwing an exception without stack trace instead of a plain
NoSuchElementException
since the stack trace doesn’t add much information forFuture
execution and stack traces are expensive
Future.transform
-
The by-name parameter of
Try.apply
produces a closure with at least two fields (s
andr
). The closure could be avoided using a regulartry/catch
block that produces aTry
-
For failures,
throw
is probably more expensive than checking if the exception is fatal usingNonFatal.apply
.
Future.filter
- Maybe it’s worth throwing a stackless exception if the predicate doesn’t hold
Future.collect
-
Cache the fallback function as a companion object field
-
Maybe use a stackless exception
Future.recoverWith
- The
applyOrElse
function could be avoided usingisDefined
+apply
. It’s not possible to cache the function since it captures context.
Future.fallbackTo
- Instead of calling
recoverWith
twice, usetranformWith
to avoid allocating two closures and avoid the double cost of callingrecoverWith
twice.
Future$.sequence
- Cache the
foldLeft
,zipWith
, andmap
functions as companion object vals
Future$.firstCompletedOf
-
Return the element if the collection has only one element
-
Create a
Promise
anon class that mixes the function trait that satisfies theforeach
signature, avoiding the closure allocation.
Future$.find
- Create a val for the
transformWith
function and reuse it
Future$.traverse
- Reuse the
zipWith
andmap
cached functions created for Future$.sequence
impl.Promise.resolver
- This method could take a
Failure
to avoid allocating a new instance if the exception doesn’t require special handling (last case)
impl.Promise.transform
- Merge the function trait required by
onComplete
into the a new anon class that also extendsDefaultPromise
impl.Promise.transformWith
- Merge the function trait required by
onComplete
into a new anon class that also extendsDefaultPromise
impl.DefaultPromise.result
- Instead of calling
value
useget()
directly and cast it to avoid theOption
allocation
Future$.sequence
zipWith
is called for each element of the collection and is too expensive. This method could collect results into an array and transform it to the result collection using thecbf
. As reference, see the collect method of the trane.io impl.
Future$.traverse
- This method could share some code with an optimized version of
Future$.sequence
.
Future.*
- The kept promise classes already override methods that can be optimized based on the outcome of the promise, but other methods could also be optimized. For instance,
Future.map
could be implemented byKeptPromise.Successful
and submit a lightweightRunnable
to theExecutionContext
instead of going through the expensive path of callingtransform
. I suggest making theFuture
trait abstract, without method bodies, and reimplement all methods in an optimized manner in the concrete Promise classes.
Introduce WaitQueue
- Attaching new
CallbackRunnable
s toPromise
s requires allocation of a newList
. A newWaitQueue
implementation that acts both as the callback and as the queue itself could avoid theList
allocation. I’m not sure what’s the best approach: creating a pointer to the previous callback (immutable, but requires traversing to the root when executing) or to the next one (mutable, harder to manage).
Introduce ExecutionContext.immediate: Boolean
-
The Scala Future is an outlier when it comes to scheduling behavior. Most other implementations have a way to reuse the thread that is creating the composition to execute a transformation when possible, avoiding the high scheduling cost. Once the
Future.*
methods are implemented by the concrete classes, it’s possible to add anif
condition to the appropriate methods that checks ifimmediate
returns true and executes the transformation immediately. -
This new method opens the path to implement an immediate-only
ExecutionContext
that could replace the internal executor, avoiding theRunnable
allocation overhead -
For recursive computations, an immediate and batching execution context similar to trane.io’s Tailrec (this class was inspired by Monix) would be a good addition
Future$.firstCompletedOf
- Shouldn’t this method fail if the collection is empty?
Future$.find
- This method doesn’t seem to have the same behavior as deprecated version. It seems that, if the last element of the collection finishes first, it’ll never be the result since the implementation will wait for all other futures that come before it in the collection.
This document only gives an overview of the potential optimizations. If we decide to proceed, it will be much easier to discuss the details with code reviews and jmh results for each change.
Note that, even though the Scala Future currently has lower scores for most benchmarks, it has the potential to have better scores than trane.io and Twitter. It doesn’t implement cancellations and locals, so it doesn’t incur the performance overhead of these features.
This document is licensed under the Apache License 2.0.