Skip to content

Instantly share code, notes, and snippets.

@viktorklang
Forked from fwbrasil/doc.md
Created December 19, 2018 16:03
Show Gist options
  • Save viktorklang/25dc2541652212717ad9266b3c57b16b to your computer and use it in GitHub Desktop.
Save viktorklang/25dc2541652212717ad9266b3c57b16b to your computer and use it in GitHub Desktop.
Scala `Future` optimizations (8/8/2018 snapshot)

Scala Future optimizations


Flavio W. Brasil, July 2017

Problem


The Future implementation provided by the Scala Standard Library introduces considerable performance overhead when compared to other Future libraries. This cost is probably acceptable for low-scale systems but can be problematic at high scale.

Let’s consider this benchmark:

private static final Function1 mapF = i -> "string"; private static final ExecutionContext ec = scala.concurrent.ExecutionContext.global();

@Benchmark public Future mapPromise() { return Promise.apply().future().map(mapF, ec); }

I’ve isolated the execution of this method in a single iteration after warmup and profiled it using yourkit’s allocation profiler. Creating a Promise and mapping it involves five object allocations and 120 bytes of memory footprint:

image alt text

The jmh output confirms the profiling results:

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit
mapPromise thrpt 1 3 21879893.29 1539495.68 ops/s
mapPromise:·gc.alloc.rate thrpt 1 3 1668.95 83.34 MB/sec
mapPromise:·gc.alloc.rate.norm thrpt 1 3 120.00 0.00 B/op
mapPromise:·gc.churn.PS_Eden_Space thrpt 1 3 1582.11 76.99 MB/sec
mapPromise:·gc.churn.PS_Eden_Space.norm thrpt 1 3 113.76 1.88 B/op
mapPromise:·gc.churn.PS_Survivor_Space thrpt 1 3 0.05 0.58 MB/sec
mapPromise:·gc.churn.PS_Survivor_Space.norm thrpt 1 3 0.00 0.04 B/op
mapPromise:·gc.count thrpt 1 3 21.00 NaN counts
mapPromise:·gc.time thrpt 1 3 12.00 NaN ms

Solution


I’ve open sourced a new project that is a Future implementation in java inspired by Twitter’s Future and focused on performance. It supports most of the features provided by Twitter’s Future, including signals (cancellation) and locals.

The optimizations applied to this new project make the implementation diverge considerably from Twitter’s Future, but we were able to port some of the changes. The results were significant regarding of both CPU usage and allocation rate decrease.

For comparison, the same operation above described, when executed against the new project, requires only one object allocation for each operation (promise creation, mapping) and 40 bytes of memory:

image alt text

The jmh output confirms the profiling results:

Benchmark Mode Threads Samples Score Score Error (99.9%) Unit
mapPromise thrpt 1 3 59993938.44 9740470.96 ops/s
mapPromise:·gc.alloc.rate thrpt 1 3 1524.91 261.10 MB/sec
mapPromise:·gc.alloc.rate.norm thrpt 1 3 40.00 0.00 B/op
mapPromise:·gc.churn.PS_Eden_Space thrpt 1 3 1358.31 20.65 MB/sec
mapPromise:·gc.churn.PS_Eden_Space.norm thrpt 1 3 35.63 5.83 B/op
mapPromise:·gc.churn.PS_Survivor_Space thrpt 1 3 0.04 0.38 MB/sec
mapPromise:·gc.churn.PS_Survivor_Space.norm thrpt 1 3 0.00 0.01 B/op
mapPromise:·gc.count thrpt 1 3 18.00 NaN counts
mapPromise:·gc.time thrpt 1 3 9.00 NaN ms

The throughput of the new implementation (59,993,938.44) is 2.74 times higher than Scala’s Future (21,879,893.29) in this benchmark. It scores better in other scenarios as well.

The project has a set of benchmarks comparing Twitter’s, Scala’s, and Java’s Futures. I’ve compiled the jmh results into a spreadsheet with the throughput and gc scores.

Overview


From a general perspective, these are the techniques I’ve used to optimize the trane.io implementation:

Reduce closure allocations

Closures are so convenient and easy to create in Java 8 and Scala that it’s common to create one without thinking about the allocation of the closure object and references to the outer scope. For instance, this function (_ += _) is instantiated for each list element. It’d be possible to cache a single instance of it and reuse for all elements.

Merge interfaces

A typical pattern in the future implementation is creating a promise and other control structures to perform a transformation. It’s often possible to create a Promise new class that satisfies multiple interfaces and instantiate a single object. As an example, these three objects (p, completeFirst, foreach closure) can be merged into a single one.

Specialize implementations

Sometimes an abstraction can simplify the code but hurt performance. For example, Scala’s Future trait is a nice abstraction; it requires its subclasses to implement only a few methods. This approach comes with a performance penalty: the other methods require an additional closure allocation that adapts the user function to the transform function. Specializing methods often enables optimizations [1] [2] [3].

Keep methods small

JIT inlining is crucial to reduce CPU utilization. Some hot methods used by Scala’s Future can’t be inlined because they are too large or become large after inlining its nested methods. I’ve inspected the JIT log of the new project multiple times to make sure that the hot methods were being inlined. This work can be tricky with a scala codebase since the bytecode is often larger than the code we write.

The next sections introduce a few optimization ideas based on the revision 9ab72a2 from Jul 11 of the Scala repository. Please note that they are only speculative at this point and require proper validation through microbenchmarks if we decide to implement them.

Minor optimizations


Future.failed

  • Cache the transformation function as a companion object field since it doesn’t capture context

  • It might be worth throwing an exception without stack trace instead of a plain NoSuchElementException since the stack trace doesn’t add much information for Future execution and stack traces are expensive

Future.transform

  • The by-name parameter of Try.apply produces a closure with at least two fields (s and r). The closure could be avoided using a regular try/catch block that produces a Try

  • For failures, throw is probably more expensive than checking if the exception is fatal using NonFatal.apply.

Future.filter

  • Maybe it’s worth throwing a stackless exception if the predicate doesn’t hold

Future.collect

  • Cache the fallback function as a companion object field

  • Maybe use a stackless exception

Future.recoverWith

  • The applyOrElse function could be avoided using isDefined + apply. It’s not possible to cache the function since it captures context.

Future.fallbackTo

  • Instead of calling recoverWith twice, use tranformWith to avoid allocating two closures and avoid the double cost of calling recoverWith twice.

Future$.sequence

  • Cache the foldLeft, zipWith, and map functions as companion object vals

Future$.firstCompletedOf

  • Return the element if the collection has only one element

  • Create a Promise anon class that mixes the function trait that satisfies the foreach signature, avoiding the closure allocation.

Future$.find

  • Create a val for the transformWith function and reuse it

Future$.traverse

  • Reuse the zipWith and map cached functions created for Future$.sequence

impl.Promise.resolver

  • This method could take a Failure to avoid allocating a new instance if the exception doesn’t require special handling (last case)

impl.Promise.transform

  • Merge the function trait required by onComplete into the a new anon class that also extends DefaultPromise

impl.Promise.transformWith

  • Merge the function trait required by onComplete into a new anon class that also extends DefaultPromise

impl.DefaultPromise.result

  • Instead of calling value use get() directly and cast it to avoid the Option allocation

Medium optimizations


Future$.sequence

  • zipWith is called for each element of the collection and is too expensive. This method could collect results into an array and transform it to the result collection using the cbf. As reference, see the collect method of the trane.io impl.

Future$.traverse

  • This method could share some code with an optimized version of Future$.sequence.

Future.*

  • The kept promise classes already override methods that can be optimized based on the outcome of the promise, but other methods could also be optimized. For instance, Future.map could be implemented by KeptPromise.Successful and submit a lightweight Runnable to the ExecutionContext instead of going through the expensive path of calling transform. I suggest making the Future trait abstract, without method bodies, and reimplement all methods in an optimized manner in the concrete Promise classes.

Major optimizations


Introduce WaitQueue

  • Attaching new CallbackRunnables to Promises requires allocation of a new List. A new WaitQueue implementation that acts both as the callback and as the queue itself could avoid the List allocation. I’m not sure what’s the best approach: creating a pointer to the previous callback (immutable, but requires traversing to the root when executing) or to the next one (mutable, harder to manage).

Introduce ExecutionContext.immediate: Boolean

  • The Scala Future is an outlier when it comes to scheduling behavior. Most other implementations have a way to reuse the thread that is creating the composition to execute a transformation when possible, avoiding the high scheduling cost. Once the Future.* methods are implemented by the concrete classes, it’s possible to add an if condition to the appropriate methods that checks if immediate returns true and executes the transformation immediately.

  • This new method opens the path to implement an immediate-only ExecutionContext that could replace the internal executor, avoiding the Runnable allocation overhead

  • For recursive computations, an immediate and batching execution context similar to trane.io’s Tailrec (this class was inspired by Monix) would be a good addition

Non-optimization notes


Future$.firstCompletedOf

  • Shouldn’t this method fail if the collection is empty?

Future$.find

  • This method doesn’t seem to have the same behavior as deprecated version. It seems that, if the last element of the collection finishes first, it’ll never be the result since the implementation will wait for all other futures that come before it in the collection.

Conclusion


This document only gives an overview of the potential optimizations. If we decide to proceed, it will be much easier to discuss the details with code reviews and jmh results for each change.

Note that, even though the Scala Future currently has lower scores for most benchmarks, it has the potential to have better scores than trane.io and Twitter. It doesn’t implement cancellations and locals, so it doesn’t incur the performance overhead of these features.

License


This document is licensed under the Apache License 2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment