RFC: Partial Escape Analysis in HotSpot C2

Summary

Reduce object-oriented overhead of a method compiled by C2 compiler. The C2 parser delays heap allocation and initialization to the place from which the object is about to escape. The optimization reduces allocation and initialization cost of an object on the paths of flowgraph where it is not escaped.

Goals

Perform flow-sensitive escape analysis
Delay heap allocation and initialization of an object until it is about to escape.
Avoid allocation and initialization of an object on the paths of flowgraph where it not escaped.

Non-Goals

Replace the existing escape analysis and its optimizations

Motivation

Currently, C2 escape analysis (EA) is flow-insensitive. It determines object escapability by considering all paths.

In Figure-1, C2 EA marks the new object at line 3 as escaped because the assignment at line 6 stores the reference which points to the object into an instance field. C2 can not perform any optimization on the basis of its escapability. Allocation and initialization of the object is useless when cond is false.

  1 private Object _cache;
  2 public void foo(boolean cond) {
  3     Object x = new Object(); 
  4
  5     if (cond) {
  6         _cache = x;
  7     }
  8 }
  
  Figure-1: partially escaped object

Many Java programs incorporate high-level abstractions, such as logger, instrumentation etc. Pure object-oriented programming languages such as Scala endorse the style even more. Those abstractions involve extra objects but may be conditionally evaluated. This leads to partially escaped objects. Those objects are only escaped on certain paths of flowgraph but not all. Flow-sensitive analysis can discover this opportunity and optimize them.

Vladimir Dolzhenko studied Preconditions of Guava library¹. Java supports varargs using array. That is, all optional arguments are packed into a new Object[] as an actual argument by javac. Varargs errorMessageArgs at line 4 in Figure-2 implies creation of an array, which is the up-front cost of checkArgument but is rarely needed.

  1 public static void checkArgument(
  2     boolean expression,
  3     @Nullable String errorMessageTemplate,
  4     @Nullable Object... errorMessageArgs) {
  5   if (!expression) {
  6     throw new IllegalArgumentException(format(errorMessageTemplate, errorMessageArgs));
  7   }
  8 }
  
  Figure-2: checkArguments in Guava Library

A microbenchmark measures execution time and normalized allocation rate of checkArguments. The slow path is taken in 2 out of 20 iterations. Varargs use probability is only 10%. The allocation rate of openjdk19_ga is 71% more than GraalVM 22.3* and the execution time is 1.7x. Since allocation rate of GraalVM without PEA is close to openjdk19_ga, it shows the PEA phase contributes the allocation reduction.

cond_alloc	jdk19_ga	graal22_jdk17	graal22_wo_PEA
runtime(ns/op)	945.800615	552.64111	858.128181
allocate_norm(B/op)	2320.04308	1344.02671	2544.05462

_{* GraalVM we tested is running on top of OpenJDK 17. The different factor in the benchmark is that GraalVM replaces C2 with Graal JIT.}

Description

We propose to implement Stadler’s Partial Escape Analysis (PEA) algorithm ²³ in C2. Strictly speaking, Stadler’s PEA can subsume C2 EA. Instead of such a radical change, we propose to focus on escaped objects and delegate non-escaped objects to the existing C2 phases. There are 3 elements to Stadler’s PEA: 1) flow-sensitive escape analysis. 2) lazy code motion 3) scalar replacement. Since C2 already performs scalar replacement for the eligible objects in MacroExpand phase, we would like to supplement C2 with the missing parts. Our adaption will let C2 take the same benefits but make more sense to the established architecture.

We plan to perform PEA in C2’s parser. The driving consideration is compilation time. The algorithm requires a Control-Flowgraph (CFG) and a reverse-post-order (RPO) basic block walk. Neither CFG nor RPO is available in C2 optimizer, so C2 would have to build it and schedule nodes if we performed the algorithm in optimizer. The author reports that the PEA phase takes 3.5~4.5% of the total compilation time in Graal and half of that is for scheduling (³, 8.4). It is not economical to do so in C2 plus our design is not full-blown Stadler’s PEA. On the other hand, C2 parser leverages prebuilt flowgraph and the parsing order of basic blocks is RPO. Another property C2 possesses is that it inlines methods along with parsing. The success of inlining helps C2 to collect object states. In the future, we would like to utilize escape analysis to make eager inlining decision on constructors ⁴.

Stadler’s PEA requires to move allocation and initialization. In C2, allocation node is a macro node with JVMState. InitializeNode is a memory barrier and represents a bunch of memory stores. None of them are easy to move around. Given that, we propose to clone the object where PEA needs to materialize a VirtualObjectNode. Because the original AllocationNode is either dead or scalar replaceable after PEA, we expect MacroExpand phase to get rid of the original one in C2 optimizer. We plan to assert that the original node is either eliminated or scalarized. The assertion guarantees that no redundancy is introduced by cloning.

We plan to keep track of allocation states using VirtualState* in Figure-3 as Stadler described. The only difference is that we do not replace AllocationNode with VirtualObjectNode. When C2 needs to materialize a virtual object, parser clones the AllocationNode and replaces its JVMState with the JVMState right before materialization. We need to tweak the JVMState to produce the same stacktrace of OutOfMemory exception as if the allocation node has never moved. An InitializeNode is placed to set the object according to its VirtualState as if it has been updated since beginning. An object translates to a SafePointScalarObjectNode at a safepoint in scalar replacement like before, either it is original or cloned. The existing deoptimization support is expected to work with few modifications. It is worth noting that we plan not to perform on-the-fly scalar replacement. C2 optimizer can perform memory redundancy elimination because the ideal graph of C2 encodes pretty accurate memory information.

class ObjectState {};

class VirtualState: public ObjectState {
  int lockCount;
  Node** entries;
};

class EscapedState: public ObjectState {
  Node* materializedValue ;
};

class State {
  Map<AllocationNode* , ObjectState*> state;
  Map<Node*, AllocationNode*> alias;
};

Figure-3: Tracking Allocation State

Here is the major differences from Graal’s implementation.

The algorithm runs in parser instead of optimizer.
Prefer clone-and-eliminate strategy rather than virtualize-and-materialize.
Refrain from scalar replacement on-the-fly.

Overall, the decisions above are for simple implementation and fail-fast model. A downside we are aware of is that it can not leverage loop unrolling to discover more arrays. C2 EA can cover non-escaped arrays. We think it is rare that an array escapes conditionally in a loop. We would reconsider or add workaround if we found this limitation matters in real applications.

_{* VirtualState overlaps a bit with C2 InitializeNode. InitializeNode emphasizes memory effects rather than Object State. We also feel that it’s easier to implement the algorithm “by the book”.}

Risks

Compilation time increases, in particular for loops. PEA keeps iterating a loop until it finds a fixed point. Graal shows that the process converges within 10 iterations most of the time. We will abandon this optimization for the compilation unit if it turns out too slow.
Code bloat in certain code shape*. Original allocation node may appear multiple times at different escaping locations. Even though the runtime cost is the same, C2 has to expand the allocation nodes multiple times. It is even worse in the presence of fast-slow optimization of TLAB.

_{* I reckon Graal also has this issue and it seems not to be a concern}

Alternatives

C2 could make the existing EA phase flow-sensitive ⁵, but it can only compute escapability. C2 would require another phase to perform lazy code motion. This is complicated because (1) C2 lacks a reliable way to recalculate JVMState, and (2) delaying object initialization means moving stores. It is error-prone to rewire memory nodes.

It is also possible to do flow-sensitive escape analysis in bytecode and transform code using bytecode rewriter. In this way, all JIT compilers and even interpreter would benefit from it. There are two obstacles to overcome: (1) it has to be inter-procedural because there is no inliner in bytecode analyzer. (2) code motion requires rewriter to inject massive bytecode sequences to set an object to a certain state. It would significantly increase code size in bytecode* and may jeopardize other bytecode-based tools.

_{* JIT compilers usually have a constraint on code size. A large method may prevent a JIT compiler from compiling.}

Vladimir Dolzhenko, Guava, Graal and Partial Escape Analysis. https://gist.github.com/vladimirdolzhenko/d85789232d289f72607bebac88d43ab2 ↩
Stadler, Lukas, Thomas Würthinger, and Hanspeter Mössenböck. "Partial escape analysis and scalar replacement for Java." Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization. 2014. ↩
Stadler, Lukas, Partial and Speculative Escape Analysis, “Partial Escape Analysis and Scalar Replacement for Java”, Phd Dissertation ↩ ↩²
Kotzmann, Thomas, and Hanspeter Mossenbock. "Run-time support for optimizations based on escape analysis." International Symposium on Code Generation and Optimization (CGO'07). IEEE, 2007. ↩
Choi, Jong-Deok, et al. "Escape analysis for Java." Acm Sigplan Notices 34.10 (1999): 1-19. ↩

navyxliu/PEA_C2.md