In JDK 22 we resolved a performance issue that had been plaguing the FFM API for a while. We were seeing excess allocations when using try-with-resources with freshly created Arenas, which were caused by dead code in exception handlers (catch blocks) not being removed by the C2 JIT compiler.

We solved this for JDK 22: 8267532: C2: Profile and prune untaken exception handlers

The effect of this optimization is more broadly applicable to Java code using try-with-resources, or just any untaken catch block. So I thought it would be interesting to discuss the issue, and how it was solved, in this post.

By the way, ‘pruning’ in the title refers to the pruning you would do while gardening, where part of a plant, such as a branch, is cut off. This is similar to how we ‘cut off’ dead branches in code while JIT compiling.

Why are my allocations escaping?

The use pattern where we ran into this issue is one that is fairly common when using the FFM API:

try (Arena arena = Arena.ofConfined()) {
    MemorySegment segment = arena.allocateFrom("Hello!");
    func(segment);
}

This code is creating a new arena, allocating a some data in that arena, and then passing the pointer to that data to a native function (func). Assume that the implementation of func just forwards the call using a native method handle produced by java.lang.foreign.Linker::downcallHandle.

There are several allocations in this code: the Arena and the MemorySegment are the most prominent ones. However, since the implementation of func will only pass a primitive address to the native function, none of the allocated objects escape to wider Java code, and in theory C2 should be able to scalar replace these objects, avoiding their allocation altogether.

But, when verifying this using the techniques described in my other post: ‘Tracking down escaping objects’ it turned out that there were several escaping allocations. In JDK 21, this code has the following escaping allocations:

JavaObject(38) allocation in: MemorySessionImpl::createConfined @ bci:0 (line 145)
  -> Field(63)
  -> JavaObject(40)
  -> LocalVar(117)
  -> LocalVar(157)
  Reason: Escapes as argument to call to: jdk.internal.foreign.MemorySessionImpl$1::close void ( jdk/internal/foreign/MemorySessionImpl$1 (java/lang/AutoCloseable,java/lang/foreign/Arena,java/lang/foreign/SegmentAllocator):NotNull * ) TestArena::payload @ bci:36 (line 25)


JavaObject(39) allocation in: ConfinedSession::<init> @ bci:2 (line 55)
  -> Field(45)
  -> JavaObject(38)
  -> Field(63)
  -> JavaObject(40)
  -> LocalVar(117)
  -> LocalVar(157)
  Reason: Escapes as argument to call to: jdk.internal.foreign.MemorySessionImpl$1::close void ( jdk/internal/foreign/MemorySessionImpl$1 (java/lang/AutoCloseable,java/lang/foreign/Arena,java/lang/foreign/SegmentAllocator):NotNull * ) TestArena::payload @ bci:36 (line 25)


JavaObject(40) allocation in: MemorySessionImpl::asArena @ bci:0 (line 80)
  -> LocalVar(117)
  -> LocalVar(157)
  Reason: Escapes as argument to call to: jdk.internal.foreign.MemorySessionImpl$1::close void ( jdk/internal/foreign/MemorySessionImpl$1 (java/lang/AutoCloseable,java/lang/foreign/Arena,java/lang/foreign/SegmentAllocator):NotNull * ) TestArena::payload @ bci:36 (line 25)


JavaObject(41) allocation in: NativeMemorySegmentImpl::makeNativeSegment @ bci:112 (line 136)
  -> Field(67)
  -> JavaObject(39)
  -> Field(45)
  -> JavaObject(38)
  -> Field(63)
  -> JavaObject(40)
  -> LocalVar(117)
  -> LocalVar(157)
  Reason: Escapes as argument to call to: jdk.internal.foreign.MemorySessionImpl$1::close void ( jdk/internal/foreign/MemorySessionImpl$1 (java/lang/AutoCloseable,java/lang/foreign/Arena,java/lang/foreign/SegmentAllocator):NotNull * ) TestArena::payload @ bci:36 (line 25)

If you look at the escape routes of these objects, you’ll notice that JavaObject(38) escapes because JavaObject(40) escapes, JavaObject(39) escapes because JavaObject(38) escapes, and JavaObject(41) escapes because JavaObject(39) escapes. Or, in other words: the entire graph of objects escapes together with JavaObject(40), which escapes as an argument to and out-of-line call to jdk.internal.foreign.MemorySessionImpl$1::close.

Why is this call not being inlined?

So, why is there an out-of-line call here that is making our object graph escape? Looking at the inlining trace, we find that… the call is being inlined?

TestArena::payload (53 bytes)
@ 22   jdk.internal.foreign.MemorySessionImpl$1::close (8 bytes)   inline (hot)
  @ 4   jdk.internal.foreign.MemorySessionImpl::close (12 bytes)   inline (hot)
    @ 1   jdk.internal.foreign.ConfinedSession::justClose (52 bytes)   inline (hot)
     ...

No wait… it’s not?

@ 36   java.lang.foreign.Arena::close (0 bytes)   virtual call
@ 36   jdk.internal.foreign.MemorySessionImpl$1::close (8 bytes)   low call site frequency

Ah right! There are 2 calls to close: one for the path without an exception, and one for the path with an exception (in the ‘catch block’). This is the bytecode javac generates for the Java code:

Code:
 0: invokestatic  #12                 // InterfaceMethod java/lang/foreign/Arena.ofConfined:()Ljava/lang/foreign/Arena;
 3: astore_0
 4: aload_0
 5: ldc           #18                 // String Hello!
 7: invokeinterface #20,  2           // InterfaceMethod java/lang/foreign/Arena.allocateUtf8String:(Ljava/lang/String;)Ljava/lang/foreign/MemorySegment;
12: astore_1
13: aload_1
14: invokestatic  #24                 // Method func:(Ljava/lang/foreign/MemorySegment;)V
17: aload_0
18: ifnull        52
21: aload_0
22: invokeinterface #28,  1           // InterfaceMethod java/lang/foreign/Arena.close:()V
27: goto          52
30: astore_1
31: aload_0
32: ifnull        50
35: aload_0
36: invokeinterface #28,  1           // InterfaceMethod java/lang/foreign/Arena.close:()V
41: goto          50
44: astore_2
45: aload_1
46: aload_2
47: invokevirtual #33                 // Method java/lang/Throwable.addSuppressed:(Ljava/lang/Throwable;)V
50: aload_1
51: athrow
52: return
Exception table:
   from    to  target type
       4    17    30   Class java/lang/Throwable
      35    41    44   Class java/lang/Throwable

We see two calls to close: one at bytecode index (bci) 22 and one at 36. This is a consequence of how finally blocks are translated by javac. The code in a finally block is essentially copy pasted along the non-exception path and the exception path (in the exception handler). If we look at the exception table, we see that bci 36 is inside an exception handler. This all checks out so far.

So, if we look back at our inlining trace, we see that the call to close along the normal exception-less path (@ 22) is being inlined as expected, but the call to close in the exception handler (@ 36) is not being inlined due to ‘low call site frequency’. If we reference this back to the escaping allocations, we see that it is this call in the exception handler at bci 36 into which the object graph escapes:

Reason: Escapes as argument to call to: jdk.internal.foreign.MemorySessionImpl$1::close ... TestArena::payload @ bci:36 (line 25)

C2 will not inline calls that are hardly ever reached (low frequency), since these calls are likely not on the hot path, and therefore there would be less benefit from inlining. In our specific case, the call site of close in the exception handler is never reached however, because an exception is never thrown. But, this ‘dead’ code is still interfering with other optimizations, such as scalar replacement. The dead code is not outright removed because there is still a chance that an exception might occur some time in the future, so the code C2 generates has to account for that possibility.

However, C2 also has a way to deal with code that is never reached in practice: uncommon traps. An uncommon traps replaces a piece of code that is very unlikely to be needed (based on profiling information), with a trap that deoptimizes the code and continues running in the interpreter. C2 essentially bets that this code is not needed, so it’s not worth optimizing for. The fact that this is possible is one of the strengths of a mixed mode VM that can both interpret and JIT compile code: the JIT can speculate and go back to the interpreter in the worst case. Uncommon traps are for instance used to prune unlikely branches of if/else statements.

An uncommon trap can also re-inflate objects that have been scalar replaced, so they do not interfere with the scalar replacement optimization, even though an uncommon trap may ‘use’ an allocated object. So, in theory, it should be possible for C2 to replace the exception handler of our try-with-resources block with an uncommon trap (since we never enter the exception handler). This should then allow our object graph to be scalar replaced.

Why is there no uncommon trap?

For ‘regular’ branches, such as the branches of an if statement, the VM profiles these branches by counting how many times a branch is taken. The JIT then uses this count, together with the invocation count of the enclosing method, to determine how frequent this branch is taken. If a branch is heuristically deemed to be ‘rarely’ taken, that branch is replaced with an uncommon trap.

We could say that exception handlers are a type of branch, where the branch can be entered from any point in the code that the exception handler covers. So, we should be able to profile those as well, right? Well, most profiling works based on the byte code that’s being executed. For example, for if statements, goto bytecodes are used. When a goto bytecode is executed, we also do some branch profiling. This is both simple and fast. However, since exception handlers can start with any bytecode, we can not apply the same strategy: we can’t tell when executing e.g. an iconst_0 bytecode whether it’s the first bytecode of an exception handler just by looking at the bytecode. Perhaps we could keep a table of all the first bcis of the exception handlers in the method we are executing, and if the bci we execute is one of them, do the profiling. But, this would slow down the execution of all bytecodes, for something that is supposed to be ‘exceptional’, i.e. happen rarely. This doesn’t sound like a good deal, and I suspect one of the main reasons why profiling of exception handlers wasn’t done sooner.

But, when an exception is thrown, we go through a runtime call (an out-of-line call to some C++ code) to look up the exception handler. We can just put the profiling code in that runtime call, under the assumption that we only need to look up an exception handler when we are actually going to execute it. That is also what we ended up doing: whenever we look up an exception handler, we mark that handler as ‘entered’, and that is our profiling. When C2 parses an exception handler, it can now check if that exception handler was ever entered, and if not, insert an uncommon trap instead of the exception handler. Besides that, we also need to mark an exception handler as entered when we deoptimize through this uncommon trap, so that if an exception is thrown after all, we don’t try to insert another uncommon trap the next time that the code is compiled (since exceptions are evidently a possibility after all).

Success!

We side step the issue of an out-of-line call into which our objects escape, by replacing the branch that the call is in with an uncommon trap. As a result, most of our objects no longer escape:

JavaObject(31) allocation in: SegmentFactories::allocateSegment @ bci:100 (line 158)
  Reason: MergedWithObject[other=JavaObject(1) [ [ ]]    128  ConP  === 0  [[ 216 210 209 213 643 212 59 208 214 1178 167 211 151 503 165 508 166 138 ]]  #null]

Only the MemorySegment escapes, for another reason (which I have a potential fix in mind for as well).

For use cases like our little code sample, this cuts the amounts of bytes allocated in half, and potentially makes the execution twice as fast. See for instance the numbers from the benchmark included in the linked pull request:

Before:

Benchmark                                                      Mode  Cnt      Score     Error   Units
ResourceScopeCloseMin.confined_close                           avgt   30     10.458 ±   0.070   ns/op
ResourceScopeCloseMin.confined_close:gc.alloc.rate.norm        avgt   30    104.000 ±   0.001    B/op

After:

Benchmark                                                      Mode  Cnt      Score     Error   Units
ResourceScopeCloseMin.confined_close                           avgt   30      4.563 ±   0.043   ns/op
ResourceScopeCloseMin.confined_close:gc.alloc.rate.norm        avgt   30     56.000 ±   0.001    B/op

The great thing is that this doesn’t just help the FFM API, but potentially helps any code that uses exception handlers, such as try-with-resources, or catch. So, if you have any use-cases on the hot paths of your code that have untaken exception handlers, keep an eye out for any performance improvements coming in JDK 22!