ehcache.md

All versions of Ehcache 3 suffer from entries clustering which can lead to severely degraded performance and is a potential attack vulnerability (denial of service).

Observation

The performance of the eviction policy reduces as the cache size increases due to the eviction implementation in their customized hash table. On an M3 Max laptop, Ehcache takes over 18 minutes in an IBM SQL database workload trace, whereas Guava and Caffeine take only 15 seconds and 7 seconds respectively. The runtime will be significantly worse on typical, virtualized cloud compute rather than a high performance laptop.

$ git clone https://github.com/ben-manes/caffeine.git
$ pushd caffeine ; git checkout 2fee4bdaff40a76c44cccbdfe954059ed6c3f2bc ; popd
$ wget https://github.com/moka-rs/cache-trace/raw/ef0a9de8cf0202a1f2fee186a0af497774b0f0a9/arc/DS1.lis.zst
$ ./caffeine/gradlew --project-dir caffeine simulator:run -q \
  -Dcaffeine.simulator.files.format="arc" \
  -Dcaffeine.simulator.maximum-size=4_000_000 \
  -Dcaffeine.simulator.policies.0="product.Ehcache3" \
  -Dcaffeine.simulator.files.paths.0="$PWD/DS1.lis.zst"
╔══════════════════╤══════════╤═══════════╤════════════╤════════════╤════════════╤════════════╤═══════════╗
║ Policy           │ Hit Rate │ Miss Rate │ Hits       │ Misses     │ Requests   │ Evictions  │ Time      ║
╠══════════════════╪══════════╪═══════════╪════════════╪════════════╪════════════╪════════════╪═══════════╣
║ product.Ehcache3 │ 28.01 %  │ 71.99 %   │ 12,240,263 │ 31,464,716 │ 43,704,979 │ 27,464,716 │ 18.24 min ║
╚══════════════════╧══════════╧═══════════╧════════════╧════════════╧════════════╧════════════╧═══════════╝

When running we can observe this as the problem by using jstack and Java Mission Control.

$ jps | grep Simulator
29929 Simulator
$ jstack 29929 | grep \"Ehcache3Policy\" -A10
"Ehcache3Policy" #32 [43267] daemon prio=5 os_prio=31 cpu=188377.19ms elapsed=193.78s tid=0x0000000126018400 nid=43267 runnable  [0x0000000172bae000]
   java.lang.Thread.State: RUNNABLE
	at org.ehcache.impl.internal.concurrent.ConcurrentHashMap$Traverser.advance(ConcurrentHashMap.java:3410)
	at org.ehcache.impl.internal.concurrent.ConcurrentHashMap.getEvictionCandidate(ConcurrentHashMap.java:6493)
	at org.ehcache.impl.internal.store.heap.SimpleBackend.getEvictionCandidate(SimpleBackend.java:53)
	at org.ehcache.impl.internal.store.heap.OnHeapStore.evict(OnHeapStore.java:1579)
	at org.ehcache.impl.internal.store.heap.OnHeapStore.enforceCapacity(OnHeapStore.java:1558)
	at org.ehcache.impl.internal.store.heap.OnHeapStore.put(OnHeapStore.java:365)
	at org.ehcache.core.Ehcache.doPut(Ehcache.java:93)
	at org.ehcache.core.EhcacheBase.put(EhcacheBase.java:187)
	at com.github.benmanes.caffeine.cache.simulator.policy.product.Ehcache3Policy.record(Ehcache3Policy.java:80)

This workload is not a problem for other caches such as Guava's LRU or Caffeine's W-TinyLFU.

$ ./caffeine/gradlew --project-dir caffeine simulator:run -q \
  -Dcaffeine.simulator.files.format="arc" \
  -Dcaffeine.simulator.maximum-size=4_000_000 \
  -Dcaffeine.simulator.policies.0="product.Guava" \
  -Dcaffeine.simulator.files.paths.0="$PWD/DS1.lis.zst"
╔═══════════════╤══════════╤═══════════╤═══════════╤════════════╤════════════╤════════════╤═════════╗
║ Policy        │ Hit Rate │ Miss Rate │ Hits      │ Misses     │ Requests   │ Evictions  │ Time    ║
╠═══════════════╪══════════╪═══════════╪═══════════╪════════════╪════════════╪════════════╪═════════╣
║ product.Guava │ 20.24 %  │ 79.76 %   │ 8,847,984 │ 34,856,995 │ 43,704,979 │ 30,856,995 │ 14.44 s ║
╚═══════════════╧══════════╧═══════════╧═══════════╧════════════╧════════════╧════════════╧═════════╝

$ ./caffeine/gradlew --project-dir caffeine simulator:run -q \
  -Dcaffeine.simulator.files.format="arc" \
  -Dcaffeine.simulator.maximum-size=4_000_000 \
  -Dcaffeine.simulator.policies.0="product.Caffeine" \
  -Dcaffeine.simulator.files.paths.0="$PWD/DS1.lis.zst"
╔══════════════════╤══════════╤═══════════╤════════════╤════════════╤════════════╤════════════╤═════════╗
║ Policy           │ Hit Rate │ Miss Rate │ Hits       │ Misses     │ Requests   │ Evictions  │ Time    ║
╠══════════════════╪══════════╪═══════════╪════════════╪════════════╪════════════╪════════════╪═════════╣
║ product.Caffeine │ 45.83 %  │ 54.17 %   │ 20,028,995 │ 23,675,984 │ 43,704,979 │ 19,675,984 │ 6.937 s ║
╚══════════════════╧══════════╧═══════════╧════════════╧════════════╧════════════╧════════════╧═════════╝

If we apply a simple patch of premixing the key's hashCode using a Murmur3 hash function then that the drops the execution time to 40 seconds.

$ wget https://github.com/user-attachments/files/18551302/ehcache.patch
$ pushd caffeine ; git apply ../ehcache.patch ; popd
$ ./caffeine/gradlew --project-dir caffeine simulator:run -q \
  -Dcaffeine.simulator.files.format="arc" \
  -Dcaffeine.simulator.maximum-size=4_000_000 \
  -Dcaffeine.simulator.policies.1="product.Ehcache3" \
  -Dcaffeine.simulator.files.paths.0="$PWD/DS1.lis.zst"
╔══════════════════╤══════════╤═══════════╤═══════════╤════════════╤════════════╤════════════╤═════════╗
║ Policy           │ Hit Rate │ Miss Rate │ Hits      │ Misses     │ Requests   │ Evictions  │ Time    ║
╠══════════════════╪══════════╪═══════════╪═══════════╪════════════╪════════════╪════════════╪═════════╣
║ product.Ehcache3 │ 21.97 %  │ 78.03 %   │ 9,600,600 │ 34,104,379 │ 43,704,979 │ 30,104,379 │ 39.94 s ║
╚══════════════════╧══════════╧═══════════╧═══════════╧════════════╧════════════╧════════════╧═════════╝

Note that Ehcache's LRU hit rate decreased from 28% to more closely match Guava's LRU. This is because a database workload has a strong MRU / LFU bias, so the worse distribution randomly selects a better victim. This would be a worse hit rate in a recency-biased workload, such as data streams, due to the selection bias.

Background

Ehcache uses sampled eviction, aiming to provide a simple, concurrency-friendly LRU in which reads, writes, and eviction do not require a cache-wide exclusive lock. The eviction process selects a random starting point within the hash table to begin scanning, traverses to capture 8 entries, and select the entry with the oldest last accessed timestamp as the victim to remove. The time-to-live policy piggybacks on eviction where expired entries remain hidden in the cache until either replaced on a cache miss or chosen for eviction by the LRU policy.

The developers modified Java's ConcurrentHashMap to add the eviction functionality. Ehcache 2.x forked Java 5's implementation that uses coarse per-segment locks, where each segment is a traditional hash table of linked list bins and selects the bin by the key's hashCode after spreading it with a strong secondary hash function. Ehcache 3.x uses a fork of Java 8's rewritten implementation, which uses fine-grained per-bin locks, red-black tree bins, and selects the bin by the key's hashCode with a weak secondary hash function.

Developer Response

The Ehcache developers have been aware of this problem since 2015 and continue to deprioritize a resolution. Their reasoning is that users would not observe this issue by using caches that are effectively unbounded and never evict, which avoids this slow procedure.

"Ehcache, at least by default, will indeed by unhappy if the cache is full and you keep adding to it, thus causing evictions to occur... The cache works super well but should never be full."
-- Henri Tremblay

There appears to be a misunderstanding by believing that this requires a malicious actor, whereas the example scenario was using a production trace from an IBM customer running an ERP application.

"I would also disagree with any suggestion that additional static mixing is a trivial fix for the problem. My opinion is that static hash mixing alone cannot prevent a hash flooding attack. It just amounts to security through obscurity."
-- Chris Dennis

The issue has been observed by others in their production workloads, such as in this 2018 conference talk.

Analysis

Typically an object's hashCode() is a light accumulation of its attributes. For example, photos in an album might be hashed by the author id, album id, and photo id. As each of those may be sequence numbers, and two attributes are similar for all photos in the album, the fast but weak hash function used by Java, 31 * result + e.hashCode(), will scramble only a few bits in the bottom half of the integer. The hash table will select its array location by a modulus, for example, mod 16 is the least significant four bits. This leads to clustering, where entries concentrate in a subset of bins, making some significantly more populated than others due to collisions. When the hash table reaches its load factor (75% full), it resizes to maintain a more uniform distribution.

A traditional hash table mitigates clustering by applying a stronger bit mixer, such as Murmur3, to make the bit patterns of similar keys more distinct. The goal is that with a single bit flip, like the photo id increment, the output should appear effectively random. This is known as the avalanche effect and tools like hash-prospector try to find hash functions with a very low bias. We can see this in the output of their 2-round function (adjusted for Java by using an unsigned right shift), using Objects.hash(...) for the key's hashCode(), Integer.toBinaryString(int), and jshell to generate a few examples for a hash table with 1024 bins. The hash code changes only by its 8 least significant bits and there is no discernible pattern once spread.

(user, album, photo)	hash	spread	hash index	spread index
(1000, 1, 1)	11110001111001100111	11110010101000010100110011110100	615	244
(1000, 1, 2)	11110001111001101000	11100001010000100110101101011100	616	860
(1000, 2, 1)	11110001111010000110	00011110100011000110011100110110	646	822
(1000, 2, 2)	11110001111010000111	10001100001001111101101111101000	647	1000

In a traditional hash flooding attack, the concern is that a lack of uniform distribution of entries across the hash table causes it to degrade to an O(n) lookup when searching a linked list. To protect the static hash function from being defeated by specially crafted keys it is often enhanced by using the application's startup system time (not wall clock) as an XOR seed to force random bit flips, e.g. spread(e.hashCode() ^ SEED).

Java's approach was to instead switch from linked list bins to red-black tree bins so that the worst-case is O(lg n), e.g. 1M entries require up to 20 comparisons and 1B requires up to 30 comparisons. This avoids any direct attack on the hash function. The goal of its mixing function, h ^ (h >>> 16) & 0x7fffffff, is to spread entropy from higher bits down into the lower bits because the bin is selected using the least significant bits. It does not randomize poor hash distributions (no avalanche) which allows for clustering so that key patterns like sequential numbers or low-entropy strings will map to nearby bins. Since the map is designed for fast key lookups this is perfectly fine because the difference between a more complex mixing function versus a few more comparisons is noise, so the authors may have viewed it as unnecessary or easily misunderstood as sensitive like in traditional hash tables.

In either design, iterating over the hash table requires visiting every bin. The load factor means that it will double the array when the map reaches 75% of its expected capacity in order to reduce collisions into the same bins. Said another way, the load factor of 0.75 means that at least 25% of the table is always available (e.g. empty bins). While less critical for tree-based bins, maintaining free space contributes to near O(1) performance and, due to per-bin locking, higher write concurrency. Iterating over the hash table requires traversing the entire array and its bins, resulting in an O(N + array length) operation. This is not a common operation or the primary use-case for a Map, and especially not for large ones, so it is expected to have acceptable performance but not be optimal. The clustering causes most bins to be empty, which has little impact on the overall time complexity or performance of Map operations.

The addition of eviction sampling invalidates the above design assumptions. In the provided example, at 4M entries the array length is 8M (2^23) and the non-uniform spreader causes only 26% of those to be bins (2,211,582 bins and 6,177,026 null slots). To perform an eviction, a uniformly random start index is chosen in the array and then it scans for bins to compare the first 8 entries found. As the key hash codes are allowed to cluster, the bins tend to form in groups within the table. In one case when attaching the debugger where the start index was 1,364,883, the traverser had to scan 967,573 null array slots before it found the next bin. Since the hash table has many large gaps where a search might randomly begin, the better-distributed entries are likely to be found first, leading to their bins being emptied and widening the gaps. The eviction effectively defragments the array by making the clustering worse and leads to skewing the policy towards an MRU because the older items are protected by its neighbors.

Can the compiler and hardware help?

One could argue that scanning contiguous memory is well-optimized by hardware and eviction may still be fast despite not being O(1). The problem with that is an over reliance on the hardware prefetcher and cache. An array length of 8M requires either 32MB or 64MB, depending on if the garbage collector uses compressed (32-bit) pointers for heaps less than 32GB. On a modern high-end machine this might fit into the L3 cache if lucky, but that is very unlikely due to the other application data like the entries themselves. For smaller workloads the impact will be reduced because the CPU cache can absorb it in a simulated environment, however in a cloud environment this is not likely to be as beneficial due to competing work. This can lead to many cache misses and slower execution times.

Ideally, the prefetcher would assist when detecting a contiguous scan. For a sequential lookahead that could range from 2-8 KBs (32-128 cache lines) and for a distance of ~1M that means 4MB or 8MB of memory. Even if it aggressively fetches, since it is a simple null check the traversal would be limited by the DRAM latency of 50-100ns. That calculates as ((scan mb) / (prefetch size)) * (dram latency), resulting in a range of 25,000ns (25us) to 400,000ns (0.4ms) per run. The Linux thread scheduler uses an adaptive time slice of 1–6ms, allowing for approximately 2-240 evictions per thread before it is suspended. For most array traversals the compiler optimizes by loop unrolling, auto vectorizing, etc which can improve prefetching and enable larger lookahead sizes.

Unfortunately, optimization by the hardware and compiler is hindered by the concurrent nature of the data structure. This introduces memory barriers like volatile reads which prevents reordering memory accesses, restricts prefetching, and breaks the code shape for the compiler to apply its optimizations. Since the array is scanned sequentially at random offsets, the hardware cache likely won't see much reuse across concurrent evictions as they search different areas and the cpu cache may aggressively evict those entries due to sequential streaming detection in an effort to be scan resistant.

Mitigations

Getting back to the original problem, we saw that clustering caused the long traversals when trying to find 8 candidates to evaluate for eviction. There are two obvious solutions to this that would not require extra memory or a major change.

The sampled eviction assumes that entries are uniformly distributed, per a traditional hash table, so that its random starting point can find non-empty bins quickly. A stronger hash function restores that characteristic and the runtime drops dramatically in real, non-malicious workloads. While ideally an internal change, users can premix their hash codes as a workaround.

Another approach would be to reuse the iterator across evictions, relying on its weakly consistent behavior. This would avoid the random starts in the large gaps by galloping through them and spending more time within the clusters. It would require an internal change to serialize the eviction process and use a cyclical iterator, instead of the current concurrent eviction approach. The approach still suffers from the eviction latency costs of a large gap, but reduces how often that penalty is incurred.

Combining the hash mixer and cyclical eviction iterator would give the best results. This is because while the iterator reuse does not rely on the hash quality, it benefits by having the uniform mixer remove those spans to traverse through. If a collision attack against the hash function was successful then its impact would be minimal, thereby making the effort not worthwhile.