As part of my “real job”, I’m developing a Clojure-based app that does a significant amount of number crunching. The inner loop continuously ref-sets random portions of a large array of refs (when I say “large”, I mean that I can plausibly fire off a 50gb heap). I had a tough time getting it performant, and it’s an interesting enough story that I thought I’d relate it here.
After the standard futzing with algorithms and data structures, the thing that ended up holding me back was excessive time in GC. I started with the “throughput” collector (hey, I’m doing number crunching, I don’t have real-time requirements, throughput is awesome!). Somewhat surprisingly, I saw worse and worse performance as my app ran, ending in a kind of sawtoothed purgatory of GC. What little information I found about Clojure-specific GC tuning uniformly showed using the CMS / low-latency / concurrent collector as a good choice. Curious, I switched - but that just resulted in slow-rising, slow-falling memory consumption patterns, with the pauses coinciding with pure serial collections (and still, not enough work geting done).
This caused me to go a bit down the rabbit hole of garbage collection lore - for simplicity I’ve condensed this somewhat. The JVM stores its heap object in the permanent generation (ignorable, mostly class definitions and such), a few pools in the young generation, and a tenured generation of long-lived data. The young generation is collected via a parallel algorithm that isn’t really that tunable, and the split between young and tenured (and the various young sub-pools) will be auto-tuned if you let it. The real mojo happens in the tenured pool, where you have the choice of a CMS / low-latency / concurrent collector, or a parallel throughput collector, with various options for each. The trigger that sets off a CMS collection is theoretically auto-tuned as well, and there are “ergonomics” for the parallel collector, but performance was crappy enough as things adjusted themselves that I never got the chance to examine the long-run, automated performance. I had to figure out most of this via experimentation and reading lots of GC logs.
What I discovered was that because Clojure data structures are immutable, I had enough of them, and my mutation rate was low enough, the majority were advancing to the JVM’s tenured pool. When a ref was set to a new object, the old one lingered in the tenured pool until it was collected, and the new one eventually migrated there too. This is what caused the slow rise in memory usage using the stock CMS collector - objects weren’t being collected until a threshold of the tenured generation was filled, and by the time the threshold was hit, it was close enough to filling the tenured pool entirely that it always blew through and caused a full, serial, stop-the-world collection. Causing these objects to stay in the “young” pool was inefficient, since the majority at any given time were still live and being copied back and forth (and the ones that happened to survive were just thrown away later). I needed to set the collector to kick in earlier and hoover up the garbage accreting in the tenured pool.
However, even with the CMS collector set to run more or less continuously, I was still allocating objects faster than they could be collected, because the young-generation collector was promoting almost all of my objects. The solution ended up being to not generate objects fast enough to fill the tenured generation and cause a slow serial collection. There were a few interlocking solutions to this. First, I set the number of CMS-collector threads higher than usual (4 in my case) to mark the garbage faster during a collection. I hard-coded the threshold at which the CMS collector kicks in, and actually set it relatively high, in order to have control over when GC triggered. I figured the maximum amount of time the app could mutate for before causing a serial GC or triggering an uncontrolled CMS GC cycle, and paused at the end of that period to manually trigger a CMS run and block until it was complete, via calling System.gc() and telling the JVM to associate that with the CMS collector. For long-term operation when I didn’t care about maximum throughput, just stability, I simply ran the object-generating portions of my app on fewer threads (could have just as easily called Thread.sleep() intermittently), and set the CMS threshold low enough that it was running continuously.
Calling System.gc() is nearly always a smell, but in my case I found it was the only way to ensure I was operating at max throughput without causing excessive pausing. My allocation pattern simply isn’t something the JVM deals well with - had I been mutating objects directly, rather than generating new ones, running an alteration to the objects that leveraged some Clojure structural sharing, or operating at a higher CPU-heap ratio, I could have likely avoided calling it.
My JVM options (some of which may be superfluous, I wasn’t able to establish whether CMSIncrementalDutyCycle or CMSIncrementalSafetyFactor were necessary without the incremental-CMS option turned on) ended up being:
-d64 -da -dsa \
-XX:+PrintGCDetails \
-Xloggc:gc.log \
-Xms50g -Xmx50g \
-XX:+UseConcMarkSweepGC \
-XX:+UseParNewGC \
-XX:ParallelCMSThreads=4 \
-XX:+ExplicitGCInvokesConcurrent \
-XX:+CMSParallelRemarkEnabled \
-XX:-CMSIncrementalPacing \
-XX:+UseCMSInitiatingOccupancyOnly \
-XX:CMSIncrementalDutyCycle=100 \
-XX:CMSInitiatingOccupancyFraction=90 \
-XX:CMSIncrementalSafetyFactor=10 \
-XX:+CMSClassUnloadingEnabled -XX:+DoEscapeAnalysis