Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ObjectMapper default heap consumption increased significantly from 2.13.x to 2.14.0 #3665

Closed
mhalbritter opened this issue Nov 15, 2022 · 46 comments
Milestone

Comments

@mhalbritter
Copy link

mhalbritter commented Nov 15, 2022

Hello,

I'm currently looking at heap consumption when using native image with a Spring Boot 3 application. Spring Boot 3.0 SNAPSHOT uses Jackson 2.14.0-rc3. I noticed that with the upgrade from 2.13.4.2 to 2.14.0 the application uses more heap memory, which i tracked down to this PR. It changed the cache implementation from ConcurrentHashMap to PrivateMaxEntriesMap, which in turn preallocates a lot of AtomicReference (they appear to be all empty).

I created a reproducer here - it creates an empty objectmapper and then dumps the heap. It has more information and screenshots in the README.

When using Jackson 2.13.4.2 it uses 567 B for the LRUmap, with Jackson 2.14.0 it uses 507320 B for the LRU map. The bad thing is that most of these caches are per ObjectMapper - create more of them and it uses even more heap.

As we're trying to optimize the memory consumption of native image applications, this is quite a bummer and we hope you can help us out here.

Thanks!

@mhalbritter mhalbritter added the to-evaluate Issue that has been received but not yet evaluated label Nov 15, 2022
@mhalbritter
Copy link
Author

mhalbritter commented Nov 15, 2022

You can see the problem (but not as exaggerated) on a regular JVM. I created 100 ObjectMapper and heapdumped.

This is 2.13.4.2:

image

The usage is negligible.

This is 2.14.0:

image

The LRUmap uses 82.1% of the heap.

@GedMarc
Copy link

GedMarc commented Nov 15, 2022

From my understanding, ObjectMapper's should be singleton's as much as possible,
and the ObjectReader and ObjectWriters are the thread-safe objects that can be built from it -

So this would be a bad implementation of using the ObjectMapper

We use native imaging throughout and don't experience anything like this when it's used properly

@mhalbritter
Copy link
Author

mhalbritter commented Nov 15, 2022

I created 100 ObjectMappers to show the extent of this problem. As you'll see in https://github.com/mhalbritter/graalvm-jackson, it is even observable with only 1 ObjectMapper. 2.14.0 uses ~880x more heap for an empty ObjectMapper compared to 2.13.4.2.

Spring Boot in my WebMVC sample app uses 4 of them. Not sure if we can get this down to 1.

@GedMarc
Copy link

GedMarc commented Nov 15, 2022

Hmmm - @cowtowncoder this one needs your eyes

@sdeleuze
Copy link

I agree with @mhalbritter, it looks like the heap allocation regression in 2.14 is pretty huge and I hope that something could be refined on Jackson side.

@cowtowncoder Looking forward to your feedback, we still have the opportunity to update Jackson before Spring Boot 3 GA is released (expected next week).

@cowtowncoder
Copy link
Member

cowtowncoder commented Nov 16, 2022

Uh. I did have some concerns about apparent complexity of the improved cache choice, relative to benefits. The baseline for cache should be solid from what I recall, but @pjfanning worked on it most closely.

This is unfortunate of course. But just to make sure: that's about 0.5 MB per ObjectMapper, without any use? (basic data structures).

Given timing I suspect the only potentially easyish win would be changing initial size? And even that would require 2.14.1 release.

@cowtowncoder cowtowncoder changed the title Heap consumption regression from 2.13.4.2 to 2.14.0 ObjectMapper default heap consumption increased significantly from 2.13.x to 2.14.0 Nov 16, 2022
@cowtowncoder
Copy link
Member

Ok, looks like there are just 4 cache instances by default (plus one optional):

  1. JacksonAnnotationIntrospector uses it to keep track of annotations that have (or not) JacksonAnnotationsInside. This definitely does not need super-nice cache
  2. TypeFactory caches types; this is relatively heavily used during construction of (de)serializers
  3. SerializerCache probably the most important cache since many serializers are dynamically accessed
  4. DeserializerCache somewhat important but deserializers mostly statically accessed beyond root instances
  5. (optional) RootNameLookup -- mostly used with XML, sometimes JSON

Initial sizes of all these seem to be between 16 and 65, nothing drastic unless I am missing something.

It would be good to get more numbers, if possible, on sizes of individual LRUMap instances.
Unfortunately it doesn't look like initial sizes can be reduced a lot (or put another way, reducing would not have humongous effect).
It'd also be good to just have an estimate of a single LRUMap with specific initial size as baseline to consider what exactly to do. The whole PrivateMaxEntriesMap implementation seems very much overpowered for Jackson's use case... and it'd be good to see if some parts could be removed.

@ben-manes
Copy link

Sorry, I haven’t looked at that caching logic for about 7 years (or 12 when written). At the time the emphasis was concurrency for server-class usages, when a concurrent lru algorithm was a novel, hard problem. I’m sure those AtomicReferences could be replaced by aAtomicReferenceArray to slim it down. It is striped lanes to reduce thread contention, but you could greatly simplify that by knowing what is actually needed here, e.g. a single array is probably fine.

I later rewrote this to dynamically stripe light weight, mpsc ring buffers. That is more memory conscious by being slimmer and demand based.

@ben-manes
Copy link

@bclozel may be interested as he similarly rewrote that original code for use by Spring. There are versions floating around in groovy, mssql jdbc, etc. it should be simple to streamline to be more tuned towards the project needs.

@cowtowncoder
Copy link
Member

That sounds useful @ben-manes thank you for sharing. I wonder how easy it'd be to measure -- of course it's not rocket science to just count fields and do it manually but there are probably tools to automate using introspection to at least give static minimums or such.

@ben-manes
Copy link

I used openjdk’s object layout on the current code to verify padding rules for protecting against false sharing. And for a rough footprint estimate, jamm is good enough for seeing if things line up to expectations. All are just best guesses due to jvm internal complexity, JEP-8249196.

@mhalbritter
Copy link
Author

mhalbritter commented Nov 16, 2022

This is unfortunate of course. But just to make sure: that's about 0.5 MB per ObjectMapper, without any use? (basic data structures).

VisualVM tells me that 1 ObjectMapper on the JVM uses 132.352 B of heap. This is a new ObjectMapper() without doing anything more with it. However all LRUMap instances are 213.640 B of heap.

VisualVM tells me that 1 ObjectMapper in a native image uses 311.944 B of heap. However all LRUmap instances are 507.320 B of heap.

Here are all the LRUmap instances on the JVM with the fields which have a reference on them:

image

@mhalbritter
Copy link
Author

They are all the same size, because the AtomicReference[][] readBuffers field, which consumes most of the heap, is sized this way:

static final int NCPU = Runtime.getRuntime().availableProcessors(); // 9 on my machine
static final int NUMBER_OF_READ_BUFFERS = ceilingNextPowerOfTwo(NCPU); // 16 on my machine
static final int READ_BUFFER_THRESHOLD = 32;
static final int READ_BUFFER_DRAIN_THRESHOLD = 2 * READ_BUFFER_THRESHOLD; // 64
static final int READ_BUFFER_SIZE = 2 * READ_BUFFER_DRAIN_THRESHOLD; // 128

readBuffers = new AtomicReference[NUMBER_OF_READ_BUFFERS][READ_BUFFER_SIZE];

which will make this is a AtomicReference[16][128] array on my machine (this depends on the number of CPU cores). This is not influenced by initialCapacity or maximumCapacity as far as i can see.

@ben-manes
Copy link

Yes, concurrency is unrelated to a maximum cache size. At that time a synchronized lru was standard and perfectly fine in most cases due to single core machines still being common. Those who used this concurrent cache (ConcurrentLinkedHashMap) were servers that had clear needs (4-1024 cores). This led to porting into Guava (CacheBuilder) where we defaulted to 4 for Google’s needs, with a setting to adjust manually (concurrencyLevel). For a more general, modern cache (Caffeine) it now dynamically adjusts the size by need so that low concurrency has a low footprint while high concurrency benefits by trading some extra space. However CLHM is simpler to fork and embed, so decisions that aged poorly are copied over.

For Jackson a modest size is enough, thanks to lossy buffers and modest needs, so slimming this down should be simple and effective. Or if you prefer the advanced solution, copying that code and making the minor replacement edits.

@pjfanning
Copy link
Member

pjfanning commented Nov 16, 2022

The jackson-databind use case is to actually implement LRU behavior and not clear the whole cache when it fills (unlike 2.13-and-before implementation).

The use of PrivateMaxEntriesMap (a renamed fork of https://github.com/ben-manes/concurrentlinkedhashmap) is to be able to cap the entry size of the map and to evict the LRU instances when the entry size limit is reached.

We could certainly try to reduce the AtomicReference[][] readBuffers size to maybe max out as NUMBER_OF_READ_BUFFERS=4 and READ_BUFFER_THRESHOLD=4. We should set up some load test scenarios to see if those numbers are ok. Throwing unexpected exceptions is the main worry but also, we would like not to have poor performance in any usage scenarios. We could tweak these new numbers to see if they have a major impact.

@bclozel
Copy link

bclozel commented Nov 16, 2022

I've locally switched AtomicReference[] to AtomicReferenceArray and I'm seeing a nice improvement on heap consumption. In theory this shouldn't change things much for runtime performance but I'm seeing a degradation (but still in the error margin).

Right now Spring's implementation is not the dominator here, but we might still revisit the read buffer sizes or how its implementation works.

@cowtowncoder cowtowncoder added 2.14 and removed to-evaluate Issue that has been received but not yet evaluated labels Nov 16, 2022
@cowtowncoder
Copy link
Member

@pjfanning (I reworded your desc a bit) -- +1 for capping size of those arrays so that no matter how many cores it would not increase linearly.

@bclozel Use of AtomicReferenceArray sounds good as well.

Ultimately I think memory usage per cache should be significant lower than initial implementation. Load testing would be nice, but it is worth noting we didn't perf test new implementation vs old implementation either...

@ben-manes
Copy link

In theory this shouldn't change things much for runtime performance but I'm seeing a degradation

This is due to cpu cache line sharing which causes coherency bottlenecks as a distinct fields within a block are invalidated together. As the thread distribution is not perfect some shared producers will select the same lane. In a perfect world each slot would be fully padded so that once an index is acquired the two producers do not impact each other. That of course is very wasteful (64-byte blocks). The use of object references was a middle ground for partial padding. In these guava benchmarks from one of my attempts to revisit using ring buffers (uses CLQ as quick-and-dirty first pass), the read throughput ranged from 240M (64b) / 256M (256b) / 285M (1kb) ops/s. While the 45M spread sounds scary, in practice the cache reads won't be a bottleneck if it can support ~60M ops/s due to real application work so further gains won't impact system performance.

We should set up some load test scenarios to see if those numbers are ok.

Please see this jmh benchmark as a reference.

@pjfanning
Copy link
Member

16*128*4 (4 bytes per Java object pointer) means that the AtomicReference[16][128] readBuffers should use 8192 bytes (8 Kb). This seems a long way short of the totals being quoted for each ObjectMapper.

@ben-manes
Copy link

ben-manes commented Nov 16, 2022

@pjfanning you forgot to account for the AtomicReference instances themselves. That ranges from 16-32 bytes depending on the jvm optimizations (compressed references, compressed class pointers, byte alignment). In that case you might expect to see it as 16 * 128 (4 + 32) = 73,728 or ~300kb if each mapper uses 4 caches. Using Java Object Layout,

# Running 64-bit HotSpot VM.
# Using compressed oop with 3-bit shift.
# Using compressed klass with 3-bit shift.
# WARNING | Compressed references base/shifts are guessed by the experiment!
# WARNING | Therefore, computed addresses are just guesses, and ARE NOT RELIABLE.
# WARNING | Make sure to attach Serviceability Agent to get the reliable addresses.
# Objects are 8 bytes aligned.
# Field sizes by type: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]
# Array element sizes: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]

... (many variations) ...

***** Hotspot Layout Simulation (JDK 15, 64-bit model, compressed class pointers, 16-byte aligned)
java.util.concurrent.atomic.AtomicReference object internals:
OFF  SZ               TYPE DESCRIPTION               VALUE
  0   8                    (object header: mark)     N/A
  8   4                    (object header: class)    N/A
 12   4                    (alignment/padding gap)
 16   8   java.lang.Object AtomicReference.value     N/A
 24   8                    (object alignment gap)
Instance size: 32 bytes
Space losses: 4 bytes internal + 8 bytes external = 12 bytes total

@cowtowncoder
Copy link
Member

300 kB sounds in the ballpark that was observed.

AtomicReferenceArray could improve things, but also seems like limiting actual size of 2-dimensional array (which @pjfanning suggested I think) would make sense.

I wonder how much we could cut things here -- to me some added overhead is fine; ObjectMapper instances will accumulate other baggage when actually used (for cached Serializers, Deserializers in particular).
But getting from ~300kB to less than 100kB would seem like a worthwhile goal to me.

@yawkat
Copy link
Member

yawkat commented Nov 17, 2022

i dont think it has to be a 2-dimensional array either, could save some space by flattening it.

@cowtowncoder
Copy link
Member

cowtowncoder commented Nov 17, 2022

@yawkat Going to AtomicReferenceArray would remove one dimension, although I guess it'd be possible to flatten both dimensions into single instance. Not sure if that'd be bad wrt false sharing or other caching aspects but that'd probably eliminate quite a bit of usage. And in some ways locality could help (but this is not my area of expertise TBH).

@ben-manes
Copy link

To be fair, since Jackson's caches are likely tiny a Clock (aka Second Chance) policy would be adequate if you wanted a much simpler rewrite. That was my original stop-gap when starting to explore a general approach. It is simply a FIFO with a mark bit that is set on a read and reset during an O(n) eviction scan. That has similar hit rates to an LRU with lock-free reads, but writes block on a shared lock and large caches have GC-like pauses due to scanning. For a small cache like I imagine these to be, anything simple (even just FIFO or random) is likely good enough. I think it would be perfectly fine to write a very simple, fast, low overhead cache that is specific to your needs.

@cowtowncoder
Copy link
Member

I have bit ambivalent feelings about this. On one hand, yes, use case specific would be good. On the other hand, using something (as close to) off-the-shelf (as possible) is great.
So in that sense, incremental changes for 2.14.x at least, to address unexpectedly sizable memory usage increase, would probably be best.
For 2.15 could do something bigger. Assuming we have someone with time and interest.

@cowtowncoder
Copy link
Member

Ok at this point I am optimistic about being able to optimize LRUMap implementation to have much lower memory footprint, so there's no need to revert change. Better cache pluggability/configuration would still be nice but that would require API change and go in at earliest in 2.15.

I wonder if use of jol for main ObjectMapper would help us guard against significant unintended memory usage increase for some changes -- just constructing an instance, testing direct size. And for bonus points do the same after one simple serialization, deserialization.
The idea being that limit for maximum size (on CI?) would be kept relatively tight and increased along with additional fields, but making it easier to notice increases that are higher than expected (including surprises where none were expected etc).

@yawkat
Copy link
Member

yawkat commented Nov 20, 2022

@cowtowncoder I tried to add exactly such a test (though the limit is fairly lax), but ran into these issues with JOL on CI: #3675 (comment)

@cowtowncoder
Copy link
Member

Fixed for 2.14.1.

@sdeleuze
Copy link

Awesome, thanks a lot for this quick update! Looking forward to leverage Jackson 2.14.1 asap available. If the release could happen by tomorrow (Nov 22), that would I think allow us to ship Spring Boot 3 GA with the memory fix.

@mhalbritter
Copy link
Author

Nice, thanks a lot!

@cowtowncoder
Copy link
Member

I'll see if I could find time today to release a patch version; cannot promise it but will try my best.

@cowtowncoder
Copy link
Member

Ta-dah! I did it. 2.14.1 release mostly done, last couple of artifacts on their way to Maven Central (Scala module may take bit longer). Please LMK if you find issues.

sdeleuze added a commit to sdeleuze/spring-framework that referenced this issue Nov 22, 2022
Fix an important memory consumption regression, see
FasterXML/jackson-databind#3665
for more details.

Closes spring-projectsgh-29539
srowen pushed a commit to apache/spark that referenced this issue Nov 23, 2022
### What changes were proposed in this pull request?
This pr aims upgrade `Jackson` related dependencies to 2.14.1

### Why are the changes needed?
This version include  an optimization of heap memory usage for Jackson 2.14.x:

- FasterXML/jackson-databind#3665

The full release notes as follows:

- https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.14.1

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

Closes #38771 from LuciferYang/SPARK-41239.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
@cowtowncoder
Copy link
Member

@yawkat MapperFootprintTest seems to be running for CI too and arbitrarily failing. I think I'll increase limit a bit for now, to at least reduce incidence rate.

@yawkat
Copy link
Member

yawkat commented Nov 28, 2022

@cowtowncoder ugh, or just disable it for now. ive given up on making the results resilient to gc, there is too much going on, weird caches with soft references and such. maybe it will be less flaky with epsilon gc, but that's not available on java 8 so i didn't try it for the ci

bclozel added a commit to spring-projects/spring-framework that referenced this issue Dec 5, 2022
Prior to this commit, the `ConcurrentLruCache` implementation would use
arrays of `AtomicReference` as operation buffers, and the buffer count
would be calculated with the nearest power of two for the CPU count.

This can result in significant heap memory usage as each
`AtomicReference` buffer entry adds to the memory pressure. As seen in
FasterXML/jackson-databind#3665, this can add a significant overhead for
no real added benefit for the current use case.

This commit changes the current implementation to use
`AtomicReferenceArray` as buffers and reduce the number of buffers.
JMH benchmarks results are within the error margin so we can assume that
this does not change the performance characteristics for the typical use
case in Spring Framework.

Fixes gh-29520
beliefer pushed a commit to beliefer/spark that referenced this issue Dec 15, 2022
### What changes were proposed in this pull request?
This pr aims upgrade `Jackson` related dependencies to 2.14.1

### Why are the changes needed?
This version include  an optimization of heap memory usage for Jackson 2.14.x:

- FasterXML/jackson-databind#3665

The full release notes as follows:

- https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.14.1

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

Closes apache#38771 from LuciferYang/SPARK-41239.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
beliefer pushed a commit to beliefer/spark that referenced this issue Dec 18, 2022
### What changes were proposed in this pull request?
This pr aims upgrade `Jackson` related dependencies to 2.14.1

### Why are the changes needed?
This version include  an optimization of heap memory usage for Jackson 2.14.x:

- FasterXML/jackson-databind#3665

The full release notes as follows:

- https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.14.1

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

Closes apache#38771 from LuciferYang/SPARK-41239.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants