8353115: GenShen: mixed evacuation candidate regions need accurate live_data #24319

kdnilsen · 2025-03-31T03:17:51Z

The existing implementation of get_live_data_bytes() and git_live_data_words() does not always behave as might be expected. In particular, the value returned ignores any allocations that occur subsequent to the most recent mark effort that identified live data within the region. This is typically ok for young regions, where the amount of live data determines whether a region should be added to the collection set during the final-mark safepoint.

However, old-gen regions that are placed into the set of candidates for mixed evacuation are more complicated. In particular, by the time the old-gen region is added to a mixed evacuation, its live data may be much larger than at the time concurrent old marking ended.

This PR provides comments to clarify the shortcomings of the existing functions, and adds new functions that provide more accurate accountings of live data for mixed-evacuation candidate regions.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8353115: GenShen: mixed evacuation candidate regions need accurate live_data (Task - P3)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24319/head:pull/24319
$ git checkout pull/24319

Update a local copy of the PR:
$ git checkout pull/24319
$ git pull https://git.openjdk.org/jdk.git pull/24319/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 24319

View PR using the GUI difftool:
$ git pr show -t 24319

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24319.diff

Using Webrev

Link to Webrev Comment

This reverts commit 702710e.

This reverts commit 3a67b1f.

bridgekeeper · 2025-03-31T03:18:46Z

👋 Welcome back kdnilsen! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-03-31T03:19:16Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

openjdk · 2025-03-31T03:19:50Z

@kdnilsen The following labels will be automatically applied to this pull request:

hotspot-gc
shenandoah

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

mlbridge · 2025-03-31T03:22:57Z

Webrevs

earthling-amzn · 2025-04-01T17:49:38Z

src/hotspot/share/gc/shenandoah/shenandoahHeapRegion.inline.hpp

@@ -155,6 +155,23 @@ inline size_t ShenandoahHeapRegion::get_live_data_bytes() const {
  return get_live_data_words() * HeapWordSize;
 }

+inline size_t ShenandoahHeapRegion::get_mixed_candidate_live_data_bytes() const {
+  assert(SafepointSynchronize::is_at_safepoint(), "Should be at Shenandoah safepoint");


Could we use shenandoah_assert_safepoint here (and other places) instead?

Good call. I'll make this change.

earthling-amzn · 2025-04-01T18:16:43Z

src/hotspot/share/gc/shenandoah/shenandoahHeapRegion.cpp

@@ -75,6 +75,7 @@ ShenandoahHeapRegion::ShenandoahHeapRegion(HeapWord* start, size_t index, bool c
  _plab_allocs(0),
  _live_data(0),
  _critical_pins(0),
+  _mixed_candidate_garbage_words(0),


Do we need a new field to track this? During final_mark, we call increase_live_data_alloc_words to add TAMS + top to _live_data to account for objects allocated during mark. Could we "fix" get_live_data so that it always returned marked objects (counted by increase_live_data_gc_words) plus top - TAMS. This way, the live data would not become stale after final_mark and we wouldn't have another field to manage. What do you think?

This is a good idea. Let me experiment with this.

My experiment with an initial attempt at this failed with over 60 failures. The "problem" is that we often consult get_live_data() in contexts from which it is "not appropriate" to add (top- TAMS) to the atomic volatile ShenandoahHeapRegion::_live_data() . I think most of these are asserts. I have so far confirmed that there are at least two different places that need to be fixed. Not sure how many total scenarios.

I'm willing to move forward with changes to the failing asserts to make this change work. I think the code would be cleaner with your suggested refactor. It just makes this PR a little more far-reaching than the original.

See the most recent commit on this PR to see the direction this would move us. Let me know if you think I should move forward with more refactoring, or revert this most recent change.

Thanks.

It does look simpler. Do you have an example of one of the failing asserts?

One thing I hadn't considered is how "hot" ShenandoahHeapRegion::get_live_data_words is. Is there going to be a significant performance hit if we make this method do more work? It does look like this method is called frequently.

Examples:
FullGC worker:

void ShenandoahMCResetCompleteBitmapTask::work(uint worker_id) { ShenandoahParallelWorkerSession worker_session(worker_id); ShenandoahHeapRegion* region = _regions.next(); ShenandoahHeap* heap = ShenandoahHeap::heap(); ShenandoahMarkingContext* const ctx = heap->complete_marking_context(); while (region != nullptr) { if (heap->is_bitmap_slice_committed(region) && !region->is_pinned() && region->has_marked()) { // kelvin replacing has_live() with new method has_marked() because has_live() calls get_live_data_words() // and pointer_delta() asserts out because TAMS is not less than top(). has_marked() does what has_live() // used to do... ctx->clear_bitmap(region); } region = _regions.next(); } }

ShenandoahInitMarkUpdateRegionStateClosure::heap_region_do() {

- assert(!r->has_live(), "Region %zu should have no live data", r->index()); + assert(!r->has_marked(), "Region %zu should have no marked data", r->index());

Not sure about performance impact, other than implementing and testing...

i suspect performance impact is minimal.

I've committed changes that endeavor to implement the suggested refactor. Performance impact does appear to be minimal. This broader refactoring does change behavior slightly. In particular:

We now have a better understanding of live-memory evacuated during mixed evacuations. This allows the selection of old-candidates for mixed evacuations to be more conservative. We'll have fewer old regions in order to honor the intended budget.

Potentially, this will result in more mixed evacuations, but each mixed evacuation should take less time.

There should be no impact on behavior of traditional Shenandoah.

On one recently completed test run, we observed the following impacts compared to tip:
Shenandoah

+80.69% specjbb2015/trigger_failure p=0.00000
Control: 58.250 (+/- 13.48 ) 110
Test: 105.250 (+/- 33.13 ) 30

Genshen

-19.46% jme/context_switch_count p=0.00176
Control: 117.420 (+/- 28.01 ) 108
Test: 98.292 (+/- 32.76 ) 30

Perhaps we need more data to decide whether this is "significant".

This result seems to be consistent. The effect on traditional Shenandoah is apparently to reduce the size of traditional Shenandoah collection sets also because certain regions that would have been collected are now rejected due to "better awareness" of how much live data will need to be copied. The amount of garbage associated with candidate regions for the young collection set is reduced by the amount of allocations above TAMS.
Previously, this had been erroneously reported as garbage. This has the effect of delaying reclamation of some garbage, resulting in an increase in allocation failures on the specjbb 2025 workload.

We might argue that the original behavior was incorrect, in that it was allowing violation of the intended evacuation budget.

We apparently were getting away with this violation because we were able to flip mutator regions to collector space, and/or because evacuation waste was sufficient to accommodate the unbudgeted evacuations.

Now that we have more accurate accounting of live memory, we could perhaps slightly reduce the default evacuation waste budget if we want to claw back the losses in specjbb performance (to enable larger collection sets) as part of this PR.

Redefine the way ShenandoahHeapRegion::get_live_data_<type> works to simplify changes.

ysramakrishna · 2025-04-10T22:33:28Z

Haven't started looking at these changes, but I do wonder if it might be worthwhile to also consider (and implement under a tunable flag) the alternative policy of never adding to the collection set any regions that are still "active" at the point when the collection set for a marking cycle is first assembled at the end of the final marking. That way we don't have to do any re-computing, and the criterion for evacuation is garbage-first (or liveness-least) both of which remain invariant (and complements of each other) throughout the duration of evacuation and obviating entirely the need for recomputing the goodness/choice metric afresh.

The downside is that we may leave some garbage on the table in the active regions, but this is probably a minor price for most workloads and heap configurations, and doesn't unnecessarily complicate or overengineer the solution.

One question to consider is how G1 does this. May be regions placed in the collection set are retired (i.e. made inactive?) -- I prefer not to forcibly retire active regions as this wastes space that may have been usable.

Thoughts? (Can add this comment and discuss on the ticket if that is logistically preferable.)

kdnilsen · 2025-04-11T21:28:07Z

@ysramakrishna : Interesting idea. Definitely worthy of an experiment. On the upside, this can make GC more "efficient" by procrastinating until the GC effort maximizes the returns of allocatable memory. On the downside, this can allow garbage to hide out for arbitrarily long times in regions that are not "fully used". This might also end up making it more difficult for the GC to reclaim garbage, in the case that the allocations that are added to the region after we declined to evacuate the first time might need to be evacuated when we come back to evacuate this region at a later time. I'd be in favor of proposing these experiments and possible feature enhancements in the context of a separate JBS ticket.

bridgekeeper · 2025-05-12T17:52:17Z

@kdnilsen This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

ysramakrishna · 2025-05-13T22:36:23Z

/keepalive

openjdk · 2025-05-13T22:37:01Z

@ysramakrishna The pull request is being re-evaluated and the inactivity timeout has been reset.

kdnilsen added 30 commits January 12, 2024 01:06

Improve documentation of how Evac-OOM Protocol works

702710e

Merge branch 'openjdk:master' into master

61b575f

Revert "Improve documentation of how Evac-OOM Protocol works"

51d056f

This reverts commit 702710e.

Merge branch 'openjdk:master' into master

ba98e42

Merge branch 'openjdk:master' into master

441487c

Merge branch 'openjdk:master' into master

dafc363

Merge branch 'openjdk:master' into master

c4c252e

Merge branch 'openjdk:master' into master

41ba86a

Merge branch 'openjdk:master' into master

f215a70

Merge branch 'openjdk:master' into master

4d6b5cd

Merge branch 'openjdk:master' into master

7fe605f

Merge branch 'openjdk:master' into master

2e224f6

Merge branch 'openjdk:master' into master

46ad5c6

Merge branch 'openjdk:master' into master

9a1989d

Merge branch 'openjdk:master' into master

4126c22

Merge branch 'openjdk:master' into master

981692e

Make GC logging less verbose

3a67b1f

Revert "Make GC logging less verbose"

3692312

This reverts commit 3a67b1f.

Merge branch 'openjdk:master' into master

045590b

Merge branch 'openjdk:master' into master

fbbd88c

Merge branch 'openjdk:master' into master

7e0edf0

Merge branch 'openjdk:master' into master

3525369

Merge branch 'openjdk:master' into master

fe0da51

Merge branch 'openjdk:master' into master

db12fe5

Merge branch 'openjdk:master' into master

0440bae

Merge branch 'openjdk:master' into master

3bdc022

Merge branch 'openjdk:master' into master

1ee2ff1

Merge branch 'openjdk:master' into master

e6e772f

Merge branch 'openjdk:master' into master

c5a159e

Merge branch 'openjdk:master' into master

e7ca4f8

kdnilsen added 2 commits March 27, 2025 15:59

Merge branch 'openjdk:master' into master

42a93c7

Track live and garbage for mixed-evac regions

7061388

openjdk bot added the rfr Pull request is ready for review label Mar 31, 2025

openjdk bot added hotspot-gc hotspot-gc-dev@openjdk.org shenandoah shenandoah-dev@openjdk.org labels Mar 31, 2025

earthling-amzn suggested changes Apr 1, 2025

View reviewed changes

kdnilsen added 7 commits April 9, 2025 01:47

Experiment with reviewer suggestion

3c1f788

Redefine the way ShenandoahHeapRegion::get_live_data_<type> works to simplify changes.

Experiment 2: refinements to reduce regressions

8ff388d

Fix garbage_before_padded_for_promote()

8e820f2

Fix set_live() after full gc

d3cba66

Refactor for better abstraction

eb2679a

Adjust candidate live memory for each mixed evac

b9f828c

Remove deprecation conditional compiles

ef783d4

Fix uninitialized variable

e6e44b6

8353115: GenShen: mixed evacuation candidate regions need accurate live_data #24319

Are you sure you want to change the base?

8353115: GenShen: mixed evacuation candidate regions need accurate live_data #24319

Uh oh!

Conversation

kdnilsen commented Mar 31, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewing

Uh oh!

bridgekeeper bot commented Mar 31, 2025

Uh oh!

openjdk bot commented Mar 31, 2025

Uh oh!

openjdk bot commented Mar 31, 2025

Uh oh!

mlbridge bot commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kdnilsen Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

On one recently completed test run, we observed the following impacts compared to tip: Shenandoah

Genshen

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ysramakrishna commented Apr 10, 2025

Uh oh!

kdnilsen commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bridgekeeper bot commented May 12, 2025

Uh oh!

ysramakrishna commented May 13, 2025

Uh oh!

openjdk bot commented May 13, 2025

Uh oh!

Uh oh!

kdnilsen commented Mar 31, 2025 •

edited by openjdk bot

Loading

mlbridge bot commented Mar 31, 2025 •

edited

Loading

kdnilsen Apr 9, 2025 •

edited

Loading

On one recently completed test run, we observed the following impacts compared to tip:
Shenandoah

kdnilsen commented Apr 11, 2025 •

edited

Loading