Skip to content

Adjust initial tlab size #25423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 48 commits into
base: master
Choose a base branch
from

Conversation

kdnilsen
Copy link
Contributor

@kdnilsen kdnilsen commented May 23, 2025

We have found with certain workloads that the initial and maximum tlab sizes result in very high latencies for the first few invocations of particular methods for certain threads. The root cause is that TLABs are too large. This is causing allocatable memory to be depleted too quickly. When large numbers of threads are trying to startup at the same time, some of the threads end up with no TLABs or very small TLABs and their efforts run hundreds of times slower than the threads that were able to grab very large TLABs.

This PR reduces the maximum TLAB size and adjusts the initial TLAB size in order to reduce the impact of this problem.

This PR also changes the value of TLABAllocationWeight from 90 to 35 when we are running in generational mode. 35 is the default value used for G1 GC, which is also generational. The default value of 90 was established years ago for non-generational Shenandoah because it tends to have less frequent GC cycles than generational collectors.

With a ``small'' workload, the most significant benefit of this change is seen with p99.99 (66.1% latency improvement) and p99.999 (62.6% latency improvement). At other percentiles, the latency slightly increased (0.6% at p50, 1.7% at p100).

image

image

The small workload is represented by the following execution script:

            ~/github/jdk.adjust-initial-tlab-size/build/linux-x86_64-server-release/images/jdk/bin/java \
                -XX:ActiveProcessorCount=2 \
                -XX:+UnlockExperimentalVMOptions \
                -XX:-ShenandoahPacing \
                -XX:+AlwaysPreTouch -XX:+DisableExplicitGC -Xms4g -Xmx4g \
                -XX:+UseShenandoahGC -XX:ShenandoahGCMode=generational \
                -XX:ShenandoahFullGCThreshold=1024 \
                -XX:ShenandoahMinRegionSize=4M \
                -Xlog:"gc*=info,ergo" \
                -Xlog:safepoint=trace -Xlog:safepoint=debug -Xlog:safepoint=info \
                -XX:+UnlockDiagnosticVMOptions \
                -jar ~/github/heapothesys.fix-two-bugs/Extremem/src/main/java/extremem.jar \
                -dDictionarySize=3000000 \
                -dNumCustomers=30000 \
                -dNumProducts=30000 \
                -dCustomerThreads=500 \
                -dCustomerPeriod=5s \
                -dCustomerThinkTime=1s \
                -dKeywordSearchCount=1 \
                -dSelectionCriteriaCount=3 \
                -dProductReviewLength=12 \
                -dServerThreads=5 \
                -dServerPeriod=10s \
                -dProductNameLength=10 \
                -dBrowsingHistoryQueueCount=5 \
                -dSalesTransactionQueueCount=5 \
                -dProductDescriptionLength=40 \
                -dProductReplacementPeriod=60s \
                -dProductReplacementCount=25 \
                -dCustomerReplacementPeriod=60s \
                -dCustomerReplacementCount=1500 \
                -dBrowsingExpiration=1m \
                -dPhasedUpdates=true \
                -dPhasedUpdateInterval=180s \
                -dSimulationDuration=25m \
                -dResponseTimeMeasurements=100000 \
                >$t.genshen.MaxRSWby8-TLABisRSBby128.small.overrides.out 2>$t.genshen.MaxRSWby8-TLABisRSBby128.small.overrides.err &
            job_pid=$!
            sleep 1500
            cpu_percent=$(ps -o cputime -o etime -p $job_pid)
            rss_kb=$(ps -o rss= -p $job_pid)
            rss_mb=$((rss_kb / 1024))
            wait $job_pid
            echo "RSS: $rss_mb MB" >>$t.genshen.MaxRSWby8-TLABisRSBby128.small.overrides.out 2>>$t.genshen.MaxRSWby8-TLABisRSBby128.small.overrides.err
            echo "$cpu_percent" >>$t.genshen.MaxRSWby8-TLABisRSBby128.small.overrides.out
            gzip $t.genshen.MaxRSWby8-TLABisRSBby128.small.overrides.out $t.genshen.MaxRSWby8-TLABisRSBby128.small.overrides.err

With a ``medium'' workload, the impact is somewhat neutral, ranging from 9% improvement at p100 to 22.4% degradation at p99.999.

image

image

The medium workload is represented by this execution script:

            ~/github/jdk.adjust-initial-tlab-size/build/linux-x86_64-server-release/images/jdk/bin/java \
                -XX:+UnlockExperimentalVMOptions \
                -XX:-ShenandoahPacing \
                -XX:+AlwaysPreTouch -XX:+DisableExplicitGC -Xms31g -Xmx31g \
                -XX:+UseShenandoahGC -XX:ShenandoahGCMode=generational \
                -XX:ShenandoahFullGCThreshold=1024 \
                -Xlog:"gc*=info,ergo" \
                -Xlog:safepoint=trace -Xlog:safepoint=debug -Xlog:safepoint=info \
                -XX:+UnlockDiagnosticVMOptions \
                -jar ~/github/heapothesys/Extremem/src/main/java/extremem.jar \
                -dDictionarySize=3000000 \
                -dNumCustomers=8000000 \
                -dNumProducts=1800000 \
                -dCustomerThreads=500 \
                -dCustomerPeriod=5s \
                -dCustomerThinkTime=1s \
                -dKeywordSearchCount=1 \
                -dSelectionCriteriaCount=2 \
                -dProductReviewLength=32 \
                -dServerThreads=5 \
                -dServerPeriod=10s \
                -dProductNameLength=10 \
                -dBrowsingHistoryQueueCount=5 \
                -dSalesTransactionQueueCount=5 \
                -dProductDescriptionLength=34 \
                -dProductReplacementPeriod=60s \
                -dProductReplacementCount=25 \
                -dCustomerReplacementPeriod=60s \
                -dCustomerReplacementCount=1500 \
                -dBrowsingExpiration=1m \
                -dPhasedUpdates=true \
                -dPhasedUpdateInterval=180s \
                -dSimulationDuration=25m \
                -dResponseTimeMeasurements=100000 \
                >$t.genshen.medium.MaxTLABisRSWby8-TLABisRSBby128.out 2>$t.genshen.medium.MaxTLABisRSWby8-TLABisRSBby128.err &
            job_pid=$!
            sleep 1500
            cpu_percent=$(ps -o cputime -o etime -p $job_pid)
            rss_kb=$(ps -o rss= -p $job_pid)
            rss_mb=$((rss_kb / 1024))
            wait $job_pid
            echo "RSS: $rss_mb MB" >>$t.genshen.medium.MaxTLABisRSWby8-TLABisRSBby128.out
            echo "$cpu_percent" >>$t.genshen.medium.MaxTLABisRSWby8-TLABisRSBby128.out
            gzip $t.genshen.medium.MaxTLABisRSWby8-TLABisRSBby128.out $t.genshen.medium.MaxTLABisRSWby8-TLABisRSBby128.err

The huge workload comparisons are still being tested...

The huge workload is represented by this execution script:

            ~/github/jdk.adjust-initial-tlab-size/build/linux-x86_64-server-release/images/jdk/bin/java \
                -XX:ActiveProcessorCount=16 \
                -XX:+UnlockExperimentalVMOptions \
                -XX:-ShenandoahPacing \
                -XX:+AlwaysPreTouch -XX:+DisableExplicitGC -Xms512g -Xmx512g \
                -XX:+UseShenandoahGC -XX:ShenandoahGCMode=generational \
                -XX:ShenandoahFullGCThreshold=1024 \
                -XX:ShenandoahGuaranteedGCInterval=0 \
                -XX:ShenandoahGuaranteedOldGCInterval=0 \
                -XX:ShenandoahGuaranteedYoungGCInterval=0 \
                -Xlog:"gc*=info,ergo" \
                -Xlog:safepoint=trace -Xlog:safepoint=debug -Xlog:safepoint=info \
                -XX:+UnlockDiagnosticVMOptions \
                -jar ~/github/heapothesys/Extremem/src/main/java/extremem.jar \
                -dDictionarySize=3000000 \
                -dNumCustomers=210000000 \
                -dNumProducts=18000000 \
                -dCustomerThreads=2000 \
                -dCustomerPeriod=2000ms \
                -dCustomerThinkTime=300ms \
                -dKeywordSearchCount=2 \
                -dAllowAnyMatch=false \
                -dSelectionCriteriaCount=3 \
                -dProductReviewLength=96 \
                -dBuyThreshold=0.5 \
                -dSaveForLaterThreshold=0.15 \
                -dBrowsingExpiration=5m \
                -dServerThreads=20 \
                -dServerPeriod=10s \
                -dProductNameLength=6 \
                -dProductDescriptionLength=70 \
                -dBrowsingHistoryQueueCount=1 \
                -dSalesTransactionQueueCount=1 \
                -dProductReplacementPeriod=60s \
                -dProductReplacementCount=25 \
                -dCustomerReplacementPeriod=60s \
                -dCustomerReplacementCount=150 \
                -dBrowsingExpiration=1m \
                -dSimulationDuration=25m \
                -dResponseTimeMeasurements=100000 \
                -dPhasedUpdates=true \
                -dPhasedUpdateInterval=180s \
                >$t.genshen.huge.MaxTLABisRSWby8-TLABisRSBisRSBby128.out 2>$t.genshen.huge.MaxTLABisRSWby8-TLABisRSBisRSBby128.err &
            job_pid=$!
            sleep 3000
            cpu_percent=$(ps -o cputime -o etime -p $job_pid)
            rss_kb=$(ps -o rss= -p $job_pid)
            rss_mb=$((rss_kb / 1024))
            wait $job_pid
            echo "RSS: $rss_kb KB" >>$t.genshen.huge.MaxTLABisRSWby8-TLABisRSBisRSBby128.out
            echo "RSS: $rss_mb MB" >>$t.genshen.huge.MaxTLABisRSWby8-TLABisRSBisRSBby128.out
            echo "$cpu_percent" >>$t.genshen.huge.MaxTLABisRSWby8-TLABisRSBisRSBby128.out
            gzip $t.genshen.huge.MaxTLABisRSWby8-TLABisRSBisRSBby128.out $t.genshen.huge.MaxTLABisRSWby8-TLABisRSBisRSBby128.err

We also tested the impact of this change on one of our current development branches, identified as adaptive-evac-with-surge. Performance of this development branch, which we are in the process of merging into upstream, is what motivated the original efforts to explore improved tlab sizes.

For the same small workload described above running on a c6a.2xlarge host, the most significant benefits are seen at p99.99, p99.999, and p100 percentiles, with 50.1%, 17.6%, and 98.2% improvement respectively:

image

When this small workload is run on a m5.4xlarge host, we still see very significant benefits at p100, but degradation at p99.999.

image

The medium workload performed especially poorly without the improvements provided by this PR. All percentiles except p50 show very large improvement:

image

The huge workload is roughly neutral with this PR:

image


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25423/head:pull/25423
$ git checkout pull/25423

Update a local copy of the PR:
$ git checkout pull/25423
$ git pull https://git.openjdk.org/jdk.git pull/25423/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 25423

View PR using the GUI difftool:
$ git pr show -t 25423

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25423.diff

@bridgekeeper
Copy link

bridgekeeper bot commented May 23, 2025

👋 Welcome back kdnilsen! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented May 23, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk
Copy link

openjdk bot commented May 23, 2025

⚠️ @kdnilsen This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

@openjdk
Copy link

openjdk bot commented May 23, 2025

@kdnilsen The following labels will be automatically applied to this pull request:

  • hotspot-gc
  • shenandoah

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot-gc hotspot-gc-dev@openjdk.org shenandoah shenandoah-dev@openjdk.org labels May 23, 2025
@kdnilsen kdnilsen marked this pull request as draft May 23, 2025 20:27
@kdnilsen
Copy link
Contributor Author

Leaving this in draft while I prepare details for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-gc hotspot-gc-dev@openjdk.org shenandoah shenandoah-dev@openjdk.org
Development

Successfully merging this pull request may close these issues.

1 participant