[release-7.4] Revert condition for releasing memory from tlog's versionLocation data structure #12298
+1
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
We introduced this change in #11815. The idea was that for tlog spill by reference, we should be using popped version, not persistent version, for the purposes of releasing memory from the logData->versionLocation data structure. This causes OOMs in remote tlogs under certain high traffic workloads after a multi-hour primary-remote disconnect i.e. after the disconnect is healed, remote tlogs' RSS footprint grows in an unbounded way, presumably because some storage server wasn't able to pop fast enough, causing memory build up in tlog.
This got us thinking whether we really need this change in the first place and whether it's safe to revert. After discussion with a few people, the understanding is that this should not cause a correctness issue since tlog restart can rebuild the state from disk. Joshua/simulation also have not shown this issue, we made the fix pre-emptively. Worst case potential cost is disk build up and that's because the versionLocation in-memory state is used for disk queue popping, and lack of the versions in memory could mean we don't release disk space fast enough. Having said that, we think even if it were to happen, this is transient and disk space would eventually be released because of how disk queue popping works.
Overall, we think this change is safe to revert.
Testing
20250813-225250-praza-743-tag-v216-01060c4e-43f3c5618750bb80 compressed=True data_size=40160171 duration=5570120 ended=100000 fail=1 fail_fast=10 max_runs=100000 pass=99999 priority=100 remaining=0 runtime=1:19:41 sanity=False started=100000 stopped=20250814-001231 submitted=20250813-225250 timeout=5400 username=praza-743-tag-v216-01060c4ee1fb5386e045bb9c0b086f07315ac3ef
For the 1 Joshua failure, it's noSim which does not setup a simulated cluster, and I can not reproduce:
I also looked at the stack trace mentioned in Joshua logs, does not seem related to this change:
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branch
ormain
if this is the youngest branch)