Skip to content

Comments

Fix bio jobs leak on server crash#3011

Closed
dvkashapov wants to merge 1 commit intovalkey-io:unstablefrom
dvkashapov:fix-bio-leak-on-crash
Closed

Fix bio jobs leak on server crash#3011
dvkashapov wants to merge 1 commit intovalkey-io:unstablefrom
dvkashapov:fix-bio-leak-on-crash

Conversation

@dvkashapov
Copy link
Member

@dvkashapov dvkashapov commented Jan 6, 2026

If there are jobs still in the queue that haven't been processed yet, those jobs will never be freed because the threads are killed before they can process them.
Introduced: #2895
Daily run: https://github.com/valkey-io/valkey/actions/runs/20733311391/job/59525607904

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
@codecov
Copy link

codecov bot commented Jan 6, 2026

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.31%. Comparing base (0ee4234) to head (1eb6d0f).
⚠️ Report is 105 commits behind head on unstable.

Files with missing lines Patch % Lines
src/bio.c 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3011      +/-   ##
============================================
+ Coverage     74.29%   74.31%   +0.02%     
============================================
  Files           129      129              
  Lines         70974    70976       +2     
============================================
+ Hits          52731    52749      +18     
+ Misses        18243    18227      -16     
Files with missing lines Coverage Δ
src/bio.c 80.86% <0.00%> (-1.44%) ⬇️

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dvkashapov
Copy link
Member Author

@enjoy-binbin Thanks for quick approval, appreciate that!

}

bio_job *job;
while ((job = mutexQueuePop(bwd->bio_jobs, false)) != NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this a memory leak? We still have a reference to the bio_workers object, so we still have a reference to the mutexQueue, so we still have a reference to the underlying jobs?

I wonder if it's related to the pointer tagging done within the fifo?

Copy link
Member Author

@dvkashapov dvkashapov Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My logic was that when we exit, the jobs allocated at line 215 are still reachable via the FIFO's pointers, but there's no code path that will ever free them - the FIFO only frees its own blocks, leaving the jobs orphaned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jobs should only be freed on line 297, when they are processed.

We don't ever free the FIFOs or the mutexQueues as far as I can see, so the pointers should still be valid for the entire lifetime of the process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the concern here is not that we actually have a leak, but that the tool is reporting a leak, right?

This function is used at shutdown/crash, and is documented to be "unclean". I don't think it's intended to spend time actually cleaning up the memory.

So why is the tool reporting the leak? I think @madolson may be correct that the tagged pointers are confusing the tool. This situation is a lot like sds where the pointer points into the block. In this case, the lower 3 bits of the pointer are tagged, so the pointer may point up to 7 bytes into the block (it's impossible to point beyond the block).

We have:

  • a fixed pointer in code, pointing to block A
  • Inside block A, we have a tagged pointer pointing to block B - and the tagging results in a small offset.
  • Inside block B, we have pointers to blocks allocated elsewhere.

It's plausible that because of the offset in the pointer to B, we're not recognizing that a "referenced" and therefore all of the pointers within B are not referenced either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a little more research, it does appear that the tagged/interior pointers are confusing the tool.

The proposed code change looks like the most straightforward approach to addressing the immediate reported issue. However it's likely that this will result in additional reported leaks. The "job" freed below very likely contains additional pointers - and these will be reported as leaks.

We could possibly try suppressing the warnings, but this may be challenging as the intermediate blocks with tagged pointers contain references to possibly almost anything, allocated almost anywhere.

A more robust solution for this would be a new Fifo API which untags the pointers. This would make further use of the queue undefined, but should satisfy the memory check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: #3061

Do we want to use this approach as a workaround for memcheck? It's a bit ugly, but at this point, the scope is small. It's not clear if this will become a more significant issue with greater adoption of fifo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the concern here is not that we actually have a leak, but that the tool is reporting a leak, right?

Exactly, this looks to be false positive. Not sure whats the best way to handle that currently though.. #3061 may be overkill for current scope, should we just add Valgrind suppression rules?

@dvkashapov
Copy link
Member Author

Fixed by #3178

@dvkashapov dvkashapov closed this Feb 23, 2026
@dvkashapov dvkashapov deleted the fix-bio-leak-on-crash branch February 23, 2026 06:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants