Fix bio jobs leak on server crash#3011
Conversation
Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #3011 +/- ##
============================================
+ Coverage 74.29% 74.31% +0.02%
============================================
Files 129 129
Lines 70974 70976 +2
============================================
+ Hits 52731 52749 +18
+ Misses 18243 18227 -16
🚀 New features to boost your workflow:
|
|
@enjoy-binbin Thanks for quick approval, appreciate that! |
| } | ||
|
|
||
| bio_job *job; | ||
| while ((job = mutexQueuePop(bwd->bio_jobs, false)) != NULL) { |
There was a problem hiding this comment.
How is this a memory leak? We still have a reference to the bio_workers object, so we still have a reference to the mutexQueue, so we still have a reference to the underlying jobs?
I wonder if it's related to the pointer tagging done within the fifo?
There was a problem hiding this comment.
My logic was that when we exit, the jobs allocated at line 215 are still reachable via the FIFO's pointers, but there's no code path that will ever free them - the FIFO only frees its own blocks, leaving the jobs orphaned.
There was a problem hiding this comment.
The jobs should only be freed on line 297, when they are processed.
We don't ever free the FIFOs or the mutexQueues as far as I can see, so the pointers should still be valid for the entire lifetime of the process.
There was a problem hiding this comment.
I think the concern here is not that we actually have a leak, but that the tool is reporting a leak, right?
This function is used at shutdown/crash, and is documented to be "unclean". I don't think it's intended to spend time actually cleaning up the memory.
So why is the tool reporting the leak? I think @madolson may be correct that the tagged pointers are confusing the tool. This situation is a lot like sds where the pointer points into the block. In this case, the lower 3 bits of the pointer are tagged, so the pointer may point up to 7 bytes into the block (it's impossible to point beyond the block).
We have:
- a fixed pointer in code, pointing to block A
- Inside block A, we have a tagged pointer pointing to block B - and the tagging results in a small offset.
- Inside block B, we have pointers to blocks allocated elsewhere.
It's plausible that because of the offset in the pointer to B, we're not recognizing that a "referenced" and therefore all of the pointers within B are not referenced either.
There was a problem hiding this comment.
After a little more research, it does appear that the tagged/interior pointers are confusing the tool.
The proposed code change looks like the most straightforward approach to addressing the immediate reported issue. However it's likely that this will result in additional reported leaks. The "job" freed below very likely contains additional pointers - and these will be reported as leaks.
We could possibly try suppressing the warnings, but this may be challenging as the intermediate blocks with tagged pointers contain references to possibly almost anything, allocated almost anywhere.
A more robust solution for this would be a new Fifo API which untags the pointers. This would make further use of the queue undefined, but should satisfy the memory check.
There was a problem hiding this comment.
See: #3061
Do we want to use this approach as a workaround for memcheck? It's a bit ugly, but at this point, the scope is small. It's not clear if this will become a more significant issue with greater adoption of fifo.
There was a problem hiding this comment.
I think the concern here is not that we actually have a leak, but that the tool is reporting a leak, right?
Exactly, this looks to be false positive. Not sure whats the best way to handle that currently though.. #3061 may be overkill for current scope, should we just add Valgrind suppression rules?
|
Fixed by #3178 |
If there are jobs still in the queue that haven't been processed yet, those jobs will never be freed because the threads are killed before they can process them.
Introduced: #2895
Daily run: https://github.com/valkey-io/valkey/actions/runs/20733311391/job/59525607904