-
Notifications
You must be signed in to change notification settings - Fork 308
HPCC-34915 Add memory monitor to dump core when very high #20412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC-34915 Add memory monitor to dump core when very high #20412
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a configurable memory monitoring system to the HPCC Platform that can automatically generate core dumps when memory usage exceeds specified thresholds. The primary purpose is to help diagnose memory issues in production environments by capturing the state of the process when memory usage becomes problematic.
Key changes:
- Implements a new
IMemoryMonitor
interface andCMemoryMonitor
class with automatic core dump generation - Integrates memory monitoring into Thor workers with configurable thresholds and intervals
- Adds comprehensive configuration options through the expert settings with auto-configuration mode
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
File | Description |
---|---|
system/jlib/jdebug.hpp | Declares new memory monitoring interface and core dump functionality |
system/jlib/jdebug.cpp | Implements memory monitor class with threaded monitoring and core dump generation |
thorlcr/slave/slavmain.cpp | Integrates memory monitor into Thor worker lifecycle and adds worker memory tracking |
helm/hpcc/docs/expert.md | Documents new memory core dump configuration options |
Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-34915 Jirabot Action Result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks reasonable. Main concern is over the potential size of the files being generated. @mckellyln what are your thoughts?
system/jlib/jdebug.cpp
Outdated
} | ||
} | ||
else | ||
memoryIncrementMB = memoryThresholdMB / 100; // default to 1% of threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1% seems quite high - that means you may get 400 * 5 = 2000 core files. if all the workers hit high memory at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have revised (in new commit), so that by default (in "auto" mode) it will be 95% and increment 4% of total, meaning, it will produce no more than 2 cores.
system/jlib/jdebug.cpp
Outdated
stopped = true; | ||
stopSemaphore.signal(); | ||
threaded.join(); | ||
stopSemaphore.wait(0); // swallow signal if thread didn't consume it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stopSemaphore.reset(0) may be slightly cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stopSemaphore.reinit(0) ? I'll change it.
system/jlib/jdebug.cpp
Outdated
{ | ||
// Generate a core dump in a child process without terminating the parent | ||
#ifdef __linux__ | ||
int childpid = fork(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know if its an issue or not, b/c child memory may not be paged in until modified (COW) but if we are very tight on memory already and we fork and child has copy of parent's memory then perhaps we are in the mud even more at this point ? Maybe try vfork() instead ??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that will play.. this mechanism relies on the child process having a copy of the parent's memory space to produce the core dump reflecting it. With vfork() from what I understand that would defeat the goal.
I think COW is a concern, if the parent is free pages and creating new ones (as it's core dumping), which would mean the child had the old ones (cosuming memory) and the parent had the new ones - increasing the total usage.
I'm not sure how much of an issue it would be, but I'm changing the implementation so that it suspends the parent process before it triggers the core dump. There'll still be a tiny window before the parent process is suspended and the child process is active that there could be some COW, but should be minimal.
After the child has finished core dumping, it will resume the parent process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, good idea. If it doesnt work well, another crazy idea ...
run gcore <pid of thor worker>
Maybe from itself or maybe from a separate monitor script/process (check_executes ??) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this approach seems to work ok with large memory program ...
char pidStr[21];
snprintf(pidStr, 20, "%d", getpid());
char *nArgv[] = { "/usr/bin/gcore", pidStr, NULL };
int pid = vfork();
if (pid == 0)
{
close(1);
close(2);
execve(nArgv[0], nArgv, NULL);
int rc = errno;
_exit(rc);
}
else if (pid > 0)
{
int status;
waitpid(pid, &status, 0);
printf("after trying to write core file, status = %d ...\n", WEXITSTATUS(status));
}
else
printf("unable to create core file\n");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Just the comment about memory and fork().
And perhaps with auto maybe check/enforce a minimum size for incrementMB (if its not 0) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
system/jlib/jdebug.cpp
Outdated
{ | ||
// Set threshold based on defaultMemoryThresholdPct. Bear in mind roxiemem is normally ~80% of workerTotalMem | ||
// we want this at a fairly high level to avoid false positives, but not so high that we run out of memory | ||
memoryThresholdMB = totalMB / 100 * defaultMemoryThresholdPct; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integer division occurs before multiplication, causing precision loss. For totalMB values not evenly divisible by 100, this will truncate the result. Use (totalMB * percentage) / 100
instead to maintain precision.
Copilot uses AI. Check for mistakes.
system/jlib/jdebug.cpp
Outdated
if (0 == coreDumpPid) | ||
{ | ||
// Core dump process: suspend parent before core dumping, to minimize chance parent is adding to memory pressure while dumping | ||
if (kill(parentPid, SIGSTOP) == -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does sending parent SIGSTOP cause its parent's (shell ?) waitpid() to return prematurely ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this call wasn't supposed to be left in there. Removed in 0079e8e5aae487a98afb44b5e5702b8066977243
The original parent process SIGSTOP's itself, and the reaper process resumes it, once the child that has cored has exited/finished core dumping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, but I think same issue here - once original pid is SIGSTOPed, its parent (shell, init_thorslave, etc) might think it ended.
Is that ok ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this code with large memory and the 2 forks show 2x the rss.
Here is a top trace while it was running -
virt rss
464975 kellma02 20 0 19.5g 19.5g 1492 S 0.0 90.8 0:05.39 x <-- orig pid
464975 kellma02 20 0 19.5g 19.5g 1492 S 0.0 90.8 0:05.39 x
465158 kellma02 20 0 19.5g 19.5g 0 R 65.9 90.8 0:00.27 x <-- first new fork, same large rss
464975 kellma02 20 0 19.5g 19.5g 1492 S 4.9 90.8 0:05.41 x
465157 kellma02 20 0 19.5g 19.5g 0 S 0.0 90.8 0:00.00 x
465158 kellma02 20 0 19.5g 19.5g 0 R 99.9 90.8 0:00.67 x <-- second new fork, same large rss
464975 kellma02 20 0 19.5g 19.5g 1488 S 0.0 90.8 0:05.41 x
465157 kellma02 20 0 19.5g 19.5g 0 S 0.0 90.8 0:00.00 x
465158 kellma02 20 0 19.5g 19.5g 0 R 99.9 90.8 0:01.07 x
464975 kellma02 20 0 19.5g 19.5g 1488 S 0.0 90.8 0:05.41 x
If I run it with just a little more memory allocated then I get OOMs.
If I run the vfork() + exec(gcore) with same mem size, it writes out the large core file and continues on without an OOM.
I am a little concerned the forks will end up causing a thor worker to get OOM killed before a core can get written, unless we trigger the core when memory is not that tight or we have a lot of swap space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can understand top will show the same RSS for a fork()'d child process, but is not contributing to the total consumed memory in the cgroup.
But maybe fork() + abort() is causing some commiting..
I'll experiment with vfork() - still need (or desirable) to suspend the parent process and resume it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. I think vfork() only makes sense if you go the exec("gcore pid") route in the child.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
I've node added it as an option. One observation is that gcore take a lot longer to produce a core file than the kernel does. Reading up, it sounds like that's because gcore attaches and has to walk pages and writes unreserved pages (0's) too, where as the kernel creates a sparse file (which consumes less space also).
On my local machine with a large core (12G) it wa taking ~2 secs with fork()+abort() vs ~60secs with gcore.
In hpcc the memory [should] never reach very high amounts, because roxiemem limits and automatic spilling limits inside that, but clearly it does. I think with suitable thresholds in a suitable 1st test environment, we should start getting some cores from jobs that did not OOM, but hopefully shed light on what is consuming a lot of heap.
We can play with the vfork route by switching configuration with expert/memoryCoreDump/useVforkAndCore: true
Not sure why would the parent (in k8s case, check_executes.sh) would think it had ended when suspended? |
ok, I agree, parent suspending itself should be ok - check_executes.sh wait should not return early when its pid is stopped. |
I'm good with this PR, its just that I couldn't get a good core with the fork()+suspend method when I tried to use almost all the mem. So I just suggested the vfork()+exec("gcore pid") method as a possible alternative since it worked ok for me when I tried to use almost all the mem. |
@mckellyln - please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
constexpr unsigned defaultMemoryThresholdPct = 95; // auto set threshold to 95% of totalMB. | ||
constexpr unsigned defaultNextIncrementPct = 4; // i.e next threshold will be at 99% (95% + 4%) of totalMB. | ||
memoryThresholdMB = mcdSettings->getPropInt("@thresholdMB", NotFound); | ||
if (NotFound != memoryThresholdMB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comparison NotFound != memoryThresholdMB
reads unnaturally. Consider using memoryThresholdMB != NotFound
for better readability and consistency with typical comparison patterns.
if (NotFound != memoryThresholdMB) | |
if (memoryThresholdMB != NotFound) |
Copilot uses AI. Check for mistakes.
memoryThresholdMB = totalMB * defaultMemoryThresholdPct / 100; | ||
} | ||
memoryIncrementMB = mcdSettings->getPropInt("@incrementMB", NotFound); | ||
if (NotFound != memoryIncrementMB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comparison NotFound != memoryIncrementMB
reads unnaturally. Consider using memoryIncrementMB != NotFound
for better readability and consistency with typical comparison patterns.
if (NotFound != memoryIncrementMB) | |
if (memoryIncrementMB != NotFound) |
Copilot uses AI. Check for mistakes.
system/jlib/jdebug.cpp
Outdated
snprintf(gCorePPid, sizeof(gCorePPid), "%d", (int)parentPid); | ||
} | ||
// Reaper process creation | ||
pid_t reaperPid = fork(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vfork() method should allow for skipping this fork() and the setpgid(), signal() resets and parent suspend and continue steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to:
- suspend parent process (original process suspends itself)
- wait for vfork()/exec'd child process.
The original process can't wait because it's suspended.
The reaper process waits for the 'core' child (whether vfork()+exec'd or fork()'d), and then wakes up the original process.
vfork() doesn't suspend it's parent process.
The reaper process in both cases, is to faciliate the suspend/resume semantics.
Will discuss more offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a vfork() shouldn't be called within a vfork()'d process, so can't replaced the reaper process with vfork().
But agree if fork() is a concern, keeping the reaper fork()'d process in the case of vfork() may defeat the point.
However, without suspending the parent process.. is also problematic.. it could continue ramping up memory as gcore is running and be OOM'd killed as a result.
Could add it as an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, last comment :-)
I wanted to avoid ever calling fork() as I was concerned about the addl memory it might require.
So the vfork() method really only makes sense to me without any fork()s.
The gcore program should pause the process it is creating a core for while its working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But all good, merge it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explicit suspension of the parent process with the additional reaper process that comes with it, is probably overkill in both cases. I've added an optoin and defaulted it to false.
We can play with these varieties during testing if we need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. Look forward to getting some core files to analyze.
I do suggest the vfork() code path should avoid the fork() and could be more simple.
My thinking was that the vfork() method was to avoid a fork() - only to keep memory usage down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakesmith looks sensible.
system/jlib/jdebug.cpp
Outdated
unsigned currentMemMB = checkAndCoreDump(); | ||
if (currentMemMB) // if 0, no core dump was generated | ||
{ | ||
if (!incrementMB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would benefit from a comment indicating that the monitoring is terminating. (Wasn't clear on first reading)
// If current usage is <99% set one last threshold | ||
unsigned nineNinePct = totalMB * 99 / 100; | ||
if (currentMemMB >= nineNinePct) | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also worth a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved.
Looks great.
@jakesmith need to fix windows/mac build problems. Otherwise looks good. |
Added commit to fix windows/mac compile issues. |
@jakesmith please squash |
Add a configurable memory monitor and use in thor workers, to spot when memory is very high, and when it is, create a core dump in the debug plane for future analysis. Signed-off-by: Jake Smith <jake.smith@lexisnexisrisk.com>
27b4ec4
to
7cfd4bd
Compare
@ghalliday - squashed. |
Jirabot Action Result: |
Add a configurable memory monitor and use in thor workers, to spot when memory is very high, and when it is, create a core dump in the debug plane for future analysis.
Type of change:
Checklist:
Smoketest:
Testing: