Skip to content

Conversation

jakesmith
Copy link
Member

@jakesmith jakesmith commented Sep 23, 2025

Add a configurable memory monitor and use in thor workers, to spot when memory is very high, and when it is, create a core dump in the debug plane for future analysis.

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

@Copilot Copilot AI review requested due to automatic review settings September 23, 2025 13:21
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a configurable memory monitoring system to the HPCC Platform that can automatically generate core dumps when memory usage exceeds specified thresholds. The primary purpose is to help diagnose memory issues in production environments by capturing the state of the process when memory usage becomes problematic.

Key changes:

  • Implements a new IMemoryMonitor interface and CMemoryMonitor class with automatic core dump generation
  • Integrates memory monitoring into Thor workers with configurable thresholds and intervals
  • Adds comprehensive configuration options through the expert settings with auto-configuration mode

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
system/jlib/jdebug.hpp Declares new memory monitoring interface and core dump functionality
system/jlib/jdebug.cpp Implements memory monitor class with threaded monitoring and core dump generation
thorlcr/slave/slavmain.cpp Integrates memory monitor into Thor worker lifecycle and adds worker memory tracking
helm/hpcc/docs/expert.md Documents new memory core dump configuration options

@jakesmith jakesmith requested a review from ghalliday September 23, 2025 13:26
Copy link

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-34915

Jirabot Action Result:
Workflow Transition To: Merge Pending
Additional PR: #20412

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks reasonable. Main concern is over the potential size of the files being generated. @mckellyln what are your thoughts?

}
}
else
memoryIncrementMB = memoryThresholdMB / 100; // default to 1% of threshold
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1% seems quite high - that means you may get 400 * 5 = 2000 core files. if all the workers hit high memory at the same time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have revised (in new commit), so that by default (in "auto" mode) it will be 95% and increment 4% of total, meaning, it will produce no more than 2 cores.

stopped = true;
stopSemaphore.signal();
threaded.join();
stopSemaphore.wait(0); // swallow signal if thread didn't consume it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stopSemaphore.reset(0) may be slightly cleaner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stopSemaphore.reinit(0) ? I'll change it.

@ghalliday ghalliday requested a review from mckellyln September 24, 2025 13:07
{
// Generate a core dump in a child process without terminating the parent
#ifdef __linux__
int childpid = fork();
Copy link
Contributor

@mckellyln mckellyln Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont know if its an issue or not, b/c child memory may not be paged in until modified (COW) but if we are very tight on memory already and we fork and child has copy of parent's memory then perhaps we are in the mud even more at this point ? Maybe try vfork() instead ??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that will play.. this mechanism relies on the child process having a copy of the parent's memory space to produce the core dump reflecting it. With vfork() from what I understand that would defeat the goal.

I think COW is a concern, if the parent is free pages and creating new ones (as it's core dumping), which would mean the child had the old ones (cosuming memory) and the parent had the new ones - increasing the total usage.

I'm not sure how much of an issue it would be, but I'm changing the implementation so that it suspends the parent process before it triggers the core dump. There'll still be a tiny window before the parent process is suspended and the child process is active that there could be some COW, but should be minimal.
After the child has finished core dumping, it will resume the parent process.

Copy link
Contributor

@mckellyln mckellyln Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, good idea. If it doesnt work well, another crazy idea ...
run gcore <pid of thor worker>
Maybe from itself or maybe from a separate monitor script/process (check_executes ??) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this approach seems to work ok with large memory program ...

char pidStr[21];
snprintf(pidStr, 20, "%d", getpid());
char *nArgv[] = { "/usr/bin/gcore", pidStr, NULL };

int pid = vfork();
if (pid == 0)
{
    close(1);
    close(2);
    execve(nArgv[0], nArgv, NULL);
    int rc = errno;
    _exit(rc);
}
else if (pid > 0)
{
    int status;
    waitpid(pid, &status, 0);
    printf("after trying to write core file, status = %d ...\n", WEXITSTATUS(status));
}
else
    printf("unable to create core file\n");

Copy link
Contributor

@mckellyln mckellyln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
Just the comment about memory and fork().
And perhaps with auto maybe check/enforce a minimum size for incrementMB (if its not 0) ?

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

{
// Set threshold based on defaultMemoryThresholdPct. Bear in mind roxiemem is normally ~80% of workerTotalMem
// we want this at a fairly high level to avoid false positives, but not so high that we run out of memory
memoryThresholdMB = totalMB / 100 * defaultMemoryThresholdPct;
Copy link
Preview

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer division occurs before multiplication, causing precision loss. For totalMB values not evenly divisible by 100, this will truncate the result. Use (totalMB * percentage) / 100 instead to maintain precision.

Copilot uses AI. Check for mistakes.

if (0 == coreDumpPid)
{
// Core dump process: suspend parent before core dumping, to minimize chance parent is adding to memory pressure while dumping
if (kill(parentPid, SIGSTOP) == -1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does sending parent SIGSTOP cause its parent's (shell ?) waitpid() to return prematurely ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this call wasn't supposed to be left in there. Removed in 0079e8e5aae487a98afb44b5e5702b8066977243

The original parent process SIGSTOP's itself, and the reaper process resumes it, once the child that has cored has exited/finished core dumping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but I think same issue here - once original pid is SIGSTOPed, its parent (shell, init_thorslave, etc) might think it ended.
Is that ok ?

Copy link
Contributor

@mckellyln mckellyln Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this code with large memory and the 2 forks show 2x the rss.
Here is a top trace while it was running -
virt rss
464975 kellma02 20 0 19.5g 19.5g 1492 S 0.0 90.8 0:05.39 x <-- orig pid
464975 kellma02 20 0 19.5g 19.5g 1492 S 0.0 90.8 0:05.39 x
465158 kellma02 20 0 19.5g 19.5g 0 R 65.9 90.8 0:00.27 x <-- first new fork, same large rss
464975 kellma02 20 0 19.5g 19.5g 1492 S 4.9 90.8 0:05.41 x
465157 kellma02 20 0 19.5g 19.5g 0 S 0.0 90.8 0:00.00 x
465158 kellma02 20 0 19.5g 19.5g 0 R 99.9 90.8 0:00.67 x <-- second new fork, same large rss
464975 kellma02 20 0 19.5g 19.5g 1488 S 0.0 90.8 0:05.41 x
465157 kellma02 20 0 19.5g 19.5g 0 S 0.0 90.8 0:00.00 x
465158 kellma02 20 0 19.5g 19.5g 0 R 99.9 90.8 0:01.07 x
464975 kellma02 20 0 19.5g 19.5g 1488 S 0.0 90.8 0:05.41 x

If I run it with just a little more memory allocated then I get OOMs.
If I run the vfork() + exec(gcore) with same mem size, it writes out the large core file and continues on without an OOM.

I am a little concerned the forks will end up causing a thor worker to get OOM killed before a core can get written, unless we trigger the core when memory is not that tight or we have a lot of swap space.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can understand top will show the same RSS for a fork()'d child process, but is not contributing to the total consumed memory in the cgroup.
But maybe fork() + abort() is causing some commiting..

I'll experiment with vfork() - still need (or desirable) to suspend the parent process and resume it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I think vfork() only makes sense if you go the exec("gcore pid") route in the child.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.
I've node added it as an option. One observation is that gcore take a lot longer to produce a core file than the kernel does. Reading up, it sounds like that's because gcore attaches and has to walk pages and writes unreserved pages (0's) too, where as the kernel creates a sparse file (which consumes less space also).
On my local machine with a large core (12G) it wa taking ~2 secs with fork()+abort() vs ~60secs with gcore.

In hpcc the memory [should] never reach very high amounts, because roxiemem limits and automatic spilling limits inside that, but clearly it does. I think with suitable thresholds in a suitable 1st test environment, we should start getting some cores from jobs that did not OOM, but hopefully shed light on what is consuming a lot of heap.

We can play with the vfork route by switching configuration with expert/memoryCoreDump/useVforkAndCore: true

@jakesmith
Copy link
Member Author

ok, but I think same issue here - once original pid is SIGSTOPed, its parent (shell, init_thorslave, etc) might think it ended.
Is that ok ?

Not sure why would the parent (in k8s case, check_executes.sh) would think it had ended when suspended?
In testing this is not the case, it is suspended (SIGSTOP), the core is created, the reaper thread unsuspends the suspended pid (SIGCONT) and the process continues.

@mckellyln
Copy link
Contributor

ok, I agree, parent suspending itself should be ok - check_executes.sh wait should not return early when its pid is stopped.

@mckellyln
Copy link
Contributor

I'm good with this PR, its just that I couldn't get a good core with the fork()+suspend method when I tried to use almost all the mem. So I just suggested the vfork()+exec("gcore pid") method as a possible alternative since it worked ok for me when I tried to use almost all the mem.

@jakesmith
Copy link
Member Author

@mckellyln - please review

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

constexpr unsigned defaultMemoryThresholdPct = 95; // auto set threshold to 95% of totalMB.
constexpr unsigned defaultNextIncrementPct = 4; // i.e next threshold will be at 99% (95% + 4%) of totalMB.
memoryThresholdMB = mcdSettings->getPropInt("@thresholdMB", NotFound);
if (NotFound != memoryThresholdMB)
Copy link
Preview

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comparison NotFound != memoryThresholdMB reads unnaturally. Consider using memoryThresholdMB != NotFound for better readability and consistency with typical comparison patterns.

Suggested change
if (NotFound != memoryThresholdMB)
if (memoryThresholdMB != NotFound)

Copilot uses AI. Check for mistakes.

memoryThresholdMB = totalMB * defaultMemoryThresholdPct / 100;
}
memoryIncrementMB = mcdSettings->getPropInt("@incrementMB", NotFound);
if (NotFound != memoryIncrementMB)
Copy link
Preview

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comparison NotFound != memoryIncrementMB reads unnaturally. Consider using memoryIncrementMB != NotFound for better readability and consistency with typical comparison patterns.

Suggested change
if (NotFound != memoryIncrementMB)
if (memoryIncrementMB != NotFound)

Copilot uses AI. Check for mistakes.

snprintf(gCorePPid, sizeof(gCorePPid), "%d", (int)parentPid);
}
// Reaper process creation
pid_t reaperPid = fork();
Copy link
Contributor

@mckellyln mckellyln Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vfork() method should allow for skipping this fork() and the setpgid(), signal() resets and parent suspend and continue steps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to:

  1. suspend parent process (original process suspends itself)
  2. wait for vfork()/exec'd child process.

The original process can't wait because it's suspended.
The reaper process waits for the 'core' child (whether vfork()+exec'd or fork()'d), and then wakes up the original process.

vfork() doesn't suspend it's parent process.
The reaper process in both cases, is to faciliate the suspend/resume semantics.

Will discuss more offline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a vfork() shouldn't be called within a vfork()'d process, so can't replaced the reaper process with vfork().

But agree if fork() is a concern, keeping the reaper fork()'d process in the case of vfork() may defeat the point.
However, without suspending the parent process.. is also problematic.. it could continue ramping up memory as gcore is running and be OOM'd killed as a result.
Could add it as an option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, last comment :-)
I wanted to avoid ever calling fork() as I was concerned about the addl memory it might require.
So the vfork() method really only makes sense to me without any fork()s.
The gcore program should pause the process it is creating a core for while its working.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But all good, merge it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explicit suspension of the parent process with the additional reaper process that comes with it, is probably overkill in both cases. I've added an optoin and defaulted it to false.
We can play with these varieties during testing if we need to.

Copy link
Contributor

@mckellyln mckellyln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Look forward to getting some core files to analyze.
I do suggest the vfork() code path should avoid the fork() and could be more simple.
My thinking was that the vfork() method was to avoid a fork() - only to keep memory usage down.

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith looks sensible.

unsigned currentMemMB = checkAndCoreDump();
if (currentMemMB) // if 0, no core dump was generated
{
if (!incrementMB)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would benefit from a comment indicating that the monitoring is terminating. (Wasn't clear on first reading)

// If current usage is <99% set one last threshold
unsigned nineNinePct = totalMB * 99 / 100;
if (currentMemMB >= nineNinePct)
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also worth a comment

@jakesmith jakesmith requested a review from mckellyln October 1, 2025 14:10
Copy link
Contributor

@mckellyln mckellyln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.
Looks great.

@ghalliday
Copy link
Member

@jakesmith need to fix windows/mac build problems. Otherwise looks good.

@jakesmith
Copy link
Member Author

@jakesmith need to fix windows/mac build problems. Otherwise looks good.

Added commit to fix windows/mac compile issues.

@ghalliday
Copy link
Member

@jakesmith please squash

Add a configurable memory monitor and use in thor workers, to spot when
memory is very high, and when it is, create a core dump in the debug
plane for future analysis.

Signed-off-by: Jake Smith <jake.smith@lexisnexisrisk.com>
@jakesmith jakesmith force-pushed the HPCC-34915-memorymonitor branch from 27b4ec4 to 7cfd4bd Compare October 2, 2025 10:38
@jakesmith
Copy link
Member Author

@ghalliday - squashed.

@ghalliday ghalliday merged commit 60d5af5 into hpcc-systems:candidate-9.12.x Oct 2, 2025
52 checks passed
Copy link

github-actions bot commented Oct 2, 2025

Jirabot Action Result:
Added fix version: 9.12.44
Added fix version: 9.14.26
Workflow Transition: 'Resolve issue'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants