HPCC-34915 Add memory monitor to dump core when very high #20412

jakesmith · 2025-09-23T13:21:24Z

Add a configurable memory monitor and use in thor workers, to spot when memory is very high, and when it is, create a core dump in the debug plane for future analysis.

Type of change:

This change is a bug fix (non-breaking change which fixes an issue).
This change is a new feature (non-breaking change which adds functionality).
This change improves the code (refactor or other change that does not change the functionality)
This change fixes warnings (the fix does not alter the functionality or the generated code)
This change is a breaking change (fix or feature that will cause existing behavior to change).
This change alters the query API (existing queries will have to be recompiled)

Checklist:

Smoketest:

Send notifications about my Pull Request position in Smoketest queue.
Test my draft Pull Request.

Testing:

Copilot

Pull Request Overview

This PR adds a configurable memory monitoring system to the HPCC Platform that can automatically generate core dumps when memory usage exceeds specified thresholds. The primary purpose is to help diagnose memory issues in production environments by capturing the state of the process when memory usage becomes problematic.

Key changes:

Implements a new IMemoryMonitor interface and CMemoryMonitor class with automatic core dump generation
Integrates memory monitoring into Thor workers with configurable thresholds and intervals
Adds comprehensive configuration options through the expert settings with auto-configuration mode

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
system/jlib/jdebug.hpp	Declares new memory monitoring interface and core dump functionality
system/jlib/jdebug.cpp	Implements memory monitor class with threaded monitoring and core dump generation
thorlcr/slave/slavmain.cpp	Integrates memory monitor into Thor worker lifecycle and adds worker memory tracking
helm/hpcc/docs/expert.md	Documents new memory core dump configuration options

system/jlib/jdebug.cpp

github-actions · 2025-09-23T13:30:04Z

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-34915

Jirabot Action Result:
Workflow Transition To: Merge Pending
Additional PR: #20412

ghalliday

Code looks reasonable. Main concern is over the potential size of the files being generated. @mckellyln what are your thoughts?

ghalliday · 2025-09-24T13:01:29Z

system/jlib/jdebug.cpp

+            }
+        }
+        else
+            memoryIncrementMB = memoryThresholdMB / 100; // default to 1% of threshold


1% seems quite high - that means you may get 400 * 5 = 2000 core files. if all the workers hit high memory at the same time.

Have revised (in new commit), so that by default (in "auto" mode) it will be 95% and increment 4% of total, meaning, it will produce no more than 2 cores.

ghalliday · 2025-09-24T13:03:09Z

system/jlib/jdebug.cpp

+            stopped = true;
+            stopSemaphore.signal();
+            threaded.join();
+            stopSemaphore.wait(0); // swallow signal if thread didn't consume it


stopSemaphore.reset(0) may be slightly cleaner.

stopSemaphore.reinit(0) ? I'll change it.

mckellyln · 2025-09-24T17:28:02Z

system/jlib/jdebug.cpp

+{
+    // Generate a core dump in a child process without terminating the parent
+#ifdef __linux__
+    int childpid = fork();


I dont know if its an issue or not, b/c child memory may not be paged in until modified (COW) but if we are very tight on memory already and we fork and child has copy of parent's memory then perhaps we are in the mud even more at this point ? Maybe try vfork() instead ??

I don't think that will play.. this mechanism relies on the child process having a copy of the parent's memory space to produce the core dump reflecting it. With vfork() from what I understand that would defeat the goal.

I think COW is a concern, if the parent is free pages and creating new ones (as it's core dumping), which would mean the child had the old ones (cosuming memory) and the parent had the new ones - increasing the total usage.

I'm not sure how much of an issue it would be, but I'm changing the implementation so that it suspends the parent process before it triggers the core dump. There'll still be a tiny window before the parent process is suspended and the child process is active that there could be some COW, but should be minimal.
After the child has finished core dumping, it will resume the parent process.

ok, good idea. If it doesnt work well, another crazy idea ...
run gcore <pid of thor worker>
Maybe from itself or maybe from a separate monitor script/process (check_executes ??) ?

this approach seems to work ok with large memory program ...

char pidStr[21]; snprintf(pidStr, 20, "%d", getpid()); char *nArgv[] = { "/usr/bin/gcore", pidStr, NULL }; int pid = vfork(); if (pid == 0) { close(1); close(2); execve(nArgv[0], nArgv, NULL); int rc = errno; _exit(rc); } else if (pid > 0) { int status; waitpid(pid, &status, 0); printf("after trying to write core file, status = %d ...\n", WEXITSTATUS(status)); } else printf("unable to create core file\n");

mckellyln

Looks good.
Just the comment about memory and fork().
And perhaps with auto maybe check/enforce a minimum size for incrementMB (if its not 0) ?

Copilot

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

system/jlib/jdebug.cpp

Copilot · 2025-09-26T10:33:59Z

system/jlib/jdebug.cpp

+        {
+            // Set threshold based on defaultMemoryThresholdPct. Bear in mind roxiemem is normally ~80% of workerTotalMem
+            // we want this at a fairly high level to avoid false positives, but not so high that we run out of memory
+            memoryThresholdMB = totalMB / 100 * defaultMemoryThresholdPct;


Integer division occurs before multiplication, causing precision loss. For totalMB values not evenly divisible by 100, this will truncate the result. Use (totalMB * percentage) / 100 instead to maintain precision.

system/jlib/jdebug.cpp

mckellyln · 2025-09-26T15:15:38Z

system/jlib/jdebug.cpp

+        if (0 == coreDumpPid)
+        {
+            // Core dump process: suspend parent before core dumping, to minimize chance parent is adding to memory pressure while dumping
+            if (kill(parentPid, SIGSTOP) == -1)


Does sending parent SIGSTOP cause its parent's (shell ?) waitpid() to return prematurely ?

this call wasn't supposed to be left in there. Removed in 0079e8e5aae487a98afb44b5e5702b8066977243

The original parent process SIGSTOP's itself, and the reaper process resumes it, once the child that has cored has exited/finished core dumping.

ok, but I think same issue here - once original pid is SIGSTOPed, its parent (shell, init_thorslave, etc) might think it ended.
Is that ok ?

I tested this code with large memory and the 2 forks show 2x the rss.
Here is a top trace while it was running -
virt rss
464975 kellma02 20 0 19.5g 19.5g 1492 S 0.0 90.8 0:05.39 x <-- orig pid
464975 kellma02 20 0 19.5g 19.5g 1492 S 0.0 90.8 0:05.39 x
465158 kellma02 20 0 19.5g 19.5g 0 R 65.9 90.8 0:00.27 x <-- first new fork, same large rss
464975 kellma02 20 0 19.5g 19.5g 1492 S 4.9 90.8 0:05.41 x
465157 kellma02 20 0 19.5g 19.5g 0 S 0.0 90.8 0:00.00 x
465158 kellma02 20 0 19.5g 19.5g 0 R 99.9 90.8 0:00.67 x <-- second new fork, same large rss
464975 kellma02 20 0 19.5g 19.5g 1488 S 0.0 90.8 0:05.41 x
465157 kellma02 20 0 19.5g 19.5g 0 S 0.0 90.8 0:00.00 x
465158 kellma02 20 0 19.5g 19.5g 0 R 99.9 90.8 0:01.07 x
464975 kellma02 20 0 19.5g 19.5g 1488 S 0.0 90.8 0:05.41 x

If I run it with just a little more memory allocated then I get OOMs.
If I run the vfork() + exec(gcore) with same mem size, it writes out the large core file and continues on without an OOM.

I am a little concerned the forks will end up causing a thor worker to get OOM killed before a core can get written, unless we trigger the core when memory is not that tight or we have a lot of swap space.

From what I can understand top will show the same RSS for a fork()'d child process, but is not contributing to the total consumed memory in the cgroup.
But maybe fork() + abort() is causing some commiting..

I'll experiment with vfork() - still need (or desirable) to suspend the parent process and resume it.

ok. I think vfork() only makes sense if you go the exec("gcore pid") route in the child.

yes.
I've node added it as an option. One observation is that gcore take a lot longer to produce a core file than the kernel does. Reading up, it sounds like that's because gcore attaches and has to walk pages and writes unreserved pages (0's) too, where as the kernel creates a sparse file (which consumes less space also).
On my local machine with a large core (12G) it wa taking ~2 secs with fork()+abort() vs ~60secs with gcore.

In hpcc the memory [should] never reach very high amounts, because roxiemem limits and automatic spilling limits inside that, but clearly it does. I think with suitable thresholds in a suitable 1st test environment, we should start getting some cores from jobs that did not OOM, but hopefully shed light on what is consuming a lot of heap.

We can play with the vfork route by switching configuration with expert/memoryCoreDump/useVforkAndCore: true

jakesmith · 2025-09-30T14:17:42Z

ok, but I think same issue here - once original pid is SIGSTOPed, its parent (shell, init_thorslave, etc) might think it ended.
Is that ok ?

Not sure why would the parent (in k8s case, check_executes.sh) would think it had ended when suspended?
In testing this is not the case, it is suspended (SIGSTOP), the core is created, the reaper thread unsuspends the suspended pid (SIGCONT) and the process continues.

mckellyln · 2025-09-30T16:52:34Z

ok, I agree, parent suspending itself should be ok - check_executes.sh wait should not return early when its pid is stopped.

mckellyln · 2025-09-30T16:55:18Z

I'm good with this PR, its just that I couldn't get a good core with the fork()+suspend method when I tried to use almost all the mem. So I just suggested the vfork()+exec("gcore pid") method as a possible alternative since it worked ok for me when I tried to use almost all the mem.

jakesmith · 2025-09-30T19:04:58Z

@mckellyln - please review

Copilot

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Copilot · 2025-09-30T19:05:09Z

system/jlib/jdebug.cpp

+        constexpr unsigned defaultMemoryThresholdPct = 95; // auto set threshold to 95% of totalMB.
+        constexpr unsigned defaultNextIncrementPct = 4; // i.e next threshold will be at 99% (95% + 4%) of totalMB.
+        memoryThresholdMB = mcdSettings->getPropInt("@thresholdMB", NotFound);
+        if (NotFound != memoryThresholdMB)


[nitpick] The comparison NotFound != memoryThresholdMB reads unnaturally. Consider using memoryThresholdMB != NotFound for better readability and consistency with typical comparison patterns.

Suggested change

if (NotFound != memoryThresholdMB)

if (memoryThresholdMB != NotFound)

Copilot · 2025-09-30T19:05:10Z

system/jlib/jdebug.cpp

+            memoryThresholdMB = totalMB * defaultMemoryThresholdPct / 100;
+        }
+        memoryIncrementMB = mcdSettings->getPropInt("@incrementMB", NotFound);
+        if (NotFound != memoryIncrementMB)


[nitpick] The comparison NotFound != memoryIncrementMB reads unnaturally. Consider using memoryIncrementMB != NotFound for better readability and consistency with typical comparison patterns.

Suggested change

if (NotFound != memoryIncrementMB)

if (memoryIncrementMB != NotFound)

system/jlib/jdebug.cpp

mckellyln · 2025-09-30T21:05:06Z

system/jlib/jdebug.cpp

+        snprintf(gCorePPid, sizeof(gCorePPid), "%d", (int)parentPid);
+    }
+    // Reaper process creation
+    pid_t reaperPid = fork();


The vfork() method should allow for skipping this fork() and the setpgid(), signal() resets and parent suspend and continue steps.

Still need to:

suspend parent process (original process suspends itself)

wait for vfork()/exec'd child process.

The original process can't wait because it's suspended.
The reaper process waits for the 'core' child (whether vfork()+exec'd or fork()'d), and then wakes up the original process.

vfork() doesn't suspend it's parent process.
The reaper process in both cases, is to faciliate the suspend/resume semantics.

Will discuss more offline.

a vfork() shouldn't be called within a vfork()'d process, so can't replaced the reaper process with vfork().

But agree if fork() is a concern, keeping the reaper fork()'d process in the case of vfork() may defeat the point.
However, without suspending the parent process.. is also problematic.. it could continue ramping up memory as gcore is running and be OOM'd killed as a result.
Could add it as an option.

ok, last comment :-)
I wanted to avoid ever calling fork() as I was concerned about the addl memory it might require.
So the vfork() method really only makes sense to me without any fork()s.
The gcore program should pause the process it is creating a core for while its working.

But all good, merge it

The explicit suspension of the parent process with the additional reaper process that comes with it, is probably overkill in both cases. I've added an optoin and defaulted it to false.
We can play with these varieties during testing if we need to.

mckellyln

Approved. Look forward to getting some core files to analyze.
I do suggest the vfork() code path should avoid the fork() and could be more simple.
My thinking was that the vfork() method was to avoid a fork() - only to keep memory usage down.

ghalliday

@jakesmith looks sensible.

ghalliday · 2025-10-01T10:07:36Z

system/jlib/jdebug.cpp

+                unsigned currentMemMB = checkAndCoreDump();
+                if (currentMemMB) // if 0, no core dump was generated
+                {
+                    if (!incrementMB)


Would benefit from a comment indicating that the monitoring is terminating. (Wasn't clear on first reading)

ghalliday · 2025-10-01T10:14:30Z

system/jlib/jdebug.cpp

+                        // If current usage is <99% set one last threshold
+                        unsigned nineNinePct = totalMB * 99 / 100;
+                        if (currentMemMB >= nineNinePct)
+                            break;


Also worth a comment

mckellyln

Approved.
Looks great.

ghalliday · 2025-10-01T16:17:57Z

@jakesmith need to fix windows/mac build problems. Otherwise looks good.

jakesmith · 2025-10-01T16:29:30Z

@jakesmith need to fix windows/mac build problems. Otherwise looks good.

Added commit to fix windows/mac compile issues.

ghalliday · 2025-10-02T08:25:08Z

@jakesmith please squash

Add a configurable memory monitor and use in thor workers, to spot when memory is very high, and when it is, create a core dump in the debug plane for future analysis. Signed-off-by: Jake Smith <jake.smith@lexisnexisrisk.com>

jakesmith · 2025-10-02T10:38:53Z

@ghalliday - squashed.

github-actions · 2025-10-02T15:23:33Z

Jirabot Action Result:
Added fix version: 9.12.44
Added fix version: 9.14.26
Workflow Transition: 'Resolve issue'

Copilot AI review requested due to automatic review settings September 23, 2025 13:21

Copilot AI reviewed Sep 23, 2025

View reviewed changes

system/jlib/jdebug.cpp Show resolved Hide resolved

system/jlib/jdebug.cpp Show resolved Hide resolved

system/jlib/jdebug.cpp Outdated Show resolved Hide resolved

system/jlib/jdebug.cpp Outdated Show resolved Hide resolved

system/jlib/jdebug.cpp Outdated Show resolved Hide resolved

jakesmith requested a review from ghalliday September 23, 2025 13:26

ghalliday reviewed Sep 24, 2025

View reviewed changes

ghalliday requested a review from mckellyln September 24, 2025 13:07

mckellyln reviewed Sep 24, 2025

View reviewed changes

mckellyln reviewed Sep 25, 2025

View reviewed changes

jakesmith requested review from ghalliday, mckellyln and Copilot September 26, 2025 10:32

Copilot AI reviewed Sep 26, 2025

View reviewed changes

mckellyln reviewed Sep 26, 2025

View reviewed changes

jakesmith requested review from Copilot and mckellyln September 30, 2025 19:04

Copilot AI reviewed Sep 30, 2025

View reviewed changes

mckellyln reviewed Sep 30, 2025

View reviewed changes

mckellyln approved these changes Sep 30, 2025

View reviewed changes

ghalliday approved these changes Oct 1, 2025

View reviewed changes

jakesmith requested a review from mckellyln October 1, 2025 14:10

mckellyln approved these changes Oct 1, 2025

View reviewed changes

HPCC-34915 Add memory monitor to dump core when very high

7cfd4bd

Add a configurable memory monitor and use in thor workers, to spot when memory is very high, and when it is, create a core dump in the debug plane for future analysis. Signed-off-by: Jake Smith <jake.smith@lexisnexisrisk.com>

jakesmith force-pushed the HPCC-34915-memorymonitor branch from 27b4ec4 to 7cfd4bd Compare October 2, 2025 10:38

ghalliday merged commit 60d5af5 into hpcc-systems:candidate-9.12.x Oct 2, 2025
52 checks passed

	if (NotFound != memoryThresholdMB)
	if (memoryThresholdMB != NotFound)

	if (NotFound != memoryIncrementMB)
	if (memoryIncrementMB != NotFound)

HPCC-34915 Add memory monitor to dump core when very high #20412

HPCC-34915 Add memory monitor to dump core when very high #20412

Conversation

jakesmith commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of change:

Checklist:

Smoketest:

Testing:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Sep 23, 2025

Uh oh!

ghalliday left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mckellyln Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mckellyln Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mckellyln left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mckellyln Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakesmith commented Sep 30, 2025

Uh oh!

mckellyln commented Sep 30, 2025

Uh oh!

jakesmith commented Sep 23, 2025 •

edited

Loading

mckellyln Sep 24, 2025 •

edited

Loading

mckellyln Sep 25, 2025 •

edited

Loading

mckellyln left a comment •

edited

Loading

mckellyln Sep 26, 2025 •

edited

Loading

mckellyln Sep 30, 2025 •

edited

Loading

mckellyln left a comment •

edited

Loading