Skip to content

Conversation

@allantargino
Copy link
Contributor

@allantargino allantargino commented Sep 23, 2025

ProcDump-for-Linux is a tool that supports .NET as one of its targets to take dumps. For .NET, it works pretty much as dotnet-dump tool, triggering createdump utility.

In ProcDump's integration pipelines, we've been seeing some intermittent errors taking .NET dumps. Sometimes we see:

[createdump] The process or container does not have permissions or access: open(/proc/10170/mem) FAILED Permission denied (13)
[createdump] Failure took 0ms

We even saw that for the same process, a first dump succeeded, a second one failed with the message above.

After studying the code from createdump's launcher in the CLR, we can see there is a fork: the child execs and becomes createdump, while the parent process uses prctl syscall to allow createdump to read its /proc/pid/mem:

// Fork the core dump child process.
pid_t childpid = fork();
// If error, write an error to trace log and abort
if (childpid == -1)
{
if (errorMessageBuffer != nullptr)
{
sprintf_s(errorMessageBuffer, cbErrorMessageBuffer, "Problem launching createdump: fork() FAILED %s (%d)\n", strerror(errno), errno);
}
close(pipe_descs[0]);
close(pipe_descs[1]);
return false;
}
else if (childpid == 0)
{
// Close the read end of the pipe, the child doesn't need it
int callbackResult = 0;
close(parent_pipe);
// Only dup the child's stderr if there is error buffer
if (errorMessageBuffer != nullptr)
{
dup2(child_pipe, STDERR_FILENO);
}
if (g_createdumpCallback != nullptr)
{
// Remove the signal handlers inherited from the runtime process
SEHCleanupSignals(true /* isChildProcess */);
// Call the statically linked createdump code
callbackResult = g_createdumpCallback(argv.size(), argv.data());
// Set the shutdown callback to nullptr and exit
// If we don't exit, the child's execution will continue into the diagnostic server behavior
// which causes all sorts of problems.
g_shutdownCallback = nullptr;
exit(callbackResult);
}
else
{
// Execute the createdump program
if (execve(argv[0], (char**)argv.data(), palEnvironment) == -1)
{
fprintf(stderr, "Problem launching createdump (may not have execute permissions): execve(%s) FAILED %s (%d)\n", argv[0], strerror(errno), errno);
exit(-1);
}
}
}
else
{
#if HAVE_PRCTL_H && HAVE_PR_SET_PTRACER
// Gives the child process permission to use /proc/<pid>/mem and ptrace
if (prctl(PR_SET_PTRACER, childpid, 0, 0, 0) == -1)
{
// Ignore any error because on some CentOS and OpenSUSE distros, it isn't
// supported but createdump works just fine.
ERROR("PROCCreateCrashDump: prctl() FAILED %s (%d)\n", strerror(errno), errno);
}

We noticed there can be a potential race condition on the code above when the child/createdump tries to read /proc/pid/mem before the parent grants permissions. I assume it is rare in multi-core systems, but it happens quite frequently in our pipelines that have a limited number of cores.

To force the race condition to happen locally, I added a sleep on the parent branch right before it calls prctl - it always fails with the error message above!

This PR proposes a fix to allow the parent to signal the child using pipes when prctl(PR_SET_PTRACER, childpid) is done so it can proceed.

Copilot AI review requested due to automatic review settings September 23, 2025 16:04
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Sep 23, 2025
@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Sep 23, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a race condition in PROCCreateCrashDump where the child process (createdump) could attempt to read /proc/pid/mem before the parent process grants PTRACE permissions via prctl(PR_SET_PTRACER, childpid). The fix ensures proper synchronization between parent and child processes to prevent intermittent permission failures.

@jkotas jkotas added area-Diagnostics-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Sep 23, 2025
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @steveisok, @dotnet/dotnet-diag
See info in area-owners.md if you want to be subscribed.

@allantargino
Copy link
Contributor Author

I will address copilot's suggestions after a team member review if the fix makes sense 😅

@allantargino
Copy link
Contributor Author

BTW, the same pattern is also present on nativeaot:

pid_t childpid = fork();
// If error, write an error to trace log and abort
if (childpid == -1)
{
if (errorMessageBuffer != nullptr)
{
snprintf(errorMessageBuffer, cbErrorMessageBuffer, "Problem launching createdump: fork() FAILED %s (%d)\n", strerror(errno), errno);
}
close(pipe_descs[0]);
close(pipe_descs[1]);
return false;
}
else if (childpid == 0)
{
// Close the read end of the pipe, the child doesn't need it
close(parent_pipe);
// Only dup the child's stderr if there is error buffer
if (errorMessageBuffer != nullptr)
{
dup2(child_pipe, STDERR_FILENO);
}
// Execute the createdump program
if (execv(argv[0], (char* const *)argv) == -1)
{
fprintf(stderr, "Problem launching createdump (may not have execute permissions): execv(%s) FAILED %s (%d)\n", argv[0], strerror(errno), errno);
exit(-1);
}
}
else
{
#if HAVE_PRCTL_H && HAVE_PR_SET_PTRACER
// Gives the child process permission to use /proc/<pid>/mem and ptrace
if (prctl(PR_SET_PTRACER, childpid, 0, 0, 0) == -1)
{
// Ignore any error because on some CentOS and OpenSUSE distros, it isn't
// supported but createdump works just fine.
#ifdef _DEBUG
fprintf(stderr, "CreateCrashDump: prctl() FAILED %s (%d)\n", strerror(errno), errno);
#endif
}
#endif // HAVE_PRCTL_H && HAVE_PR_SET_PTRACER

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments inline and I agree with those AI review comments that we should error check the new read/write calls you added.

Thanks for finding this and making the PR!

@noahfalk
Copy link
Member

BTW, the same pattern is also present on nativeaot:

Interested in making the fix for NativeAOT in the same PR?

allantargino and others added 4 commits September 23, 2025 20:24
Co-authored-by: Noah Falk <noahfalk@users.noreply.github.com>
Co-authored-by: Noah Falk <noahfalk@users.noreply.github.com>
Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@noahfalk
Copy link
Member

@hoyosjs - if this looks good to you as well go ahead and merge it

@allantargino allantargino changed the title Fix race condition in PROCCreateCrashDump when createdump tries to read /proc/pid/mem before prctl(PR_SET_PTRACER, childpid) Fix race condition when createdump tries to read /proc/pid/mem before prctl(PR_SET_PTRACER, childpid) is set Sep 24, 2025
@hoyosjs hoyosjs merged commit 8614658 into dotnet:main Sep 24, 2025
100 checks passed
@hoyosjs
Copy link
Member

hoyosjs commented Sep 24, 2025

/backport to release/10.0

@github-actions
Copy link
Contributor

Started backporting to release/10.0: https://github.com/dotnet/runtime/actions/runs/17990123963

@allantargino allantargino deleted the fix-prctl-race-condition-unix branch September 25, 2025 12:32
@github-actions github-actions bot locked and limited conversation to collaborators Oct 26, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-Diagnostics-coreclr community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants