Skip to content

Comments

Fix some Linux (and windows) Dac crashes when given reasonable inputs#124686

Closed
leculver wants to merge 5 commits intodotnet:mainfrom
leculver:issue_124640
Closed

Fix some Linux (and windows) Dac crashes when given reasonable inputs#124686
leculver wants to merge 5 commits intodotnet:mainfrom
leculver:issue_124640

Conversation

@leculver
Copy link
Contributor

@leculver leculver commented Feb 21, 2026

There's a lot of places where the dac will crash when we feed it bad input. This doesn't completely fix or exhaust all of those places, but it makes a significant dent in the number of places we trust input that shouldn't be strictly trusted.

This primarily affects Linux (and probably macos, though I don't have a machine to test there). There were a few places windows could throw an AV too.

The underlying fixes were pretty rote, so I let copilot take care of a lot of them. I have a dac test app that does the bare minimum to test every api with the most basic inputs and this was the result. I ran the resulting dac/runtime through ClrMD's test suite and all tests pass.

Dac tests live here temporarily: https://github.com/leculver/dac-tests/. Planning to move them to a unit test at some point (here or the diagnostics repo) to make sure we don't regress these. Even though we don't change the surface of the dac private apis much anymore, underlying runtime changes can cause some crashes, especially on linux/osx.

Fixes #124640.

leculver and others added 2 commits February 20, 2026 09:17
Crash fixes (all SIGSEGV on Linux):
- GetAppDomainName(0): add null address check
- TraverseModuleMap: add null callback check
- TraverseVirtCallStubHeap: add null callback check
- GetRegisterName: replace wcscpy_s with bounded memcpy to prevent buffer overflow
- GetPendingReJITID/GetReJITInformation/GetProfilerModifiedILInformation: add DacValidateMD
- GetMethodsWithProfilerModifiedIL: add module address -1 check
- GetFinalizationFillPointersSvr: validate heap address against known server GC heaps
- GetAssemblyLoadContext: add DacValidateMethodTable

Shadowed HRESULT fixes:
- GetGenerationTable, GetFinalizationFillPointers, GetGenerationTableSvr,
  GetFinalizationFillPointersSvr, GetObjectComWrappersData all declared a
  local HRESULT hr that shadowed the one from SOSDacEnter(), causing the
  return after SOSDacLeave() to always return S_OK regardless of errors.

Contributes to: dotnet#124640

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 21, 2026 02:17
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens SOS DAC implementations against invalid-but-plausible inputs to reduce crashes (especially on Unix) by adding validation and using safer access patterns when reading target data.

Changes:

  • Over-allocates the buffer used by GetGCGlobalMechanisms to avoid potential overflow when newer runtimes write more entries.
  • Replaces some direct GetClass() dereferences with GetClassWithPossibleAV() and adds additional argument validation/guard rails in several DAC APIs.
  • Adds additional validation in GC and ReJIT-related helpers to avoid AVs on bad addresses / null callbacks.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
src/coreclr/inc/dacprivate.h Over-allocates GC mechanisms buffer and copies out only supported entries to prevent overflow.
src/coreclr/debug/daccess/request.cpp Adds argument validation and safer target reads (e.g., GetClassWithPossibleAV, DacValidateMD, callback checks) to reduce DAC crashes.

…OUNT growth

Over-allocate the buffer passed to GetGCGlobalMechanisms to 256 entries
so that a newer runtime writing more than 6 entries won't overflow the
caller's stack. Only copy back DAC_MAX_GLOBAL_GC_MECHANISMS_COUNT
elements into the struct.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

// Default to having found no information.
HRESULT hr = S_FALSE;
hr = S_FALSE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I just discovered the same issue in my own testing. If you are fixing these shadowing bugs mind getting all of them at once? I believe there are three more in GetHandleEnumForTypes, GetHandleEnumForGC, and GetObjectStringData.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I missed those. Did you see them actually happen in practice, or just code review? I will add those to the PR, assuming this is one we will merge. I'll hold off on work until I know the ground rules.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fixes that respect buffer bounds look good to me.

The validation fixes look fairly arbitrary. The legacy DAC is not robust against corrupted data on non-Windows by construction. I do not think scattershot AI-driven fixes are meaningfully improving the legacy DAC robustness.

retval = FALSE;
}
else if (pEEClass != pMethodTable->GetClass())
else if (pEEClass != pMethodTable->GetClassWithPossibleAV())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no difference between GetClassWithPossibleAV and GetClass in DAC builds. This change is no-op

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I misunderstood that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted this part back but will wait for your reply on the bigger question below before continuing on this PR.

data->wNumInstanceFields = pMT->GetNumInstanceFields();
data->wNumStaticFields = pMT->GetNumStaticFields();
data->wNumThreadStaticFields = pMT->GetNumThreadStaticFields();
PTR_EEClass pClass = pMT->GetClassWithPossibleAV();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary change

MTData->Module = HOST_CDADDR(pMT->GetModule());
// Note: DacpMethodTableData::Class is really a pointer to the canonical method table
MTData->Class = HOST_CDADDR(pMT->GetClass()->GetMethodTable());
PTR_EEClass pClass = pMT->GetClassWithPossibleAV();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary change


#ifdef FEATURE_CODE_VERSIONING
PTR_Module pModule = PTR_Module(TO_TADDR(mod));
if (dac_cast<TADDR>(pModule) == (TADDR)-1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of special-casing -1 for this API only?

CodeVersionManager::LockHolder codeVersioningLockHolder;
ILCodeVersion ilVersion = pCodeVersionManager->GetActiveILCodeVersion(pMD);
if (ilVersion.IsNull())
if (!DacValidateMD(pMD))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that every legacy DAC API that takes a MethodDesc pointer should validate it using DacValidateMD? If it is the case, there is a bunch more to fix.

@jkotas
Copy link
Member

jkotas commented Feb 21, 2026

The validation fixes look fairly arbitrary. The legacy DAC is not robust against corrupted data on non-Windows by construction. I do not think scattershot AI-driven fixes are meaningfully improving the legacy DAC robustness.

Targeted fixes for issues we have observed in practice = like the one you have done in #124457 - look reasonable to me.

@leculver
Copy link
Contributor Author

leculver commented Feb 21, 2026

There is no difference between GetClassWithPossibleAV and GetClass in DAC builds. This change is no-op

Thanks! Yes, this was user error on my part (not ai generated). I was cribbing from DacValidateEEClass and should have read the difference between those two more carefully, yes. (This one I can't blame on AI.)

Targeted fixes for issues we have observed in practice = like the one you have done in #124457 - look reasonable to me.

This was driven by real issues observed in practice. We are building some tools based on ClrMD internally and I'm hitting a surprising variety of dac crashes, hence these pull requests for stackwalking and dac functions. The one that spurred this set of work was this:

EEClass::GetPackableField(this=NULL)           class.cpp:3343   ← SIGSEGV
EEClass::GetNumInstanceFields                  class.h:914
MethodTable::GetNumInstanceFields              methodtable.inl:128
ClrDataAccess::GetMethodTableFieldData         request.cpp:1906
...
SOSDac.GetFieldInfo(UInt64, MethodTableFieldInfo&)           SosDac.cs:298
DacTypeHelpers+<EnumerateFields>d__17.MoveNext()             DacTypeHelpers.cs:196
ClrType.CacheFields(...)                                     ClrType.cs:256
ClrType.get_Fields()                                         ClrType.cs:191

The validation fixes look fairly arbitrary. The legacy DAC is not robust against corrupted data on non-Windows by construction. I do not think scattershot AI-driven fixes are meaningfully improving the legacy DAC robustness.

Well, I definitely agree with you that the "legacy DAC is not robust against corrupted data on non-Windows by construction". 100% in agreement. In making these changes (and other changes in this vein) I'm not trying to change the guarantees here. Yes it's 100% true that before and after this change (and changes like this) this invariant hasn't changed. We are not robust to corrupted data, and changing that isn't my intent here. I'm only trying to increase the number of crashdumps and coredumps that we can successfully debug until we have a complete replacement for this style of debugging.

If you want me to only limit this and future changes to directly observable crashes, and you'll reject anything else, then I can do that, of course. (I'll actually be 100% happy to never submit any changes to this code again when the cDac has all the right functionality.)

However, I think rejecting bugfixes unless they are observed is the wrong approach for a couple reasons:

  1. The universe of dumps is large, and I am playing whack-a-mole with "the dac crashed on this input, go make this tiny fix in the dac". I can't run clrmd/sos over every dump in existence to figure out a-priori what needs to be fixed.
  2. This is an attempt to be a little more systematic in trying to not push through a constant stream of tiny fixes, and just get input validation out of the way in one big pass. It doesn't cover everything, of course, and I'm not trying to change the fundamental nature of sigsev on linux. A little bit of extra preemptive work could save us more work here in the future, and that's what I'm trying to do.
  3. These are real issues. The risk of this checkin is low, it's just doing some input validation, and doesn't affect the runtime itself. Adding a simple suite of (yes contrived, yes AI built) regression tests somewhere that run occasionally would avoid regressions like (Don't crash when walking callstack roots in the debugger #124402).

To be clear, I'm super happy to close up shop when the cDac is finished, and never again work on this terrible infrastructure from 20 years ago. But we have customers that use this, investigations that are painful because of debuggger crashes, and there's no alternative (yet). And I can't predict which of these easy to fix/avoidable bugs I'll have to open another PR for next. Yes of course, some of these we should fix on the caller side, and we are doing that.

In general though, this PR is an attempt to do less work in this space in the future, and not an attempt to make this side of the codebase the path forward in diagnostics.

Let me know how you want to proceed here. I'll hold off on further PR updates until we are on the same page with what's allowed in this space. Thanks for the feedback!

Copilot AI review requested due to automatic review settings February 21, 2026 19:10
@leculver leculver marked this pull request as draft February 21, 2026 19:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment on lines +2062 to +2065
else if (pMT->GetClass() == NULL)
{
hr = E_INVALIDARG;
}
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The null check pMT->GetClass() == NULL is redundant. GetClass() never returns NULL in DAC builds - it calls GetClassWithPossibleAV() which throws DacError(E_UNEXPECTED) if the class pointer is null. The exception is caught by the EX_CATCH block from SOSDacEnter, so the explicit check and hr = E_INVALIDARG assignment will never be reached. Consider removing this check or replacing it with a direct call to GetClassWithPossibleAV() wrapped in a nested EX_TRY if you want to distinguish this error case.

Suggested change
else if (pMT->GetClass() == NULL)
{
hr = E_INVALIDARG;
}

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 21, 2026 19:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


// Default to having found no information.
HRESULT hr = S_FALSE;
hr = S_FALSE;
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In GetObjectComWrappersData, now that the inner HRESULT hr shadowing was removed, the unconditional hr = S_FALSE; near the end of the method will force this API to always return S_FALSE even when rcw/mowList were successfully populated. Please remove that final overwrite (or gate it behind the “no data found” path) so callers can rely on S_OK when data is returned.

Suggested change
hr = S_FALSE;

Copilot uses AI. Check for mistakes.
Comment on lines +5722 to 5723
if (pClass->GetNumThreadStaticFields() == 0)
{
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetThreadStaticBaseAddress dereferences pClass without checking for null. MethodTable::GetClassWithPossibleAV() can return NULL (e.g., if m_pEEClass is null), which would still crash here. Please add a null check and return an error HRESULT (or S_FALSE) instead of dereferencing.

Suggested change
if (pClass->GetNumThreadStaticFields() == 0)
{
if (pClass == NULL)
{
hr = E_FAIL;
}
else if (pClass->GetNumThreadStaticFields() == 0)
{

Copilot uses AI. Check for mistakes.
Comment on lines +4841 to 4844
{
CodeVersionManager* pCodeVersionManager = pModule->GetCodeVersionManager();
CodeVersionManager::LockHolder codeVersioningLockHolder;

Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The else block added in GetMethodsWithProfilerModifiedIL isn’t indented, which makes the control flow hard to follow and increases the risk of future edits accidentally changing which code is guarded by the pointer validation. Please reformat/indent the contents of the else block (and its closing brace) to match the file’s existing style.

Copilot uses AI. Check for mistakes.
@jkotas
Copy link
Member

jkotas commented Feb 21, 2026

This was driven by real issues observed in practice. We are building some tools based on ClrMD internally and I'm hitting a surprising variety of dac crashes, hence these pull requests for stackwalking and dac functions. The one that spurred this set of work was this:

EEClass::GetPackableField(this=NULL)           class.cpp:3343   ← SIGSEGV
EEClass::GetNumInstanceFields                  class.h:914
MethodTable::GetNumInstanceFields              methodtable.inl:128
ClrDataAccess::GetMethodTableFieldData         request.cpp:1906

This is a stacktrace from .NET Framework or very old .NET Core. EEClass::GetPackableField does not exist in the live main. Also, GetClass has NULL checks as copilot codereview pointed out in #124686 (comment) should be addressing the problem more holistically. For reference, these NULL checks got added by #104128 .

Let me know how you want to proceed here.

I would like to see the legacy DAC deleted as soon as possible. If you would like to spend time on polishing it in the meantime, I do not have a problem with it. I am not sure whether it is a time well spent.

@noahfalk
Copy link
Member

I'm fine taking fixes in the DAC that may not have been directly observed, but I would suggest some clean up criteria:

  • Run the SOS tests in the diagnostics repo in addition to CLRMD testing if you haven't done so already. Not all calls to these APIs go through CLRMD.
  • It seemed like the goal was to be methodical across the scope of ISOSDacInterface APIs but the actual validation added doesn't appear consistent. For example ClrDataAccess::GetTieredVersions on SOSDacInterface5 has a MethodDesc parameter that doesn't call DacValidateMD() whereas other methods now do.
  • I'd lean towards not testing with -1 in the set of 'reasonable' inputs for a module pointer, but I'm not that opposed to it if you really think its a valuable check. If we are going to do it then it would be nice to do it consistently across the ISOSDacInterface, for example ClrDataAccess::GetModule

if (SUCCEEDED(hr = sos->QueryInterface(__uuidof(ISOSDacInterface3), (void**) &psos3)))
{
hr = psos3->GetGCGlobalMechanisms(globalMechanisms);
// GetGCGlobalMechanisms writes MAX_GLOBAL_GC_MECHANISMS_COUNT entries (currently 6)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better fix would be to enforce in the DAC that it never writes more than 6 entries. If MAX_GLOBAL_GC_MECHANISMS_COUNT is greater than 6 the API should return a failing HRESULT. If we ever need to support more than 6 we can make an API that properly handles dynamic sizing.

Also heads up this is dead code in the runtime repo as far as I know. The header is used by some other libraries but I don't know of a code path that invokes this method. There is another copy of this header in the diagnostics repo where SOS is built and that one calls this code.

@leculver
Copy link
Contributor Author

I would like to see the legacy DAC deleted as soon as possible.

Agreed.

If you would like to spend time on polishing it in the meantime, I do not have a problem with it. I am not sure whether it is a time well spent.

Fair enough. I'll focus my time on caller-side fixes in ClrMD/SOS and cDac (generally) instead.

I'll keep #124402 open as Linux managed heap analysis is mostly broken in .NET 10/11 and that one needs a fix regardless. (Wasn't planning to back port unless someone asks for it, but I do need main to not crash as I'm testing tools.)

@leculver leculver closed this Feb 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DAC-Linux] Dac private interface crashes with some benign inputs

3 participants