Fix some Linux (and windows) Dac crashes when given reasonable inputs#124686
Fix some Linux (and windows) Dac crashes when given reasonable inputs#124686leculver wants to merge 5 commits intodotnet:mainfrom
Conversation
Crash fixes (all SIGSEGV on Linux): - GetAppDomainName(0): add null address check - TraverseModuleMap: add null callback check - TraverseVirtCallStubHeap: add null callback check - GetRegisterName: replace wcscpy_s with bounded memcpy to prevent buffer overflow - GetPendingReJITID/GetReJITInformation/GetProfilerModifiedILInformation: add DacValidateMD - GetMethodsWithProfilerModifiedIL: add module address -1 check - GetFinalizationFillPointersSvr: validate heap address against known server GC heaps - GetAssemblyLoadContext: add DacValidateMethodTable Shadowed HRESULT fixes: - GetGenerationTable, GetFinalizationFillPointers, GetGenerationTableSvr, GetFinalizationFillPointersSvr, GetObjectComWrappersData all declared a local HRESULT hr that shadowed the one from SOSDacEnter(), causing the return after SOSDacLeave() to always return S_OK regardless of errors. Contributes to: dotnet#124640 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag |
There was a problem hiding this comment.
Pull request overview
This PR hardens SOS DAC implementations against invalid-but-plausible inputs to reduce crashes (especially on Unix) by adding validation and using safer access patterns when reading target data.
Changes:
- Over-allocates the buffer used by
GetGCGlobalMechanismsto avoid potential overflow when newer runtimes write more entries. - Replaces some direct
GetClass()dereferences withGetClassWithPossibleAV()and adds additional argument validation/guard rails in several DAC APIs. - Adds additional validation in GC and ReJIT-related helpers to avoid AVs on bad addresses / null callbacks.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/coreclr/inc/dacprivate.h | Over-allocates GC mechanisms buffer and copies out only supported entries to prevent overflow. |
| src/coreclr/debug/daccess/request.cpp | Adds argument validation and safer target reads (e.g., GetClassWithPossibleAV, DacValidateMD, callback checks) to reduce DAC crashes. |
…OUNT growth Over-allocate the buffer passed to GetGCGlobalMechanisms to 256 entries so that a newer runtime writing more than 6 entries won't overflow the caller's stack. Only copy back DAC_MAX_GLOBAL_GC_MECHANISMS_COUNT elements into the struct. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
7453a8e to
52544a8
Compare
|
|
||
| // Default to having found no information. | ||
| HRESULT hr = S_FALSE; | ||
| hr = S_FALSE; |
There was a problem hiding this comment.
Haha, I just discovered the same issue in my own testing. If you are fixing these shadowing bugs mind getting all of them at once? I believe there are three more in GetHandleEnumForTypes, GetHandleEnumForGC, and GetObjectStringData.
There was a problem hiding this comment.
Sure, I missed those. Did you see them actually happen in practice, or just code review? I will add those to the PR, assuming this is one we will merge. I'll hold off on work until I know the ground rules.
jkotas
left a comment
There was a problem hiding this comment.
The fixes that respect buffer bounds look good to me.
The validation fixes look fairly arbitrary. The legacy DAC is not robust against corrupted data on non-Windows by construction. I do not think scattershot AI-driven fixes are meaningfully improving the legacy DAC robustness.
| retval = FALSE; | ||
| } | ||
| else if (pEEClass != pMethodTable->GetClass()) | ||
| else if (pEEClass != pMethodTable->GetClassWithPossibleAV()) |
There was a problem hiding this comment.
There is no difference between GetClassWithPossibleAV and GetClass in DAC builds. This change is no-op
There was a problem hiding this comment.
Ah I misunderstood that.
There was a problem hiding this comment.
I reverted this part back but will wait for your reply on the bigger question below before continuing on this PR.
| data->wNumInstanceFields = pMT->GetNumInstanceFields(); | ||
| data->wNumStaticFields = pMT->GetNumStaticFields(); | ||
| data->wNumThreadStaticFields = pMT->GetNumThreadStaticFields(); | ||
| PTR_EEClass pClass = pMT->GetClassWithPossibleAV(); |
| MTData->Module = HOST_CDADDR(pMT->GetModule()); | ||
| // Note: DacpMethodTableData::Class is really a pointer to the canonical method table | ||
| MTData->Class = HOST_CDADDR(pMT->GetClass()->GetMethodTable()); | ||
| PTR_EEClass pClass = pMT->GetClassWithPossibleAV(); |
|
|
||
| #ifdef FEATURE_CODE_VERSIONING | ||
| PTR_Module pModule = PTR_Module(TO_TADDR(mod)); | ||
| if (dac_cast<TADDR>(pModule) == (TADDR)-1) |
There was a problem hiding this comment.
What's the point of special-casing -1 for this API only?
| CodeVersionManager::LockHolder codeVersioningLockHolder; | ||
| ILCodeVersion ilVersion = pCodeVersionManager->GetActiveILCodeVersion(pMD); | ||
| if (ilVersion.IsNull()) | ||
| if (!DacValidateMD(pMD)) |
There was a problem hiding this comment.
Is the idea that every legacy DAC API that takes a MethodDesc pointer should validate it using DacValidateMD? If it is the case, there is a bunch more to fix.
Targeted fixes for issues we have observed in practice = like the one you have done in #124457 - look reasonable to me. |
Thanks! Yes, this was user error on my part (not ai generated). I was cribbing from
This was driven by real issues observed in practice. We are building some tools based on ClrMD internally and I'm hitting a surprising variety of dac crashes, hence these pull requests for stackwalking and dac functions. The one that spurred this set of work was this:
Well, I definitely agree with you that the "legacy DAC is not robust against corrupted data on non-Windows by construction". 100% in agreement. In making these changes (and other changes in this vein) I'm not trying to change the guarantees here. Yes it's 100% true that before and after this change (and changes like this) this invariant hasn't changed. We are not robust to corrupted data, and changing that isn't my intent here. I'm only trying to increase the number of crashdumps and coredumps that we can successfully debug until we have a complete replacement for this style of debugging. If you want me to only limit this and future changes to directly observable crashes, and you'll reject anything else, then I can do that, of course. (I'll actually be 100% happy to never submit any changes to this code again when the cDac has all the right functionality.) However, I think rejecting bugfixes unless they are observed is the wrong approach for a couple reasons:
To be clear, I'm super happy to close up shop when the cDac is finished, and never again work on this terrible infrastructure from 20 years ago. But we have customers that use this, investigations that are painful because of debuggger crashes, and there's no alternative (yet). And I can't predict which of these easy to fix/avoidable bugs I'll have to open another PR for next. Yes of course, some of these we should fix on the caller side, and we are doing that. In general though, this PR is an attempt to do less work in this space in the future, and not an attempt to make this side of the codebase the path forward in diagnostics. Let me know how you want to proceed here. I'll hold off on further PR updates until we are on the same page with what's allowed in this space. Thanks for the feedback! |
| else if (pMT->GetClass() == NULL) | ||
| { | ||
| hr = E_INVALIDARG; | ||
| } |
There was a problem hiding this comment.
The null check pMT->GetClass() == NULL is redundant. GetClass() never returns NULL in DAC builds - it calls GetClassWithPossibleAV() which throws DacError(E_UNEXPECTED) if the class pointer is null. The exception is caught by the EX_CATCH block from SOSDacEnter, so the explicit check and hr = E_INVALIDARG assignment will never be reached. Consider removing this check or replacing it with a direct call to GetClassWithPossibleAV() wrapped in a nested EX_TRY if you want to distinguish this error case.
| else if (pMT->GetClass() == NULL) | |
| { | |
| hr = E_INVALIDARG; | |
| } |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
|
||
| // Default to having found no information. | ||
| HRESULT hr = S_FALSE; | ||
| hr = S_FALSE; |
There was a problem hiding this comment.
In GetObjectComWrappersData, now that the inner HRESULT hr shadowing was removed, the unconditional hr = S_FALSE; near the end of the method will force this API to always return S_FALSE even when rcw/mowList were successfully populated. Please remove that final overwrite (or gate it behind the “no data found” path) so callers can rely on S_OK when data is returned.
| hr = S_FALSE; |
| if (pClass->GetNumThreadStaticFields() == 0) | ||
| { |
There was a problem hiding this comment.
GetThreadStaticBaseAddress dereferences pClass without checking for null. MethodTable::GetClassWithPossibleAV() can return NULL (e.g., if m_pEEClass is null), which would still crash here. Please add a null check and return an error HRESULT (or S_FALSE) instead of dereferencing.
| if (pClass->GetNumThreadStaticFields() == 0) | |
| { | |
| if (pClass == NULL) | |
| { | |
| hr = E_FAIL; | |
| } | |
| else if (pClass->GetNumThreadStaticFields() == 0) | |
| { |
| { | ||
| CodeVersionManager* pCodeVersionManager = pModule->GetCodeVersionManager(); | ||
| CodeVersionManager::LockHolder codeVersioningLockHolder; | ||
|
|
There was a problem hiding this comment.
The else block added in GetMethodsWithProfilerModifiedIL isn’t indented, which makes the control flow hard to follow and increases the risk of future edits accidentally changing which code is guarded by the pointer validation. Please reformat/indent the contents of the else block (and its closing brace) to match the file’s existing style.
This is a stacktrace from .NET Framework or very old .NET Core.
I would like to see the legacy DAC deleted as soon as possible. If you would like to spend time on polishing it in the meantime, I do not have a problem with it. I am not sure whether it is a time well spent. |
|
I'm fine taking fixes in the DAC that may not have been directly observed, but I would suggest some clean up criteria:
|
| if (SUCCEEDED(hr = sos->QueryInterface(__uuidof(ISOSDacInterface3), (void**) &psos3))) | ||
| { | ||
| hr = psos3->GetGCGlobalMechanisms(globalMechanisms); | ||
| // GetGCGlobalMechanisms writes MAX_GLOBAL_GC_MECHANISMS_COUNT entries (currently 6) |
There was a problem hiding this comment.
I think a better fix would be to enforce in the DAC that it never writes more than 6 entries. If MAX_GLOBAL_GC_MECHANISMS_COUNT is greater than 6 the API should return a failing HRESULT. If we ever need to support more than 6 we can make an API that properly handles dynamic sizing.
Also heads up this is dead code in the runtime repo as far as I know. The header is used by some other libraries but I don't know of a code path that invokes this method. There is another copy of this header in the diagnostics repo where SOS is built and that one calls this code.
Agreed.
Fair enough. I'll focus my time on caller-side fixes in ClrMD/SOS and cDac (generally) instead. I'll keep #124402 open as Linux managed heap analysis is mostly broken in .NET 10/11 and that one needs a fix regardless. (Wasn't planning to back port unless someone asks for it, but I do need main to not crash as I'm testing tools.) |
There's a lot of places where the dac will crash when we feed it bad input. This doesn't completely fix or exhaust all of those places, but it makes a significant dent in the number of places we trust input that shouldn't be strictly trusted.
This primarily affects Linux (and probably macos, though I don't have a machine to test there). There were a few places windows could throw an AV too.
The underlying fixes were pretty rote, so I let copilot take care of a lot of them. I have a dac test app that does the bare minimum to test every api with the most basic inputs and this was the result. I ran the resulting dac/runtime through ClrMD's test suite and all tests pass.
Dac tests live here temporarily: https://github.com/leculver/dac-tests/. Planning to move them to a unit test at some point (here or the diagnostics repo) to make sure we don't regress these. Even though we don't change the surface of the dac private apis much anymore, underlying runtime changes can cause some crashes, especially on linux/osx.
Fixes #124640.