Skip to content

[cDAC] Stack walk support more Frame types #112997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 69 commits into from
Mar 20, 2025

Conversation

max-charlamb
Copy link
Contributor

@max-charlamb max-charlamb commented Feb 27, 2025

Builds on #111759
Contributes to #110758

Adds cDAC stack walking support for the following Frame types:

  • TransitionFrame (Base class)
    • FramedMethodFrame
    • CLRToCOMMethodFrame
    • PInvokeCalliFrame
    • StubDispatchFrame
    • CallCountingHelperFrame
    • ExternalMethodFrame
    • DynamicHelperFrame
  • FuncEvalFrame
  • ResumableFrame (Base class)
    • RedirectedThreadFrame
  • FaultingExceptionFrame
  • HijackFrame

Expands cDAC Frame handling to be platform specific

In line with current cDAC stack walking support, only AMD64 and ARM64 support was added. When bringing up support for other architectures, platform specific frame handlers will be required. This was done to exactly match the existing DAC stack walking behavior.

Callouts

  1. Data descriptor naming convention.
    • FaulingExceptionFrame's stores a T_CONTEXT directly. ResumableFrame's store a pointer to a T_CONTEXT (PT_CONTEXT). To make this difference clear I named the direct struct as TargetContext and the pointer as TargetContextPtr. This distinction is important when reading the data descriptors. The former is a direct offset from the Frame's address, the latter is a pointer read at an offset from the Frame's address.
    • So far, we have avoided adding logic to the IData classes. For that reason, the classes expose a TargetPointer instead of the struct directly. However, this naming convention is a bit confusing because both TargetPointers (TargetContext and TargetContextPtr) end up pointing to the beginning of a T_CONTEXT.
  2. Using reflection to read/write a registers to a context.
    • Some data structures (CalleeSavedRegisters and HijackArgs) store a bag of registers values. To prevent hard coding the number/names of these register value, I have opted to use reflection to write the corresponding struct values from their string names.
    • This isn't my favorite choice, but the alternatives are hardcoding these values on the cDAC side or creating a large switch statement in each IPlatformContext to avoid reflection.
  3. Some data descriptors have very different representation on different platforms. CalleeSavedRegisters/HijackArgs
    • If this is common, maybe it would be better to have a separate platform specific <arch>datadescriptor.h that is imported based on platform.

Testing

Tested using amd64 Windows. Verified byte for byte correctness with DAC stack walking.

How produce call stacks with uncommon Frames

FuncEvalFrame

FuncEvalFrame's are generated during debugger immediate execution.

Debug any program in VS. Break and call a blocking function (Console.ReadLine()) attach a debugger non-invasively to stack walk or create a dump.
For example, using cdb, cdb -pv -p <processId>

HijackFrame

class HijackTest()
{
    public volatile bool flag;
    public volatile int num;

    // Set breakpoint at ThreadSuspend::SuspendEE then step out and look at the main thread stack
    // (bu coreclr!ThreadSuspend::SuspendEE)
    // Note: HijackFrames are not used on Windows if CET is enabled. Either test on non-Windows
    // or disable CET by modifying Thread::AreShadowStacksEnabled to return false.
    [MethodImpl(MethodImplOptions.AggressiveOptimization)]
    public void Test()
    {
        // start other thread that will force a GC collection.
        Task.Run(Work);

        // run loop checking volatile variable to generate non-interruptible code.
        while (!flag)
        {
            TestLoop();
        }
    }
    public void Work()
    {
        Thread.Sleep(500);
        GC.Collect();
    }

    [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
    public void TestLoop()
    {
        num++; num++; num++; num++; num++;
        num++; num++; num++; num++; num++;
        num++; num++; num++; num++; num++;
        num++; num++; num++; num++; num++;
        num++; num++; num++; num++; num++;
        num++; num++; num++; num++; num++;
    }
}

RedirectedThreadFrame (ResumableFrame)

class RedirectedThreadFrame()
{
    public volatile bool flag;
    public volatile int num;

    // Configure WinDBG to break on clr exceptions
    // (sxe clr)
    [MethodImpl(MethodImplOptions.AggressiveOptimization)]
    public void Test()
    {
        var cts = new CancellationTokenSource();
        cts.CancelAfter(500);
        
        // use ControlledExecution with a cancellation token to trigger a
        // thread abort with a try/catch
        ControlledExecution.Run(Work, cts.Token);
        while (!flag)
        {
            TestLoop();
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
    public void TestLoop()
    {
        for (int i = 0; i < 20; i++)
        {
            num++;
        }
    }

    [MethodImpl(MethodImplOptions.AggressiveOptimization)]
    public void Work()
    {
        try
        {
            while (!flag)
            {
                TestLoop();
            }
        }
        catch (Exception e)
        {
            Console.WriteLine(e);
        }
    }
}

FaultingExceptionFrame

class FaultingExceptionTest()
{
    public volatile bool flag;
    public volatile int num;

    // Set breakpoint on ThrowControlForThread
    // (bu coreclr!ThrowControlForThread)
    [MethodImpl(MethodImplOptions.AggressiveOptimization)]
    public void Test()
    {
        Console.ReadLine();
        var cts = new CancellationTokenSource();
        cts.CancelAfter(500);
        ControlledExecution.Run(Work, cts.Token);
        
        while (!flag)
        {
            TestLoop();
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
    public void TestLoop()
    {
        for (int i = 0; i < 20; i++)
        {
            if (num > 10000)
            {
                // important to call another function here
                Console.WriteLine("num is greater than 10000");
            }
            num++;
        }
    }

    [MethodImpl(MethodImplOptions.AggressiveOptimization)]
    public void Work()
    {
        try
        {
            while (!flag)
            {
                TestLoop();
            }
        }
        catch (Exception e)
        {
            Console.WriteLine(e);
        }
    }
}

Copy link
Member

@davidwrighton davidwrighton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good enough to check in. I'm excited to see us start to run tests on this.

@max-charlamb
Copy link
Contributor Author

I think this is good enough to check in. I'm excited to see us start to run tests on this.

I have tests running locally on Win-x64 using my changes: dotnet/diagnostics#5350.

The last two blockers on multiple platform runs in CI are linking and cross-platform T_CONTEXT. I have verified that the static linking we discussed yesterday works correctly for both debug/release and will raise a PR and I'm actively looking at how to resolve the T_CONTEXT issues.

The pipeline should be running soon.

{
value = default;
if (typeof(T).GetField(fieldName) is not FieldInfo field) return false;
if (field.GetValue(Context) is not object fieldValue) return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of a weird check. It seems like the actual function of it is to check whether the field value is null, in which case a clearer way to write it would be

object fieldValue = field.GetValue(Context);
if (fieldValue == null) return false;

I was actually surprised to learn that x is object y would evaluate to false for null.
It's also redundant with the ulong and uint checks below, so I think what you can actually just write is:

object fieldValue = field.GetValue(Context);

and the two ulong/uint checks below will do the rest of the work, falling through to the return false at the bottom.

value = default;
if (typeof(T).GetField(fieldName) is not FieldInfo field) return false;
if (field.GetValue(Context) is not object fieldValue) return false;
if (fieldValue is ulong ul && target.PointerSize == sizeof(ulong))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not strictly necessary to fix, just a thought)
do we want to communicate something if they are mismatched and fail in a loud way, i.e. 'expected uint but got ulong' or is it enough to just return false? This is 'correct' but I don't know if this is a good opportunity for us to "fail well" by properly detecting and messaging error scenarios at this layer instead of just returning a false that conveys no information about the nature of the failure. I'm assuming an end user is touching this code through 20 layers of abstraction, and if this method returns false because the field held a ulong instead of a uint they might really appreciate it if there were an appropriate exception or error handler for that scenario.
I would assume the only way we can reach the return false at the bottom is if the target and the context are somehow mismatched, like the wrong dll was loaded. That's a scenario where we would want to fail fast, I think, instead of just silently not working?
On the other hand, this is robust, so I have no problem with it being checked in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that we usually want to fail fast. Currently that happens in the caller, it could be moved in here but I'm not sure on other scenarios if that's always the behavior we want.

Comment on lines 88 to 92
if (other is null)
{
return false;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, you can eliminate this whole check by changing the if below to if (GetType() != other?.GetType()) - a null other will have a null other?.GetType() which can't ever match our own GetType().

Change not necessary though.

Is there a reason why this Equals implementation is written in this specific way (exact type match only) even though ContextHolder isn't sealed? It's a little weird to see it, and I'd prefer to block this class of failure by sealing the class if we can do that because then the attempt to do the wrong thing will fail robustly at compile time instead of silently failing Equals checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change this class to sealed

}
catch (InvalidOperationException)
{
// if the read fails, the unwinder behavior changes. Return failing HR instead of throwing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we specifically make sure we're only swallowing the type of IOEs we expect and not other IOEs from an actual bug in the implementation? Do we know what scenarios we expect a throw here in? It's fine if there's no way to sense what kind of IOE it is robustly (I don't expect you to parse .Message etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This throw occurs when the global is not found.

It happens:

public TargetPointer ReadGlobalPointer(string name, out string? type)
{
if (!_globals.TryGetValue(name, out (ulong Value, string? Type) global))
throw new InvalidOperationException($"Failed to read global pointer '{name}'.");
type = global.Type;
return new TargetPointer(global.Value);
}

I agree that I don't like using exceptions to handle this scenario. It may be worth changing the Target API to include a TryReadGlobalPointer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, given the narrow scope of the code inside the try I'm fine with this for now, maybe just make a to-do note to clean it up later to not rely on exceptions.

@@ -109,7 +109,15 @@ private static unsafe int ReadFromTarget(ulong address, void* pBuffer, int buffe
return -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Existing code, so not strictly your problem)
This GCHandle.FromIntPtr on context makes me suspicious, doubly so that we're checking whether the target is some type other than what we expect. Are we trying to guard against garbage context pointers here? Or something else? A comment might be good saying what this if statement is solving for people who look at the code later, if you happen to know. If it's equally mysterious to you then you can put it out of your mind and resolve this comment :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through the existing code it looks like this is an abstract unwinder native API that lets you smuggle arbitrary data through void* context, so since we're providing an IntPtr when calling it elsewhere in this file, it's probably fine? But in that case, I'd like to see us fail fast and fail loudly - we shouldn't return -1, we should probably bail out and throw an exception because somehow we passed ourselves something other than the context we expected. Maybe I'm not understanding this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No totally reasonable. I also wrote that code.

This is guarding against memory corruption / an invalid call. The intptr should always be a GCHandle with Target ContextCallback. I can add a comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. If we're concerned about memory corruption, I'm not certain that it's safe to turn arbitrary bytes into a GCHandle and dereference it. Hopefully we bounds-check the table lookup? I'll see if I can find out...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCHandle.ThrowIfInvalid throws for 0, so a null context would make us throw here instead of returning -1.
GCHandleInternalGet does not appear to do any verification or bounds checking. It calls ObjectFromHandle, which also does not appear to do any verification or bounds checking. So I think if we got something here that wasn't a GCHandle, Anything Could Happen. Someone who knows the VM better could correct me here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the issues is that this is a callback from a native code boundary. Therefore, we probably shouldn't throw because on non-Windows platforms that's a fail fast. Ideally, we'd return an error to the native code which would propagate that error back to the managed invoking function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like the right approach is to try and verify the handle in a way that allows you to quickly return an error code to the native caller, without throwing. Right now if the handle is 0, ThrowIfInvalid will throw and unwind the stack, which it sounds like we don't want.

One workaround would be to use a thread local stack of some sort to record valid handles so this function can look up the context in the stack before dereferencing it. That's complex and not speedy though, so I don't love it.

Another option would be to try and ensure - by construction - that this callback can't be invoked by anyone else, and then we know that the argument(s) are trusted. I'm not sure we can do that here though.

{
ContextHolder<AMD64Context> contextHolder => new AMD64FrameHandler(target, contextHolder),
ContextHolder<ARM64Context> contextHolder => new ARM64FrameHandler(target, contextHolder),
_ => throw new InvalidOperationException("Unsupported context type"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to say which type is unsupported, it will make failures in production more actionable. Not strictly necessary though.

@max-charlamb max-charlamb merged commit 8cd8921 into dotnet:main Mar 20, 2025
148 of 151 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Apr 20, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Diagnostics-coreclr enhancement Product code improvement that does NOT require public API changes/additions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants