Skip to content

prevent manifest sending for special cased event sources #60357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

josalem
Copy link
Contributor

@josalem josalem commented Oct 13, 2021

As a workaround for making NativeRuntimeEventSource and specifically Portable Threadpool events show up under the Microsoft-DotNETRuntime-Private provider in .net6, the GUID for the EventSource was set to be the same as the manually defined provider.

This had an unfortunate side-effect of tripping up some ETW based listeners that request the manifest if a provider is an EventSource. It appears there may be some issues in the generated code that would cause the manifest returned by the EventSource to not match the hand-written one most tools use. Typically, consumers of Microsoft-DotNETRuntime-Private will use a local copy of ClrEtwAll.man to parse the events. If a consumer uses both that and the manifest collected from the EventSource it can mess up parsing.

This patch doesn't attempt to sort out the issues in the generating code, but corrects the behavior to not send a manifest for the EventSource which is more correct for how this code is set up. It also keeps the change small which will be better for backporting/getting into 6.

The manifest generated from ClrEtwAll.man is too correct and marks parameters as unsigned integer values. The ClrTraceEventParser in TraceEvent treats them as signed integer values at the other end. When we send this manifest it causes TraceEvent to treat the provider as an EventSource which results in different parsing logic depending on how you register callbacks. If customer/tool code is expecting events from events with this provider name and event name to have signed values, but get unsigned, it will cause exceptions in their code.

using System;
using Microsoft.Diagnostics.Tracing;
using Microsoft.Diagnostics.Tracing.Parsers;
using Microsoft.Diagnostics.Tracing.Parsers.Clr;
using Microsoft.Diagnostics.Tracing.Session;

using (TraceEventSession session = new TraceEventSession("MySession"))
{
    DynamicTraceEventParser dynamicParser = new DynamicTraceEventParser(session.Source);
    dynamicParser.AddCallbackForProviderEvent(ClrTraceEventParser.ProviderName, "MethodJittingStarted_V1", (data) =>
    {
        // signed & named MethodJittingStarted_V1
    });
    dynamicParser.All += delegate (TraceEvent data)
    {
        if (data.EventName.StartsWith("MethodJitting"))
        {
            // signed & named MethodJittingStarted_V1
        }
    };

    session.Source.Clr.MethodJittingStarted += delegate (MethodJittingStartedTraceData data)
    {
        // unsigned & named Method/JittingStarted
    };
    session.EnableProvider(ClrTraceEventParser.ProviderGuid.ToString(), TraceEventLevel.Verbose, (ulong)ClrTraceEventParser.Keywords.Jit);
    session.Source.Process();
}

If customer or tool event handling code special cases the unboxing of the payload values based on the static manifest values, then they will hit cast failure exceptions in the Dynamic* cases above.

Reverting to draft - Looking for further validation from the issue reporter. This patch fixed their issue (hence the PR), but further investigation has made me doubt this is the correct fix to check in to the runtime.

CC @benmwatson @brianrob

@josalem josalem added this to the 7.0.0 milestone Oct 13, 2021
@josalem josalem requested review from noahfalk and a team October 13, 2021 18:50
@josalem josalem self-assigned this Oct 13, 2021
@ghost
Copy link

ghost commented Oct 13, 2021

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti
See info in area-owners.md if you want to be subscribed.

Issue Details

As a workaround for making NativeRuntimeEventSource and specifically Portable Threadpool events show up under the Microsoft-DotNETRuntime-Private provider in .net6, the GUID for the EventSource was set to be the same as the manually defined provider.

This had an unfortunate side-effect of tripping up some ETW based listeners that request the manifest if a provider is an EventSource. It appears there may be some issues in the generated code that would cause the manifest returned by the EventSource to not match the hand-written one most tools use. Typically, consumers of Microsoft-DotNETRuntime-Private will use a local copy of ClrEtwAll.man to parse the events. If a consumer uses both that and the manifest collected from the EventSource it can mess up parsing.

This patch doesn't attempt to sort out the issues in the generating code, but corrects the behavior to not send a manifest for the EventSource which is more correct for how this code is set up. It also keeps the change small which will be better for backporting/getting into 6.

CC @benmwatson @brianrob

Author: josalem
Assignees: josalem
Labels:

area-System.Diagnostics.Tracing

Milestone: 7.0.0

Co-authored-by: Brian Robbins <brianrob@microsoft.com>
@noahfalk
Copy link
Member

Is this intended as a short term workaround, or a long-term solution to the problem? As a short term workaround I'm happy to be more flexible how we approach it, but I'd also then want to understand how we might get to the long term solution.

In general I want it to be possible for the runtime to start sending an inline manifest for providers that we did not send a manifest for in the past. If the inline manifest currently being sent is wrong I understand not sending it until those issues can be corrected. Can you add more detail to this issue about what is problematic with the current manifest EventSource sends?

@josalem
Copy link
Contributor Author

josalem commented Oct 13, 2021

@noahfalk, taking a look at the generated managed code a little closer, I think the issue is a little more subtle than I made it out to be.

As an example, here is event 145 MethodJittingSTarted_V1.

ClrEtwAll.man:

<template tid="MethodJittingStarted_V1">
    <data name="MethodID" inType="win:UInt64" outType="win:HexInt64" />
    <data name="ModuleID" inType="win:UInt64" outType="win:HexInt64" />
    <data name="MethodToken" inType="win:UInt32" outType="win:HexInt32" />
    <data name="MethodILSize" inType="win:UInt32" outType="win:HexInt32" />
    <data name="MethodNamespace" inType="win:UnicodeString" />
    <data name="MethodName" inType="win:UnicodeString" />
    <data name="MethodSignature" inType="win:UnicodeString" />
    <data name="ClrInstanceID" inType="win:UInt16" />

    <!-- snip -->
</template>

Generated code in NativeRuntimeEventSource:

[Event(145, Version = 1, Level = EventLevel.Verbose, Keywords = Keywords.JitKeyword)]
private void MethodJittingStarted_V1(ulong MethodID, ulong ModuleID, uint MethodToken, uint MethodILSize, string MethodNamespace, string MethodName, string MethodSignature, ushort ClrInstanceID)
{
    // To have this event be emitted from managed side, refer to NativeRuntimeEventSource.cs
    throw new NotImplementedException();
}

and finally, the code used by ClrTraceEventParser class in the TraceEvent library:

    public sealed class MethodJittingStartedTraceData : TraceEvent
    {
        public long MethodID { get { return GetInt64At(0); } }
        public long ModuleID { get { return GetInt64At(8); } }
        public int MethodToken { get { return GetInt32At(16); } }
        public int MethodILSize { get { return GetInt32At(20); } }
        public string MethodNamespace { get { return GetUnicodeStringAt(24); } }
        public string MethodName { get { return GetUnicodeStringAt(SkipUnicodeString(24)); } }
        public string MethodSignature { get { return GetUnicodeStringAt(SkipUnicodeString(SkipUnicodeString(24))); } }
        public int ClrInstanceID { get { if (Version >= 1) { return GetInt16At(SkipUnicodeString(SkipUnicodeString(SkipUnicodeString(24)))); } return 0; } }
        // ...

Up till now, tools using TraceEvent would parse the Microsoft-DotNETRuntime-Private provider using the ClrTraceEventParser which extracts all the values as signed Int{32|64}s rather than UInt{64|32} as the hand-written manifest defines. When we added the workaround to allow for sending the threading events from managed code, we started generating all the events based on the manifest. When tooling using TraceEvent gets this manifest, it will have unsigned values since that is what ClrEtwAll.man defines the event as.

Put simply, ClrTraceEventParser represented integer values as signed while the new generated manifest is claiming they're unsigned. User code that is expecting the former may end up getting called (since the GUID is now the same) with TraceEvent claiming the latter, which may cause "unable to cast" exceptions if you aren't expecting it.

I would classify this change as a short term fix. The generated code is technically correct but doesn't agree with the historical parser which may cause ecosystem issues. A longer-term fix will require us to decide who is correct and whether we should modify the generated code to be like ClrTraceEventParser or vice versa.

For now, this change can keep existing parsers from breaking, while we make an ecosystem decision that comes with a tooling update to prevent breaking existing tools. I'm not sure that type of change should attempt to go in for .net6 servicing/GA (hence this change), but I could be convinced otherwise.

@noahfalk
Copy link
Member

noahfalk commented Oct 14, 2021

Thanks @josalem, if I follow correctly, both clretwall.man and the EventSource definition define the values as signed, but the hard coded event definition in TraceEvent defines it as unsigned. It also sounded like you are saying the TraceEvent library is no longer creating instances of MethodJittingStartedTraceData to represent these events, but is instead using a different derived class of TraceEvent, and that when user code queries the payload properties it gets an invalid cast trying to convert a uint to int. It would still be nice to fill in the details with a specific example (a few frames of a callstack for the relevant parsing code that leads to the exception) and whether it is TraceEvent itself or the user's code where the problematic cast is occuring.

In terms of resolving it, is there a reason to prefer changing this in the runtime vs. changing it in TraceEvent? Given how late we are in the release cycle its hard to imagine us changing 6.0 until a future servicing release a few months from now whereas an update to TraceEvent could probably be made available before before 6.0 RTMs. I may be lacking context, but a short term workaround in TraceEvent seems less risky and more expedient?

@josalem
Copy link
Contributor Author

josalem commented Oct 14, 2021

both clretwall.man and the EventSource definition define the values as signed, but the hard coded event definition in TraceEvent defines it as unsigned

The inverse of this. The manifest and the EventSource have the values defined at unsigned, while the TraceEvent parser has them defined as signed.

It also sounded like you are saying the TraceEvent library is no longer creating instances of MethodJittingStartedTraceData to represent these events, but is instead using a different derived class of TraceEvent

The behavior is that the same event will be parsed and typed differently depending on what parser you use.

If you register a DynamicParser.All or a DynamicParser.AddCallbackFroProviderEvent(ClrTraceEventParser.ProviderName, "MethodJittingStarted_V1", ...) callback, you be handed an event parsed with unsigned integer values. On the other hand, if you register a callback on ClrTraceEventParser.MethodJittingStarted, you will get an event parsed with signed integer values. Additionally, the name of the event shows up differently. Under ClrTraceEventParser, the name of MethodJittingStarted_V1, for instance, gets simplified to Method/JittingStarted, but other locations will show it as MethodJittingStarted_V1 (its name in the manifest and EventSource).

using System;
using Microsoft.Diagnostics.Tracing;
using Microsoft.Diagnostics.Tracing.Parsers;
using Microsoft.Diagnostics.Tracing.Parsers.Clr;
using Microsoft.Diagnostics.Tracing.Session;

using (TraceEventSession session = new TraceEventSession("MySession"))
{
    DynamicTraceEventParser dynamicParser = new DynamicTraceEventParser(session.Source);
    dynamicParser.AddCallbackForProviderEvent(ClrTraceEventParser.ProviderName, "MethodJittingStarted_V1", (data) =>
    {
        // signed
    });
    dynamicParser.All += delegate (TraceEvent data)
    {
        if (data.EventName.StartsWith("MethodJitting"))
        {
            // signed
        }
    };

    session.Source.Clr.MethodJittingStarted += delegate (MethodJittingStartedTraceData data)
    {
        // unsigned
    };
    session.EnableProvider(ClrTraceEventParser.ProviderGuid.ToString(), TraceEventLevel.Verbose, (ulong)ClrTraceEventParser.Keywords.Jit);
    session.Source.Process();
}

It's worth noting that the TraceEvent and MethodJittingStartedTraceData types box the payload values as object, so you need to cast them back to the appropriate types. This is where issues in tool and customer code can arise. For example, if you wrote code that special cased events from the runtime provider that attempts to cast the data to the types in the static manifest, but your code gets invoked in the dynamic callback, then you can incur cast failure exceptions, e.g., if you attempt to cast to Int64 in the Dynamic* callbacks above because you recognized the provider and event name.

Here is a sample output of some logging added to the sample above:

==============
(e13c0d23-ccbc-4e12-931b-d9cc2eee27e4) Microsoft-Windows-DotNETRuntime/Method/JittingStarted (145) (isClassicProvider: False)
Static: 1
	MethodID: type System.Int64
	ModuleID: type System.Int64
	MethodToken: type System.Int32
	MethodILSize: type System.Int32
	MethodNamespace: type System.String
	MethodName: type System.String
	MethodSignature: type System.String
	ClrInstanceID: type System.Int32
   at Program.<>c__DisplayClass0_0.<<Main>$>b__2(MethodJittingStartedTraceData data) in C:\Users\josalem\scratch\TestApps\EtwBehavior\Program.cs:line 69
   at Microsoft.Diagnostics.Tracing.Parsers.Clr.MethodJittingStartedTraceData.Dispatch()
   at Microsoft.Diagnostics.Tracing.TraceEventDispatcher.DoDispatch(TraceEvent anEvent)
   at Microsoft.Diagnostics.Tracing.TraceEventDispatcher.Dispatch(TraceEvent anEvent)
   at Microsoft.Diagnostics.Tracing.ETWTraceEventSource.RawDispatch(EVENT_RECORD* rawData)
   at Microsoft.Diagnostics.Tracing.TraceEventNativeMethods.ProcessTrace(UInt64[] handleArray, UInt32 handleCount, IntPtr StartTime, IntPtr EndTime)
   at Microsoft.Diagnostics.Tracing.TraceEventNativeMethods.ProcessTrace(UInt64[] handleArray, UInt32 handleCount, IntPtr StartTime, IntPtr EndTime)
   at Microsoft.Diagnostics.Tracing.ETWTraceEventSource.ProcessOneFile()
   at Microsoft.Diagnostics.Tracing.ETWTraceEventSource.Process()
   at Program.<Main>$(String[] args) in C:\Users\josalem\scratch\TestApps\EtwBehavior\Program.cs:line 75

==============
(e13c0d23-ccbc-4e12-931b-d9cc2eee27e4) Microsoft-Windows-DotNETRuntime/MethodJittingStarted_V1 (145) (isClassicProvider: False)
Dynamic: 1
	MethodID: type System.UInt64
	ModuleID: type System.UInt64
	MethodToken: type System.UInt32
	MethodILSize: type System.UInt32
	MethodNamespace: type System.String
	MethodName: type System.String
	MethodSignature: type System.String
	ClrInstanceID: type System.UInt16
   at Program.<>c__DisplayClass0_0.<<Main>$>b__0(TraceEvent data) in C:\Users\josalem\scratch\TestApps\EtwBehavior\Program.cs:line 31
   at Microsoft.Diagnostics.Tracing.Parsers.DynamicTraceEventData.Dispatch()
   at Microsoft.Diagnostics.Tracing.TraceEventDispatcher.DoDispatch(TraceEvent anEvent)
   at Microsoft.Diagnostics.Tracing.TraceEventDispatcher.Dispatch(TraceEvent anEvent)
   at Microsoft.Diagnostics.Tracing.ETWTraceEventSource.RawDispatch(EVENT_RECORD* rawData)
   at Microsoft.Diagnostics.Tracing.TraceEventNativeMethods.ProcessTrace(UInt64[] handleArray, UInt32 handleCount, IntPtr StartTime, IntPtr EndTime)
   at Microsoft.Diagnostics.Tracing.TraceEventNativeMethods.ProcessTrace(UInt64[] handleArray, UInt32 handleCount, IntPtr StartTime, IntPtr EndTime)
   at Microsoft.Diagnostics.Tracing.ETWTraceEventSource.ProcessOneFile()
   at Microsoft.Diagnostics.Tracing.ETWTraceEventSource.Process()
   at Program.<Main>$(String[] args) in C:\Users\josalem\scratch\TestApps\EtwBehavior\Program.cs:line 75

==============
(e13c0d23-ccbc-4e12-931b-d9cc2eee27e4) Microsoft-Windows-DotNETRuntime/MethodJittingStarted_V1 (145) (isClassicProvider: False)
DynamicAll: 1
	MethodID: type System.UInt64
	ModuleID: type System.UInt64
	MethodToken: type System.UInt32
	MethodILSize: type System.UInt32
	MethodNamespace: type System.String
	MethodName: type System.String
	MethodSignature: type System.String
	ClrInstanceID: type System.UInt16
   at Program.<>c__DisplayClass0_0.<<Main>$>b__1(TraceEvent data) in C:\Users\josalem\scratch\TestApps\EtwBehavior\Program.cs:line 50
   at Microsoft.Diagnostics.Tracing.Parsers.DynamicTraceEventData.Dispatch()
   at Microsoft.Diagnostics.Tracing.TraceEventDispatcher.DoDispatch(TraceEvent anEvent)
   at Microsoft.Diagnostics.Tracing.TraceEventDispatcher.Dispatch(TraceEvent anEvent)
   at Microsoft.Diagnostics.Tracing.ETWTraceEventSource.RawDispatch(EVENT_RECORD* rawData)
   at Microsoft.Diagnostics.Tracing.TraceEventNativeMethods.ProcessTrace(UInt64[] handleArray, UInt32 handleCount, IntPtr StartTime, IntPtr EndTime)
   at Microsoft.Diagnostics.Tracing.TraceEventNativeMethods.ProcessTrace(UInt64[] handleArray, UInt32 handleCount, IntPtr StartTime, IntPtr EndTime)
   at Microsoft.Diagnostics.Tracing.ETWTraceEventSource.ProcessOneFile()
   at Microsoft.Diagnostics.Tracing.ETWTraceEventSource.Process()
   at Program.<Main>$(String[] args) in C:\Users\josalem\scratch\TestApps\EtwBehavior\Program.cs:line 75

Now it's worth noting that I ran this on .net5 and .net6. I observed the same behavior on both runtimes. When this was reported to me, it was reported as a regression in .net6, but I'm inclined to believe otherwise now. I'll loop back with the folks who encountered the issue and verify. This patch was verified to have fixed their issue when they ran their tests on top of it (hence the PR), but I think this is shaping up to be an issue on the TraceEvent side rather than something we should "fix" in the runtime. Let's hold off on merging this till I can get some confirmation/more clarity.

@josalem josalem marked this pull request as draft October 14, 2021 18:31
@benmwatson
Copy link
Member

I'm the one who reported this. I can definitely say that the observed behavior did not happen on .net 5. I can literally just change the , with no other changes, to .net 6, and our event parsing logic breaks.

Now, I can certainly work around this. Our code is hooking both the dynamic parser and the CLR parser. I can ignore the CLR event source when it starts coming through the dynamic event handler. And I can change our parsing logic to be more robust and handle either Int64/UInt64.

But this does seem like an unintended change, and who knows what else it will break.

@ghost ghost closed this Nov 13, 2021
@ghost
Copy link

ghost commented Nov 13, 2021

Draft Pull Request was automatically closed for inactivity. Please let us know if you'd like to reopen it.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 14, 2021
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants