Skip to content

fix: Implement a max number of metric values dispatched per frame #1287

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Oct 18, 2021

Conversation

bendoyon
Copy link
Contributor

@bendoyon bendoyon commented Oct 12, 2021

Introduces a maximum number of metrics that can be recorded in a given frame. This is to avoid performance issues when a high volume of metrics are tracked. This is a band-aid fix for an issue that was reported in slack.

https://unity.slack.com/archives/C01P4R0S2QH/p1634071340408400?thread_ts=1634052675.376600&cid=C01P4R0S2QH

Changelog

N/A addresses regression within release

Testing and Documentation

  • Updates existing tests
  • No documentation changes or additions were necessary.

@bendoyon bendoyon changed the title Implement a max number of metric values dispatched per frame fix: Implement a max number of metric values dispatched per frame Oct 13, 2021
@bendoyon bendoyon marked this pull request as ready for review October 13, 2021 13:29
Benoit Doyon added 2 commits October 13, 2021 09:36
…y-Technologies/com.unity.multiplayer.mlapi into fix/implement-max-metrics-per-frame
Benoit Doyon added 2 commits October 15, 2021 14:56
# Conflicts:
#	com.unity.netcode.gameobjects/Tests/Runtime/Metrics/MetricsDispatchTests.cs
@andrews-unity
Copy link
Contributor

Please update the description of this PR using the PR template.

So if I am reaching the max amount of events in a given frame... what is the likeliness I am going to reach the max amount in the next frame? within a couple frames? I ask because the LogWarrning seems like a bitch much to me because its not something that seems actionable like if I as a dev just happen to send a lot of messages or whatever that generate a lot of tracking events in SDK code its not like I can do much beyond just seeing the warning. So my question is do we need the log? I though in the profiler you could provide UI that could give little signals around bad things like I think in the CPU profiler you could see a warning symbol. Is there any reason we just don't push something to the profiler UI where this information actually matters ?

@becksebenius-unity
Copy link
Contributor

Please update the description of this PR using the PR template.

So if I am reaching the max amount of events in a given frame... what is the likeliness I am going to reach the max amount in the next frame? within a couple frames? I ask because the LogWarrning seems like a bitch much to me because its not something that seems actionable like if I as a dev just happen to send a lot of messages or whatever that generate a lot of tracking events in SDK code its not like I can do much beyond just seeing the warning. So my question is do we need the log? I though in the profiler you could provide UI that could give little signals around bad things like I think in the CPU profiler you could see a warning symbol. Is there any reason we just don't push something to the profiler UI where this information actually matters ?

The profiler doesn't have anything builtin for that, each of the profiler modules is pretty much custom. Ideally we would show this in the UI, but this issue was discovered late this week and we put this in as a bandaid for cases where there is a surge of metrics in a single frame.

Regarding the logs, thinking about it now it is probably better to omit it for now to reduce risk of creating spam. I think the likelihood is low but that's an assumption. But if we remove the log we'll want to urgently follow up with some UX to tell the user that their metrics may be incomplete.

@andrews-unity
Copy link
Contributor

Please update the description of this PR using the PR template.
So if I am reaching the max amount of events in a given frame... what is the likeliness I am going to reach the max amount in the next frame? within a couple frames? I ask because the LogWarrning seems like a bitch much to me because its not something that seems actionable like if I as a dev just happen to send a lot of messages or whatever that generate a lot of tracking events in SDK code its not like I can do much beyond just seeing the warning. So my question is do we need the log? I though in the profiler you could provide UI that could give little signals around bad things like I think in the CPU profiler you could see a warning symbol. Is there any reason we just don't push something to the profiler UI where this information actually matters ?

The profiler doesn't have anything builtin for that, each of the profiler modules is pretty much custom. Ideally we would show this in the UI, but this issue was discovered late this week and we put this in as a bandaid for cases where there is a surge of metrics in a single frame.

Regarding the logs, thinking about it now it is probably better to omit it for now to reduce risk of creating spam. I think the likelihood is low but that's an assumption. But if we remove the log we'll want to urgently follow up with some UX to tell the user that their metrics may be incomplete.

Yea I am thinking about the Warning Column in the Hierarchy view in the CPU profiler.. But yea you are right it is specific to that module as you said.

@andrews-unity
Copy link
Contributor

andrews-unity commented Oct 17, 2021

I read the slack message and I am trying to understand how this is related to that issue? And if so is the issue that we just have a high message volume and profiler things just break? or that we just don't want to track a high volume of messages?

I am also trying to understand why this check has to be done on the SDK side? Like why isn't this an implementation detail of a IMetricDispatcher? It just seems odd to me that the SDK should care about this.

With that said it does seem that many things get tracked in Dictionary/Lists within the metrics tooling? Is there any reason we aren't just using some sort of defined size RingBuffer that will just kickout the oldest message? Because this change would essentially prevent new messages but if I am getting spammed I am wondering if I just want the latest version of things and not the oldest version event?

@bendoyon
Copy link
Contributor Author

I read the slack message and I am trying to understand how this is related to that issue? And if so is the issue that we just have a high message volume and profiler things just break? or that we just don't want to track a high volume of messages?

I am also trying to understand why this check has to be done on the SDK side? Like why isn't this an implementation detail of a IMetricDispatcher? It just seems odd to me that the SDK should care about this.

With that said it does seem that many things get tracked in Dictionary/Lists within the metrics tooling? Is there any reason we aren't just using some sort of defined size RingBuffer that will just kickout the oldest message? Because this change would essentially prevent new messages but if I am getting spammed I am wondering if I just want the latest version of things and not the oldest version event?

Yes, this is a "day before shipping" fix, and not the final solution. We want a real mechanism to do this and test it well in NetStats, but this is the quickest way to work around the issue for now.

@andrews-unity
Copy link
Contributor

I read the slack message and I am trying to understand how this is related to that issue? And if so is the issue that we just have a high message volume and profiler things just break? or that we just don't want to track a high volume of messages?
I am also trying to understand why this check has to be done on the SDK side? Like why isn't this an implementation detail of a IMetricDispatcher? It just seems odd to me that the SDK should care about this.
With that said it does seem that many things get tracked in Dictionary/Lists within the metrics tooling? Is there any reason we aren't just using some sort of defined size RingBuffer that will just kickout the oldest message? Because this change would essentially prevent new messages but if I am getting spammed I am wondering if I just want the latest version of things and not the oldest version event?

Yes, this is a "day before shipping" fix, and not the final solution. We want a real mechanism to do this and test it well in NetStats, but this is the quickest way to work around the issue for now.

Just so I am clear... How is this issue related to the slack link posted in the description? Was this causing some sort of crash? or just that the metrics system was going crazy because of too much data ?

@bendoyon
Copy link
Contributor Author

I read the slack message and I am trying to understand how this is related to that issue? And if so is the issue that we just have a high message volume and profiler things just break? or that we just don't want to track a high volume of messages?
I am also trying to understand why this check has to be done on the SDK side? Like why isn't this an implementation detail of a IMetricDispatcher? It just seems odd to me that the SDK should care about this.
With that said it does seem that many things get tracked in Dictionary/Lists within the metrics tooling? Is there any reason we aren't just using some sort of defined size RingBuffer that will just kickout the oldest message? Because this change would essentially prevent new messages but if I am getting spammed I am wondering if I just want the latest version of things and not the oldest version event?

Yes, this is a "day before shipping" fix, and not the final solution. We want a real mechanism to do this and test it well in NetStats, but this is the quickest way to work around the issue for now.

Just so I am clear... How is this issue related to the slack link posted in the description? Was this causing some sort of crash? or just that the metrics system was going crazy because of too much data ?

If you pause execution when profiling a client, when resuming it tries to catch up on all the frames it missed. So a bunch of metrics. Since we have so many metrics to catch up on, there's a hitch, and then it has to catch up on the metrics from that hitch. Basically it can never catch up to a normal state.

This prevents it from doing this by just skipping metrics for a frame (after a certain number of metrics, in this case 1000) and then resuming normal execution.

@andrews-unity
Copy link
Contributor

Thanks for the context... do you have a ticket created for the proper fix somewhere?

@bendoyon
Copy link
Contributor Author

Thanks for the context... do you have a ticket created for the proper fix somewhere?

Not a lot of details in the ticket but we created this to track the tech debt https://jira.unity3d.com/browse/MTT-1420

@andrews-unity andrews-unity merged commit 9570c47 into develop Oct 18, 2021
@andrews-unity andrews-unity deleted the fix/implement-max-metrics-per-frame branch October 18, 2021 15:15
mollstam pushed a commit to Keepsake-Games/com.unity.netcode.gameobjects that referenced this pull request Feb 13, 2023
…ity-Technologies#1287)

* Implement a max number of metric values dispatched per frame
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants