-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initialize values for yield normalization during startup #99991
Closed
eduardo-vp
wants to merge
7
commits into
dotnet:main
from
eduardo-vp:add-yield-normalization-initialization
Closed
Changes from 2 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
cdebbfe
Add initialization for yield normalization
eduardo-vp acd40a5
Fix code
eduardo-vp b37ffcd
Add yieldprocessornormalized.h
eduardo-vp 1a90ad8
Comment changes in ceemain.cpp to see effect on pipelines
eduardo-vp 13c0161
Revert "Comment changes in ceemain.cpp to see effect on pipelines"
eduardo-vp ea46a91
Check IsGCHeapInitialized
eduardo-vp c3ffb0a
Rename method
eduardo-vp File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat pre-existing to the PR here, but I think there's some latent bugs in the general measurement logic here...
Notably, Linux always reports
1 billion
here for the frequency (1 nanosecond ticks) even if the underlying hardware timer is actually much slower.Additionally, there are places below where we try to measure the length of time between two
QPC
calls and whileQPC
is guaranteed to be monotonic (never decreasing) there is no guarantee that two separate calls will not return the same value (which can occur if they happen faster than the underlying timer resolution). A simple program such as the following shows this on most modern hardware:Additionally, on some hardware the underlying mechanism used by
QPC
could have a resolution as low as 10-20 cycles or it could have a latency of nearly 100-300 ns (detailed in part here: https://learn.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps). While thepause
instruction (x86/x64) can execute somewhere around 10-20 cycles, 40 cycles, or up to 150 cycles (depending on the hardware in question, with newer and higher core count machines often explicitly having longer delay times). All of these cycle times could be potentially much less than the100-300ns
latency that a QPC call might have which can lead to skewed measurements.We've also seen that
yield
on Arm64 is known to be a hint which can often be ignored or so low that many other ecosystems have switched to using theisb
instruction instead to achieve a similar effect to the x86/x64pause
instruction.It might be good to take a deeper look at what this code is actually doing and how accurate it actually is compared to mechanisms which are known to more accurately measure the actual instruction latency in a platform specific manner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stopwatch.Frequency
also seems to report similarly. The tick count reports ticks in nanoseconds. I don't think this affects the measurements, just the frequency doesn't indicate the actual underlying frequency of the clock. The actual frequency of the clock also does not necessarily reflect the total overhead involved in getting the ticks.I don't see any assumption being made in this code that the difference wouldn't be zero.
Yes that's why the QPC overhead is measured and the measurement duration is adjusted (with some reasonable limits). Previously the measurement was taken over a 10 ms interval but that was too much, and taking multiple shorter measurements as currently done showed similar measurement results even on systems with higher QPC overhead.
The normalization should be able to figure that out and compensate by using more iterations.
isb
may be more useful when there is no normalization, to induce a more similar delay.isb
appears to have other side effects (involving an instruction barrier) that may be unnecessary for spin-waiting.Let me know if you see any specific issues, I can take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance, suppose the actual clock resolution is 1 us or higher (quite high, though perhaps possible). Each measurement would then typically be shorter than that duration, with some being shorter and some longer. The chosen measurement that is used for normalization favors a lower ns-per-yield, which would come from the longer measurements and would be a bit closer to the actual delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's because Stopwatch.Frequency just calls into
QPF
, which is implemented in PAL as a call to the underlying CLOCK_MONOTONIC (or similar), which always reports1 ns ticks
.The general consideration here is we have logic such as this https://github.com/dotnet/runtime/blob/cf053aba5c42ad94fe696238ff40f28ec82c7362/src/coreclr/vm/yieldprocessornormalized.cpp#L148C13-L153C14
Where the first check (did
QPF
fail) cannot be triggered on Windows and where the latter check (li.QuadPart < 1_000_000
) cannot be triggered on Linux. Given the intent of the check, the general logic here is "flawed" and isn't actually accounting for whether normalization can be properly done. -- This is particularly relevant as we're much more likely to encounter a low clock resolution device on Linux where embedded and low-end/legacy device support is more prevalent.I think the fact that we aren't explicitly checking for and handling that is the problem.
This call is trying to detect overhead and its registering it as
0
:runtime/src/coreclr/vm/yieldprocessornormalized.cpp
Lines 48 to 56 in cf053ab
MeasureDuration
of1us
without actually confirming, for example, that 1us is a reasonable measurement duration.This value is passed into
MeasureNsPerYield
: https://github.com/dotnet/runtime/blob/cf053aba5c42ad94fe696238ff40f28ec82c7362/src/coreclr/vm/yieldprocessornormalized.cpp#L59C46-L59C63.yieldCount
will on many machines simply be28
andmeasureDurationTicks
will be10
.On a machine, like the Ryzen 7950X which has documented
pause
to be64
cycles and therefore to take ~14ns to execute (base clock of 4.5GHz, can boost up to 5.7GHz), this is generally sufficient and gets roughly the right times. However, I've seen it compute anywhere up to 30ns perpause
and down to10ns
, such as if the core it happens to measure is in a low-power state or have just finished a boost state. This variance can cause quite a bit of noise in some benchmarks or even just general app execution.We then have some logic just below that which tries to account for inaccurate measures.:
runtime/src/coreclr/vm/yieldprocessornormalized.cpp
Lines 103 to 110 in cf053ab
However, the numbers are a bit odd as there is no such thing as a 10GHz processor and so
0.1
as aMinNsPerYield
is completely unrealistic value. Beyond that, a more typical processor is around 2.5GHz (often more for Desktop or Workstation, sometimes less for Mobile) and so a reasonable yield time (based on actual known data) would likely be at a minimum of around 10 cycles, given around 2-4ns per yield as a reasonable minimum.The main consideration is that if the underlying
yield
/pause
instruction is truly a no-op, then we likely should be falling back to an alternative sequence that better matches the actual intent of the spin loop. If we're literally just executing what is functionally an emptywhile()
loop, that's not great for the system.There's a lot of cases, particularly for Arm64 systems, where we're seeing tons of context switches compared to an equivalent x64 machine. We're also spending for more time executing instructions in places like the underlying
Semaphore
spin loop used by the thread pool.So not only are we spinning and not letting the processor acknowledge the potential for a proper lower power state, but, we're doing a lot of potentially expensive work on the core that could be getting spent elsewhere.
For example, on my Volterra, it's measuring around 0.5-5ns per yield, with some of those being done on
BIG
cores and some being done onLITTLE
cores, which means that separate measurements ofyield
could be problematic. If you just spin doingMeasureNsPerYield
forever, measing themin
andmax
seen over time and the ProcessorId it's running on, for example:With the really high numbers (89 and 173-182) being caused by the OS context switches -- Similar happens on the 7950X:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these unaccounted for nuances are causing us issues, especially for Arm64 on Linux and are part of why the retired instruction counts, the number of context switches (which are currently more expensive for Arm64), and other metrics are so much higher.
I would imagine some of them would also cause issues for things like Intel Alder Lake CPUs where you have Power and Efficiency cores with very different cycle times and base/max clock speeds.
For example, the same code running on WSL or an actual direct boot to Linux (Ubuntu 22.04) reports the following for Ryzen 7950X:
And the following for
Volterra
:So the same code on the same hardware is reporting significantly different results in Linux and not measuring something which really does what we're wanting for Arm64 as its taking so little time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding, can you file an issue? We can continue the discussion there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logged #100242