-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Improve total L2 size calculation logic on osx-arm64 #75881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I am not sure we're interested in total size, L3 that we use to calculate gen0 budget is expected to be a single chip, like per core group or a unified one. It's likely that we already return total on other platform but we'd better fix those IMO. The idea, how I understand it, is to be able to put the while gen0 into a single piece of cache memory in order to walk it efficiently (especially here on Workstation GC as macOS is unlikely to use Server GC) |
|
Core groups within performance cluster on M1 chips (sorry, there's no official terminology so it's confusing) work mostly (as far as I understand) as power/clock domains, while they don't share L2, regular x86 don't do that either. They are, however, within the same cluster similar to single CCX of AMD CPU rather than two different ones. This is mostly to account for the fact that no L3 data is available on My understanding is that other code paths already calculate total L2 size without even accounting for the fact whether it's shared or not if L3 number is unavailable so I think it will be reporting numbers closer to x86_64 counterparts. If you have any benchmarks on hand or other links to look into to gauge the difference pre and post this change - please let me know! As for which size of L2 is observable for CPU cores with low latency, it is unclear and reverse-engineered data varies. See: https://www.realworldtech.com/forum/?threadid=205277&curpostid=205283 and https://www.anandtech.com/show/17024/apple-m1-max-performance-review/2 p.s.: fun cursed fact, performance cores on M1 have variable sizes of their respective L2 slices, M1 is 5-1-3-3 and M1 Pro/Max is 1-5-5-1 and 3-3-3-3 megabytes of L2$. |
You can try these #64576 and could you also measure the working set size difference, presumably it will be higher with this. |
|
Hi @neon-sunset, is this an experimental PR, we could mark it as a draft. Looks like you are still running some perf tests to check what the impact is here? |
Yes, I haven't been able to work on it recently so if you could mark it as draft - please do and I will change it back to ready for review once there is data available to back up (or not) the suggested change. Thanks! |
|
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
While looking at #75854 and double-checking
sysctlbehavior, I've noticed that onM1 Prothe actual value reported forhw.perflevel0.l2cachesizeis12582912instead of roughly 24MB which is its actual L2 cache size.However,
sysctlhas another keyhw.perflevel0.cpusperl2which allows us to calculate the total size of L2 of all performance cores.