Skip to content

Conversation

@neon-sunset
Copy link
Contributor

While looking at #75854 and double-checking sysctl behavior, I've noticed that on M1 Pro the actual value reported for hw.perflevel0.l2cachesize is 12582912 instead of roughly 24MB which is its actual L2 cache size.

However, sysctl has another key hw.perflevel0.cpusperl2 which allows us to calculate the total size of L2 of all performance cores.

macOS 13.0 22A5342f arm64 | M1 Pro 2E+6P

hw.perflevel0.physicalcpu: 6
hw.perflevel0.physicalcpu_max: 6
hw.perflevel0.logicalcpu: 6
hw.perflevel0.logicalcpu_max: 6
hw.perflevel0.l1icachesize: 196608
hw.perflevel0.l1dcachesize: 131072
hw.perflevel0.l2cachesize: 12582912
hw.perflevel0.cpusperl2: 3
hw.perflevel0.name: Performance

@ghost ghost added community-contribution Indicates that the PR has been added by a community member area-PAL-coreclr labels Sep 20, 2022
@neon-sunset neon-sunset marked this pull request as ready for review September 20, 2022 11:50
@EgorBo
Copy link
Member

EgorBo commented Sep 20, 2022

I am not sure we're interested in total size, L3 that we use to calculate gen0 budget is expected to be a single chip, like per core group or a unified one. It's likely that we already return total on other platform but we'd better fix those IMO.

The idea, how I understand it, is to be able to put the while gen0 into a single piece of cache memory in order to walk it efficiently (especially here on Workstation GC as macOS is unlikely to use Server GC)

@neon-sunset
Copy link
Contributor Author

neon-sunset commented Sep 20, 2022

Core groups within performance cluster on M1 chips (sorry, there's no official terminology so it's confusing) work mostly (as far as I understand) as power/clock domains, while they don't share L2, regular x86 don't do that either. They are, however, within the same cluster similar to single CCX of AMD CPU rather than two different ones.

This is mostly to account for the fact that no L3 data is available on osx-arm64. Even when not hitting L2, it appears that for now we can assume that SLC which effectively works like memory-side L3 is of similar/larger size to total L2 cache size. SLC for M1 is 8MB (12MB of L2$), Pro is 24MB and Max is 48MB.

My understanding is that other code paths already calculate total L2 size without even accounting for the fact whether it's shared or not if L3 number is unavailable so I think it will be reporting numbers closer to x86_64 counterparts.

If you have any benchmarks on hand or other links to look into to gauge the difference pre and post this change - please let me know!

As for which size of L2 is observable for CPU cores with low latency, it is unclear and reverse-engineered data varies.

See: https://www.realworldtech.com/forum/?threadid=205277&curpostid=205283 and https://www.anandtech.com/show/17024/apple-m1-max-performance-review/2

p.s.: fun cursed fact, performance cores on M1 have variable sizes of their respective L2 slices, M1 is 5-1-3-3 and M1 Pro/Max is 1-5-5-1 and 3-3-3-3 megabytes of L2$.

@EgorBo
Copy link
Member

EgorBo commented Sep 20, 2022

If you have any benchmarks on hand or other links to look into to gauge the difference pre and post this change - please let me know!

You can try these #64576

and could you also measure the working set size difference, presumably it will be higher with this.

@mangod9
Copy link
Member

mangod9 commented Oct 24, 2022

Hi @neon-sunset, is this an experimental PR, we could mark it as a draft. Looks like you are still running some perf tests to check what the impact is here?

@neon-sunset neon-sunset reopened this Oct 24, 2022
@neon-sunset
Copy link
Contributor Author

neon-sunset commented Oct 24, 2022

Hi @neon-sunset, is this an experimental PR, we could mark it as a draft. Looks like you are still running some perf tests to check what the impact is here?

Yes, I haven't been able to work on it recently so if you could mark it as draft - please do and I will change it back to ready for review once there is data available to back up (or not) the suggested change. Thanks!

@ghost
Copy link

ghost commented Nov 23, 2022

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@ghost ghost closed this Nov 23, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Dec 23, 2022
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-PAL-coreclr community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants