-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hetzner Cloud CCX33 instance machine details #189
Comments
Happy to, if I can. Any specific commands I should run whose output you'd like to see? |
Not sure what is best on linux. Would if possible in vm like to know specific CPU core/arch, Zen 3?, cache configuration L1/L2 etc, freq, memory configuration (channels, clock, bw). Supported ISA, AVX, AVX-512? |
It would be great if we could see the output of |
Here it is. I.e. EPYC-Milan:
Ah, that's interesting, where did you get that info from? Might explain the much better numbers which @ebarlas reported after setting up his own CCX33 instance (in another, newer Hetzner DC). |
@gunnarmorling
|
Ah, I see. Seems there's no way to find out which one you'd get? Unless
it's unique per DC. Kinda bizarre, definitely an interesting learning for
me from this challenge :)
… Message ID: ***@***.***>
|
Which DC did you rent from? |
It seems that Milan has no AVX-512 while Genoa does. Wikipedia claims that AVX-512 on EPYC is available on Zen 4 and later. Milan is Zen 3 and Genoa is Zen 4. |
AVX-512 is officially supported by Zen 4, look e.g. here: https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf (look for 'avx-512' mentions). Zen 3 doesn't support AVX-512, but only AVX2 and below. Presence of AVX-512 will probably affect performance of all vectorized code (autovectorized and/or manually vectorized using Vector API from Project Panama). If you have AVX-512-capable CPU then you can measure the difference by running JVM with -XX:UseAVX=2 (or something like that) to limit the AVX level used by JVM (IIRC original AVX=1, AVX2=2, AVX-512=3). |
Ah, that's great insight, thx for sharing! I'm a bit blocked right now with evaluations: it seems my instance got moved to a different host, as I'm observing substantially different (read, better) numbers as of today, making any new measurements not comparable with previous runs. I've opened a ticket with Hetzner to see what's going on, but I might have to look for a more reliable alternative. |
So I am considering to get an AMD EPYC 7401P from the Hetzner Server Auction. That's Zen 1, i.e. I reckon slower per core, but then it has 24 cores :) Like Zen 3, it has AVX2. Numbers wouldn't be comparable of course, but once we've set up hyperfine, it shouldn't be a problem to run all entries again and update the leader board accordingly (apart from the overall absolute shift, there might be relative changes in case different contenders handle the increased core number differently). My biggest question there is around administering the thing (e.g. how to disable turbo-boost and SMT, which would be a good idea), as I'm not super-savvy when it comes to that. The alternative would be to re-run everything on the existing instance (which is much faster as of today, no idea why). But I don't feel very confident about it, not being sure whether there might not be a change in performance again. I've also asked the community for help, let's see what comes out of it. Open for any help and suggestions of course :) |
Zen 1 has very high penalties for inter-chiplet communication (actually even inter-CCX communication). Zen 2 brought the central IOD (IO die) and made the inter-chiplet communication much more robust and faster. If you're going for Zen, I would recommend at least Zen 2. There's Zen 4 available in the form of AX52 server with Ryzen 7 7700: https://www.hetzner.com/dedicated-rootserver/matrix-ax . It has single CCX, so it should be easy to tune multithreading for that chip. Server finder https://www.hetzner.com/dedicated-rootserver show it's "available in few minutes", but direct link to search results (in that server finder) somehow doesn't work. |
A dedicated machine is by far the most important here, with min Avx2 support. I am not too worried about cache hierarchy here given the highly parallellizable problem and all solutions doing chunks per processor. However, Zen 1 has some issues with certain simd/Avx2 instructions, high latencies etc. Not sure any such are or will be used here though due to the simple usages. load, cmp, movemask, lzcnt, etc. disks are not important given entire file cached in memory. the more cores the less difference in efficiency probably. more limited by mem bw/cache. I don't know how good java AVX-512 support is, but would not see as requirement also harder for most to test locally since many don't have dev machines with it. I don't for example. |
I would rather take a Zen 3 consumer dedicated machine than a server based
simply on this better matching what developers have at their disposal, this
can quickly become a race for who has access to certain machines or
similar. It already is a diverse set of CPUs out there of course, though.
Thermal throttling due to boosting is an issue on most modern CPUs anyway.
Better to invest in more rigorous and statically sound benchmarking. In
dotnet we would always use BenchmarkDotNet, and forget about process
start/stop.
…On Mon, Jan 8, 2024, 22:51 Gunnar Morling ***@***.***> wrote:
There's Zen 4 available in the form of AX52 server with Ryzen 7 7700
@shipilev <https://github.com/shipilev> recommended to use EPYC rather
than Ryzen; the reasoning is a bit above my pay grade, though :) There's
the AX161 <https://www.hetzner.com/dedicated-rootserver/ax161> EPYC 7502P
(Zen2), though a bits towards the pricier end. Oh the options... .
—
Reply to this email directly, view it on GitHub
<#189 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACSMN3Z2HXS2QKLE6ROI3JDYNRS43AVCNFSM6AAAAABBP2DKD2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBRHA3TONZZHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I have a script that sets my CPU to max non-turbo frequency:
Not sure if it would work on others' machines. As for SMT: why disable it? If someone doesn't want it, then setting CPU affinity mask would amount to the same. Details: https://linux.die.net/man/1/taskset
As I said before, you can run JVM with AVX-512 adds support for lane masking and that could potentially allow more interesting programs. But, OTOH, Zen 4 has some bad AVX-512 instructions implementations. https://www.hwcooling.net/en/how-good-is-amds-avx-512-does-it-improve-zen-4-performance/ says:
As for Ryzen vs Epyc choice: the Ryzen servers that Hetzner provides can have ECC enabled for RAM. The configuration page https://www.hetzner.com/dedicated-rootserver/ax52/configurator says:
That should make the server reliable enough for lots of heavy workload hammering :) I'm not sure what improvements Epyc would bring on top of that. |
I think we shouldn't disable SMT. Some solutions scale perfectly well with hyperthreading (i.e nearly 2x performance when run at 16 threads on a 8c16t PC). I think at the end of the contest, maybe top 10 solutions are selected and run again on a dedicated physical machine (no remote server), with turbo boost disabled, and as few as possible background processes running. |
I agree on not disabling SMT. SMT was used before and workload scales fine
on it.
AX41-NVMe Zen2 3600 or AX52 Zen4 7700 both seem like good enough options.
I'd then exclude AVX-512 usage mainly due to how many devs have access if
zen 4.
…On Tue, Jan 9, 2024, 06:10 lehuyduc ***@***.***> wrote:
My biggest question there is around administering the thing (e.g. how to
disable turbo-boost and SMT, which would be a good idea), as I'm not
super-savvy when it comes to that.
I think we shouldn't disable SMT. Some solutions scale perfectly well with
hyperthreading (i.e nearly 2x performance when run at 16 threads on a 8c16t
PC).
—
Reply to this email directly, view it on GitHub
<#189 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACSMN335YL2SG7OLTSG5AELYNTGLHAVCNFSM6AAAAABBP2DKD2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBSGQZDENBWHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Ok, so if we want to stick to Hetzner (which I'd prefer, so as to limit the search space somewhat), it seems AX52 (AMD Ryzen™ 7 7700, Zen 4) would be the best fit. I'm just not sure whether turbo boost can be disabled in that setting? But I'm also not sure how much that may skew results? CC @rschwietzke
Ha yeah, would love that. Just would need to get my hands on one :) |
I think turbo boost should be left on. Thermal throttling is part of the game and rigorous benchmarking will show this, e.g. high variation in results or similar. It's a dedicated DC server, it should have consistent and reliable cooling. Little fluctuation. Handle it by better benchmarking. Make it part of the challenge. If someone can make a single threaded solution at highest single (or fewer) threaded boost clock that is faster than (all hardware threads) multi threaded lower boost clock then that's game. |
I would say off because: It is not that much of a boost for "desktop" CPUs and it might turn the execution order into a factor as well as the CPU usage. Turbo often works only when only a few core are on. Cloud machines (which is the deployment norm) don't have turbo modes at all. And yes, we can turn that off for AMD (do that for my notebook sometimes). |
Yeah, the motivation for turning it off would be better comparability between different contenders, in particular you'd want to avoid one subsequent run to suffer from being throttled due to a previous run. I suppose one could kinda get on top of it by pausing in between, but that's more voodoo than anything else.
Do you do that in the BIOS or at OS level? Because I reckon the former isn't available with Hetzner dedicated host (if one only could try out before committing to it...). |
OS level. SYSCTL settings. |
Alrighty, after confering some more with @rschwietzke and @shipilev, I'm gonna set up an AX161 (AMD EPYC™ 7502P, 32 Core Rome (Zen2)). It's the same in terms of ISA as the original one (i.e. no AVX-512), which also nice. I'll run in 8 cores, as before. |
I'm gonna close this one. We've moved to aforementioned instance AX161, and the leader board has been updated to reflect this move. |
that means 1 thread per each of the 8 cores? zen has 2 threads per core. is smt disabled now? the original 8 vcpu cloud machine probably had just 4 cores with 2 vcpu / core = 8 vcpus total. the
it says 4 cores per socket and 1 socket, so 4 cores total, but 2 threads per core = 8 threads total. |
Yes, SMT is disabled, and we run on eight cores out of 32 via numctl. This is the lscpu output of the new machine:
|
I am planning to run the Top 5 or so on all 32 cores (64 threads with SMT) towards the end of the challenge, so as to see how far we can push it below 1 sec :) |
@gunnarmorling could you perhaps post details on this machine. Hetzner info does not include specific CPU, memory configuration incl. bandwidth etc. Would be interesting to determine utilization etc.
The text was updated successfully, but these errors were encountered: