Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hetzner Cloud CCX33 instance machine details #189

Closed
nietras opened this issue Jan 6, 2024 · 28 comments
Closed

Hetzner Cloud CCX33 instance machine details #189

nietras opened this issue Jan 6, 2024 · 28 comments

Comments

@nietras
Copy link

nietras commented Jan 6, 2024

@gunnarmorling could you perhaps post details on this machine. Hetzner info does not include specific CPU, memory configuration incl. bandwidth etc. Would be interesting to determine utilization etc.

@gunnarmorling
Copy link
Owner

Happy to, if I can. Any specific commands I should run whose output you'd like to see?

@nietras
Copy link
Author

nietras commented Jan 7, 2024

Not sure what is best on linux. lscpu as a minimum perhaps. hwinfo or similar if possible.

Would if possible in vm like to know specific CPU core/arch, Zen 3?, cache configuration L1/L2 etc, freq, memory configuration (channels, clock, bw). Supported ISA, AVX, AVX-512?

@as-com
Copy link
Contributor

as-com commented Jan 7, 2024

It would be great if we could see the output of lscpu, it seems Hetzner uses a mix of Milan and Genoa processors for their "dedicated vCPUs" instances

@gunnarmorling
Copy link
Owner

Here it is. I.e. EPYC-Milan:

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         40 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC-Milan Processor
    CPU family:          25
    Model:               1
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            1
    BogoMIPS:            4792.79
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
                          aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xget
                         bv1 xsaves clzero xsaveerptr wbnoinvd arat umip pku ospke rdpid fsrm
Virtualization features:
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    2 MiB (4 instances)
  L3:                    32 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Vulnerable: Safe RET, no microcode
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

it seems Hetzner uses a mix of Milan and Genoa processors for their "dedicated vCPUs" instances

Ah, that's interesting, where did you get that info from? Might explain the much better numbers which @ebarlas reported after setting up his own CCX33 instance (in another, newer Hetzner DC).

@as-com
Copy link
Contributor

as-com commented Jan 7, 2024

@gunnarmorling
https://www.hetzner.com/cloud in the dedicated vCPU tab:

Optimize your workload with AMD Milan EPYC™ 7003 and AMD Genoa EPYC™ 9654 processors.

@gunnarmorling
Copy link
Owner

gunnarmorling commented Jan 7, 2024 via email

@sharpobject
Copy link

Which DC did you rent from?

@fuzzypixelz
Copy link

It seems that Milan has no AVX-512 while Genoa does. Wikipedia claims that AVX-512 on EPYC is available on Zen 4 and later. Milan is Zen 3 and Genoa is Zen 4.

@tarsa
Copy link

tarsa commented Jan 8, 2024

It seems that Milan has no AVX-512 while Genoa does. Wikipedia claims that AVX-512 on EPYC is available on Zen 4 and later. Milan is Zen 3 and Genoa is Zen 4.

AVX-512 is officially supported by Zen 4, look e.g. here: https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf (look for 'avx-512' mentions). Zen 3 doesn't support AVX-512, but only AVX2 and below.

Presence of AVX-512 will probably affect performance of all vectorized code (autovectorized and/or manually vectorized using Vector API from Project Panama).

If you have AVX-512-capable CPU then you can measure the difference by running JVM with -XX:UseAVX=2 (or something like that) to limit the AVX level used by JVM (IIRC original AVX=1, AVX2=2, AVX-512=3).

@gunnarmorling
Copy link
Owner

Ah, that's great insight, thx for sharing!

I'm a bit blocked right now with evaluations: it seems my instance got moved to a different host, as I'm observing substantially different (read, better) numbers as of today, making any new measurements not comparable with previous runs. I've opened a ticket with Hetzner to see what's going on, but I might have to look for a more reliable alternative.

@gunnarmorling
Copy link
Owner

So I am considering to get an AMD EPYC 7401P from the Hetzner Server Auction. That's Zen 1, i.e. I reckon slower per core, but then it has 24 cores :) Like Zen 3, it has AVX2. Numbers wouldn't be comparable of course, but once we've set up hyperfine, it shouldn't be a problem to run all entries again and update the leader board accordingly (apart from the overall absolute shift, there might be relative changes in case different contenders handle the increased core number differently).

My biggest question there is around administering the thing (e.g. how to disable turbo-boost and SMT, which would be a good idea), as I'm not super-savvy when it comes to that.

The alternative would be to re-run everything on the existing instance (which is much faster as of today, no idea why). But I don't feel very confident about it, not being sure whether there might not be a change in performance again. I've also asked the community for help, let's see what comes out of it.

Open for any help and suggestions of course :)

@tarsa
Copy link

tarsa commented Jan 8, 2024

Zen 1 has very high penalties for inter-chiplet communication (actually even inter-CCX communication). Zen 2 brought the central IOD (IO die) and made the inter-chiplet communication much more robust and faster. If you're going for Zen, I would recommend at least Zen 2.

There's Zen 4 available in the form of AX52 server with Ryzen 7 7700: https://www.hetzner.com/dedicated-rootserver/matrix-ax . It has single CCX, so it should be easy to tune multithreading for that chip. Server finder https://www.hetzner.com/dedicated-rootserver show it's "available in few minutes", but direct link to search results (in that server finder) somehow doesn't work.

@nietras
Copy link
Author

nietras commented Jan 8, 2024

A dedicated machine is by far the most important here, with min Avx2 support. I am not too worried about cache hierarchy here given the highly parallellizable problem and all solutions doing chunks per processor. However, Zen 1 has some issues with certain simd/Avx2 instructions, high latencies etc. Not sure any such are or will be used here though due to the simple usages. load, cmp, movemask, lzcnt, etc.

disks are not important given entire file cached in memory. the more cores the less difference in efficiency probably. more limited by mem bw/cache.

I don't know how good java AVX-512 support is, but would not see as requirement also harder for most to test locally since many don't have dev machines with it. I don't for example.

@gunnarmorling
Copy link
Owner

There's Zen 4 available in the form of AX52 server with Ryzen 7 7700

@shipilev recommended to use EPYC rather than Ryzen; the reasoning is a bit above my pay grade, though :) There's the AX161 EPYC 7502P (Zen2), though a bits towards the pricier end. Oh the options... .

@nietras
Copy link
Author

nietras commented Jan 8, 2024 via email

@tarsa
Copy link

tarsa commented Jan 8, 2024

My biggest question there is around administering the thing (e.g. how to disable turbo-boost and SMT, which would be a good idea), as I'm not super-savvy when it comes to that.

I have a script that sets my CPU to max non-turbo frequency:

sudo cpupower frequency-set -g performance
sudo cpupower frequency-set -u 3400MHz -d 3400MHz

Not sure if it would work on others' machines.

As for SMT: why disable it? If someone doesn't want it, then setting CPU affinity mask would amount to the same. Details: https://linux.die.net/man/1/taskset

I don't know how good java AVX-512 support is, but would not see as requirement also harder for most to test locally since many don't have dev machines with it. I don't for example.

As I said before, you can run JVM with -XX:UseAVX=2 and avoid AVX-512-related surprises.

AVX-512 adds support for lane masking and that could potentially allow more interesting programs. But, OTOH, Zen 4 has some bad AVX-512 instructions implementations. https://www.hwcooling.net/en/how-good-is-amds-avx-512-does-it-improve-zen-4-performance/ says:

On the other hand there are some instructions that perform much worse than what is necessitated by the use of 256-bit units and 256bit load/store. That is a case with Compress (vcompressd) operations and Scatter/Gather performance is also poor. Scatter/Gather is also not great on Intel, but to a lesser extent.

As for Ryzen vs Epyc choice: the Ryzen servers that Hetzner provides can have ECC enabled for RAM. The configuration page https://www.hetzner.com/dedicated-rootserver/ax52/configurator says:

Upgrade to ECC RAM €4.76 monthly

That should make the server reliable enough for lots of heavy workload hammering :) I'm not sure what improvements Epyc would bring on top of that.

@lehuyduc
Copy link

lehuyduc commented Jan 9, 2024

My biggest question there is around administering the thing (e.g. how to disable turbo-boost and SMT, which would be a good idea), as I'm not super-savvy when it comes to that.

I think we shouldn't disable SMT. Some solutions scale perfectly well with hyperthreading (i.e nearly 2x performance when run at 16 threads on a 8c16t PC).

I think at the end of the contest, maybe top 10 solutions are selected and run again on a dedicated physical machine (no remote server), with turbo boost disabled, and as few as possible background processes running.

@nietras
Copy link
Author

nietras commented Jan 9, 2024 via email

@gunnarmorling
Copy link
Owner

Ok, so if we want to stick to Hetzner (which I'd prefer, so as to limit the search space somewhat), it seems AX52 (AMD Ryzen™ 7 7700, Zen 4) would be the best fit. I'm just not sure whether turbo boost can be disabled in that setting? But I'm also not sure how much that may skew results? CC @rschwietzke

I think at the end of the contest, maybe top 10 solutions are selected and run again on a dedicated physical machine (no remote server), with turbo boost disabled, and as few as possible background processes running.

Ha yeah, would love that. Just would need to get my hands on one :)

@nietras
Copy link
Author

nietras commented Jan 9, 2024

I think turbo boost should be left on. Thermal throttling is part of the game and rigorous benchmarking will show this, e.g. high variation in results or similar. It's a dedicated DC server, it should have consistent and reliable cooling. Little fluctuation. Handle it by better benchmarking. Make it part of the challenge.

If someone can make a single threaded solution at highest single (or fewer) threaded boost clock that is faster than (all hardware threads) multi threaded lower boost clock then that's game.

@rschwietzke
Copy link
Contributor

I would say off because: It is not that much of a boost for "desktop" CPUs and it might turn the execution order into a factor as well as the CPU usage. Turbo often works only when only a few core are on.

Cloud machines (which is the deployment norm) don't have turbo modes at all.

And yes, we can turn that off for AMD (do that for my notebook sometimes).

@gunnarmorling
Copy link
Owner

Yeah, the motivation for turning it off would be better comparability between different contenders, in particular you'd want to avoid one subsequent run to suffer from being throttled due to a previous run. I suppose one could kinda get on top of it by pausing in between, but that's more voodoo than anything else.

And yes, we can turn that off for AMD (do that for my notebook sometimes).

Do you do that in the BIOS or at OS level? Because I reckon the former isn't available with Hetzner dedicated host (if one only could try out before committing to it...).

@rschwietzke
Copy link
Contributor

OS level. SYSCTL settings.

@gunnarmorling
Copy link
Owner

Alrighty, after confering some more with @rschwietzke and @shipilev, I'm gonna set up an AX161 (AMD EPYC™ 7502P, 32 Core Rome (Zen2)). It's the same in terms of ISA as the original one (i.e. no AVX-512), which also nice. I'll run in 8 cores, as before.

@gunnarmorling
Copy link
Owner

I'm gonna close this one. We've moved to aforementioned instance AX161, and the leader board has been updated to reflect this move.

@tarsa
Copy link

tarsa commented Jan 12, 2024

(..) I'll run in 8 cores, as before.

that means 1 thread per each of the 8 cores? zen has 2 threads per core. is smt disabled now?

the original 8 vcpu cloud machine probably had just 4 cores with 2 vcpu / core = 8 vcpus total. the lscpu for the previous cloud machine said:

CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC-Milan Processor
    CPU family:          25
    Model:               1
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            1
    BogoMIPS:            4792.79
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
                          aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xget
                         bv1 xsaves clzero xsaveerptr wbnoinvd arat umip pku ospke rdpid fsrm
Virtualization features:
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    2 MiB (4 instances)
  L3:                    32 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7

it says 4 cores per socket and 1 socket, so 4 cores total, but 2 threads per core = 8 threads total.
also 4 instances of L1 and L2 cache says that there were just 4 cores.

@gunnarmorling
Copy link
Owner

Yes, SMT is disabled, and we run on eight cores out of 32 via numctl. This is the lscpu output of the new machine:

lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         43 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-31
  Off-line CPU(s) list:  32-63
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 7502P 32-Core Processor
    CPU family:          23
    Model:               49
    Thread(s) per core:  1
    Core(s) per socket:  32
    Socket(s):           1
    Stepping:            0
    Frequency boost:     disabled
    CPU max MHz:         2500.0000
    CPU min MHz:         0.0000
    BogoMIPS:            4990.70
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 mov
                         be popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx
                         2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pa
                         usefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization features:
  Virtualization:        AMD-V
Caches (sum of all):
  L1d:                   1 MiB (32 instances)
  L1i:                   1 MiB (32 instances)
  L2:                    16 MiB (32 instances)
  L3:                    128 MiB (8 instances)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-31
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Mitigation; untrained return thunk; SMT disabled
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

@gunnarmorling
Copy link
Owner

I am planning to run the Top 5 or so on all 32 cores (64 threads with SMT) towards the end of the challenge, so as to see how far we can push it below 1 sec :)

@tarsa tarsa mentioned this issue Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants