Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoW modifications (shuffles and integer math) discussion and tests #1

Open
SChernykh opened this issue Jun 20, 2018 · 296 comments
Open

Comments

@SChernykh
Copy link
Owner

SChernykh commented Jun 20, 2018

The original discussion starts here: monero-project/monero#3545 (comment)
GPU version of shuffle and integer math modifications is here: https://github.com/SChernykh/xmr-stak-amd

You can post your performance test results, also your suggestions and concerns here.

AMD Ryzen 7 1700 @ 3.6 GHz, 8 threads

Mod Hashrate Performance level
- 600.8 H/s 100.0%
INT_MATH 588.0 H/s 97.9%
SHUFFLE 586.6 H/s 97.6%
Both mods 572.0 H/s 95.2%

AMD Ryzen 5 2600 @ 4.0 GHz, 1 thread

Mod Hashrate Performance level
- 97.0 H/s 100.0%
INT_MATH 91.7 H/s 94.5%
SHUFFLE 94.6 H/s 97.5%
Both mods 91.3 H/s 94.1%
Both mods (PGO build) 93.5 H/s 96.4%
Both mods (ASM optimized) 94.8 H/s 97.7%

AMD Ryzen 5 2600 @ 4.0 GHz, 8 threads (affinity 0,2,4,5,6,8,10,11)

Mod Hashrate Performance level
- 657.6 H/s 100.0%
INT_MATH 613.3 H/s 93.3%
SHUFFLE 647.0 H/s 98.4%
Both mods 612.3 H/s 93.1%
Both mods (PGO build) 622.4 H/s 94.6%
Both mods (ASM optimized) 636.0 H/s 96.7%

Intel Pentium G5400 (Coffee Lake, 2 cores, 4 MB Cache, 3.70 GHz), 2 threads

Mod Hashrate Performance level
- 146.5 H/s 100.0%
INT_MATH 141.0 H/s 96.2%
SHUFFLE 145.3 H/s 99.2%
Both mods 140.5 H/s 95.9%

Intel Core i5 3210M (Ivy Bridge, 2 cores, 3 MB Cache, 2.80 GHz), 1 thread

Mod Hashrate Performance level
- 72.7 H/s 100.0%
INT_MATH 66.3 H/s 91.2%
SHUFFLE 71.1 H/s 97.8%
Both mods 66.3 H/s 91.2%
Both mods (PGO build) 66.3 H/s 91.2%
Both mods (ASM optimized) 69.6 H/s 95.7%

Intel Core i7 2600K (Sandy Bridge, 4 cores, 8 MB Cache, 3.40 GHz), 1 thread

Mod Hashrate Performance level
- 85.6 H/S 100.0%
Both mods 70.6 H/S 82.5%
Both mods (PGO build) 76.5 H/S 89.4%
Both mods (ASM optimized) 79.2 H/S 92.5%

Intel Core i7 7820X (Skylake-X, 8 cores, 11 MB Cache, 3.60 GHz), 1 thread

Mod Hashrate Performance level
- 68.3 H/s 100.0%
INT_MATH 65.9 H/s 96.5%
SHUFFLE 67.3 H/s 98.5%
Both mods 65.0 H/s 95.2%

XMR-STAK used is an old version, so don't expect the same numbers that you have on your mining rigs. What's important here are relative numbers of original and modified Cryptonight versions.

Radeon RX 560 on Windows 10 (overclocked): core @ 1196 MHz, memory @ 2200 MHz, 1 Click PBE Timing Straps, monitor plugged in, intensity 1024, worksize 32:

Mod Hashrate Performance level
- 477.1 H/s 100.0%
INT_MATH 448.4 H/s 94.0%
SHUFFLE 457.6 H/s 95.9%
Both mods 447.0 H/s 93.7%
Both mods strided* 469.8 H/s 98.5%

* strided_index = 2, mem_chunk = 2 (64 bytes)

Radeon RX 560 on Windows 10 (RX 550 simulation): core @ 595 MHz, memory @ 2200 MHz, 1 Click PBE Timing Straps, monitor plugged in, intensity 1024, worksize 32:

Mod Hashrate Performance level
- 394.3 H/s 100.0%
INT_MATH 357.4 H/s 90.6%
SHUFFLE 343.2 H/s 87.0%
Both mods 316.4 H/s 80.2%
Both mods, intensity 1440* 321.1 H/s 81.4%

* Increasing intensity to 1440 improved both mods performance, but made performance worse in other cases.

It looks like RX 550 needs GPU core overclocking to properly handle new modifications.

GeForce GTX 1080 Ti 11 GB on Windows 10: core 2000 MHz, memory 11800 MHz, monitor plugged in, intensity 1280, worksize 8:

Mod Hashrate Performance level
- 908.4 H/s 100.0%
INT_MATH 902.7 H/s 99.4%
SHUFFLE 848.6 H/s 93.4%
Both mods 846.7 H/s 93.2%

GeForce GTX 1060 6 GB on Windows 10: all stock, monitor plugged in, intensity 800, worksize 8:

Mod Hashrate Performance level
- 453.6 H/s 100.0%
INT_MATH 452.2 H/s 99.7%
SHUFFLE 422.6 H/s 93.2%
Both mods 421.5 H/s 92.9%

GeForce GTX 1050 2 GB on Windows 10: core 1721 MHz, memory 1877 MHz, monitor unplugged, intensity 448, worksize 8:

Mod Hashrate Performance level
- 319.9 H/s 100.0%
INT_MATH 318.1 H/s 99.4%
SHUFFLE 292.5 H/s 91.4%
Both mods 291.0 H/s 91.0%
@tevador
Copy link

tevador commented Jun 20, 2018

RX 550 (2 GB, 640 shaders) / Ubuntu 16.04

Mode Intensity/Worksize Hashrate
- 600/8 395 H/s
-DINT_MATH_MOD -DSQRT_OPT_LEVEL=0 760/32 277 H/s
-DINT_MATH_MOD -DSQRT_OPT_LEVEL=1 760/32 345 H/s
-DINT_MATH_MOD -DSQRT_OPT_LEVEL=2 760/32 319 H/s
-DSHUFFLE_MOD -DINT_MATH_MOD -DSQRT_OPT_LEVEL=0 760/32 218 H/s
-DSHUFFLE_MOD -DINT_MATH_MOD -DSQRT_OPT_LEVEL=1 760/32 254 H/s
-DSHUFFLE_MOD -DINT_MATH_MOD -DSQRT_OPT_LEVEL=2 760/32 259 H/s

The results are a bit strange. Hashrate without any mods dropped from 425 to 395. INT_MATH with optimization 1 is faster without shuffle, optimization 2 is faster with shuffle mod.

@SChernykh
Copy link
Owner Author

The results are a bit strange. Hashrate without any mods dropped from 425 to 395

It was calculated incorrectly before.

INT_MATH with optimization 1 is faster without shuffle, optimization 2 is faster with shuffle mod.

Optimization 1 is for NVIDIA cards only, AMD cards don't need it. Really strange because optimization 2 actually does less computations than optimization 1.

@SChernykh
Copy link
Owner Author

@tevador @MoneroCrusher I've improved my shuffle mod GPU code significantly. There is almost no slowdown with shuffle now and much better performance with both mods on RX 550. Can you check it? And we still need someone with Vega 56/64...

@MoneroCrusher
Copy link

@SChernykh @tevador
I can check for both RX 550 (8 CU & 10 CU) and Vega 56 (and Vega 56 with 64 BIOS Flashed) in a couple hours. Only used the Vega on Windows so far. Are Linux drivers finally up to date? What should I use?

@SChernykh
Copy link
Owner Author

SChernykh commented Jun 26, 2018

@MoneroCrusher You can test on Windows as well, it's not a problem. I've added .sln file for Visual Studio so you can compile it.

P.S. Community edition of Visual Studio (which is free) should be enough for compiling.

@SChernykh
Copy link
Owner Author

SChernykh commented Jun 26, 2018

In the meantime, I've tried to overclock memory on my RX 560 (only memory, I left GPU core clock at default 1196 MHz), here are the results:

Radeon RX 560 on Windows 10 (overclocked): core @ 1196 MHz, memory @ 2275 MHz, monitor plugged in, intensity 1000, worksize 32:

Mod Hashrate Performance level
- 407.2 H/s 100.0%
INT_MATH 406.5 H/s 99.8%
SHUFFLE 389.0 H/s 95.5%
Both mods 386.3 H/s 94.9%

Is 2275 MHz a good speed for the memory on RX 560? I didn't get any CPU/GPU mismatch errors and I can't overclock it further - MSI Afterburner just doesn't let me do it.

@MoneroCrusher Did you try to test your Vega?

@MoneroCrusher
Copy link

@SChernykh
You did those tests with 1 click timing straps?
Can you try to do 2 threads?
Did not test anything yet but will do now. Would be happy if you could provide me with the Windows binary. I don't have visual studio.

@SChernykh
Copy link
Owner Author

@MoneroCrusher

  • No idea about timing straps. Whatever is default on the stock card I guess. I only used MSI Afterburner and changed memory frequency, that's all.
  • Performance is the same with two threads at intensity 500.
  • Added the Windows binary.

@MoneroCrusher
Copy link

MoneroCrusher commented Jun 27, 2018

@SChernykh
Can you do the tests with PBE 1 click timing straps? More real life then
Thanks for the Windows Binary btw!

I did tests now and wrongly used Worksize 8 first for the mods, but didn't see you guys were using WS 32, so I corrected it afterwards and tried both WS 16 and 32.

Gigabyte RX 550 2 GB, 8 CU, 2 Threads (432/432), 1220/2150, 1 Click PBE Timing Straps, Ubuntu 16.04

Mod Hashrate (WS 8) Hashrate (WS 16) Hashrate (WS 32)
No Mod 467 H/s 440 H/s Crash
SHUFFLE 409 H/s 453 H/s Crash
INT_MATH 223 H/s 302 H/s 360 H/s
Both mods 202 H/s 267 H/s 316 H/s

Sapphire RX 550 2 GB, 10 CU, 2 Threads (432/432), 1220/2150, 1 Click PBE Timing Straps, Ubuntu 16.04

Mod Hashrate (WS 8) Hashrate (WS 16) Hashrate (WS 32)
No Mod 528 H/s 479 H/s 470 H/s
SHUFFLE 419 H/s 458 H/s 419 H/s
INT_MATH 229 H/s 354 H/s 353 H/s
Both mods 217 H/s 309 H/s 315 H/s

Vega RX 56, 56 CU, 2 Threads (2016/1716), 950/1417, Windows 10

Mod Hashrate (WS 8) Hashrate (WS 16) Hashrate (WS 32)
No Mod 1650 H/s 1632 H/s 1613 H/s
SHUFFLE 1588 H/s 1639 H/s 1591 H/s
INT_MATH 1052 H/s 1411 H/s 1471 H/s
Both mods 1026 H/s 1321 H/s 1303 H/s

So Worksize 32 helps INT_MATH more, while worksize 16 helps Shuffle more, while worksize 8 helps no mod more.
Is there some way to somehow align them?

Also, could somebody ELI5 me why RandomJS permanently prevents ASICs?

@SChernykh
Copy link
Owner Author

Can you do the tests with PBE 1 click timing straps? More real life then

I'll do it this evening. Hopefully it won't brick my card. Thanks for the numbers for Vega 56. It seems that it can handle shuffle mod perfectly. As for integer math mod, it's 89% performance compared to no mods and 81% performance for shuffle+int_math compared to shuffle mod. Can you try to tweak parameters some more? Also overclocking GPU core should really help. We need to know how good it can perform.

Also, could somebody ELI5 me why RandomJS permanently prevents ASICs?

Any ASIC that can run random code is basically a CPU. Read this comment: monero-project/monero#3545 (comment)

@SChernykh
Copy link
Owner Author

Vega RX 56, 56 CU, 2 Threads (2016/1716), 950/1417

Are the last 2 numbers GPU core and memory clocks? You really need to push GPU core to the maximum - integer math mod adds a lot of computations.

@MoneroCrusher
Copy link

MoneroCrusher commented Jun 27, 2018 via email

@SChernykh
Copy link
Owner Author

No, HBM mem has different clocks. 950 is mem and 1417 is core.

Ok, it looks like it's time to leave only 1 square root in int_math mod to make it easier for GPUs. Two square roots were kind of overkill anyway.

Is it necessary for both mods to be implemented for ASIC resitance or would one of them be enough?

They target different classes of ASICs/FPGAs. Shuffle mod targets devices with external memory, making them 4 times slower. Integer math mod targets devices with on-chip memory, making them 8-10 times slower because of high division and square root latency. They work best together. Remove one mod, and you will enable an efficient ASIC/FPGA again - either with on-chip SRAM or with HBM external memory.

@MoneroCrusher
Copy link

MoneroCrusher commented Jun 27, 2018

Will going from 2 square root to 1 square root make it easier game for FPGA?
Is it somehow possible to make RX 550 better? Quite some people are mining on those and it would be a pity if they would do 40-50% worse than other GPU in comparison (pre-fork/after-fork)

@SChernykh
Copy link
Owner Author

Will going from 2 square root to 1 square root make it easier game for FPGA?

Leaving just 1 square root won't make it easier for FPGA. The point of having a square root is that they'll still need to implement it and waste space on chip for it and that it has high computation latency.

Is it somehow possible to make RX 550 better?

Leaving 1 square root should help. The problem with RX 550 is that they are unbalanced unlike other Radeons. If you calculate GFLOPs/Memory bandwidth ratio, it will be in the range 20-25 GFLOPs/GB/s for all Radeons starting from RX 560 and up to Vega 64. RX 550 has only 10 GFLOP/GB - two times worse.

@SChernykh
Copy link
Owner Author

I think I'll make the number of square roots configurable for convenience. You'll be able to test 0, 1 or 2 square roots in int_math mod.

@MoneroCrusher
Copy link

Please find a way to disadvantage all GPUs the same, if that's im any way possible!

@SChernykh
Copy link
Owner Author

It's possible to slow down all GPUs from RX 560 up to Vega 64 the same, I can't guarantee it with RX 550. But we still have time for experimenting, the next fork is in September/October.

@MoneroCrusher
Copy link

MoneroCrusher commented Jun 27, 2018

@SChernykh
So nice you came up with this algo. So in your opinion it will permanently move ASICs and FPGAs from the network? And also it uses much less power than ProgPOW. What's your opinion about ProgPOW?

I hope we could fork POW sooner if it's production ready. There are reasons to believe FPGA/ASICs are already on network..
Edit: really hope there is a way to disadvantage 560-vega more than 550 to balance it out..but let's test!

@SChernykh
Copy link
Owner Author

SChernykh commented Jun 27, 2018

As for FPGAs that are coming in August (BCU1525) - they'll be slowed down from 20 KH/s to less than 5 KH/s (even down to 2 KH/s if my assumptions about division and square root latencies are correct) which will make them much worse than Vega 56/64 in terms of performance per $, so they'll not be mining Cryptonight at all.

As for possible ASICs: devices with external memory will still be ~2.5 times faster than Vega 56/64 if they use the same HBM2 memory. Given that they'll certainly be more expensive, they won't be a serious competition. Devices with on-chip memory like those 220 KH/s Bitmain miners will be down to 20-30 KH/s, still at 550 watts. Much less dangerous to the network.

ProgPOW is perfectly tuned for GPUs, but it's not CPU-friendly like Cryptonight which is a minus for decentralization. ProgPOW ASICs won't be economically viable at all.

@SChernykh
Copy link
Owner Author

SChernykh commented Jun 27, 2018

I've tested RX 560 with one click timing straps, could overclock memory to 2200 MHz where it started giving CPU/GPU mismatch errors (1-3 errors per test run), but I could still test the performance.

Radeon RX 560 on Windows 10 (overclocked): core @ 1196 MHz, memory @ 2200 MHz, 1 Click PBE Timing Straps, monitor plugged in, intensity 1000, worksize 32:

Mod Hashrate Performance level
- 448.5 H/s 100.0%
INT_MATH 447.3 H/s 99.7%
SHUFFLE 446.8 H/s 99.6%
Both mods 439.3 H/s 97.9%

It looks like new memory timings made things better compared to plain memory overclock: 97.9% vs 94.9% performance for plain memory overclock.

P.S. I didn't see any difference between 8, 16 and 32 worksizes for version without mods.

@SChernykh
Copy link
Owner Author

I've tested it again at 2150 MHz memory - there were no CPU/GPU mismatch errors at all. I wanted to be sure that errors didn't influence test results.

Radeon RX 560 on Windows 10 (overclocked): core @ 1196 MHz, memory @ 2150 MHz, 1 Click PBE Timing Straps, monitor plugged in, intensity 1024, worksize 32:

Mod Hashrate Performance level
- 447.4 H/s 100.0%
INT_MATH 438.0 H/s 97.9%
SHUFFLE 440.6 H/s 98.5%
Both mods 434.9 H/s 97.2%

No mods version worked better with 2 threads @ 512 intensity, all versions with mods worked better with 1 thread @ 1024 intensity.

@SChernykh
Copy link
Owner Author

really hope there is a way to disadvantage 560-vega more than 550 to balance it out..but let's test!

It's just impossible because 560 has exactly the same memory but 2 times more powerful GPU core. Whatever you do, 560 will be faster than 550.

@SChernykh
Copy link
Owner Author

SChernykh commented Jun 27, 2018

I removed one square root from integer math mod, also tested different thread count, intensity and worksize for different versions. The best I could get:

  • 466.7 H/s (up from 447.4 H/s) for version without mods: 2 threads, intensity 512, worksize 8
  • 439.1 H/s (up from 434.9 H/s) for version with both mods (one square root per iteration): 1 thread, intensity 1024, worksize 32

This is for RX 560. @MoneroCrusher Can you test again on Vega 56 and RX 550? I've updated the repository.

@SChernykh
Copy link
Owner Author

SChernykh commented Jun 28, 2018

@MoneroCrusher If you haven't started testing yet, don't do it for now. I have some very cool changes incoming for the integer math mod. These changes both improve GPU performance AND slowdown ASIC/FPGA two times more, comparing to current integer math mod.

P.S. GPU performance didn't really improve, it stayed the same. But still very cool.

@MoneroCrusher
Copy link

MoneroCrusher commented Jun 28, 2018

I started tests and it has gotten better (around 20% for RX 550, but still not parity like 56/7/80, havent tested Vega yet) but I'll wait then.

Very cool! So the advantage of ASIC will only be 2-3x after your mod?

@SChernykh
Copy link
Owner Author

Very cool! So the advantage of ASIC will only be 2-3x after your mod?

Yes. We're now talking about ~15x slowdown for the coming BCU1525 FPGA (20 KH/s -> 1.4 KH/s) and similar slowdown for Bitmain ASICs (220 KH/s -> 15 KH/s). Strange thing: these changes improved int_math mod performance when it's applied alone (10% better), but int_math + shuffle stayed the same, even got 1% slower. I'm sure it can be improved further.

@SChernykh
Copy link
Owner Author

I've committed it to the repository: SChernykh/xmr-stak-amd@566f30c

The trick was to prevent parallel calculation of division and square roots. Now they have to be done in sequence, effectively doubling the latency for ASIC/FPGA. You can start testing now.

@SChernykh
Copy link
Owner Author

SChernykh commented Jun 28, 2018

I've managed to improve integer math mod a bit more: from 254.3 H/s to 275.6 H/s on my simulated RX 550 when combined with shuffle mod, so it's 8% speed up. But I still need numbers for the real RX 550 and Vega 56 to know where it is at now.

P.S. And I've improved it some more from 275.6 H/s up to 277.0 H/s, so 9% speed up compared to the current version. I don't know what else can be done there without making it easier for ASIC/FPGA. I'm out of ideas for today & waiting for the numbers.

@madscientist159
Copy link

@SChernykh Yes agreed regarding stalling, but the v8 algorithm is only running at about 60% of the v7 hashrate, so something's clearly still off. I'll poke around in perf a bit more and see if there's anything obvious.

@madscientist159
Copy link

madscientist159 commented Oct 15, 2018

@SChernykh I'm wondering if the LUT is getting evicted from L1 cache. Perf output:

Percent│       return vec_xor(__A, __B);                                                                                                                                                                                                   ▒
  0.85 │       xxlxor vs44,vs0,vs44                                                                                                                                                                                                        ▒
       │     _mm_add_epi64():                                                                                                                                                                                                              ▒
       │       return (__m128i) ((__vector unsigned long long)__A + (__vector unsigned long long)__B);                                                                                                                                     ▒
  0.82 │       vaddudm v12,v1,v12                                                                                                                                                                                                          ▒
       │     _Z23cryptonight_single_hashILN5xmrig4AlgoE0ELb0ELNS0_7VariantE8EEvPKhmPhPP15cryptonight_ctx():                                                                                                                                ▒
       │                 idx0 = d ^ q;                                                                                                                                                                                                     ▒
       │             }                                                                                                                                                                                                                     ▒
       │             if (VARIANT == xmrig::VARIANT_2) {                                                                                                                                                                                    ▒
       │                 bx1 = bx0;                                                                                                                                                                                                        ▒
       │             }                                                                                                                                                                                                                     ▒
       │             bx0 = cx;                                                                                                                                                                                                             ▒
  0.85 │       xxlor  vs33,vs32,vs32                                                                                                                                                                                                       ▒
       │             ah0 ^= ch;                                                                                                                                                                                                            ▒
  0.67 │       xor    r6,r6,r0                                                                                                                                                                                                             ▒
       │                 VARIANT2_INTEGER_MATH(0, cl, cx);                                                                                                                                                                                 ▒
  1.10 │       clrldi r8,r8,32                                                                                                                                                                                                             ▒
  0.50 │       rldicr r9,r9,32,31                                                                                                                                                                                                          ▒
  1.58 │       add    r9,r9,r8                                                                                                                                                                                                             ▒
  1.99 │       add    r3,r10,r9                                                                                                                                                                                                            ▒
  0.58 │       mtvsrdd vs62,r5,r9                                                                                                                                                                                                          ▒
       │                     const SqrtV2& r = ((SqrtV2*)SqrtV2Table)[n >> 53];                                                                                                                                                            ▒
  1.14 │       rldicl r10,r3,11,53                                                                                                                                                                                                         ▒
       │                     const int64_t index1 = static_cast<int64_t>((n >> 25) & 268435455) - 134217728;                                                                                                                               ▒
  0.65 │       rldicl r8,r3,39,36                                                                                                                                                                                                          ▒
       │                     const SqrtV2& r = ((SqrtV2*)SqrtV2Table)[n >> 53];                                                                                                                                                            ▒
  1.11 │       rldicr r10,r10,3,60                                                                                                                                                                                                         ▒
       │                     const int64_t index1 = static_cast<int64_t>((n >> 25) & 268435455) - 134217728;                                                                                                                               ▒
  0.65 │       addis  r8,r8,-2048                                                                                                                                                                                                          ▒
       │                     uint64_t x = static_cast<uint64_t>(r.c0) << 28;                                                                                                                                                               ▒
  8.10 │       ldx    r9,r4,r10                                                                                                                                                                                                            ▒
       │                     const uint64_t index2 = static_cast<uint64_t>(index1 * index1);                                                                                                                                               ▒
  0.09 │       mulld  r10,r8,r8                                                                                                                                                                                                            ▒
       │     _mm_store_si128():                                                                                                                                                                                                            ▒
       │       vec_st(b, 0, a);                                                                                                                                                                                                            ▒
  0.29 │       stvx   v10,r31,r26                                                                                                                                                                                                          ▒
  0.12 │       stvx   v12,0,r25                                                                                                                                                                                                            ▒
  0.41 │       stvx   v13,r31,r27                                                                                                                                                                                                          ▒
       │     _Z23cryptonight_single_hashILN5xmrig4AlgoE0ELb0ELNS0_7VariantE8EEvPKhmPhPP15cryptonight_ctx():                                                                                                                                ▒
       │             ((uint64_t*)&l0[idx0 & MASK])[0] = al0;                                                                                                                                                                               ▒
  0.27 │       stdx   r24,r31,r12                                                                                                                                                                                                          ▒
       │                 ((uint64_t*)&l0[idx0 & MASK])[1] = ah0;                                                                                                                                                                           ▒
  0.53 │       std    r0,8(r11)                                                                                                                                                                                                            ▒
       │                     x -= (static_cast<uint64_t>(r.c2) * index2) >> 28;                                                                                                                                                            ▒
  0.14 │       rldicl r0,r9,9,55                                                                                                                                                                                                           ▒
       │                     uint64_t x = static_cast<uint64_t>(r.c0) << 28;                                                                                                                                                               ▒
  0.19 │       rldic  r11,r9,28,3                                                                                                                                                                                                          ◆
       │                     x += static_cast<int64_t>(r.c1) * index1;                                                                                                                                                                     ▒
  0.43 │       rldicl r9,r9,31,42                                                                                                                                                                                                          ▒
       │                     x -= (static_cast<uint64_t>(r.c2) * index2) >> 28;                                                                                                                                                            ▒
  1.45 │       mulld  r10,r10,r0                                                                                                                                                                                                           ▒
       │                     x += static_cast<int64_t>(r.c1) * index1;                                                                                                                                                                     ▒
  0.77 │       maddld r9,r9,r8,r11                                                                                                                                                                                                         ▒
       │                     x -= (static_cast<uint64_t>(r.c2) * index2) >> 28;                                                                                                                                                            ▒
  1.03 │       rldicl r10,r10,36,28                                                                                                                                                                                                        ▒
  1.50 │       subf   r10,r10,r9                                                                                                                                                                                                           ▒
       │                     x >>= 29;                                                                                                                                                                                                     ▒
  1.41 │       rldicl r9,r10,35,29                                                                                                                                                                                                         ▒
       │                     const uint64_t s = x >> 1;                                                                                                                                                                                    ▒
  0.46 │       rldicl r10,r10,34,30                                                                                                                                                                                                        ▒
       │                     const uint64_t b = x & 1;

Note the expense of ldx -- the only other time ldx is that expensive is during scratchpad access.

EDIT: Playing a bit with prefetch helped some, it's now at around 912H/s. Still a far cry from the original 1300H/s but showing that there's definitely some pressure on the LUT so maybe a smaller LUT will help.

@SChernykh
Copy link
Owner Author

@madscientist159 You can try https://github.com/SChernykh/sqrt_v2/blob/master/fast_sqrt_v2_small_LUT.h now. It's 1.5 KB LUT, I couldn't make it smaller.

@madscientist159
Copy link

madscientist159 commented Oct 15, 2018

@SChernykh I've had a chance to play with both versions, and compare against native. The larger LUT version is actually a bit faster than the small one, which is surprising to me, but can't argue with the hardware!

Is there any way I could ask you for a rsqrt()-based version to test as well? I haven't had much luck porting the OpenCL variant; if you want to put a placeholder for sqrt() and rsqrt() I can implement the inline assembly tie-ins on this side.

Thanks!

EDIT: Here's a relative comparison of expense, FWIW:

  61.41%  xmrig    xmrig               [.] cryptonight_single_hash<(xmrig::Algo)0, false, (xmrig::Variant)8>
  12.70%  xmrig    xmrig               [.] SqrtV2SmallLUT::get
  11.30%  xmrig    xmrig               [.] SqrtV2::get
   6.87%  xmrig    xmrig               [.] cn_explode_scratchpad<(xmrig::Algo)0, 2097152ul, false>
   6.47%  xmrig    xmrig               [.] cn_implode_scratchpad<(xmrig::Algo)0, 2097152ul, false>
  59.81%  xmrig    xmrig              [.] cryptonight_single_hash<(xmrig::Algo)0, false, (xmrig::Variant)8>
  14.77%  xmrig    xmrig              [.] SqrtV2Native::get
  11.13%  xmrig    xmrig              [.] SqrtV2::get
   6.64%  xmrig    xmrig              [.] cn_explode_scratchpad<(xmrig::Algo)0, 2097152ul, false>
   6.27%  xmrig    xmrig              [.] cn_implode_scratchpad<(xmrig::Algo)0, 2097152ul, false>

The LUT version is definitely helping, but it's not enough to make a huge difference (only looking at ~10% improvement really).

@SChernykh
Copy link
Owner Author

@madscientist159 Luckily I found my old test code that I wrote before I ported it to OpenCL: https://github.com/SChernykh/sqrt_v2/blob/master/sqrt_v2_single_precision.h

It works with SSE intrinsics and it works with low rsqrt() precision indeed - _mm_rsqrt_ss() gives only 12 bits of precision.

@madscientist159
Copy link

@SChernykh After testing of all methods, the fastest is the large LUT for these particular chips (the 4/8 core devices). The large LUT yields around 61% of the CNv7 performance and I don't see that increasing at all given how CNv8 is just not really a good match for SMT4 and the resource sharing implied.

Where things may be less awful is on the larger POWER9 parts that were already cache, not core, limited. When the network switches over to v8 I'll try to report back with some data from the larger machines.

@madscientist159
Copy link

@SChernykh One question I do have though is whether we can get rid of any other arithmetic operations by increasing the LUT size; we might be able to eke out a little bit more performance that way since the LUT isn't really sitting in L1 as far as I can tell.

@SChernykh
Copy link
Owner Author

SChernykh commented Oct 15, 2018

@madscientist159 I don't think so. LUT grows faster than exponentially if we start reducing multiplication count:
Cubic interpolation = 7 bit index * 96 bits = 1.5 KB LUT
Square interpolation = 11 bit index * 64 bits = 16 KB LUT
Linear interpolation = 17 bit index * 64 bits = 1024 KB LUT
No interpolation (with one correction multiplication in the end) = 34 bit index * 32 bits = 64 GB

@madscientist159
Copy link

@SChernykh We've got a "little" L3 left over (16MB on the smallest parts) so I'd be curious if the linear version would provide any speedup on these parts or not. Worst case I'll probably end up adding a L3 / core count detect and selecting the appropriate method on the fly.

@SChernykh
Copy link
Owner Author

@madscientist159 Ok, I'll prepare linear version tomorrow. It should be simple to do.

@iamsmooth
Copy link

Don't exaggerate, CPUs were never more than 10-15% of network hashrate ever since GPU miner software was created for Monero.

Kind of a side note to this discussion, but that's probably not accurate. It does look like botnets (assumed to be CPU) were pretty dominant 2015 to 2016 (pre-pump), and GPU mining software was available since 2014 (though not necessarily very good GPU mining software).

@SChernykh
Copy link
Owner Author

@madscientist159 I've added linear interpolation version and managed to fit it in only 256 KB - 15 bit index was enough thanks to smoothness of sqrt(). You can try it, and I also think that copying LUT to "huge pages" memory (like scratchpad) and accessing it there might give an additional speedup.

@Bendr0id
Copy link

@SChernykh Thx again for the tweaks.

I integrated all of them into XMRigCC 1.8.0 (https://github.com/Bendr0id/xmrigCC/releases). Additionally i added the asm version for cn-litev1 and XTL. Performance gain is quite nice.

@madscientist159
Copy link

@madscientist159 I've added linear interpolation version and managed to fit it in only 256 KB - 15 bit index was enough thanks to smoothness of sqrt(). You can try it, and I also think that copying LUT to "huge pages" memory (like scratchpad) and accessing it there might give an additional speedup.

That is a very good point on the huge pages. I'll give that a shot tonight / tomorrow and report back.

@tevador
Copy link

tevador commented Oct 18, 2018

In case anyone is interested, I measured the power consumption increase caused by CNv2.

PoW Power [W] Power
CNv1 590 100%
CNv2 700 119%

It's the combined power consumption of 2 mining rigs with a total of 9 GPUs (7x RX550, 1x RX560, 1x RX570) measured at the wall using a wattmeter (±5W).

Combined with a ~9% decrease in hashrate, the total efficiency in terms of hashes per Joule dropped by ~24%.

@SChernykh
Copy link
Owner Author

Interesting. But I think more tuning will improve H/s/watt. I think we've yet to find optimal core MHz and voltage for CNv2.

@numerys
Copy link

numerys commented Oct 19, 2018

@SChernykh I've read this discussion and do now understand a lot more why e.g. Opterons dropped so badly. Is there a chance to get more hashrate/optimization out of older Opterons?

I've read this from a guy I work with. If technical possible, and you could optimize to/near old hashrates, the effort would be compensated.

@SChernykh
Copy link
Owner Author

@GVgit I don't think it's possible to get old hashrates for old Opterons at all. They have weak CPU core, new algorithm is heavy on computations. At best, asm optimized version can be 2-3% faster than what you get now. If anyone provides me with SSH access to Opteron server, I can try to make optimized version this weekend.

@SChernykh
Copy link
Owner Author

Hmm, I just looked at instruction latencies and sqrtsd is really slow on Opterons. I think the same optimization that helped Power CPUs will help Opterons.

@numerys
Copy link

numerys commented Oct 19, 2018

Do you got some access to an AMD Opteron?

@SChernykh
Copy link
Owner Author

@GVgit Not yet... Hopefully some Opteron owner will contact me on Reddit (u/sech1) soon - it's in their own interest.

@numerys
Copy link

numerys commented Oct 19, 2018

Should be done.

@kio3i0j9024vkoenio
Copy link

kio3i0j9024vkoenio commented Oct 20, 2018

@GVgit Not yet... Hopefully some Opteron owner will contact me on Reddit (u/sech1) soon - it's in their own interest.

I do have a Dell R815 server with quad 6234 Opteron's that is available.
I have sent a message to u/sech1 on Reddit.

@SChernykh
Copy link
Owner Author

I already have a server for testing, thanks anyway.

@SChernykh
Copy link
Owner Author

I've managed to make it 3% faster on Opteron 6276 (Bulldozer): f4dfc2b

This is not final version, I'll keep tweaking it.

@Bendr0id
Copy link

Bendr0id commented Oct 20, 2018 via email

@SChernykh
Copy link
Owner Author

@Bendr0id I think I already found the theoretical limit for CNv2 performance on Opteron 6276. I just did this test out of curiosity: I removed AES and shuffle operations from the main loop completely, and even this didn't make the main loop faster. So it's dominated by div+sqrt latency, I doubt it's possible to improve it further. Maybe I'll be able to squeeze another +0.1%, but that's it.

P.S. Regarding CNv1 (and Stellite variant) - I have one optimization that will make it 2-3% faster, but I think it's already added to xmrigcc?

@Bendr0id
Copy link

Bendr0id commented Oct 20, 2018 via email

@MoneroCrusher
Copy link

offtopic:
@mobilepolice I'm still interested in the W4032BABG-60-F memory timings for low clock hashpower as you have achieved it. I can offer things of great interest to you. Please leave me an email or handle.

@MoneroCrusher
Copy link

sorry for the repeated offtopic but I've seen @mobilepolice active again on CN/R
great things of interest include this (hope you know what that means):
image
So please leave a message, email, username on discord or anything really. Maybe we can just quickly chat and maybe I won't even need the straps, just need some numbers and I'll know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests