-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PoW modifications (shuffles and integer math) discussion and tests #1
Comments
RX 550 (2 GB, 640 shaders) / Ubuntu 16.04
The results are a bit strange. Hashrate without any mods dropped from 425 to 395. INT_MATH with optimization 1 is faster without shuffle, optimization 2 is faster with shuffle mod. |
It was calculated incorrectly before.
Optimization 1 is for NVIDIA cards only, AMD cards don't need it. Really strange because optimization 2 actually does less computations than optimization 1. |
@tevador @MoneroCrusher I've improved my shuffle mod GPU code significantly. There is almost no slowdown with shuffle now and much better performance with both mods on RX 550. Can you check it? And we still need someone with Vega 56/64... |
@SChernykh @tevador |
@MoneroCrusher You can test on Windows as well, it's not a problem. I've added .sln file for Visual Studio so you can compile it. P.S. Community edition of Visual Studio (which is free) should be enough for compiling. |
In the meantime, I've tried to overclock memory on my RX 560 (only memory, I left GPU core clock at default 1196 MHz), here are the results: Radeon RX 560 on Windows 10 (overclocked): core @ 1196 MHz, memory @ 2275 MHz, monitor plugged in, intensity 1000, worksize 32:
Is 2275 MHz a good speed for the memory on RX 560? I didn't get any CPU/GPU mismatch errors and I can't overclock it further - MSI Afterburner just doesn't let me do it. @MoneroCrusher Did you try to test your Vega? |
@SChernykh |
|
@SChernykh I did tests now and wrongly used Worksize 8 first for the mods, but didn't see you guys were using WS 32, so I corrected it afterwards and tried both WS 16 and 32. Gigabyte RX 550 2 GB, 8 CU, 2 Threads (432/432), 1220/2150, 1 Click PBE Timing Straps, Ubuntu 16.04
Sapphire RX 550 2 GB, 10 CU, 2 Threads (432/432), 1220/2150, 1 Click PBE Timing Straps, Ubuntu 16.04
Vega RX 56, 56 CU, 2 Threads (2016/1716), 950/1417, Windows 10
So Worksize 32 helps INT_MATH more, while worksize 16 helps Shuffle more, while worksize 8 helps no mod more. Also, could somebody ELI5 me why RandomJS permanently prevents ASICs? |
I'll do it this evening. Hopefully it won't brick my card. Thanks for the numbers for Vega 56. It seems that it can handle shuffle mod perfectly. As for integer math mod, it's 89% performance compared to no mods and 81% performance for shuffle+int_math compared to shuffle mod. Can you try to tweak parameters some more? Also overclocking GPU core should really help. We need to know how good it can perform.
Any ASIC that can run random code is basically a CPU. Read this comment: monero-project/monero#3545 (comment) |
Are the last 2 numbers GPU core and memory clocks? You really need to push GPU core to the maximum - integer math mod adds a lot of computations. |
@SChernykh
No, HBM mem has different clocks. 950 is mem and 1417 is core.
Is it necessary for both mods to be implemented for ASIC resitance or would one of them be enough?
Gesendet mit der GMX iPhone App
Am 27.06.18 um 16:09 schrieb SChernykh
… > Vega RX 56, 56 CU, 2 Threads (2016/1716), 950/1417
Are the last 2 numbers GPU core and memory clocks? You really need to push GPU core to the maximum - integer math mod adds a lot of computations.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1 (comment)
|
Ok, it looks like it's time to leave only 1 square root in int_math mod to make it easier for GPUs. Two square roots were kind of overkill anyway.
They target different classes of ASICs/FPGAs. Shuffle mod targets devices with external memory, making them 4 times slower. Integer math mod targets devices with on-chip memory, making them 8-10 times slower because of high division and square root latency. They work best together. Remove one mod, and you will enable an efficient ASIC/FPGA again - either with on-chip SRAM or with HBM external memory. |
Will going from 2 square root to 1 square root make it easier game for FPGA? |
Leaving just 1 square root won't make it easier for FPGA. The point of having a square root is that they'll still need to implement it and waste space on chip for it and that it has high computation latency.
Leaving 1 square root should help. The problem with RX 550 is that they are unbalanced unlike other Radeons. If you calculate GFLOPs/Memory bandwidth ratio, it will be in the range 20-25 GFLOPs/GB/s for all Radeons starting from RX 560 and up to Vega 64. RX 550 has only 10 GFLOP/GB - two times worse. |
I think I'll make the number of square roots configurable for convenience. You'll be able to test 0, 1 or 2 square roots in int_math mod. |
Please find a way to disadvantage all GPUs the same, if that's im any way possible! |
It's possible to slow down all GPUs from RX 560 up to Vega 64 the same, I can't guarantee it with RX 550. But we still have time for experimenting, the next fork is in September/October. |
@SChernykh I hope we could fork POW sooner if it's production ready. There are reasons to believe FPGA/ASICs are already on network.. |
As for FPGAs that are coming in August (BCU1525) - they'll be slowed down from 20 KH/s to less than 5 KH/s (even down to 2 KH/s if my assumptions about division and square root latencies are correct) which will make them much worse than Vega 56/64 in terms of performance per $, so they'll not be mining Cryptonight at all. As for possible ASICs: devices with external memory will still be ~2.5 times faster than Vega 56/64 if they use the same HBM2 memory. Given that they'll certainly be more expensive, they won't be a serious competition. Devices with on-chip memory like those 220 KH/s Bitmain miners will be down to 20-30 KH/s, still at 550 watts. Much less dangerous to the network. ProgPOW is perfectly tuned for GPUs, but it's not CPU-friendly like Cryptonight which is a minus for decentralization. ProgPOW ASICs won't be economically viable at all. |
I've tested RX 560 with one click timing straps, could overclock memory to 2200 MHz where it started giving CPU/GPU mismatch errors (1-3 errors per test run), but I could still test the performance. Radeon RX 560 on Windows 10 (overclocked): core @ 1196 MHz, memory @ 2200 MHz, 1 Click PBE Timing Straps, monitor plugged in, intensity 1000, worksize 32:
It looks like new memory timings made things better compared to plain memory overclock: 97.9% vs 94.9% performance for plain memory overclock. P.S. I didn't see any difference between 8, 16 and 32 worksizes for version without mods. |
I've tested it again at 2150 MHz memory - there were no CPU/GPU mismatch errors at all. I wanted to be sure that errors didn't influence test results. Radeon RX 560 on Windows 10 (overclocked): core @ 1196 MHz, memory @ 2150 MHz, 1 Click PBE Timing Straps, monitor plugged in, intensity 1024, worksize 32:
No mods version worked better with 2 threads @ 512 intensity, all versions with mods worked better with 1 thread @ 1024 intensity. |
It's just impossible because 560 has exactly the same memory but 2 times more powerful GPU core. Whatever you do, 560 will be faster than 550. |
I removed one square root from integer math mod, also tested different thread count, intensity and worksize for different versions. The best I could get:
This is for RX 560. @MoneroCrusher Can you test again on Vega 56 and RX 550? I've updated the repository. |
@MoneroCrusher If you haven't started testing yet, don't do it for now. I have some very cool changes incoming for the integer math mod. These changes both improve GPU performance AND slowdown ASIC/FPGA two times more, comparing to current integer math mod. P.S. GPU performance didn't really improve, it stayed the same. But still very cool. |
I started tests and it has gotten better (around 20% for RX 550, but still not parity like 56/7/80, havent tested Vega yet) but I'll wait then. Very cool! So the advantage of ASIC will only be 2-3x after your mod? |
Yes. We're now talking about ~15x slowdown for the coming BCU1525 FPGA (20 KH/s -> 1.4 KH/s) and similar slowdown for Bitmain ASICs (220 KH/s -> 15 KH/s). Strange thing: these changes improved int_math mod performance when it's applied alone (10% better), but int_math + shuffle stayed the same, even got 1% slower. I'm sure it can be improved further. |
I've committed it to the repository: SChernykh/xmr-stak-amd@566f30c The trick was to prevent parallel calculation of division and square roots. Now they have to be done in sequence, effectively doubling the latency for ASIC/FPGA. You can start testing now. |
I've managed to improve integer math mod a bit more: from 254.3 H/s to 275.6 H/s on my simulated RX 550 when combined with shuffle mod, so it's 8% speed up. But I still need numbers for the real RX 550 and Vega 56 to know where it is at now. P.S. And I've improved it some more from 275.6 H/s up to 277.0 H/s, so 9% speed up compared to the current version. I don't know what else can be done there without making it easier for ASIC/FPGA. I'm out of ideas for today & waiting for the numbers. |
@SChernykh Yes agreed regarding stalling, but the v8 algorithm is only running at about 60% of the v7 hashrate, so something's clearly still off. I'll poke around in perf a bit more and see if there's anything obvious. |
@SChernykh I'm wondering if the LUT is getting evicted from L1 cache. Perf output:
Note the expense of ldx -- the only other time ldx is that expensive is during scratchpad access. EDIT: Playing a bit with prefetch helped some, it's now at around 912H/s. Still a far cry from the original 1300H/s but showing that there's definitely some pressure on the LUT so maybe a smaller LUT will help. |
@madscientist159 You can try https://github.com/SChernykh/sqrt_v2/blob/master/fast_sqrt_v2_small_LUT.h now. It's 1.5 KB LUT, I couldn't make it smaller. |
@SChernykh I've had a chance to play with both versions, and compare against native. The larger LUT version is actually a bit faster than the small one, which is surprising to me, but can't argue with the hardware! Is there any way I could ask you for a rsqrt()-based version to test as well? I haven't had much luck porting the OpenCL variant; if you want to put a placeholder for sqrt() and rsqrt() I can implement the inline assembly tie-ins on this side. Thanks! EDIT: Here's a relative comparison of expense, FWIW:
The LUT version is definitely helping, but it's not enough to make a huge difference (only looking at ~10% improvement really). |
@madscientist159 Luckily I found my old test code that I wrote before I ported it to OpenCL: https://github.com/SChernykh/sqrt_v2/blob/master/sqrt_v2_single_precision.h It works with SSE intrinsics and it works with low rsqrt() precision indeed - _mm_rsqrt_ss() gives only 12 bits of precision. |
@SChernykh After testing of all methods, the fastest is the large LUT for these particular chips (the 4/8 core devices). The large LUT yields around 61% of the CNv7 performance and I don't see that increasing at all given how CNv8 is just not really a good match for SMT4 and the resource sharing implied. Where things may be less awful is on the larger POWER9 parts that were already cache, not core, limited. When the network switches over to v8 I'll try to report back with some data from the larger machines. |
@SChernykh One question I do have though is whether we can get rid of any other arithmetic operations by increasing the LUT size; we might be able to eke out a little bit more performance that way since the LUT isn't really sitting in L1 as far as I can tell. |
@madscientist159 I don't think so. LUT grows faster than exponentially if we start reducing multiplication count: |
@SChernykh We've got a "little" L3 left over (16MB on the smallest parts) so I'd be curious if the linear version would provide any speedup on these parts or not. Worst case I'll probably end up adding a L3 / core count detect and selecting the appropriate method on the fly. |
@madscientist159 Ok, I'll prepare linear version tomorrow. It should be simple to do. |
Kind of a side note to this discussion, but that's probably not accurate. It does look like botnets (assumed to be CPU) were pretty dominant 2015 to 2016 (pre-pump), and GPU mining software was available since 2014 (though not necessarily very good GPU mining software). |
@madscientist159 I've added linear interpolation version and managed to fit it in only 256 KB - 15 bit index was enough thanks to smoothness of sqrt(). You can try it, and I also think that copying LUT to "huge pages" memory (like scratchpad) and accessing it there might give an additional speedup. |
@SChernykh Thx again for the tweaks. I integrated all of them into XMRigCC 1.8.0 (https://github.com/Bendr0id/xmrigCC/releases). Additionally i added the asm version for cn-litev1 and XTL. Performance gain is quite nice. |
That is a very good point on the huge pages. I'll give that a shot tonight / tomorrow and report back. |
In case anyone is interested, I measured the power consumption increase caused by CNv2.
It's the combined power consumption of 2 mining rigs with a total of 9 GPUs (7x RX550, 1x RX560, 1x RX570) measured at the wall using a wattmeter (±5W). Combined with a ~9% decrease in hashrate, the total efficiency in terms of hashes per Joule dropped by ~24%. |
Interesting. But I think more tuning will improve H/s/watt. I think we've yet to find optimal core MHz and voltage for CNv2. |
@SChernykh I've read this discussion and do now understand a lot more why e.g. Opterons dropped so badly. Is there a chance to get more hashrate/optimization out of older Opterons? I've read this from a guy I work with. If technical possible, and you could optimize to/near old hashrates, the effort would be compensated. |
@GVgit I don't think it's possible to get old hashrates for old Opterons at all. They have weak CPU core, new algorithm is heavy on computations. At best, asm optimized version can be 2-3% faster than what you get now. If anyone provides me with SSH access to Opteron server, I can try to make optimized version this weekend. |
Hmm, I just looked at instruction latencies and |
Do you got some access to an AMD Opteron? |
@GVgit Not yet... Hopefully some Opteron owner will contact me on Reddit (u/sech1) soon - it's in their own interest. |
Should be done. |
I do have a Dell R815 server with quad 6234 Opteron's that is available. |
I already have a server for testing, thanks anyway. |
I've managed to make it 3% faster on Opteron 6276 (Bulldozer): f4dfc2b This is not final version, I'll keep tweaking it. |
Awesome work man!
If you have some day a little spare time, I would love to see a cnv1_doublehash_sandybridge. Or do you think it's not worth..?
Am 20.10.2018 11:05 schrieb SChernykh <notifications@github.com>:
I've managed to make it 3% faster on Opteron 6276 (Bulldozer): f4dfc2b<f4dfc2b>
This is not final version, I'll keep tweaking it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#1 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAXZgYzn7o2MixIOENxi8gyDOj0Al6tiks5umudwgaJpZM4UvAMU>.
|
@Bendr0id I think I already found the theoretical limit for CNv2 performance on Opteron 6276. I just did this test out of curiosity: I removed AES and shuffle operations from the main loop completely, and even this didn't make the main loop faster. So it's dominated by div+sqrt latency, I doubt it's possible to improve it further. Maybe I'll be able to squeeze another +0.1%, but that's it. P.S. Regarding CNv1 (and Stellite variant) - I have one optimization that will make it 2-3% faster, but I think it's already added to xmrigcc? |
@SChernykh yes, that's implemented. I was asking for a double hash version. I don't know if it would improve hashrate.
Am 20.10.2018 17:32 schrieb SChernykh <notifications@github.com>:
@Bendr0id<https://github.com/Bendr0id> I think I already found the theoretical limit for CNv2 performance on Opteron 6276. I just did this test out of curiosity: I removed AES and shuffle operations from the main loop completely, and even this didn't make the main loop faster. So it's dominated by div+sqrt latency, I doubt it's possible to improve it further. Maybe I'll be able to squeeze another +0.1%, but that's it.
P.S. Regarding CNv1 (and Stellite variant) - I have one optimization that will make it 2-3% faster, but I think it's already added to xmrigcc?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#1 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAXZgZG8c-gO8xkTQJibRj3l_081D9ckks5um0HugaJpZM4UvAMU>.
|
offtopic: |
sorry for the repeated offtopic but I've seen @mobilepolice active again on CN/R |
The original discussion starts here: monero-project/monero#3545 (comment)
GPU version of shuffle and integer math modifications is here: https://github.com/SChernykh/xmr-stak-amd
You can post your performance test results, also your suggestions and concerns here.
AMD Ryzen 7 1700 @ 3.6 GHz, 8 threads
AMD Ryzen 5 2600 @ 4.0 GHz, 1 thread
AMD Ryzen 5 2600 @ 4.0 GHz, 8 threads (affinity 0,2,4,5,6,8,10,11)
Intel Pentium G5400 (Coffee Lake, 2 cores, 4 MB Cache, 3.70 GHz), 2 threads
Intel Core i5 3210M (Ivy Bridge, 2 cores, 3 MB Cache, 2.80 GHz), 1 thread
Intel Core i7 2600K (Sandy Bridge, 4 cores, 8 MB Cache, 3.40 GHz), 1 thread
Intel Core i7 7820X (Skylake-X, 8 cores, 11 MB Cache, 3.60 GHz), 1 thread
XMR-STAK used is an old version, so don't expect the same numbers that you have on your mining rigs. What's important here are relative numbers of original and modified Cryptonight versions.
Radeon RX 560 on Windows 10 (overclocked): core @ 1196 MHz, memory @ 2200 MHz, 1 Click PBE Timing Straps, monitor plugged in, intensity 1024, worksize 32:
* strided_index = 2, mem_chunk = 2 (64 bytes)
Radeon RX 560 on Windows 10 (RX 550 simulation): core @ 595 MHz, memory @ 2200 MHz, 1 Click PBE Timing Straps, monitor plugged in, intensity 1024, worksize 32:
* Increasing intensity to 1440 improved both mods performance, but made performance worse in other cases.
It looks like RX 550 needs GPU core overclocking to properly handle new modifications.
GeForce GTX 1080 Ti 11 GB on Windows 10: core 2000 MHz, memory 11800 MHz, monitor plugged in, intensity 1280, worksize 8:
GeForce GTX 1060 6 GB on Windows 10: all stock, monitor plugged in, intensity 800, worksize 8:
GeForce GTX 1050 2 GB on Windows 10: core 1721 MHz, memory 1877 MHz, monitor unplugged, intensity 448, worksize 8:
The text was updated successfully, but these errors were encountered: