-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cryptonight_v8 version 2 #1850
cryptonight_v8 version 2 #1850
Conversation
a39663c
to
0fef2cf
Compare
@psychocrypt Don't leave out double hash asm optimized version. The same version can be used both for Intel and AMD, and it's much faster than C++ code on both platforms. |
@kio3i0j9024vkoenio Try with either envvar Or whatever By default install, You still need the CUDA SDK ( |
cmake .. -DOpenCL_INCLUDE_DIR=/usr/local/cuda/targets/x86_64-linux/include -DOpenCL_LIBRARY=/usr/local/cuda/targets/x86_64-linux/lib/libOpenCL.so.1.0.0 Fixed the problem. Thanks |
I second this request. I am only getting: cryptonight_v8: 1293 H/s using XMR-Stak 1850/1851 latest version whereas I am getting 1525 H/s using SChernykh XMR-Stak-CPU latest code with all the same v8 changes and the optimized asm for 1x and 2x threads. |
yes I will add this soon.
|
@kio3i0j9024vkoenio The "double hash asm" is only for CPUs, it won't help your OpenCL hashrates at all. |
Yes I know that. |
update: I am still implementing the native NVIDIA backend. Everything takes longer than expected. The reason is that I found maybe a bug in the NVIDIA compiler nvcc. |
Has the optimized double hash asm for CPU been added? |
Corrently not. But do not worry I will do it soon.
|
@psychocrypt Is NVIDIA backend issue something I can help with? |
Currently not. I workarounded the bug by adding a sync in between to avoid
a memory optimization which is wrong. I need now to clean the code before I
can push it. After tha I can have a deeper look into the performance. All
at all the biggest issue is the high register usage compared to v7.
|
Is it faster than current OpenCL version? If not, maybe it's better to port OpenCL instead? |
I ported also a single threadversion which is similar to OpenCL but in CUDA
it is not faster. One reason can be that in OpenCL we had much more
compatile time information due to the just in time compilation where cfg
values are translated into defines.
I also tried to profile OpenCL but nvprof is not supporting anymore OpenCL.
Never the less currently it looks like I am only 40 hashes away from OpenCL
(500H/s for gtx1080) with the default config. I have not played around with
different configs. I use also your shared memory chunk read version from
OpenCL but must look for shared memory bank conflicts.
I try to bring a pr this evening (german time)
|
@psychocrypt There is always a way to make CUDA at least exactly as fast as OpenCL: create separate kernel for CNv2, then look at generated PTX assembly and fix all the differences. |
I certainly hope that the CUDA version will be faster than the OpenCL version as the OpenCL V8 version is terrible running on Nvidia GTX 750/750 Ti's as V8 performance is only 73.4% of what V7 produces. |
@SChernykh Bad new I can not push the CUDA code today. I have now invalid results on my GTX1080, think I know where it is coming from but I need to test (this take some time) if I am right. |
add cpu implementation for the final monero POW
apply optimizations Co-authored-by: SChernykh <sergey.v.chernykh@gmail.com>
- introduce a new schema where two threads work together on one hash - update autoadjustment - remove an mistake where shared memory was shrinked for gpus < sm_70
In the auto adjust without hwlock the asm entry was missing
0fef2cf
to
010cbd9
Compare
@SChernykh I added now the CUDA code. The code is for my GTX 1080 a few hashes faster than the OpenCL version. |
Issue building:
EDIT: FYI, using CUDA9.2. Should I revert to 9.1? |
Getting same error, where in the code do you specify the CUDA version? |
@Bathmat Line 213 on |
Cool, that fixed it. Thanks @plavirudar. Yes, compiling on Win10 v1803. Will test and publish results a little later (hopefully tonight). |
@psychocrypt cool, but still doing worse comparing to cn/7 (my previous tests: #1832 (comment)) and bit better than OpenCL kernel. Im testing on Windows with @plavirudar fix recomendation.
|
@blitss here are my tests with my GTX-1060. CNv8 is slower than v7 by design, but nvidia does appear to have a bigger drop in hashrate compared to CPU or AMD. |
@Bathmat well, i was trying to adjust threads but it didn't work for me. Setting threads for 30 made it 380H/s. |
@Bathmat are you overclocked card? Are you using Cuda 9 or 10 during build? I have exactly same card but not same results. |
@blitss yes overclocked. +150 core, +500 mem = 2000 core, 4300 mem |
@Bathmat I got a bit better results with same overclock - 461.5H/s against 508H/s on cn/7. About 10% lower. I gonna check it on OS X later. |
`uint` is unknown in windows, therefore switch to the better type `uint32_t`
@plavirudar I fixed the |
- restructe asm preparation function - add double hash asm code
@Bathmat Don't forget the Pascal/10xx P2-lock and be sure to disable it (this tool) I have used some 1060 6GB dual-fan full length PNY and they have P0 and P2 clocked identical so it doesn't matter - Blitss may have a card with "good" bios such as that Since nvidia clocking is by offset, even if you "have the same clocking" your base clock may still be lower and thus the real effective clock will not be the same as someone else due different bios (base clocks). You can't simply make your offsets higher by 700 to match the effective, because the card goes P2->P0->P8 when you exit compute applications which will nuke the card off the bus (P0 + 700 = waaaay too much for example, like P2 + 1400) Check with nvidia-smi while miner is running, start->run->cmd then cd to the nvidia program files (I forget if they are in x86 or not) whichever has NVSMI folder, go in there. Then you can run |
I disabled a few algorithms for fatser compile and missed to re-enable them.
I added the intel double hash ASM version from @SChernykh . (currently not tested under windows, feedback is welcome) |
@Spudz76 Looks like the P2 vs P0 on my GTX-1060 isn't that drastic (200 Mhz on the mem clock). I set it to P0 and did the same +150 core/+500 mem and got 2000 core/4500 mem. This gave +20 h/s on both CNv7 and CNv8 (520 to 540 for CNv7, 460 to 480 for CNv8), so only about a 4% boost. Hopefully the extra 200 Mhz won't cause stability issues, but I doubt it. Thanks. EDIT:
Lol, guess that extra 200 Mhz does cause issues. |
@Bathmat I had similar "fuzzy edge" on Hynix memory (also the MSI) whereas I can go like +500 over those on the Samsung (PNY) |
Gotta love Samsung memory... although, my Samsung AMD GPUs are taking the biggest hit with CNv8. But really, CNv8 levels the playing field for all mem brands. |
Yeah I keep remembering, all ships sink the same with this fork, so the rate difference is somewhat negated anyway (other than mixing up which cards are "best" hash per watt a little bit, I guess) |
NVRTC does exactly what AMD OpenCL runtime compiler does, and is just nvcc wrapped in the driver runtime. |
Version: xmr-stak 2.4.7 a6ecf8d on config with: sometimes segfault compiled as older versions with regards |
compiling the same code with: gcc version 6.3.1 20170216 (Red Hat 6.3.1-3) (GCC) both v8 and v7 are hashing OK |
This PR adds the final changes for the upcoming cryptonight_v8 to the CPU and OpenCl backend.
It is currently not allowed to mine cryptonight_v8 with the CUDA backend. The CUDA backend is currently not updated.
Performance can be reported in #1851.
Please report here only code suggestions or bugs.
todo