cryptonight_v8 version 2 #1850

psychocrypt · 2018-09-24T18:53:15Z

This PR adds the final changes for the upcoming cryptonight_v8 to the CPU and OpenCl backend.
It is currently not allowed to mine cryptonight_v8 with the CUDA backend. The CUDA backend is currently not updated.

Performance can be reported in #1851.
Please report here only code suggestions or bugs.

for CUDA the changes from Monero POW v2 #1831 are reverted and a new schema with two threads per hash is introduced
add ams for double hash (intel)

todo

add native CUDA version
test double hash asm under WINDOWS

SChernykh · 2018-09-24T21:44:28Z

@psychocrypt Don't leave out double hash asm optimized version. The same version can be used both for Intel and AMD, and it's much faster than C++ code on both platforms.

Spudz76 · 2018-09-25T00:39:26Z

@kio3i0j9024vkoenio Try with either envvar OpenCL_ROOT=/usr/local/cuda/targets/x86_64-linux set, or if that doesn't help then full force with -DOpenCL_INCLUDE_DIR=/usr/local/cuda/targets/x86_64-linux/include -DOpenCL_LIBRARY=/usr/local/cuda/targets/x86_64-linux/lib/libOpenCL.so.1.0.0 added to your cmake command

Or whatever sudo updatedb and then locate include/CL and locate libOpenCL.so give. Avoid system /usr/include/CL and whatever is in /usr/lib/ those are usually Mesa/Clover unless you've purged them on purpose. Add-on drivers usually don't replace existing OpenCL implementations include/libs, rather just put their ICD pointer into /etc/OpenCL/vendors/

By default install, /usr/local/cuda is linked to whichever is the active CUDA SDK, while locate will not show links - so I've replaced my locate output of /usr/local/cuda-9.1/blahblah to just /cuda/ then it will follow whichever is active in case of upgrade.

You still need the CUDA SDK (cuda-*-dev packages generally) as far as I know, but not the AMD APP SDK. But only for the headers I think that libOpenCL.so came with the drivers (nvidia-dkms or such packages). However when nothing is found it attempts to force APP SDK which is where it fails (non existent as it is probably not installed). You could also just use the OpenCL headers from Khronos which are supposed to be the new "universal" headers for everyone and everything (I have had great results with them both AMD and NVIDIA library targeted). Grab a ZIP from the download, unpack, and use the CL dir to replace /usr/include/CL then your system will just have the right headers for everything (Debian/Ubuntu will probably just wrap those headers in upcoming opencl-headers packaging - currently they pack some braindead old Mesa OpenCL headers that don't work).

kio3i0j9024vkoenio · 2018-09-25T03:39:51Z

cmake .. -DOpenCL_INCLUDE_DIR=/usr/local/cuda/targets/x86_64-linux/include -DOpenCL_LIBRARY=/usr/local/cuda/targets/x86_64-linux/lib/libOpenCL.so.1.0.0

Fixed the problem.

Thanks

kio3i0j9024vkoenio · 2018-09-25T16:43:03Z

@psychocrypt Don't leave out double hash asm optimized version. The same version can be used both for Intel and AMD, and it's much faster than C++ code on both platforms.

I second this request. I am only getting:

cryptonight_v8: 1293 H/s using XMR-Stak 1850/1851 latest version

whereas I am getting 1525 H/s using SChernykh XMR-Stak-CPU latest code with all the same v8 changes and the optimized asm for 1x and 2x threads.

#1851 (comment)

psychocrypt · 2018-09-25T17:23:19Z

yes I will add this soon.

Spudz76 · 2018-09-26T12:26:56Z

@kio3i0j9024vkoenio The "double hash asm" is only for CPUs, it won't help your OpenCL hashrates at all.

kio3i0j9024vkoenio · 2018-09-26T12:36:30Z

@kio3i0j9024vkoenio The "double hash asm" is only for CPUs, it won't help your OpenCL hashrates at all.

Yes I know that.

psychocrypt · 2018-09-28T19:52:35Z

update: I am still implementing the native NVIDIA backend. Everything takes longer than expected. The reason is that I found maybe a bug in the NVIDIA compiler nvcc.

kio3i0j9024vkoenio · 2018-09-29T04:16:45Z

Has the optimized double hash asm for CPU been added?

psychocrypt · 2018-09-29T06:51:24Z

Corrently not. But do not worry I will do it soon.

SChernykh · 2018-09-29T06:53:20Z

@psychocrypt Is NVIDIA backend issue something I can help with?

psychocrypt · 2018-09-29T07:07:50Z

Currently not. I workarounded the bug by adding a sync in between to avoid a memory optimization which is wrong. I need now to clean the code before I can push it. After tha I can have a deeper look into the performance. All at all the biggest issue is the high register usage compared to v7.

SChernykh · 2018-09-29T07:13:34Z

Is it faster than current OpenCL version? If not, maybe it's better to port OpenCL instead?

psychocrypt · 2018-09-29T07:22:10Z

I ported also a single threadversion which is similar to OpenCL but in CUDA it is not faster. One reason can be that in OpenCL we had much more compatile time information due to the just in time compilation where cfg values are translated into defines. I also tried to profile OpenCL but nvprof is not supporting anymore OpenCL. Never the less currently it looks like I am only 40 hashes away from OpenCL (500H/s for gtx1080) with the default config. I have not played around with different configs. I use also your shared memory chunk read version from OpenCL but must look for shared memory bank conflicts. I try to bring a pr this evening (german time)

SChernykh · 2018-09-29T07:24:40Z

@psychocrypt There is always a way to make CUDA at least exactly as fast as OpenCL: create separate kernel for CNv2, then look at generated PTX assembly and fix all the differences.

kio3i0j9024vkoenio · 2018-09-29T11:52:39Z

Is it faster than current OpenCL version? If not, maybe it's better to port OpenCL instead?

I certainly hope that the CUDA version will be faster than the OpenCL version as the OpenCL V8 version is terrible running on Nvidia GTX 750/750 Ti's as V8 performance is only 73.4% of what V7 produces.

#1851 (comment)

psychocrypt · 2018-09-29T21:44:47Z

@SChernykh Bad new I can not push the CUDA code today. I have now invalid results on my GTX1080, think I know where it is coming from but I need to test (this take some time) if I am right.

add cpu implementation for the final monero POW

apply optimizations Co-authored-by: SChernykh <sergey.v.chernykh@gmail.com>

- introduce a new schema where two threads work together on one hash - update autoadjustment - remove an mistake where shared memory was shrinked for gpus < sm_70

In the auto adjust without hwlock the asm entry was missing

psychocrypt · 2018-09-30T21:26:36Z

@SChernykh I added now the CUDA code. The code is for my GTX 1080 a few hashes faster than the OpenCL version.
I need to check some performance critical parts in the cuda code but for now it should be OK.

Bathmat · 2018-09-30T21:49:44Z

Issue building:

"C:\xmr-stak\xmr-stak-topic-cn8Version2\build-cuda\install.vcxproj" (default target) (1) ->
"C:\xmr-stak\xmr-stak-topic-cn8Version2\build-cuda\ALL_BUILD.vcxproj" (default target) (3) ->
"C:\xmr-stak\xmr-stak-topic-cn8Version2\build-cuda\xmrstak_cuda_backend.vcxproj" (default target) (8) ->
(CustomBuild target) ->
  C:/xmr-stak/xmr-stak-topic-cn8Version2/xmrstak/backend/nvidia/nvcc_code/cuda_core.cu(213): error : identifier "uint"
is undefined [C:\xmr-stak\xmr-stak-topic-cn8Version2\build-cuda\xmrstak_cuda_backend.vcxproj]
  C:/xmr-stak/xmr-stak-topic-cn8Version2/xmrstak/backend/nvidia/nvcc_code/cuda_core.cu(213): error : identifier "uint"
is undefined [C:\xmr-stak\xmr-stak-topic-cn8Version2\build-cuda\xmrstak_cuda_backend.vcxproj]

EDIT: FYI, using CUDA9.2. Should I revert to 9.1?

plavirudar · 2018-09-30T23:09:24Z

Issue building:

"C:\xmr-stak\xmr-stak-topic-cn8Version2\build-cuda\install.vcxproj" (default target) (1) ->
"C:\xmr-stak\xmr-stak-topic-cn8Version2\build-cuda\ALL_BUILD.vcxproj" (default target) (3) ->
"C:\xmr-stak\xmr-stak-topic-cn8Version2\build-cuda\xmrstak_cuda_backend.vcxproj" (default target) (8) ->
(CustomBuild target) ->
  C:/xmr-stak/xmr-stak-topic-cn8Version2/xmrstak/backend/nvidia/nvcc_code/cuda_core.cu(213): error : identifier "uint"
is undefined [C:\xmr-stak\xmr-stak-topic-cn8Version2\build-cuda\xmrstak_cuda_backend.vcxproj]
  C:/xmr-stak/xmr-stak-topic-cn8Version2/xmrstak/backend/nvidia/nvcc_code/cuda_core.cu(213): error : identifier "uint"
is undefined [C:\xmr-stak\xmr-stak-topic-cn8Version2\build-cuda\xmrstak_cuda_backend.vcxproj]

EDIT: FYI, using CUDA9.2. Should I revert to 9.1?

Getting same error, where in the code do you specify the CUDA version?

plavirudar · 2018-09-30T23:35:30Z

@Bathmat Line 213 on cude_core.cu doesn't compile when uint is used on certain compile platforms (Windows?), changing it to unsigned int compiles as per nitishsrivastava/deepnet#57

Bathmat · 2018-10-01T00:10:21Z

@Bathmat Line 213 on cude_core.cu doesn't compile when uint is used on certain compile platforms (Windows?), changing it to unsigned int compiles as per nitishsrivastava/deepnet#57

Cool, that fixed it. Thanks @plavirudar. Yes, compiling on Win10 v1803. Will test and publish results a little later (hopefully tonight).

blitss · 2018-10-01T16:53:09Z

@psychocrypt cool, but still doing worse comparing to cn/7 (my previous tests: #1832 (comment)) and bit better than OpenCL kernel.

Im testing on Windows with @plavirudar fix recomendation.

PS C:\Users\Andrey\Documents\xmr-stak-cn8\xmr-stak\build\bin\Release> ./xmr-stak --currency cryptonight_v8  --benchmark
8 --benchwait  20 --benchwork 30 --noCPU --noAMD
-------------------------------------------------------------------
xmr-stak 2.4.7 010cbd9

Brought to you by fireice_uk and psychocrypt under GPLv3.
Based on CPU mining code by wolf9466 (heavily optimized by fireice_uk).
Based on NVIDIA mining code by KlausT and psychocrypt.
Based on OpenCL mining code by wolf9466.

Configurable dev donation level is set to 2.0%

You can use following keys to display reports:
'h' - hashrate
'r' - results
'c' - connection
-------------------------------------------------------------------
[2018-10-01 20:50:32] : Mining coin: cryptonight_v8
!!!! Doing only a benchmark and exiting. To mine, remove the '--benchmark' option. !!!!
[2018-10-01 20:50:32] : Prepare benchmark for block version 8
[2018-10-01 20:50:32] : Starting NVIDIA GPU thread 0, no affinity.
CUDA [10.0/10.0] GPU#0, device architecture 61: "GeForce GTX 1060 6GB"... device init succeeded
[2018-10-01 20:50:32] : Wait 20 sec until all backends are initialized
[2018-10-01 20:50:52] : Start a 30 second benchmark...
[2018-10-01 20:51:22] : Benchmark Thread 0 nvidia: 393.5 H/S
[2018-10-01 20:51:22] : Benchmark Total: 393.5 H/S

Bathmat · 2018-10-01T17:04:38Z

@blitss here are my tests with my GTX-1060. CNv8 is slower than v7 by design, but nvidia does appear to have a bigger drop in hashrate compared to CPU or AMD.
CUDA test 1
CUDA test 2

blitss · 2018-10-01T17:05:29Z

@Bathmat well, i was trying to adjust threads but it didn't work for me. Setting threads for 30 made it 380H/s.

blitss · 2018-10-01T17:08:28Z

@Bathmat are you overclocked card? Are you using Cuda 9 or 10 during build? I have exactly same card but not same results.
I used same config and got only 408.8H/s

Bathmat · 2018-10-01T17:10:09Z

@blitss yes overclocked. +150 core, +500 mem = 2000 core, 4300 mem
I compiled with CUDA 9.2 and am using 397.64 driver

blitss · 2018-10-01T17:16:49Z

@Bathmat I got a bit better results with same overclock - 461.5H/s against 508H/s on cn/7. About 10% lower.
I'm using CUDA 10 and 411.63 drivers.

I gonna check it on OS X later.

`uint` is unknown in windows, therefore switch to the better type `uint32_t`

psychocrypt · 2018-10-01T18:05:21Z

@plavirudar I fixed the uint issue. It should now build without any issues under windows.

- restructe asm preparation function - add double hash asm code

Spudz76 · 2018-10-01T20:12:33Z

@Bathmat Don't forget the Pascal/10xx P2-lock and be sure to disable it (this tool)

I have used some 1060 6GB dual-fan full length PNY and they have P0 and P2 clocked identical so it doesn't matter - Blitss may have a card with "good" bios such as that
I also have used some 1060 6GB single-fan short layout MSI and they have a really garbage P2 clocking so unless I unlock, they run like they are half on power management (memory severely underclocked "for accuracy").

Since nvidia clocking is by offset, even if you "have the same clocking" your base clock may still be lower and thus the real effective clock will not be the same as someone else due different bios (base clocks). You can't simply make your offsets higher by 700 to match the effective, because the card goes P2->P0->P8 when you exit compute applications which will nuke the card off the bus (P0 + 700 = waaaay too much for example, like P2 + 1400)

Check with nvidia-smi while miner is running, start->run->cmd then cd to the nvidia program files (I forget if they are in x86 or not) whichever has NVSMI folder, go in there. Then you can run nvidia-smi -q while the miner is running to see what mode and actual clocks it is running (among tons of other info).

I disabled a few algorithms for fatser compile and missed to re-enable them.

psychocrypt · 2018-10-01T21:12:41Z

I added the intel double hash ASM version from @SChernykh . (currently not tested under windows, feedback is welcome)

Bathmat · 2018-10-01T22:12:46Z

@Spudz76 Looks like the P2 vs P0 on my GTX-1060 isn't that drastic (200 Mhz on the mem clock). I set it to P0 and did the same +150 core/+500 mem and got 2000 core/4500 mem. This gave +20 h/s on both CNv7 and CNv8 (520 to 540 for CNv7, 460 to 480 for CNv8), so only about a 4% boost. Hopefully the extra 200 Mhz won't cause stability issues, but I doubt it. Thanks.

EDIT:

Error details:
| Count | Error text                       | Last seen           |
|    12 | NVIDIA Invalid Result GPU ID 0   | 2018-10-01 17:12:48 |

Lol, guess that extra 200 Mhz does cause issues.

Spudz76 · 2018-10-02T00:30:54Z

@Bathmat I had similar "fuzzy edge" on Hynix memory (also the MSI) whereas I can go like +500 over those on the Samsung (PNY)

Bathmat · 2018-10-02T00:33:47Z

@Bathmat I had similar "fuzzy edge" on Hynix memory (also the MSI) whereas I can go like +500 over those on the Samsung (PNY)

Gotta love Samsung memory... although, my Samsung AMD GPUs are taking the biggest hit with CNv8. But really, CNv8 levels the playing field for all mem brands.

Spudz76 · 2018-10-02T00:46:52Z

Yeah I keep remembering, all ships sink the same with this fork, so the rate difference is somewhat negated anyway (other than mixing up which cards are "best" hash per watt a little bit, I guess)

Spudz76 · 2018-10-02T02:31:29Z

just in time compilation where cfg values are translated into defines

NVRTC does exactly what AMD OpenCL runtime compiler does, and is just nvcc wrapped in the driver runtime.

srwx666 · 2018-10-05T12:09:17Z

Version: xmr-stak 2.4.7 a6ecf8d

on config with:
monero7 -> hashes OK
monero8 -> error : Cryptonight hash self-test failed. This might be caused by bad compiler optimizations.

sometimes segfault
Oct 5 13:24:19 localhost kernel: in xmr-stak[400000+15b000]
Oct 5 13:26:22 localhost kernel: xmr-stak[46693]: segfault at 0 ip 000000000043d4a8 sp 00007f409fffeb50 error 6
Oct 5 13:26:22 localhost kernel: xmr-stak[46692]: segfault at 0 ip 000000000043d4a8 sp 00007f40a4906b50 error 6 in xmr-stak[400000+15b000]

compiled as older versions with
gcc version 7.1.1 20170526 (Red Hat 7.1.1-2) (GCC)

regards
A

srwx666 · 2018-10-05T13:56:56Z

compiling the same code with:

gcc version 6.3.1 20170216 (Red Hat 6.3.1-3) (GCC)

both v8 and v7 are hashing OK

psychocrypt added the enhancement label Sep 24, 2018

psychocrypt assigned fireice-uk Sep 24, 2018

psychocrypt requested a review from fireice-uk September 24, 2018 18:53

This was referenced Sep 24, 2018

Test upcoming Monero cryptonight_v8 #1832

Closed

Test final Monero POW cryptonight_v8 #1851

Open

psychocrypt force-pushed the topic-cn8Version2 branch from a39663c to 0fef2cf Compare September 24, 2018 19:02

psychocrypt and others added 5 commits September 30, 2018 23:10

iadd cryptonight_v8 tweak 2.2

cac26b9

add cpu implementation for the final monero POW

disbale CUDA backend for cryptonight_v8

915c868

optimize asm code cryptonight_v8

5003079

apply optimizations Co-authored-by: SChernykh <sergey.v.chernykh@gmail.com>

cuda: implement cryptonight_v8

5db405c

- introduce a new schema where two threads work together on one hash - update autoadjustment - remove an mistake where shared memory was shrinked for gpus < sm_70

cpu: fix missing asm autoadjust

010cbd9

In the auto adjust without hwlock the asm entry was missing

psychocrypt force-pushed the topic-cn8Version2 branch from 0fef2cf to 010cbd9 Compare September 30, 2018 21:19

remove using of type uint

22e63ce

`uint` is unknown in windows, therefore switch to the better type `uint32_t`

cpu: asm double hash

25634d4

- restructe asm preparation function - add double hash asm code

re-enable algorithm for cuda

1e5bb80

I disabled a few algorithms for fatser compile and missed to re-enable them.

psychocrypt changed the title ~~[WIP] cryptonight_v8 version 2~~ cryptonight_v8 version 2 Oct 3, 2018

fireice-uk approved these changes Oct 3, 2018

View reviewed changes

fireice-uk merged commit 98554a0 into fireice-uk:dev Oct 3, 2018

cryptonight_v8 version 2 #1850

cryptonight_v8 version 2 #1850

Conversation

psychocrypt commented Sep 24, 2018 • edited Loading

todo

SChernykh commented Sep 24, 2018

Spudz76 commented Sep 25, 2018 • edited Loading

kio3i0j9024vkoenio commented Sep 25, 2018

kio3i0j9024vkoenio commented Sep 25, 2018 • edited Loading

psychocrypt commented Sep 25, 2018 via email

Spudz76 commented Sep 26, 2018

kio3i0j9024vkoenio commented Sep 26, 2018

psychocrypt commented Sep 28, 2018

kio3i0j9024vkoenio commented Sep 29, 2018

psychocrypt commented Sep 29, 2018 via email

SChernykh commented Sep 29, 2018

psychocrypt commented Sep 29, 2018 via email

SChernykh commented Sep 29, 2018

psychocrypt commented Sep 29, 2018 via email

SChernykh commented Sep 29, 2018

kio3i0j9024vkoenio commented Sep 29, 2018 • edited Loading

psychocrypt commented Sep 29, 2018

psychocrypt commented Sep 30, 2018

Bathmat commented Sep 30, 2018 • edited Loading

plavirudar commented Sep 30, 2018

plavirudar commented Sep 30, 2018

Bathmat commented Oct 1, 2018

blitss commented Oct 1, 2018 • edited Loading

Bathmat commented Oct 1, 2018

blitss commented Oct 1, 2018

blitss commented Oct 1, 2018

Bathmat commented Oct 1, 2018

blitss commented Oct 1, 2018

psychocrypt commented Oct 1, 2018

Spudz76 commented Oct 1, 2018 • edited Loading

psychocrypt commented Oct 1, 2018

Bathmat commented Oct 1, 2018 • edited Loading

Spudz76 commented Oct 2, 2018

Bathmat commented Oct 2, 2018

Spudz76 commented Oct 2, 2018

Spudz76 commented Oct 2, 2018 • edited Loading

srwx666 commented Oct 5, 2018

srwx666 commented Oct 5, 2018

psychocrypt commented Sep 24, 2018 •

edited

Loading

Spudz76 commented Sep 25, 2018 •

edited

Loading

kio3i0j9024vkoenio commented Sep 25, 2018 •

edited

Loading

kio3i0j9024vkoenio commented Sep 29, 2018 •

edited

Loading

Bathmat commented Sep 30, 2018 •

edited

Loading

blitss commented Oct 1, 2018 •

edited

Loading

Spudz76 commented Oct 1, 2018 •

edited

Loading

Bathmat commented Oct 1, 2018 •

edited

Loading

Spudz76 commented Oct 2, 2018 •

edited

Loading