Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cryptonight_v8 version 2 #1850

Merged
merged 8 commits into from
Oct 3, 2018
Merged

Conversation

psychocrypt
Copy link
Collaborator

@psychocrypt psychocrypt commented Sep 24, 2018

This PR adds the final changes for the upcoming cryptonight_v8 to the CPU and OpenCl backend.
It is currently not allowed to mine cryptonight_v8 with the CUDA backend. The CUDA backend is currently not updated.

Performance can be reported in #1851.
Please report here only code suggestions or bugs.

  • for CUDA the changes from Monero POW v2 #1831 are reverted and a new schema with two threads per hash is introduced
  • add ams for double hash (intel)

todo

  • add native CUDA version
  • test double hash asm under WINDOWS

@SChernykh
Copy link
Contributor

@psychocrypt Don't leave out double hash asm optimized version. The same version can be used both for Intel and AMD, and it's much faster than C++ code on both platforms.

@Spudz76
Copy link
Contributor

Spudz76 commented Sep 25, 2018

@kio3i0j9024vkoenio Try with either envvar OpenCL_ROOT=/usr/local/cuda/targets/x86_64-linux set, or if that doesn't help then full force with -DOpenCL_INCLUDE_DIR=/usr/local/cuda/targets/x86_64-linux/include -DOpenCL_LIBRARY=/usr/local/cuda/targets/x86_64-linux/lib/libOpenCL.so.1.0.0 added to your cmake command

Or whatever sudo updatedb and then locate include/CL and locate libOpenCL.so give. Avoid system /usr/include/CL and whatever is in /usr/lib/ those are usually Mesa/Clover unless you've purged them on purpose. Add-on drivers usually don't replace existing OpenCL implementations include/libs, rather just put their ICD pointer into /etc/OpenCL/vendors/

By default install, /usr/local/cuda is linked to whichever is the active CUDA SDK, while locate will not show links - so I've replaced my locate output of /usr/local/cuda-9.1/blahblah to just /cuda/ then it will follow whichever is active in case of upgrade.

You still need the CUDA SDK (cuda-*-dev packages generally) as far as I know, but not the AMD APP SDK. But only for the headers I think that libOpenCL.so came with the drivers (nvidia-dkms or such packages). However when nothing is found it attempts to force APP SDK which is where it fails (non existent as it is probably not installed). You could also just use the OpenCL headers from Khronos which are supposed to be the new "universal" headers for everyone and everything (I have had great results with them both AMD and NVIDIA library targeted). Grab a ZIP from the download, unpack, and use the CL dir to replace /usr/include/CL then your system will just have the right headers for everything (Debian/Ubuntu will probably just wrap those headers in upcoming opencl-headers packaging - currently they pack some braindead old Mesa OpenCL headers that don't work).

@kio3i0j9024vkoenio
Copy link

cmake .. -DOpenCL_INCLUDE_DIR=/usr/local/cuda/targets/x86_64-linux/include -DOpenCL_LIBRARY=/usr/local/cuda/targets/x86_64-linux/lib/libOpenCL.so.1.0.0

Fixed the problem.

Thanks

@kio3i0j9024vkoenio
Copy link

kio3i0j9024vkoenio commented Sep 25, 2018

@psychocrypt Don't leave out double hash asm optimized version. The same version can be used both for Intel and AMD, and it's much faster than C++ code on both platforms.

I second this request. I am only getting:

cryptonight_v8: 1293 H/s using XMR-Stak 1850/1851 latest version

whereas I am getting 1525 H/s using SChernykh XMR-Stak-CPU latest code with all the same v8 changes and the optimized asm for 1x and 2x threads.

#1851 (comment)

@psychocrypt
Copy link
Collaborator Author

psychocrypt commented Sep 25, 2018 via email

@Spudz76
Copy link
Contributor

Spudz76 commented Sep 26, 2018

@kio3i0j9024vkoenio The "double hash asm" is only for CPUs, it won't help your OpenCL hashrates at all.

@kio3i0j9024vkoenio
Copy link

@kio3i0j9024vkoenio The "double hash asm" is only for CPUs, it won't help your OpenCL hashrates at all.

Yes I know that.

@psychocrypt
Copy link
Collaborator Author

update: I am still implementing the native NVIDIA backend. Everything takes longer than expected. The reason is that I found maybe a bug in the NVIDIA compiler nvcc.

@kio3i0j9024vkoenio
Copy link

Has the optimized double hash asm for CPU been added?

@psychocrypt
Copy link
Collaborator Author

psychocrypt commented Sep 29, 2018 via email

@SChernykh
Copy link
Contributor

@psychocrypt Is NVIDIA backend issue something I can help with?

@psychocrypt
Copy link
Collaborator Author

psychocrypt commented Sep 29, 2018 via email

@SChernykh
Copy link
Contributor

Is it faster than current OpenCL version? If not, maybe it's better to port OpenCL instead?

@psychocrypt
Copy link
Collaborator Author

psychocrypt commented Sep 29, 2018 via email

@SChernykh
Copy link
Contributor

@psychocrypt There is always a way to make CUDA at least exactly as fast as OpenCL: create separate kernel for CNv2, then look at generated PTX assembly and fix all the differences.

@kio3i0j9024vkoenio
Copy link

kio3i0j9024vkoenio commented Sep 29, 2018

Is it faster than current OpenCL version? If not, maybe it's better to port OpenCL instead?

I certainly hope that the CUDA version will be faster than the OpenCL version as the OpenCL V8 version is terrible running on Nvidia GTX 750/750 Ti's as V8 performance is only 73.4% of what V7 produces.

#1851 (comment)

@psychocrypt
Copy link
Collaborator Author

@SChernykh Bad new I can not push the CUDA code today. I have now invalid results on my GTX1080, think I know where it is coming from but I need to test (this take some time) if I am right.

psychocrypt and others added 5 commits September 30, 2018 23:10
add cpu implementation for the final monero POW
apply optimizations

Co-authored-by: SChernykh <sergey.v.chernykh@gmail.com>
- introduce a new schema where two threads work together on one hash
- update autoadjustment
- remove an mistake where shared memory was shrinked for gpus < sm_70
In the auto adjust without hwlock the asm entry was missing
@psychocrypt
Copy link
Collaborator Author

@SChernykh I added now the CUDA code. The code is for my GTX 1080 a few hashes faster than the OpenCL version.
I need to check some performance critical parts in the cuda code but for now it should be OK.

@Bathmat
Copy link

Bathmat commented Sep 30, 2018

Issue building:

"C:\xmr-stak\xmr-stak-topic-cn8Version2\build-cuda\install.vcxproj" (default target) (1) ->
"C:\xmr-stak\xmr-stak-topic-cn8Version2\build-cuda\ALL_BUILD.vcxproj" (default target) (3) ->
"C:\xmr-stak\xmr-stak-topic-cn8Version2\build-cuda\xmrstak_cuda_backend.vcxproj" (default target) (8) ->
(CustomBuild target) ->
  C:/xmr-stak/xmr-stak-topic-cn8Version2/xmrstak/backend/nvidia/nvcc_code/cuda_core.cu(213): error : identifier "uint"
is undefined [C:\xmr-stak\xmr-stak-topic-cn8Version2\build-cuda\xmrstak_cuda_backend.vcxproj]
  C:/xmr-stak/xmr-stak-topic-cn8Version2/xmrstak/backend/nvidia/nvcc_code/cuda_core.cu(213): error : identifier "uint"
is undefined [C:\xmr-stak\xmr-stak-topic-cn8Version2\build-cuda\xmrstak_cuda_backend.vcxproj]

EDIT: FYI, using CUDA9.2. Should I revert to 9.1?

@plavirudar
Copy link

Issue building:

"C:\xmr-stak\xmr-stak-topic-cn8Version2\build-cuda\install.vcxproj" (default target) (1) ->
"C:\xmr-stak\xmr-stak-topic-cn8Version2\build-cuda\ALL_BUILD.vcxproj" (default target) (3) ->
"C:\xmr-stak\xmr-stak-topic-cn8Version2\build-cuda\xmrstak_cuda_backend.vcxproj" (default target) (8) ->
(CustomBuild target) ->
  C:/xmr-stak/xmr-stak-topic-cn8Version2/xmrstak/backend/nvidia/nvcc_code/cuda_core.cu(213): error : identifier "uint"
is undefined [C:\xmr-stak\xmr-stak-topic-cn8Version2\build-cuda\xmrstak_cuda_backend.vcxproj]
  C:/xmr-stak/xmr-stak-topic-cn8Version2/xmrstak/backend/nvidia/nvcc_code/cuda_core.cu(213): error : identifier "uint"
is undefined [C:\xmr-stak\xmr-stak-topic-cn8Version2\build-cuda\xmrstak_cuda_backend.vcxproj]

EDIT: FYI, using CUDA9.2. Should I revert to 9.1?

Getting same error, where in the code do you specify the CUDA version?

@plavirudar
Copy link

@Bathmat Line 213 on cude_core.cu doesn't compile when uint is used on certain compile platforms (Windows?), changing it to unsigned int compiles as per nitishsrivastava/deepnet#57

@Bathmat
Copy link

Bathmat commented Oct 1, 2018

@Bathmat Line 213 on cude_core.cu doesn't compile when uint is used on certain compile platforms (Windows?), changing it to unsigned int compiles as per nitishsrivastava/deepnet#57

Cool, that fixed it. Thanks @plavirudar. Yes, compiling on Win10 v1803. Will test and publish results a little later (hopefully tonight).

@blitss
Copy link
Contributor

blitss commented Oct 1, 2018

@psychocrypt cool, but still doing worse comparing to cn/7 (my previous tests: #1832 (comment)) and bit better than OpenCL kernel.

Im testing on Windows with @plavirudar fix recomendation.

PS C:\Users\Andrey\Documents\xmr-stak-cn8\xmr-stak\build\bin\Release> ./xmr-stak --currency cryptonight_v8  --benchmark
8 --benchwait  20 --benchwork 30 --noCPU --noAMD
-------------------------------------------------------------------
xmr-stak 2.4.7 010cbd9

Brought to you by fireice_uk and psychocrypt under GPLv3.
Based on CPU mining code by wolf9466 (heavily optimized by fireice_uk).
Based on NVIDIA mining code by KlausT and psychocrypt.
Based on OpenCL mining code by wolf9466.

Configurable dev donation level is set to 2.0%

You can use following keys to display reports:
'h' - hashrate
'r' - results
'c' - connection
-------------------------------------------------------------------
[2018-10-01 20:50:32] : Mining coin: cryptonight_v8
!!!! Doing only a benchmark and exiting. To mine, remove the '--benchmark' option. !!!!
[2018-10-01 20:50:32] : Prepare benchmark for block version 8
[2018-10-01 20:50:32] : Starting NVIDIA GPU thread 0, no affinity.
CUDA [10.0/10.0] GPU#0, device architecture 61: "GeForce GTX 1060 6GB"... device init succeeded
[2018-10-01 20:50:32] : Wait 20 sec until all backends are initialized
[2018-10-01 20:50:52] : Start a 30 second benchmark...
[2018-10-01 20:51:22] : Benchmark Thread 0 nvidia: 393.5 H/S
[2018-10-01 20:51:22] : Benchmark Total: 393.5 H/S

@Bathmat
Copy link

Bathmat commented Oct 1, 2018

@blitss here are my tests with my GTX-1060. CNv8 is slower than v7 by design, but nvidia does appear to have a bigger drop in hashrate compared to CPU or AMD.
CUDA test 1
CUDA test 2

@blitss
Copy link
Contributor

blitss commented Oct 1, 2018

@Bathmat well, i was trying to adjust threads but it didn't work for me. Setting threads for 30 made it 380H/s.

@blitss
Copy link
Contributor

blitss commented Oct 1, 2018

@Bathmat are you overclocked card? Are you using Cuda 9 or 10 during build? I have exactly same card but not same results.
I used same config and got only 408.8H/s
image

@Bathmat
Copy link

Bathmat commented Oct 1, 2018

@blitss yes overclocked. +150 core, +500 mem = 2000 core, 4300 mem
I compiled with CUDA 9.2 and am using 397.64 driver

@blitss
Copy link
Contributor

blitss commented Oct 1, 2018

@Bathmat I got a bit better results with same overclock - 461.5H/s against 508H/s on cn/7. About 10% lower.
I'm using CUDA 10 and 411.63 drivers.

I gonna check it on OS X later.

`uint` is unknown in windows, therefore switch to the better type `uint32_t`
@psychocrypt
Copy link
Collaborator Author

@plavirudar I fixed the uint issue. It should now build without any issues under windows.

- restructe asm preparation function
- add double hash asm code
@Spudz76
Copy link
Contributor

Spudz76 commented Oct 1, 2018

@Bathmat Don't forget the Pascal/10xx P2-lock and be sure to disable it (this tool)

I have used some 1060 6GB dual-fan full length PNY and they have P0 and P2 clocked identical so it doesn't matter - Blitss may have a card with "good" bios such as that
I also have used some 1060 6GB single-fan short layout MSI and they have a really garbage P2 clocking so unless I unlock, they run like they are half on power management (memory severely underclocked "for accuracy").

Since nvidia clocking is by offset, even if you "have the same clocking" your base clock may still be lower and thus the real effective clock will not be the same as someone else due different bios (base clocks). You can't simply make your offsets higher by 700 to match the effective, because the card goes P2->P0->P8 when you exit compute applications which will nuke the card off the bus (P0 + 700 = waaaay too much for example, like P2 + 1400)

Check with nvidia-smi while miner is running, start->run->cmd then cd to the nvidia program files (I forget if they are in x86 or not) whichever has NVSMI folder, go in there. Then you can run nvidia-smi -q while the miner is running to see what mode and actual clocks it is running (among tons of other info).

I disabled a few algorithms for fatser compile and missed to re-enable them.
@psychocrypt
Copy link
Collaborator Author

I added the intel double hash ASM version from @SChernykh . (currently not tested under windows, feedback is welcome)

@Bathmat
Copy link

Bathmat commented Oct 1, 2018

@Spudz76 Looks like the P2 vs P0 on my GTX-1060 isn't that drastic (200 Mhz on the mem clock). I set it to P0 and did the same +150 core/+500 mem and got 2000 core/4500 mem. This gave +20 h/s on both CNv7 and CNv8 (520 to 540 for CNv7, 460 to 480 for CNv8), so only about a 4% boost. Hopefully the extra 200 Mhz won't cause stability issues, but I doubt it. Thanks.

EDIT:

Error details:
| Count | Error text                       | Last seen           |
|    12 | NVIDIA Invalid Result GPU ID 0   | 2018-10-01 17:12:48 |

Lol, guess that extra 200 Mhz does cause issues.

@Spudz76
Copy link
Contributor

Spudz76 commented Oct 2, 2018

@Bathmat I had similar "fuzzy edge" on Hynix memory (also the MSI) whereas I can go like +500 over those on the Samsung (PNY)

@Bathmat
Copy link

Bathmat commented Oct 2, 2018

@Bathmat I had similar "fuzzy edge" on Hynix memory (also the MSI) whereas I can go like +500 over those on the Samsung (PNY)

Gotta love Samsung memory... although, my Samsung AMD GPUs are taking the biggest hit with CNv8. But really, CNv8 levels the playing field for all mem brands.

@Spudz76
Copy link
Contributor

Spudz76 commented Oct 2, 2018

Yeah I keep remembering, all ships sink the same with this fork, so the rate difference is somewhat negated anyway (other than mixing up which cards are "best" hash per watt a little bit, I guess)

@Spudz76
Copy link
Contributor

Spudz76 commented Oct 2, 2018

just in time compilation where cfg values are translated into defines

NVRTC does exactly what AMD OpenCL runtime compiler does, and is just nvcc wrapped in the driver runtime.

@psychocrypt psychocrypt changed the title [WIP] cryptonight_v8 version 2 cryptonight_v8 version 2 Oct 3, 2018
@fireice-uk fireice-uk merged commit 98554a0 into fireice-uk:dev Oct 3, 2018
@srwx666
Copy link

srwx666 commented Oct 5, 2018

Version: xmr-stak 2.4.7 a6ecf8d

on config with:
monero7 -> hashes OK
monero8 -> error : Cryptonight hash self-test failed. This might be caused by bad compiler optimizations.

sometimes segfault
Oct 5 13:24:19 localhost kernel: in xmr-stak[400000+15b000]
Oct 5 13:26:22 localhost kernel: xmr-stak[46693]: segfault at 0 ip 000000000043d4a8 sp 00007f409fffeb50 error 6
Oct 5 13:26:22 localhost kernel: xmr-stak[46692]: segfault at 0 ip 000000000043d4a8 sp 00007f40a4906b50 error 6 in xmr-stak[400000+15b000]

compiled as older versions with
gcc version 7.1.1 20170526 (Red Hat 7.1.1-2) (GCC)

regards
A

@srwx666
Copy link

srwx666 commented Oct 5, 2018

compiling the same code with:

gcc version 6.3.1 20170216 (Red Hat 6.3.1-3) (GCC)

both v8 and v7 are hashing OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants