Name		Name	Last commit message	Last commit date
Latest commit History 367 Commits
crypto		crypto
rapidjson		rapidjson
.appveyor.yml		.appveyor.yml
.gitignore		.gitignore
.travis.yml		.travis.yml
CMakeLists.txt		CMakeLists.txt
FREEBSDCOMPILE.md		FREEBSDCOMPILE.md
LICENSE		LICENSE
LINUXCOMPILE.md		LINUXCOMPILE.md
README.md		README.md
WINCOMPILE.md		WINCOMPILE.md
autoAdjust.hpp		autoAdjust.hpp
autoAdjustHwloc.hpp		autoAdjustHwloc.hpp
cli-miner.cpp		cli-miner.cpp
config.txt		config.txt
console.cpp		console.cpp
console.h		console.h
donate-level.h		donate-level.h
executor.cpp		executor.cpp
executor.h		executor.h
httpd.cpp		httpd.cpp
httpd.h		httpd.h
hwlocMemory.hpp		hwlocMemory.hpp
jconf.cpp		jconf.cpp
jconf.h		jconf.h
jext.h		jext.h
jpsock.cpp		jpsock.cpp
jpsock.h		jpsock.h
minethd.cpp		minethd.cpp
minethd.h		minethd.h
msgstruct.h		msgstruct.h
pgo_build_linux.sh		pgo_build_linux.sh
pgo_build_win.bat		pgo_build_win.bat
socket.cpp		socket.cpp
socket.h		socket.h
socks.h		socks.h
tests.txt		tests.txt
thdq.hpp		thdq.hpp
version.h		version.h
webdesign.cpp		webdesign.cpp
webdesign.h		webdesign.h
xmr-stak-cpu.cbp		xmr-stak-cpu.cbp
xmr-stak-cpu.sln		xmr-stak-cpu.sln
xmr-stak-cpu.vcxproj		xmr-stak-cpu.vcxproj
xmr-stak-cpu.vcxproj.filters		xmr-stak-cpu.vcxproj.filters

Repository files navigation

This is a test repo for Cryptonight anti-ASIC/FPGA modifications

It's based on the old xmr-stak-cpu repo. Only "benchmark_mode config.txt" command line is supported.

1. Shuffle and add modification

Cryptonight is memory-intensive in terms of memory latency, but not bandwidth. Modern CPUs use 64-byte wide cache lines, but Cryptonight only loads/stores 16 bytes at a time, so 75% of available CPU cache bandwidth is wasted. ASICs are optimized for these 16 byte-wide memory accesses, so they always use 100% of whatever memory they have.

The idea is to do something computationally light and safe with the other 48 bytes of the same cache line (which is loaded in L1 cache anyway) on each step. Shuffle modification, as can be guessed by its name, treats these 48 bytes as 3 16-byte elements, shuffles them and performs 6x64-bit integer additions on them to make sure ASIC can't do it via simple rewiring.

The actual shuffle and add logic has 4 different cases depending on where in the 64-byte line the AES round is performed. Here are these 4 cases with variable names just like in the code. "+" operation on a 16-byte chunk means it's treated as 2 separate 64-bit unsigned integers.

-	Bytes 0-15	Bytes 16-31	Bytes 32-47	Bytes 48-63
Before	AES input	chunk1	chunk2	chunk3
After	AES output	chunk3+b1	chunk1+b	chunk2+a

-	Bytes 0-15	Bytes 16-31	Bytes 32-47	Bytes 48-63
Before	chunk1	AES input	chunk3	chunk2
After	chunk3+b1	AES output	chunk2+a	chunk1+b

-	Bytes 0-15	Bytes 16-31	Bytes 32-47	Bytes 48-63
Before	chunk2	chunk3	AES input	chunk1
After	chunk1+b	chunk2+a	AES output	chunk3+b1

-	Bytes 0-15	Bytes 16-31	Bytes 32-47	Bytes 48-63
Before	chunk3	chunk2	chunk1	AES input
After	chunk2+a	chunk1+b	chunk3+b1	AES output

The shuffle modification makes Cryptonight 4 times more demanding for memory bandwidth, making ASIC/FPGA 4 times slower*. At the same time, CPU/GPU performance stays almost the same because this bandwidth is already there, it's just not used yet. Shuffle can also be done in parallel with existing Cryptonight calculations.

* The 4 times slowdown applies only to devices (ASIC/FPGA) that use external memory for storing the scratchpad and saturate this memory's bandwidth. Devices that use on-chip memory have no problems with bandwidth, but they'll still have to do 4 times more memory reads/writes, so they'll also become somewhat slower.

2. Integer math modification

It adds one 64:32 bit integer division and one 64 bit integer square root per iteration.

Integer division is defined as follows:

Divisor(bits 0-31) = (AES_ROUND_OUTPUT(bits 0-31) + sqrt_result * 2) | 0x80000001 where sqrt_result is 32-bit unsigned integer from the previous iteration. Divisor being dependent on sqrt_result creates dependency chain and ensures that two different iterations can't be run in parallel.
Dividend(bits 0-63) = AES_ROUND_OUTPUT(bits 64-127)
Quotient(bits 0-31 of division_result) = (Dividend / Divisor)(bits 0-31)
Remainder(bits 32-63 of division_result) = (Dividend % Divisor)(bits 0-31)

Square root is defined as follows:

sqrt_input(bits 0-63) = AES_ROUND_OUTPUT(bits 0-63) + division_result(bits 0-63). Sqrt input being dependent on division_result creates dependency chain and ensures that sqrt can't be run in parallel with division.
sqrt_result(bits 0-31) = Integer part of "sqrt(2^64 + sqrt_input) * 2 - 2^33"

Both division_result and sqrt_result are blended in the main loop of Cryptonight as follows:

Data from second memory read is taken - it's 16 bytes, but only first 8 bytes are changed
Bits 0-63 are XOR'ed with bits 0-63 of division_result
Bits 32-63 are XOR'ed with bits 0-31 of sqrt_result
Changing bits 0-63 of the second memory read's data guarantees that every bit of division_result and sqrt_result influences all bits of the address for the next memory read and is also stored to the scratchpad - see the code of the original Cryptonight algorithm. ASIC can't skip calculating div+sqrt to continue the main loop.

Adding integer division and integer square roots to the main loop ramps up the complexity of ASIC/FPGA and silicon area needed to implement it, so they'll be much less efficient with the same transistor budget and/or power consumption. Most common hardware implementations of division and square roots require a lot of clock cycles of latency: at least 8 cycles for 64:32 bit division and the same 8 cycles for 64 bit square root. These latencies are achieved for the best and fastest hardware implementations I could find. And the way this modification is made ensures that division and square root from the same main loop iteration can't be done in parallel, so their latencies add up making it staggering 16 cycles per iteration in the best case, comparing to 1 cycle per iteration in the original Crytonight. Even though this latency can be hidden in pipelined implementations, it will require A LOT of logical elements/silicon area to implement. Hiding the latency will also require many parallel scratchpads which will be a strong limiting factor for hardware with on-chip memory. They just don't have enough memory to hide the latency entirely. My rough estimates show ~15x slowdown in this case.

Good news for CPU and GPU is that division and square roots can be added to the main loop in such a way that their latency is completely hidden, so again there is almost no slowdown.

Performance

Update 2018-09-13: my latest test shows that AMD Radeon RX 560 can do 98.5% of its current Cryptonight hashrate with strided_index=2, mem_chunk=2, so all GPU numbers below are now obsolete. Hopefully, all GPUs will be able to retain at least 95% of their current hashrate

Overall, it seems that all Radeon cards will be in 85-90+% range of their CryptonightV1 performance in the end after the community has more time to get familiar with these mods and come up with tuning manuals. All GeForce cards will be in 90-95+% range of their CryptonightV1 performance, so their relative performance will improve a few percent comparing to Radeons. And there is always some room for further performance improvement in the mining software itself, so overall network hashrate will only drop 5-10%.

On the other side, ASIC/FPGA which use external memory for scratchpad will get 4 times slower due to increased bandwidth usage. ASIC/FPGA which use on-chip memory for scratchpad will get ~15 times slower because of high latencies introduced with division and square root calculations: they just don't have enough on-chip memory to hide these latencies with many parallel Cryptonight calculations.

AMD Ryzen 7 1700 @ 3.6 GHz, 8 threads

Mod	Hashrate	Performance level
-	600.8 H/s	100.0%
INT_MATH	588.0 H/s	97.9%
SHUFFLE	586.6 H/s	97.6%
Both mods	572.0 H/s	95.2%

AMD Ryzen 5 2600 @ 4.0 GHz, 1 thread

Mod	Hashrate	Performance level
-	97.0 H/s	100.0%
INT_MATH	91.7 H/s	94.5%
SHUFFLE	94.6 H/s	97.5%
Both mods	91.3 H/s	94.1%
Both mods (PGO build)	93.5 H/s	96.4%
Both mods (ASM optimized)	94.8 H/s	97.7%

AMD Ryzen 5 2600 @ 4.0 GHz, 8 threads (affinity 0,2,4,5,6,8,10,11)

Mod	Hashrate	Performance level
-	657.6 H/s	100.0%
INT_MATH	613.3 H/s	93.3%
SHUFFLE	647.0 H/s	98.4%
Both mods	612.3 H/s	93.1%
Both mods (PGO build)	622.4 H/s	94.6%
Both mods (ASM optimized)	636.0 H/s	96.7%

Intel Pentium G5400 (Coffee Lake, 2 cores, 4 MB Cache, 3.70 GHz), 2 threads

Mod	Hashrate	Performance level
-	146.5 H/s	100.0%
INT_MATH	141.0 H/s	96.2%
SHUFFLE	145.3 H/s	99.2%
Both mods	140.5 H/s	95.9%

Intel Core i5 3210M (Ivy Bridge, 2 cores, 3 MB Cache, 2.80 GHz), 1 thread

Mod	Hashrate	Performance level
-	72.7 H/s	100.0%
INT_MATH	66.3 H/s	91.2%
SHUFFLE	71.1 H/s	97.8%
Both mods	66.3 H/s	91.2%
Both mods (PGO build)	66.3 H/s	91.2%
Both mods (ASM optimized)	69.6 H/s	95.7%

Intel Core i7 2600K (Sandy Bridge, 4 cores, 8 MB Cache, 3.40 GHz), 1 thread

Mod	Hashrate	Performance level
-	85.6 H/S	100.0%
Both mods	70.6 H/S	82.5%
Both mods (PGO build)	76.5 H/S	89.4%
Both mods (ASM optimized)	79.2 H/S	92.5%

Intel Core i7 7820X (Skylake-X, 8 cores, 11 MB Cache, 3.60 GHz), 1 thread

Mod	Hashrate	Performance level
-	68.3 H/s	100.0%
INT_MATH	65.9 H/s	96.5%
SHUFFLE	67.3 H/s	98.5%
Both mods	65.0 H/s	95.2%

Quad Xeon E7-8837 HP DL580 G7 Server, 16 single + 16 double threads

Mod	Hashrate	Performance level
-	1654.2 H/s	100.0%
INT_MATH	1586.3 H/s	95.9%
SHUFFLE	1606.8 H/s	97.1%
Both mods	1554.4 H/s	94.0%

GPU performance was tested using the code from this repository: https://github.com/SChernykh/xmr-stak-amd

XMR-STAK used is an old version, so don't expect the same numbers that you have on your mining rigs. What's important here are relative numbers of original and modified Cryptonight versions.

Radeon RX 560 on Windows 10 (overclocked): core @ 1196 MHz, memory @ 2200 MHz, 1 Click PBE Timing Straps, monitor plugged in, intensity 1024, worksize 32:

Mod	Hashrate	Performance level
-	477.1 H/s	100.0%
INT_MATH	448.4 H/s	94.0%
SHUFFLE	457.6 H/s	95.9%
Both mods	447.0 H/s	93.7%
Both mods strided*	469.8 H/s	98.5%

* strided_index = 2, mem_chunk = 2 (64 bytes) - will be available in xmrig-amd

GeForce GTX 1080 Ti 11 GB on Windows 10: core 2000 MHz, memory 11800 MHz, monitor plugged in, intensity 1280, worksize 8:

Mod	Hashrate	Performance level
-	908.4 H/s	100.0%
INT_MATH	902.7 H/s	99.4%
SHUFFLE	848.6 H/s	93.4%
Both mods	846.7 H/s	93.2%

GeForce GTX 1060 6 GB on Windows 10: all stock, monitor plugged in, intensity 800, worksize 8:

Mod	Hashrate	Performance level
-	453.6 H/s	100.0%
INT_MATH	452.2 H/s	99.7%
SHUFFLE	422.6 H/s	93.2%
Both mods	421.5 H/s	92.9%

GeForce GTX 1050 2 GB on Windows 10: core 1721 MHz, memory 1877 MHz, monitor unplugged, intensity 448, worksize 8:

Mod	Hashrate	Performance level
-	319.9 H/s	100.0%
INT_MATH	318.1 H/s	99.4%
SHUFFLE	292.5 H/s	91.4%
Both mods	291.0 H/s	91.0%

Results from @mobilepolice

XFX RX560D 14CU 1150Mhz Core 1750Mhz Mem https://pastebin.com/HC1TchsL

Best Result:

threads	intensity	worksize	unroll	math	shuffle	hashrate	performance
2	448	8	4	FALSE	TRUE	503.6	100.0%

Best Result with Mods:

threads	intensity	worksize	unroll	math	shuffle	hashrate	performance
2	448	8	8	TRUE	TRUE	447.3	88.8%
2	448	32	2	TRUE	TRUE	446.5	88.7%

XFX RX560 16CU 1150Mhz Core 1750Mhz Mem https://pastebin.com/pfajCwkC

Best Result:

threads	intensity	worksize	unroll	math	shuffle	hashrate	performance
2	448	4	2	FALSE	FALSE	530.8	100.0%

Best Result with Mods:

threads	intensity	worksize	unroll	math	shuffle	hashrate	performance
2	448	8	8	TRUE	TRUE	454.7	85.7%

Sapphire RX550 8CU 1325Mhz Core 1760Mhz Mem https://pastebin.com/NZbqLcV4

Best Result:

threads	intensity	worksize	unroll	math	shuffle	hashrate	performance
2	448	16	8	FALSE	TRUE	496.1	100.0%

Best Result with Mods:

threads	intensity	worksize	unroll	math	shuffle	hashrate	performance
2	448	16	4	TRUE	TRUE	444.1	89.5%

Results from @MoneroCrusher

RX 550 Gigabyte, 8 CU, 2200 mckl, 1250 sckl, 1 click PBE timings, Ubuntu 16.04 LTS Server, Same settings as CN7 (low electricity consumption, ~40-45W at the wall) Best Results (full results https://pastebin.com/Nr2N139a)

Mod	Threads/Intensity	Worksize	Unroll	Hashrate	Performance Level
Reference (latest xmr-stak)	2/432	8	-	525 H/s	-
No Mod	2/432	8	-	507.8	100%
SHUFFLE	2/448	32	1	459.7 H/s	90.5%
SHUFFLE	2/448	32	8	459.7 H/s	90.5%
INT_MATH	2/448	16	8	464 H/s	91.4%
Both mods	2/448	32	1	417.2 H/s	82.2%
Both mods	2/448	32	8	415.2 H/s	81.8%

Here I have to note that 432 intensity works better for no mods while 448 works better for mods (about 3% better).

RX Vega 56, 56 CU, 915mckl, 1590sckl, stock BIOS, Windows 10, 18.5.2 Drivers, Best Results (full results: https://pastebin.com/kz0fYhrr)

Mod	Threads/Intensity	Worksize	Unroll	Hashrate	Performance Level
Reference (latest xmr-stak)	2/1800	8	-	1865 H/s	-
No Mod	2/1800	16	-	1869 H/s	100%
SHUFFLE	2/1800	16	8	1714 H/s	92.2%
INT_MATH	2/1800	16	1	1763 H/s	94.3%
Both mods	2/1800	32	8	1595 H/s	85.3%

Results from @Bathmat

RX470 4GB (Hynix mem, 2100 clock, PBE one-click timings) Full results: #1 (comment)

Mod	Hashrate	Performance level
-	934 H/s	100.0%
INT_MATH	855 H/s	91.5%
SHUFFLE	880 H/s	94.2%
Both mods	877 H/s	93.9%

RX570 4gb Sapphire ITX, Hynix mem, 2000 mem clock, one click timings #1 (comment)

Mod	Hashrate	Performance level
-	862 H/s	100.0%
Both mods	825 H/s	95.7%

RX480 with Hynix (2100 mem, one-click timings)

Mod	Hashrate	Performance level
-	884 H/s	100.0%
Both mods	886 H/s	100.0%

RX480 with Samsung (2000 mem, one-click Uber 3.1 timings)

Mod	Hashrate	Performance level
-	939 H/s	100.0%
Both mods	852 H/s	90.7%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This is a test repo for Cryptonight anti-ASIC/FPGA modifications

1. Shuffle and add modification

2. Integer math modification

Performance

XMR-STAK used is an old version, so don't expect the same numbers that you have on your mining rigs. What's important here are relative numbers of original and modified Cryptonight versions.

About

Releases

Packages

Languages

License

SChernykh/xmr-stak-cpu

Folders and files

Latest commit

History

Repository files navigation

This is a test repo for Cryptonight anti-ASIC/FPGA modifications

1. Shuffle and add modification

2. Integer math modification

Performance

XMR-STAK used is an old version, so don't expect the same numbers that you have on your mining rigs. What's important here are relative numbers of original and modified Cryptonight versions.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages