-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cryptonote tweak v2.2 #4404
cryptonote tweak v2.2 #4404
Conversation
@vtnerd Correct me if I'm wrong, here is my understanding of this code: 128-bit multiplication result is XORed into the shuffle's 16-byte line with index 1 (j^0x10) and then (after the shuffle) this 128-bit result is XORed with shuffle's 16-byte line with index 2 (j ^ 0x20), which is essentially the same original line + value of _b.
So GPU performance will not be affected, that is 100%. CPUs will become slower, though I need to modify my assembler versions to see the impact. P.S. Was it intended that the second XOR is done essentially with the same 16-byte line? Maybe change 0x20 to 0x30? |
Edit: this is wrong |
Oh, I misunderstood it. (hi, lo) pair is still used locally (for the second shuffle), so performance impact for CPU will be minimal. |
only a note: I think this change has a larger impact on the performance for NVIDIA GPUs, mostly for older the older Kepler generation. The reason is that we still use to much registers for CNv2. |
@psychocrypt It should be fine because it's local. Those values are already available in that place. |
@SChernykh Not for the CUDA implementation because the values are distributed stored. 4 threads work on one hash to use use reduce the register count per thread and have a better memory load efficiency. |
@vtnerd I've added this tweak to my Intel-optimized assembler code and tested it on my notebook (Core i5 3210M). Hashrate went down from 69.6 H/s to 63.1 H/s - 10% slower! 👎 This is a disaster - 10% slowdown for CPU without much impact on ASIC performance. I'll try to tune my code tomorrow, but from my experience I can say it's impossible to make more than 2-3% faster, in other words 65 H/s will be the new theoretical limit for this CPU. |
I now understand why it's so bad for performance - my shuffle mod was designed to run in parallel with AES (first shuffle) and DIV->SQRT->MUL->ADD (second shuffle). This change entangles MUL->ADD and the second shuffle, so they can't run in parallel on CPU. GPUs will be probably hurt too because of this. Damn, second shuffle even has to wait for MUL result which depends on previous DIV->SQRT, so no parallel execution here for CPU/GPU at all. ASICs will just add one XOR gate (per bit) before the shuffle and one XOR gate (per bit) after the shuffle, performance change will be minimal. |
This is the only thing that might slow down ASIC a tiny bit. Addition is 3-4 times faster than multiplication, so this specific part of the ASIC circuit will be 25% slower because addition can't be merged with multiplication. But it won't help anyway, because it is a part of AES+MUL+ADD sequence which will still be much faster than DIV+SQRT sequence. There will be no slowdown for ASIC at all. |
@SChernykh didn't developers find that roughly 10% of performance loss through optimizations of the code you put forth for CNv2? If so, this 10% performance hit (pre-optimization, perhaps?) brings us back to where we were before optimizations were made for your CNv2 code. If those things are true, would it be safe to argue that the slowdowns, while having a negative effect on CPU/GPU vs ASIC, overall leave us at the same place we were comfortable at just a short month ago? If that can be assumed, then would it be fair to suggest that there are optimizations that can be made with this code as well? I'm just trying to mitigate the notion that this is a disaster. |
@mobilepolice This is an additional performance hit where there shouldn't be any (ideally), that's the problem. On the bright side, Radeon RX 560 is unaffected - performance is the same. So GPUs will most likely do fine with this tweak. |
Ryzen gets 14% performance hit - down from 94.8 H/s to 83.1 H/s (single thread). So all CPUs will get 10% or more additional hit after this tweak. |
|
Assembly level micro-optimizations can bring back up to 2%-3%, but 10% hit for Intel and 14% hit for Ryzen is too big to compensate with these optimizations. The problem with this tweak is that it adds new data dependencies in the critical code path, there is no way to work around it. Out of order execution on CPU which worked extremely well in this part of the code is now stalled there. |
@SChernykh how do you manage to run them in parallel? In my implementation for this tweak, I store the result of ; byte_mul
movq C, %rax
movq D, %rbx
mulq %rbx
movq %rdx, %xmm0
movq %rax, %xmm1
movlhps %xmm1, %xmm0
; VARIANT2_SHUFFLE_ADD
movq OFFSET, %rbx
xorq $0x10, %rbx
leaq 0(STATE, %rbx), %rbx
movdqa 0(%rbx), %xmm1 ; chunk0
; VARIANT2_2_1
pxor %xmm0, %xmm1
; shuffle other chunks
; ...
; VARIANT2_2_2
pxor %xmm1, %xmm0
; ...
; byte_add
paddq %xmm0, A I don't know if it helps. |
@Equim-chan CPU runs them in parallel - shuffle and AES work on different independent parts of 64-byte cache line, the second shuffle also works on different parts of data. DIV->SQRT is used on the next iteration, so it runs in parallel with other stuff on CPU throughout the entire main loop - this is why there is no slowdown for CPU (without this tweak). P.S. This is by design - shuffle is only there to increase required memory bandwidth by 4x, not to mess with data dependencies in the main loop. This tweak breaks CNv2 design in this aspect and CPUs immediately take 10%-15% hit as a result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revise your tweak so that it doesn't bring 10-15% hit for CPUs out of thin air.
src/crypto/slow-hash.c
Outdated
@@ -198,6 +198,21 @@ extern int aesb_pseudo_round(const uint8_t *in, uint8_t *out, const uint8_t *exp | |||
} while (0) | |||
#endif | |||
|
|||
#define VARIANT2_2_1() \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding to what was said in comments: VARIANT2_2_1 seems to be excessive and useless here. It just XORs data into scratchpad, the result of this XOR is not used until we hit the same 64-byte line on some next iteration. It will be ~100 iterations later, at best, given 2 MB scratchpad size. The delay between store and use allows it to be done fully in background on ASIC while writing back to memory and in parallel with all other calculations. Only VARIANT2_2_2 does the job of breaking fused hardware implementation of "8-byte multiply then 8-byte add".
src/crypto/slow-hash.c
Outdated
do if (variant >= 2) \ | ||
{ \ | ||
hi ^= *U64(hp_state + (j ^ 0x20)); \ | ||
lo ^= *(U64(hp_state + (j ^ 0x20)) + 1); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea to use XOR to break fused MUL+ADD implementation is great, but do we need to use scratchpad for that? We have plenty of data readily available in registers here - result from previous AES, division_result, sqrt_result etc. It will be much smaller hit for CPU if we don't do XOR with memory location. XOR will still break fused MUL+ADD as long as it's done with some non-constant value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, division and sqrt results are not yet available here because of high calculation latency. But everything else is available and can be used for XOR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simple hi ^= something1
and lo ^= something2
will be enough to break fused MUL+ADD. Let's keep this tweak simple, current version seems to be over complicated without any visible benefit to outweigh this complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea to use XOR to break fused MUL+ADD implementation is great, but do we need to use scratchpad for that? We have plenty of data readily available in registers here - result from previous AES, division_result, sqrt_result etc. It will be much smaller hit for CPU if we don't do XOR with memory location. XOR will still break fused MUL+ADD as long as it's done with some non-constant value.
All of things mentioned in this list should be easier to "tweak" to anyone "sitting" on a design for CNv2 since they are already required for the computation of the hi/lo value. I will inspect a bit further though.
The simple
hi ^= something1
andlo ^= something2
will be enough to break fused MUL+ADD. Let's keep this tweak simple, current version seems to be over complicated without any visible benefit to outweigh this complexity.
This tweak is very simple. However, it is reducing some computations that can be done in parallel. I don't see how this type of parallelism is a magic feat that only a full CPU/GPU can do ... ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how this type of parallelism is a magic feat that only a full CPU/GPU can do ... ?
You understand it the wrong way. This type of parallelism (with shuffles) is the only one CPU/GPU can do, ASIC are capable of much more. So this is the only way how CPU/GPU can be competitive.
P.S. Do you know how original Cryptonight ASICs achieved such high hashrates? They reduced the critical path to 20-27 bits needed to calculate indices in the scratchpad and did everything else in parallel in the background. Major parts of AES and MUL weren't needed to progress the main loop, they could be done with lag because they were only needed when the same 16-byte chunk was hit again. No CPU/GPU can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how this type of parallelism is a magic feat that only a full CPU/GPU can do ... ?
You understand it the wrong way. This type of parallelism (with shuffles) is the only one CPU/GPU can do, ASIC are capable of much more. So this is the only way how CPU/GPU can be competitive.
This seems extremely close to deflecting from my argument (because an ASIC can still outperform either way). I think what you are getting is that the CPU already has more available fetch capabilities from L3/memory (and so do many GPUs), and so this is increasing the relative cost of production of an ASIC whereas its fixed for CPU/GPU. Something like that?
P.S. Do you know how original Cryptonight ASICs achieved such high hashrates? They reduced the critical path to 20-27 bits needed to calculate indices in the scratchpad and did everything else in parallel in the background. Major parts of AES and MUL weren't needed to progress the main loop, they could be done with lag because they were only needed when the same 16-byte chunk was hit again. No CPU/GPU can do that.
I don't see how this was possible. Selecting an address depends on AES or mul/add results from the immediate prior step. Unless there is some shortcut through the AES stage or something ... ?
The ASIC could delay/queue writes, but you seem to be describing something different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you are getting is that the CPU already has more available fetch capabilities from L3/memory (and so do many GPUs), and so this is increasing the relative cost of production of an ASIC whereas its fixed for CPU/GPU. Something like that?
Yes, pretty close. Please comment on my proposed change to VARIANT2_2 (scroll down the page to see my last comments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how this was possible
Me neither, I got "enlightened" by FPGA developers - they use this technique in their 22 KH/s CNv1 (current Monero PoW) bitstream for FPGA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Me neither, I got "enlightened" by FPGA developers - they use this technique in their 22 KH/s CNv1 (current Monero PoW) bitstream for FPGA.
No, I think you are full of shit. What did they do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, they didn't tell me the details. I'm not a FPGA developer, but they do have 22 KH/s bitstream for CNv1. Maybe it was a misinformation (as it already happened before). And please don't use swear words.
src/crypto/slow-hash.c
Outdated
VARIANT2_SHUFFLE_ADD_NEON(hp_state, j); \ | ||
VARIANT2_2_2(); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember there were two implementations for ARM, you updated only one of them.
Is it impossible to develop an ASIC to compute both shuffles in parallel too? An entire super-scalar pipeline should not be necessary ... ? If so, then this not an "apples-to-apples" comparison. Now perhaps your argument was comparing some kind of modifiable microcoded design? Expand on this some more ...
This is a bit outdated since the entire proposal may require bigger tweaking - the order of operations cannot be reordered AFAIK, but it might be better to do 0x30.
The entanglement was the intent, although the performance hit isn't so nice.
An ASIC is always going to outperform. This is something that has been repeated frequently in discussions on "ASIC resistance". The goal was to break existing boards and (hopefully) require major re-designs too. Make it harder to "tweak" a design to increase the delay until fabrication. |
@vtnerd No, there is no need for superscalar design, shuffles are not there to add complexity to ASIC logic. If you look how shuffles are done in my design, you'll see that they are completely independent from the rest of main loop. Their only purpose is to read-modify-write 64 bytes instead of 16 bytes, requiring 4 times more memory bandwidth from all devices with external memory. They are independent so that all CPUs and GPUs can run them in parallel with the main loop - specifically to avoid any slowdown while taking advantage of available (and unused) bandwidth on CPUs and GPUs. Shuffles do nothing (and are not supposed to do anything) against ASICs with on-chip memory, but they still serve their purpose to slow down ASICs with external memory - 4 times. Now, the computational hard part of CNv2 is integer math - this is what will slowdown ASICs with on-chip memory because shuffles (even entangled with your tweak) can't do that. I don't see any point in entangling second shuffle because it only makes things worse for CPU and will not be a major redesign. ASICs will be mainly limited by integer math performance and can be quite straight-forward in the rest of the loop, not even needing optimizations like fused MUL+ADD. |
Existing boards can be broken without hurting CPU performance even by the smallest tweak, like changing "+" to "-" or to another XOR anywhere, or changing the order of arguments - there are multiple ways to do it. A change that will require major redesign is more tricky, but what I'm thinking of - two shuffles in my (algorithm) design are identical, so they can (and will) use the same physical circuit in any (ASIC) design. If the second shuffle is changed to do something different, it will require different physical circuit. The only requirement for a shuffle is to read-modify-write 48 bytes that are untouched by the original Cryptonight. How it achieves this goal is irrelevant. |
@vtnerd I've been playing with your code and discovered that if I remove VARIANT2_2_2, performance hit is less than 1%. So VARIANT2_2_1 is totally fine in this aspect - it's the second one that is responsible for all the performance hit. |
Forget it, there was a bug in my code - I missed XOR in VARIANT2_2_2. With XOR it doesn't improve anything, same 14% performance hit. But with VARIANT2_2_1 only there is almost no performance hit, so if you leave VARIANT2_2_1 as it is now and do "hi ^= something1 and lo ^= something2" for VARIANT2_2_2 it might work out fine. |
@vtnerd I keep experimenting. This code seems to work fine (92.8 H/s, only 2.1% slowdown):
It's still the same basic structure, the only change is that (hi, lo) are XORed with data from scratchpad before the shuffle. P.S. I'll generate test vectors with this code to be 100% sure that my assembler testing code is correct. But this code looks promising. P.P.S. And it still achieves all desired goals for this tweak: existing boards are broken, new design is required because:
|
@vtnerd Are you good with this change (see previous post)? I don't want to waste my time on optimizing it now if this change is not ok for you. |
I've tested this change on Intel Core i5 3210M (Ivy Bridge) and there was no slowdown at all, so this change fixes CPU performance hit. AMD Ryzen 5 2600 slowed down 2.1% but I think it can be fixed by fine-tuning the code. |
This is a solid argument. An ASIC can still be produced, and it will be faster, but the production cost (for materials) should increase too. The CPU/GPU cost is fixed. The "problem" is those shuffles are very independent. Nothing directly depends on the result, only indirectly through the scratchpad. So the technique you described above would apply to the shuffle portion. An ASIC (and perhaps even slickly written CPU versions) can store a/b from previous round, fetch the 3 16-byte blocks, "shuffle", and writeback while a separate portion is calculating the aes OR mul/div/sqrt/add portion from the "next" iteration. When the memory selection didn't overlap, the aes OR mul/div/sqrt/add cycles is so high that the ASIC has plenty of time to fetch that data. So ... after even deeper thought, this technique should crush the shit of CPUs? Unless that could be duplicated in software, buts its difficult because its not possible to precisely request the execution cores on a CPU. The CUDA might be crushing with this technique too.
Because it reduces the probability of waiting on another cacheline when computing the 8-byte add? It's still waiting on the div/sqrt result...
I also investigated this. The immediately important thing seemed to be created an immediate dependency with the shuffle ... somewhere. See above.
The output of the shuffle routine isn't used directly ... |
Actually this changes things because additional memory still has to be fetched for the mul/div/sqrt/add portion. So its increases the reads needed specifically for that section. |
@kio3i0j9024vkoenio Test using xmr-stak: fireice-uk/xmr-stak#1832 |
is the test pool killallasics already updated to the new pow?
|
@SChernykh Is there a plan to create double, triple, ..., asm implementation for ryzen and intel. I only can find double hash for intel... |
@Bendr0id The problem with double hash on Ryzen is that two single hash threads on the same core are significantly faster, so I don't see a point in tuning it for Ryzen. There are no CPU models with more than 2 MB per core like some Xeons, so what would be the use case for 2x, 3x, ... etc modes on Ryzen? |
F.e. less power consumption, less temperature and lower CPU usage. IMHO it makes sense to have it.. |
Last time I tried to tweak Intel double hash version for Ryzen, I only was able to get <1% improvement, this is why I didn't release it (and for the reasons described above). So double hash version only makes sense if it has better hash/watt performance. I can try but I can't guarantee that I can get anything better than existing Intel double hash version. |
ok, maybe for ryzen it's not that mission critical, but x3,x4,x5 for intel makes sense. |
@SChernykh the 5x thread mode for Intel was initially released on XMR-Stak for the Crystalwell 4xxxR series (such as the i5-4570R) that have 128MB of L4 cache. These CPUs can hash much better with high scratchpad count since they have the additional cache. |
@psychocrypt It looks like they are now. And don't forget to integrate my latest asm code (submitted just a few minutes ago). |
@SChernykh , @vtnerd , @psychocrypt , killallasics should be on the right fork and the right pool code for further testing and optimizing etc. thanks everyone!! |
I've started doing extensive performance testing. While older Intel CPUs (Core i7 2600k for example) generally show 90-95% of their original variant 1 performance - which is on par with GPUs, newer CPUs are doing much better. I've tested AMD Ryzen 5 2600 and final performance numbers show negligible performance difference: 628.3 -> 627.8 H/s for variant 1 -> variant 2 change. @xmrig tested Intel Core i7 7700 and his numbers showed even better performance: 302 -> 305 H/s. Yes, there is no mistake - variant 2 is even faster on the latest Intel processors. |
@SChernykh realy good news! Do you have a chance to look into x3 x4 x5 for intel? |
@Bendr0id I can look into x3 and x4, but x5 requires 10 MB cache, I don't have such hardware at hand. |
@SChernykh and others, where is the optimization happening? I'm seeing some impressive numbers on killallasics, just curious which software is getting worked on and where |
@Gingeropolous It's xmr-stak and xmrig as far as I know. |
Don't you already have a Ryzen 2600? That should have 8M cache in each of its two CCXs for 16M L3 total, so you can test up to 4x directly. You could also reduce the scratchpad size to 1MB like in the CN-lite PoW if you wanted to test scratchpad counts larger than that. |
@plavirudar I need Intel processor (preferably the exact model) to optimize x3, x4, x5 modes for Intel. I have i7 2600K with 8MB cache, so I can do x3 and x4 modes. I also have Ryzen, but I can only use it to make Ryzen-optimized code. |
I see, I must have misread your initial comment. In any case, I think it would certainly be helpful to have up to 4x scratchpad code first for both Intel and AMD. Since there is so much additional compute involved per scratchpad, at some point increasing cache won't help anyway since memory access is no longer the bottleneck, so 4x may even be the sweet spot in this case. |
RE: superthreading on purpose (with current v1 PoW) Explain how, if I'm abusing cache so hard, the hashrate goes up so much? The cores trade off about 60/30 (so lose 12h off core0 to gain 30h off core1) and the rate moves around between the two (cache migrations?) Running 1x and 1x does not do the same, it murders both cores and overall is about 80 (+8 over single thread single core) Either that or it chokes so hard the telemetry just shows improvement while it doesn't actually. It will show peaks of 108H and valleys of 90H. Might also chew watts times ten, but I have no tools nor care to measure. But the pool-side seems to match. And both those numbers are way better than 72. Windows and Linux both show identical performance for the overcommit setup (the win7 box might be a 4130 / same but less clock). I used to have a 10x patch which did similar speeds to the 5x but even slightly smoother. I run the 5x now since it's the max available without retooling the old patch to match reorganized cpu miner code in xmr-stak. Plus this PoW update may destroy whatever side-bandwidth is giving me gains for abusing CPUs. |
Actually the Xeon E7-8837 is Westmere-EX not Nehalem and was released April 3, 2011 whereas Sandy Bridge was released January 9, 2011. http://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%20E7-8837.html http://www.cpu-world.com/Cores/Sandy%20Bridge.html Some of the improvements inside Westmere are: 32 nm die shrink of Nehalem A new set of instructions that gives over 3x the encryption and decryption rate of Advanced Encryption Standard (AES) processes compared to before. Delivers seven new instructions (AES instruction set or AES-NI) that can be used by the AES algorithm. Also an instruction called PCLMULQDQ (see CLMUL instruction set) that will perform carry-less multiplication for use in cryptography and data compression. |
@kio3i0j9024vkoenio Westmere is Nehalem, even Wikipedia link says it. Die shrink = exactly the same execution speed of the same code, just physically smaller CPU core. |
Wikipedia says: Westmere (formerly Nehalem-C)
So who really knows what the "C" means but since it performs much better than you expected something must have improved. |
@kio3i0j9024vkoenio Yes, actually the main difference is AES support. Other differences are not relevant here and don't change Cryptonight performance per core. |
credit: monero monero-project/monero#4404
Reference code: monero-project/monero#4404 I tested it on x86 with av=1-10 and on ARM with av=1-4, self test passed.
This is a proposed tweak to the cryptonote algorithm for the next release. This was designed to "augment" CNv2, and reduce the chances of pre-built boards for those changes. Advantages:
Negatives:
Feedback from the community is strongly encouraged. Possible additional techniques could involve mixing information from earlier stages of the pipeline or CNv2 stages.