Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cryptonote tweak v2.2 #4404

Merged
merged 1 commit into from
Sep 22, 2018
Merged

Conversation

vtnerd
Copy link
Contributor

@vtnerd vtnerd commented Sep 19, 2018

This is a proposed tweak to the cryptonote algorithm for the next release. This was designed to "augment" CNv2, and reduce the chances of pre-built boards for those changes. Advantages:

  • Works with or without CNv2 changes by @SChernykh .
  • Any manufactured boards that implement one pass of the primary loop in the original, v1, or CNv2 algorithm is unlikely to work.
  • Any manufactured boards that have an implementation of the 8-byte multiply then 8-byte add routine (which could be leveraged in in v0, v1, or v2) is unlikely to work.
  • XORing was used because it is not distributive or associative over addition or multiplication. So XOR should require an order for the operations around the multiply/add sections of the original and CNv2 design.

Negatives:

  • Any existing design for the original, v1, or CNv2 is probably easy to update.
  • Any highly custom RISC/microcoded chip is unlikely to have much additional penalty (i.e. if some chips were designed for use with CNv2 instructions, this doesn't change performance much).

Feedback from the community is strongly encouraged. Possible additional techniques could involve mixing information from earlier stages of the pipeline or CNv2 stages.

@SChernykh
Copy link
Contributor

SChernykh commented Sep 19, 2018

@vtnerd Correct me if I'm wrong, here is my understanding of this code: 128-bit multiplication result is XORed into the shuffle's 16-byte line with index 1 (j^0x10) and then (after the shuffle) this 128-bit result is XORed with shuffle's 16-byte line with index 2 (j ^ 0x20), which is essentially the same original line + value of _b.

tmp = chunk1_before
chunk1_before ^= (hi, lo)
SHUFFLE
(hi, lo) ^= chunk2_after (the same as chunk1_before+_b)

So hi2 = ((tmp[0] ^ hi1)+_b[0]) ^ hi1, the same for lo1 -> lo2.

GPU performance will not be affected, that is 100%. CPUs will become slower, though I need to modify my assembler versions to see the impact. (hi, lo) pair will have to be carried through the entire main loop while it was only used locally before. There might be not enough x86 registers to do that. Even if there are enough registers, two 128-bit XORs will slow down the code at least 1-2%. I'll implement it tomorrow for CPU and see how good/bad it is for performance.

P.S. Was it intended that the second XOR is done essentially with the same 16-byte line? Maybe change 0x20 to 0x30?

@SChernykh
Copy link
Contributor

SChernykh commented Sep 19, 2018

On the other hand, ASIC design will have to be updated to carry (hi, lo) 128-bit pair through entire main loop as well. It's bigger redesign than just adding two XORs in two places, so this change is stronger than it seems at first sight. Size of main loop state is increased by 128-bit.

Edit: this is wrong

@SChernykh
Copy link
Contributor

Oh, I misunderstood it. (hi, lo) pair is still used locally (for the second shuffle), so performance impact for CPU will be minimal.

@psychocrypt
Copy link

only a note: I think this change has a larger impact on the performance for NVIDIA GPUs, mostly for older the older Kepler generation. The reason is that we still use to much registers for CNv2.

@SChernykh
Copy link
Contributor

@psychocrypt It should be fine because it's local. Those values are already available in that place.

@psychocrypt
Copy link

@SChernykh Not for the CUDA implementation because the values are distributed stored. 4 threads work on one hash to use use reduce the register count per thread and have a better memory load efficiency.

@SChernykh
Copy link
Contributor

SChernykh commented Sep 19, 2018

@vtnerd I've added this tweak to my Intel-optimized assembler code and tested it on my notebook (Core i5 3210M). Hashrate went down from 69.6 H/s to 63.1 H/s - 10% slower! 👎 This is a disaster - 10% slowdown for CPU without much impact on ASIC performance. I'll try to tune my code tomorrow, but from my experience I can say it's impossible to make more than 2-3% faster, in other words 65 H/s will be the new theoretical limit for this CPU.

@SChernykh
Copy link
Contributor

I now understand why it's so bad for performance - my shuffle mod was designed to run in parallel with AES (first shuffle) and DIV->SQRT->MUL->ADD (second shuffle). This change entangles MUL->ADD and the second shuffle, so they can't run in parallel on CPU. GPUs will be probably hurt too because of this.

Damn, second shuffle even has to wait for MUL result which depends on previous DIV->SQRT, so no parallel execution here for CPU/GPU at all.

ASICs will just add one XOR gate (per bit) before the shuffle and one XOR gate (per bit) after the shuffle, performance change will be minimal.

@SChernykh
Copy link
Contributor

Any manufactured boards that have an implementation of the 8-byte multiply then 8-byte add routine (which could be leveraged in in v0, v1, or v2) is unlikely to work.

This is the only thing that might slow down ASIC a tiny bit. Addition is 3-4 times faster than multiplication, so this specific part of the ASIC circuit will be 25% slower because addition can't be merged with multiplication. But it won't help anyway, because it is a part of AES+MUL+ADD sequence which will still be much faster than DIV+SQRT sequence. There will be no slowdown for ASIC at all.

@mobilepolice
Copy link

@SChernykh didn't developers find that roughly 10% of performance loss through optimizations of the code you put forth for CNv2?

If so, this 10% performance hit (pre-optimization, perhaps?) brings us back to where we were before optimizations were made for your CNv2 code.

If those things are true, would it be safe to argue that the slowdowns, while having a negative effect on CPU/GPU vs ASIC, overall leave us at the same place we were comfortable at just a short month ago?

If that can be assumed, then would it be fair to suggest that there are optimizations that can be made with this code as well?

I'm just trying to mitigate the notion that this is a disaster.

@SChernykh
Copy link
Contributor

@mobilepolice This is an additional performance hit where there shouldn't be any (ideally), that's the problem. On the bright side, Radeon RX 560 is unaffected - performance is the same. So GPUs will most likely do fine with this tweak.

@SChernykh
Copy link
Contributor

Ryzen gets 14% performance hit - down from 94.8 H/s to 83.1 H/s (single thread). So all CPUs will get 10% or more additional hit after this tweak.

@SChernykh
Copy link
Contributor

SChernykh commented Sep 20, 2018

@vtnerd #4218 (comment)

I understand that final tweak has to be independent from me - this is totally OK and must be done. But it has to be carefully designed - with performance in mind. If the target is to break existing manufactured boards for CNv2 (I'm 100% sure there aren't any) or existing complete designs just waiting to be manufactured, it can be achieved without performance hit for CPU/GPU at all. I know at least 5 different ways to do it, but I can't propose it for obvious reason.

@SChernykh
Copy link
Contributor

@mobilepolice

If that can be assumed, then would it be fair to suggest that there are optimizations that can be made with this code as well?

Assembly level micro-optimizations can bring back up to 2%-3%, but 10% hit for Intel and 14% hit for Ryzen is too big to compensate with these optimizations. The problem with this tweak is that it adds new data dependencies in the critical code path, there is no way to work around it. Out of order execution on CPU which worked extremely well in this part of the code is now stalled there.

@Equim-chan
Copy link

Equim-chan commented Sep 20, 2018

my shuffle mod was designed to run in parallel with AES (first shuffle) and DIV->SQRT->MUL->ADD (second shuffle)

@SChernykh how do you manage to run them in parallel?

In my implementation for this tweak, I store the result of mulq into a XMM, and when the first chunk is loaded, I can call pxor over it. In the end a can be added with paddq as well, looks like this:

; byte_mul
movq    C, %rax
movq    D, %rbx
mulq    %rbx
movq    %rdx, %xmm0
movq    %rax, %xmm1
movlhps %xmm1, %xmm0

; VARIANT2_SHUFFLE_ADD
movq    OFFSET, %rbx
xorq    $0x10, %rbx
leaq    0(STATE, %rbx), %rbx
movdqa  0(%rbx), %xmm1 ; chunk0
; VARIANT2_2_1
pxor    %xmm0, %xmm1
; shuffle other chunks
; ...

; VARIANT2_2_2
pxor    %xmm1, %xmm0
; ...

; byte_add
paddq   %xmm0, A

I don't know if it helps.

@SChernykh
Copy link
Contributor

SChernykh commented Sep 20, 2018

@Equim-chan CPU runs them in parallel - shuffle and AES work on different independent parts of 64-byte cache line, the second shuffle also works on different parts of data. DIV->SQRT is used on the next iteration, so it runs in parallel with other stuff on CPU throughout the entire main loop - this is why there is no slowdown for CPU (without this tweak).

P.S. This is by design - shuffle is only there to increase required memory bandwidth by 4x, not to mess with data dependencies in the main loop. This tweak breaks CNv2 design in this aspect and CPUs immediately take 10%-15% hit as a result.

Copy link
Contributor

@SChernykh SChernykh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revise your tweak so that it doesn't bring 10-15% hit for CPUs out of thin air.

@@ -198,6 +198,21 @@ extern int aesb_pseudo_round(const uint8_t *in, uint8_t *out, const uint8_t *exp
} while (0)
#endif

#define VARIANT2_2_1() \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to what was said in comments: VARIANT2_2_1 seems to be excessive and useless here. It just XORs data into scratchpad, the result of this XOR is not used until we hit the same 64-byte line on some next iteration. It will be ~100 iterations later, at best, given 2 MB scratchpad size. The delay between store and use allows it to be done fully in background on ASIC while writing back to memory and in parallel with all other calculations. Only VARIANT2_2_2 does the job of breaking fused hardware implementation of "8-byte multiply then 8-byte add".

do if (variant >= 2) \
{ \
hi ^= *U64(hp_state + (j ^ 0x20)); \
lo ^= *(U64(hp_state + (j ^ 0x20)) + 1); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea to use XOR to break fused MUL+ADD implementation is great, but do we need to use scratchpad for that? We have plenty of data readily available in registers here - result from previous AES, division_result, sqrt_result etc. It will be much smaller hit for CPU if we don't do XOR with memory location. XOR will still break fused MUL+ADD as long as it's done with some non-constant value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, division and sqrt results are not yet available here because of high calculation latency. But everything else is available and can be used for XOR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simple hi ^= something1 and lo ^= something2 will be enough to break fused MUL+ADD. Let's keep this tweak simple, current version seems to be over complicated without any visible benefit to outweigh this complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea to use XOR to break fused MUL+ADD implementation is great, but do we need to use scratchpad for that? We have plenty of data readily available in registers here - result from previous AES, division_result, sqrt_result etc. It will be much smaller hit for CPU if we don't do XOR with memory location. XOR will still break fused MUL+ADD as long as it's done with some non-constant value.

All of things mentioned in this list should be easier to "tweak" to anyone "sitting" on a design for CNv2 since they are already required for the computation of the hi/lo value. I will inspect a bit further though.

The simple hi ^= something1 and lo ^= something2 will be enough to break fused MUL+ADD. Let's keep this tweak simple, current version seems to be over complicated without any visible benefit to outweigh this complexity.

This tweak is very simple. However, it is reducing some computations that can be done in parallel. I don't see how this type of parallelism is a magic feat that only a full CPU/GPU can do ... ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this type of parallelism is a magic feat that only a full CPU/GPU can do ... ?

You understand it the wrong way. This type of parallelism (with shuffles) is the only one CPU/GPU can do, ASIC are capable of much more. So this is the only way how CPU/GPU can be competitive.

P.S. Do you know how original Cryptonight ASICs achieved such high hashrates? They reduced the critical path to 20-27 bits needed to calculate indices in the scratchpad and did everything else in parallel in the background. Major parts of AES and MUL weren't needed to progress the main loop, they could be done with lag because they were only needed when the same 16-byte chunk was hit again. No CPU/GPU can do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this type of parallelism is a magic feat that only a full CPU/GPU can do ... ?

You understand it the wrong way. This type of parallelism (with shuffles) is the only one CPU/GPU can do, ASIC are capable of much more. So this is the only way how CPU/GPU can be competitive.

This seems extremely close to deflecting from my argument (because an ASIC can still outperform either way). I think what you are getting is that the CPU already has more available fetch capabilities from L3/memory (and so do many GPUs), and so this is increasing the relative cost of production of an ASIC whereas its fixed for CPU/GPU. Something like that?

P.S. Do you know how original Cryptonight ASICs achieved such high hashrates? They reduced the critical path to 20-27 bits needed to calculate indices in the scratchpad and did everything else in parallel in the background. Major parts of AES and MUL weren't needed to progress the main loop, they could be done with lag because they were only needed when the same 16-byte chunk was hit again. No CPU/GPU can do that.

I don't see how this was possible. Selecting an address depends on AES or mul/add results from the immediate prior step. Unless there is some shortcut through the AES stage or something ... ?

The ASIC could delay/queue writes, but you seem to be describing something different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you are getting is that the CPU already has more available fetch capabilities from L3/memory (and so do many GPUs), and so this is increasing the relative cost of production of an ASIC whereas its fixed for CPU/GPU. Something like that?

Yes, pretty close. Please comment on my proposed change to VARIANT2_2 (scroll down the page to see my last comments).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this was possible

Me neither, I got "enlightened" by FPGA developers - they use this technique in their 22 KH/s CNv1 (current Monero PoW) bitstream for FPGA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me neither, I got "enlightened" by FPGA developers - they use this technique in their 22 KH/s CNv1 (current Monero PoW) bitstream for FPGA.

No, I think you are full of shit. What did they do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, they didn't tell me the details. I'm not a FPGA developer, but they do have 22 KH/s bitstream for CNv1. Maybe it was a misinformation (as it already happened before). And please don't use swear words.

VARIANT2_SHUFFLE_ADD_NEON(hp_state, j); \
VARIANT2_2_2(); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember there were two implementations for ARM, you updated only one of them.

@vtnerd
Copy link
Contributor Author

vtnerd commented Sep 20, 2018

I now understand why it's so bad for performance - my shuffle mod was designed to run in parallel with AES (first shuffle) and DIV->SQRT->MUL->ADD (second shuffle). This change entangles MUL->ADD and the second shuffle, so they can't run in parallel on CPU. GPUs will be probably hurt too because of this.

Damn, second shuffle even has to wait for MUL result which depends on previous DIV->SQRT, so no parallel execution here for CPU/GPU at all.

ASICs will just add one XOR gate (per bit) before the shuffle and one XOR gate (per bit) after the shuffle, performance change will be minimal.

Is it impossible to develop an ASIC to compute both shuffles in parallel too? An entire super-scalar pipeline should not be necessary ... ? If so, then this not an "apples-to-apples" comparison.

Now perhaps your argument was comparing some kind of modifiable microcoded design? Expand on this some more ...

P.S. Was it intended that the second XOR is done essentially with the same 16-byte line? Maybe change 0x20 to 0x30?

This is a bit outdated since the entire proposal may require bigger tweaking - the order of operations cannot be reordered AFAIK, but it might be better to do 0x30.

I now understand why it's so bad for performance - my shuffle mod was designed to run in parallel with AES (first shuffle) and DIV->SQRT->MUL->ADD (second shuffle). This change entangles MUL->ADD and the second shuffle, so they can't run in parallel on CPU. GPUs will be probably hurt too because of this.

The entanglement was the intent, although the performance hit isn't so nice.

This is the only thing that might slow down ASIC a tiny bit. Addition is 3-4 times faster than multiplication, so this specific part of the ASIC circuit will be 25% slower because addition can't be merged with multiplication. But it won't help anyway, because it is a part of AES+MUL+ADD sequence which will still be much faster than DIV+SQRT sequence. There will be no slowdown for ASIC at all.

An ASIC is always going to outperform. This is something that has been repeated frequently in discussions on "ASIC resistance". The goal was to break existing boards and (hopefully) require major re-designs too. Make it harder to "tweak" a design to increase the delay until fabrication.

@SChernykh
Copy link
Contributor

@vtnerd No, there is no need for superscalar design, shuffles are not there to add complexity to ASIC logic.

If you look how shuffles are done in my design, you'll see that they are completely independent from the rest of main loop. Their only purpose is to read-modify-write 64 bytes instead of 16 bytes, requiring 4 times more memory bandwidth from all devices with external memory.

They are independent so that all CPUs and GPUs can run them in parallel with the main loop - specifically to avoid any slowdown while taking advantage of available (and unused) bandwidth on CPUs and GPUs.

Shuffles do nothing (and are not supposed to do anything) against ASICs with on-chip memory, but they still serve their purpose to slow down ASICs with external memory - 4 times.

Now, the computational hard part of CNv2 is integer math - this is what will slowdown ASICs with on-chip memory because shuffles (even entangled with your tweak) can't do that. I don't see any point in entangling second shuffle because it only makes things worse for CPU and will not be a major redesign. ASICs will be mainly limited by integer math performance and can be quite straight-forward in the rest of the loop, not even needing optimizations like fused MUL+ADD.

@SChernykh
Copy link
Contributor

SChernykh commented Sep 20, 2018

The goal was to break existing boards and (hopefully) require major re-designs too.

Existing boards can be broken without hurting CPU performance even by the smallest tweak, like changing "+" to "-" or to another XOR anywhere, or changing the order of arguments - there are multiple ways to do it.

A change that will require major redesign is more tricky, but what I'm thinking of - two shuffles in my (algorithm) design are identical, so they can (and will) use the same physical circuit in any (ASIC) design. If the second shuffle is changed to do something different, it will require different physical circuit. The only requirement for a shuffle is to read-modify-write 48 bytes that are untouched by the original Cryptonight. How it achieves this goal is irrelevant.

@SChernykh
Copy link
Contributor

@vtnerd I've been playing with your code and discovered that if I remove VARIANT2_2_2, performance hit is less than 1%. So VARIANT2_2_1 is totally fine in this aspect - it's the second one that is responsible for all the performance hit.

@SChernykh
Copy link
Contributor

SChernykh commented Sep 20, 2018

Forget it, there was a bug in my code - I missed XOR in VARIANT2_2_2. With XOR it doesn't improve anything, same 14% performance hit.

But with VARIANT2_2_1 only there is almost no performance hit, so if you leave VARIANT2_2_1 as it is now and do "hi ^= something1 and lo ^= something2" for VARIANT2_2_2 it might work out fine.

@SChernykh
Copy link
Contributor

SChernykh commented Sep 20, 2018

@vtnerd I keep experimenting. This code seems to work fine (92.8 H/s, only 2.1% slowdown):

#define VARIANT2_2() \
do if (variant >= 2) \
{ \
    *U64(hp_state + (j ^ 0x10)) ^= hi; \
    *(U64(hp_state + (j ^ 0x10)) + 1) ^= lo; \
    hi ^= *U64(hp_state + (j ^ 0x20)); \
    lo ^= *(U64(hp_state + (j ^ 0x20)) + 1); \
}

It's still the same basic structure, the only change is that (hi, lo) are XORed with data from scratchpad before the shuffle.

P.S. I'll generate test vectors with this code to be 100% sure that my assembler testing code is correct. But this code looks promising.

P.P.S. And it still achieves all desired goals for this tweak: existing boards are broken, new design is required because:

  • MUL+ADD can't be merged anymore
  • Second shuffle is changed (because the first XOR with memory at 0x10 is essentially a new part of it), so the same physical circuit can't be used for it
  • New 128-bit data path between (hi, lo) and the second shuffle circuit needs to be created.

@SChernykh
Copy link
Contributor

@vtnerd Are you good with this change (see previous post)? I don't want to waste my time on optimizing it now if this change is not ok for you.

@SChernykh
Copy link
Contributor

I've tested this change on Intel Core i5 3210M (Ivy Bridge) and there was no slowdown at all, so this change fixes CPU performance hit. AMD Ryzen 5 2600 slowed down 2.1% but I think it can be fixed by fine-tuning the code.

@vtnerd
Copy link
Contributor Author

vtnerd commented Sep 20, 2018

They are independent so that all CPUs and GPUs can run them in parallel with the main loop - specifically to avoid any slowdown while taking advantage of available (and unused) bandwidth on CPUs and GPUs.

Shuffles do nothing (and are not supposed to do anything) against ASICs with on-chip memory, but they still serve their purpose to slow down ASICs with external memory - 4 times.

This is a solid argument. An ASIC can still be produced, and it will be faster, but the production cost (for materials) should increase too. The CPU/GPU cost is fixed.

The "problem" is those shuffles are very independent. Nothing directly depends on the result, only indirectly through the scratchpad. So the technique you described above would apply to the shuffle portion. An ASIC (and perhaps even slickly written CPU versions) can store a/b from previous round, fetch the 3 16-byte blocks, "shuffle", and writeback while a separate portion is calculating the aes OR mul/div/sqrt/add portion from the "next" iteration. When the memory selection didn't overlap, the aes OR mul/div/sqrt/add cycles is so high that the ASIC has plenty of time to fetch that data. So ... after even deeper thought, this technique should crush the shit of CPUs? Unless that could be duplicated in software, buts its difficult because its not possible to precisely request the execution cores on a CPU. The CUDA might be crushing with this technique too.

So I did this and was able to achieve 90.6 H/s with both VARIANT2_2_1 and VARIANT2_2_2. I can safely assume that with further fine-tuning I will be able to get 91 H/s, compared to 94.8 H/s without these tweaks. Only 4% slow down for Ryzen instead of 14% - much better.

Do you agree to change 0x20 to 0x30 for VARIANT2_2_2?
I was wrong, it was actually 0x10 for VARIANT2_2_2 which gave this result.

Because it reduces the probability of waiting on another cacheline when computing the 8-byte add? It's still waiting on the div/sqrt result...

Shuffles do nothing (and are not supposed to do anything) against ASICs with on-chip memory, but they still serve their purpose to slow down ASICs with external memory - 4 times.

A change that will require major redesign is more tricky, but what I'm thinking of - two shuffles in my (algorithm) design are identical, so they can (and will) use the same physical circuit in any (ASIC) design. If the second shuffle is changed to do something different, it will require different physical circuit. The only requirement for a shuffle is to read-modify-write 48 bytes that are untouched by the original Cryptonight. How it achieves this goal is irrelevant.

I also investigated this. The immediately important thing seemed to be created an immediate dependency with the shuffle ... somewhere. See above.

@vtnerd I keep experimenting. This code seems to work fine (92.5 H/s, only 2.5% slowdown):
...
It's still the same basic structure, the only change is that (hi, lo) are XORed with data from scratchpad before the shuffle.

@vtnerd Are you good with this change (see previous post)? I don't want to waste my time on optimizing it now if this change is not ok for you.

The output of the shuffle routine isn't used directly ...

@vtnerd
Copy link
Contributor Author

vtnerd commented Sep 20, 2018

@vtnerd I keep experimenting. This code seems to work fine (92.5 H/s, only 2.5% slowdown):
...
It's still the same basic structure, the only change is that (hi, lo) are XORed with data from scratchpad before the shuffle.

@vtnerd Are you good with this change (see previous post)? I don't want to waste my time on optimizing it now if this change is not ok for you.

The output of the shuffle routine isn't used directly ...

Actually this changes things because additional memory still has to be fetched for the mul/div/sqrt/add portion. So its increases the reads needed specifically for that section.

@SChernykh
Copy link
Contributor

@kio3i0j9024vkoenio Test using xmr-stak: fireice-uk/xmr-stak#1832
There is an instruction how to do it using OpenCL there.

zone117x pushed a commit to zone117x/node-multi-hashing that referenced this pull request Sep 22, 2018
@psychocrypt
Copy link

psychocrypt commented Sep 23, 2018 via email

@Bendr0id
Copy link

@SChernykh Is there a plan to create double, triple, ..., asm implementation for ryzen and intel. I only can find double hash for intel...

@SChernykh
Copy link
Contributor

@Bendr0id The problem with double hash on Ryzen is that two single hash threads on the same core are significantly faster, so I don't see a point in tuning it for Ryzen. There are no CPU models with more than 2 MB per core like some Xeons, so what would be the use case for 2x, 3x, ... etc modes on Ryzen?

@Bendr0id
Copy link

F.e. less power consumption, less temperature and lower CPU usage. IMHO it makes sense to have it..

@SChernykh
Copy link
Contributor

Last time I tried to tweak Intel double hash version for Ryzen, I only was able to get <1% improvement, this is why I didn't release it (and for the reasons described above). So double hash version only makes sense if it has better hash/watt performance. I can try but I can't guarantee that I can get anything better than existing Intel double hash version.

@Bendr0id
Copy link

ok, maybe for ryzen it's not that mission critical, but x3,x4,x5 for intel makes sense.

@plavirudar
Copy link

@SChernykh the 5x thread mode for Intel was initially released on XMR-Stak for the Crystalwell 4xxxR series (such as the i5-4570R) that have 128MB of L4 cache. These CPUs can hash much better with high scratchpad count since they have the additional cache.

@SChernykh
Copy link
Contributor

SChernykh commented Sep 23, 2018

is the test pool killallasics already updated to the new pow?

@psychocrypt It looks like they are now. And don't forget to integrate my latest asm code (submitted just a few minutes ago).

@Gingeropolous
Copy link
Collaborator

@SChernykh , @vtnerd , @psychocrypt , killallasics should be on the right fork and the right pool code for further testing and optimizing etc. thanks everyone!!

@SChernykh
Copy link
Contributor

@vtnerd @moneromooo-monero

I've started doing extensive performance testing. While older Intel CPUs (Core i7 2600k for example) generally show 90-95% of their original variant 1 performance - which is on par with GPUs, newer CPUs are doing much better.

I've tested AMD Ryzen 5 2600 and final performance numbers show negligible performance difference: 628.3 -> 627.8 H/s for variant 1 -> variant 2 change.

@xmrig tested Intel Core i7 7700 and his numbers showed even better performance: 302 -> 305 H/s. Yes, there is no mistake - variant 2 is even faster on the latest Intel processors.

@Bendr0id
Copy link

@SChernykh realy good news! Do you have a chance to look into x3 x4 x5 for intel?

@SChernykh
Copy link
Contributor

@Bendr0id I can look into x3 and x4, but x5 requires 10 MB cache, I don't have such hardware at hand.

@Gingeropolous
Copy link
Collaborator

@SChernykh and others, where is the optimization happening? I'm seeing some impressive numbers on killallasics, just curious which software is getting worked on and where

@SChernykh
Copy link
Contributor

@Gingeropolous It's xmr-stak and xmrig as far as I know.
fireice-uk/xmr-stak#1851
xmrig/xmrig#753

@plavirudar
Copy link

plavirudar commented Sep 26, 2018

@Bendr0id I can look into x3 and x4, but x5 requires 10 MB cache, I don't have such hardware at hand.

Don't you already have a Ryzen 2600? That should have 8M cache in each of its two CCXs for 16M L3 total, so you can test up to 4x directly. You could also reduce the scratchpad size to 1MB like in the CN-lite PoW if you wanted to test scratchpad counts larger than that.

@SChernykh
Copy link
Contributor

SChernykh commented Sep 26, 2018

@plavirudar I need Intel processor (preferably the exact model) to optimize x3, x4, x5 modes for Intel. I have i7 2600K with 8MB cache, so I can do x3 and x4 modes. I also have Ryzen, but I can only use it to make Ryzen-optimized code.

@plavirudar
Copy link

@plavirudar I need Intel processor (preferably the exact model) to optimize x3, x4, x5 modes for Intel. I have i7 2600K with 8MB cache, so I can do x3 and x4 modes. I also have Ryzen, but I can only use it to make Ryzen-optimized code.

I see, I must have misread your initial comment.

In any case, I think it would certainly be helpful to have up to 4x scratchpad code first for both Intel and AMD. Since there is so much additional compute involved per scratchpad, at some point increasing cache won't help anyway since memory access is no longer the bottleneck, so 4x may even be the sweet spot in this case.

@Spudz76
Copy link

Spudz76 commented Sep 26, 2018

RE: superthreading on purpose (with current v1 PoW)
i3-4160 CPU @ 3.60GHz 3MB +AES
Gets around 72 running "appropriate" single thread on one core (leave 1MB cache open)
Gets 96 running 5x threads on core0 and 1x thread on core1 (4x overcommit for cache)
Same CPU is funneling Ethereum work to 6xGPU and that's all it does otherwise.

Explain how, if I'm abusing cache so hard, the hashrate goes up so much? The cores trade off about 60/30 (so lose 12h off core0 to gain 30h off core1) and the rate moves around between the two (cache migrations?) Running 1x and 1x does not do the same, it murders both cores and overall is about 80 (+8 over single thread single core)

Either that or it chokes so hard the telemetry just shows improvement while it doesn't actually. It will show peaks of 108H and valleys of 90H. Might also chew watts times ten, but I have no tools nor care to measure. But the pool-side seems to match. And both those numbers are way better than 72.

Windows and Linux both show identical performance for the overcommit setup (the win7 box might be a 4130 / same but less clock).

I used to have a 10x patch which did similar speeds to the 5x but even slightly smoother. I run the 5x now since it's the max available without retooling the old patch to match reorganized cpu miner code in xmr-stak. Plus this PoW update may destroy whatever side-bandwidth is giving me gains for abusing CPUs.

@vtnerd vtnerd deleted the cryptonote2_2 branch September 29, 2018 01:54
@kio3i0j9024vkoenio
Copy link

kio3i0j9024vkoenio commented Oct 12, 2018

@kio3i0j9024vkoenio

does the current code have the asm optimized for both single and double threads?

I wouldn't call it optimized yet. I just did a quick round on single and double hash versions to optimize them for Sandy Bridge/Ivy Bridge (asm_version=1). I didn't even touch Ryzen versions (asm_version=2), they're not optimized at all yet. Your server is based on Nehalem which is older than Sandy Bridge, so I'm even surprized it got such small performance hit. This code wasn't supposed to be efficient on Nehalem. I'm sure that if I optimize this code for Nehalem, perfromance hit will be smaller than additional 4.7%.

Actually the Xeon E7-8837 is Westmere-EX not Nehalem and was released April 3, 2011 whereas Sandy Bridge was released January 9, 2011.

http://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%20E7-8837.html

http://www.cpu-world.com/Cores/Sandy%20Bridge.html

Some of the improvements inside Westmere are:

32 nm die shrink of Nehalem

A new set of instructions that gives over 3x the encryption and decryption rate of Advanced Encryption Standard (AES) processes compared to before.

Delivers seven new instructions (AES instruction set or AES-NI) that can be used by the AES algorithm. Also an instruction called PCLMULQDQ (see CLMUL instruction set) that will perform carry-less multiplication for use in cryptography and data compression.

https://en.wikipedia.org/wiki/Westmere_(microarchitecture)

@SChernykh
Copy link
Contributor

@kio3i0j9024vkoenio Westmere is Nehalem, even Wikipedia link says it. Die shrink = exactly the same execution speed of the same code, just physically smaller CPU core.

@kio3i0j9024vkoenio
Copy link

@kio3i0j9024vkoenio Westmere is Nehalem, even Wikipedia link says it. Die shrink = exactly the same execution speed of the same code, just physically smaller CPU core.

Wikipedia says: Westmere (formerly Nehalem-C)

Your server is based on Nehalem which is older than Sandy Bridge, so I'm even surprized it got such small performance hit. This code wasn't supposed to be efficient on Nehalem.

So who really knows what the "C" means but since it performs much better than you expected something must have improved.

@SChernykh
Copy link
Contributor

@kio3i0j9024vkoenio Yes, actually the main difference is AES support. Other differences are not relevant here and don't change Cryptonight performance per core.

shopglobal added a commit to cryptonote-labs/electronero that referenced this pull request Dec 17, 2018
VIP21 pushed a commit to VIP21/xmrig that referenced this pull request Jan 21, 2019
Reference code: monero-project/monero#4404

I tested it on x86 with av=1-10 and on ARM with av=1-4, self test passed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.