Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISC-V] Use getUnrollThreshold() to get unroll limits #98798

Merged
merged 3 commits into from
Mar 6, 2024

Conversation

MinxuanZ
Copy link
Contributor

@ryujit-bot
Copy link

Diff results for #98798

Throughput diffs

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)
Collection PDIFF
libraries.pmi.windows.arm64.checked.mch +0.01%

Details here


Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Btw, I was wondering about this: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Buffer.cs#L5

It enables block store/copy on Loongarch but doesn't on RISC-V - is it expected?

@EgorBo
Copy link
Member

EgorBo commented Feb 22, 2024

Or in other words, what is the RISC-V codegen for:

void Foo(ref byte dst, ref byte src)
{
    Unsafe.CopyBlockUnaligned(ref dst, ref src, 64);
}

?

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 23, 2024
@ghost
Copy link

ghost commented Feb 23, 2024

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

#83274

Author: monstercatss
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@ryujit-bot
Copy link

Diff results for #98798

Throughput diffs

Throughput diffs for osx/arm64 ran on linux/x64

MinOpts (-0.00% to +0.01%)
Collection PDIFF
libraries.pmi.osx.arm64.checked.mch +0.01%

Details here


@MinxuanZ
Copy link
Contributor Author

hi,@clamp03
Could you please review and help answer?
I don't have a riscv machine yet
may call memcpy

@clamp03 clamp03 added the arch-riscv Related to the RISC-V architecture label Feb 23, 2024
@clamp03
Copy link
Member

clamp03 commented Feb 23, 2024

@monstercatss Thank you so much!
@bartlomiejko Could you (or someone in SRPOL) review and help @monstercatss for testing? And I think we missed some LOONGARCH PR (like #62888 which is related to @EgorBo 's comment) Please check. CC @sirntar @tomeksowi

@ryujit-bot
Copy link

Diff results for #98798

Throughput diffs

Throughput diffs for osx/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)
Collection PDIFF
libraries.pmi.osx.arm64.checked.mch +0.01%

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)
Collection PDIFF
libraries.pmi.windows.arm64.checked.mch +0.01%

Details here


@tomeksowi
Copy link
Contributor

LGTM. Btw, I was wondering about this: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Buffer.cs#L5

It enables block store/copy on Loongarch but doesn't on RISC-V - is it expected?

Yes, the ISA we're currently targeting is RV64GC, no "V" (vector ops) extension.

@EgorBo
Copy link
Member

EgorBo commented Feb 23, 2024

LGTM. Btw, I was wondering about this: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Buffer.cs#L5
It enables block store/copy on Loongarch but doesn't on RISC-V - is it expected?

Yes, the ISA we're currently targeting is RV64GC, no "V" (vector ops) extension.

SIMD is not necessary here, HAS_CUSTOM_BLOCKS implies that e.g.

void Foo(ref byte dst, ref byte src)
{
    Unsafe.CopyBlockUnaligned(ref dst, ref src, 64);
}

is inlined in codegen whether to a set of simd or GPR stores (supports such blocks) or MEMSET call (doesn't support).
According to

+ // | riscv64 | 64 | 32 | no SIMD support

change in this PR, it should support them.

@tomeksowi
Copy link
Contributor

SIMD is not necessary here, HAS_CUSTOM_BLOCKS implies that e.g.

void Foo(ref byte dst, ref byte src)
{
    Unsafe.CopyBlockUnaligned(ref dst, ref src, 64);
}

is inlined in codegen whether to a set of simd or GPR stores (supports such blocks) or MEMSET call (doesn't support). According to

+ // | riscv64 | 64 | 32 | no SIMD support

In that case, I think you're right. The only caveat I can think of is the one discussed here. Currently Unsafe.CopyBlockUnaligned generates misaligned GPR loads/stores; in theory the ISA we're compiling against permits not supporting them (that's RV32, and I don't see RV64 overriding that). In practice I don't know of a general-purpose RISC-V 64-bit implementation that doesn't support misaligned load/stores for non-atomic instructions, RVA20U64 profile even mandates it.

We're currently focusing on bugfixing and missing features, codegen optimizations will come later. We'll have to think what to do about it, maybe .NET should simply require it and compile with -mno-strict-align. @brucehoult any thoughts?

@EgorBo
Copy link
Member

EgorBo commented Feb 26, 2024

Totally makes sense!

Currently Unsafe.CopyBlockUnaligned generates misaligned GPR loads/stores; i

I presume it can emit a set of byte stores (for that, you will need to decrease the threshold in getUnrollThreshold to avoid emitting 64 byte stores for 64 byte blocks 🙂) but that can be done separately. Also, I presume it'll make sense to revise the base class library to find places where unaligned loads are assumed to be cheap.

@brucehoult
Copy link

brucehoult commented Feb 26, 2024

SIMD is not necessary here, HAS_CUSTOM_BLOCKS implies that e.g.

void Foo(ref byte dst, ref byte src)
{
    Unsafe.CopyBlockUnaligned(ref dst, ref src, 64);
}

If RV64GC then there are really only two sensible things to do for larger blocks:

  1. an inline byte by byte loop, or

  2. call memcpy()

In that case, I think you're right. The only caveat I can think of is the one discussed here. Currently Unsafe.CopyBlockUnaligned generates misaligned GPR loads/stores; in theory the ISA we're compiling against permits not supporting them (that's RV32, and I don't see RV64 overriding that). In practice I don't know of a general-purpose RISC-V 64-bit implementation that doesn't support misaligned load/stores for non-atomic instructions, RVA20U64 profile even mandates it.

The profile mandates that non-aligned load/store don't crash the program in User-mode programs. Implementations are free to trap to M mode and have SBI emulate it. The necessary code is always present, just in case, so for example accesses that cross a cache like or VM page might trap even if others don't.

It seems VisionFive 2 does support misaligned, but not very quickly.

#include <stdio.h>

int main(){
    char *s = "The quick brown fox";
    unsigned long t = 0;
    for (int i=0; i<1000000000; ++i){
      t += *(volatile unsigned long*)(s+1);
    }
    printf("%ld\n", t);
    return 0;
}

With -O1 this takes 11.4s vs 2.0s if the amount added to s is a multiple of 8.

(I checked the codegen and it's the expected "ld;add;addiw -1;bnez")

On the Lichee Pi 4A (THead C910) it's 4.65s vs 1.1s

On my x86 box (2990wx) it's 0.27s either way.

As a comparison, using the following takes 3.36s on VF2, so it's a bit faster than doing misaligned in hardware:

t += ((*(volatile unsigned long*)(s)) >> 8) | ((*(volatile unsigned long*)(s+8)) << 56);

On the Lichee Pi it takes 1.65s, almost three times faster than the misaligned access. On the AMD it takes 0.51s, almost twice as slow as the misaligned access.

So ... avoid misaligned access if possible, but it's not a complete disaster if it happens sometimes, on the cores we have now, and future cores.

If it's in a loop then better to call the system memcpy() I think. This will also automatically pick up a vectorised version in future CPU cores.

@MinxuanZ
Copy link
Contributor Author

Totally makes sense!

Currently Unsafe.CopyBlockUnaligned generates misaligned GPR loads/stores; i

I presume it can emit a set of byte stores (for that, you will need to decrease the threshold in getUnrollThreshold to avoid emitting 64 byte stores for 64 byte blocks 🙂) but that can be done separately. Also, I presume it'll make sense to revise the base class library to find places where unaligned loads are assumed to be cheap.

btw,Is threshold just based on performance? Is it faster to use memset call when size is larger than 512 on x86 avx512?

@EgorBo
Copy link
Member

EgorBo commented Feb 26, 2024

btw,Is threshold just based on performance? Is it faster to use memset call when size is larger than 512 on x86 avx512?

Native compilers prefer to unroll it too https://godbolt.org/z/vTYMP5dYc (without alignment hints, so I presume even 8 unaligned stores are better). Also, previously, memset call was non-interruptable, so GC had to wait it to finish because of cooperative mode.

@jakobbotsch
Copy link
Member

Can you please fix the formatting errors and conflicts?

@MinxuanZ
Copy link
Contributor Author

MinxuanZ commented Mar 6, 2024

Can you please fix the formatting errors and conflicts?

done

@jakobbotsch jakobbotsch merged commit 3ddaeae into dotnet:main Mar 6, 2024
129 checks passed
@MinxuanZ MinxuanZ deleted the riscv-unrolllimits branch March 6, 2024 10:22
@github-actions github-actions bot locked and limited conversation to collaborators Apr 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-riscv Related to the RISC-V architecture area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants