Skip to content

Conversation

usamoi
Copy link
Contributor

@usamoi usamoi commented Sep 20, 2025

I'm surprised to find that intrinsic-test is testing core::arch, not core_arch::arch. This caused the failure that should have been detected in #1919 to go unnoticed until rust-lang/rust#146709.

I guess the compiler's previous behavior was most likely correct, so we probably need to revert #1919. However, I really can't find any specification that describes the behavior of ARM NEON in big-endian mode.

cc @sayantn @adamgemmell @folkertdev @Jamesbarford

@rustbot
Copy link
Collaborator

rustbot commented Sep 20, 2025

r? @sayantn

rustbot has assigned @sayantn.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@usamoi usamoi force-pushed the intrinsic-test-fix branch 2 times, most recently from e5d81ff to a48dbcc Compare September 20, 2025 10:29
@usamoi usamoi marked this pull request as draft September 20, 2025 11:28
@folkertdev
Copy link
Contributor

The problem analysis is correct here, I think: that test should be using the local build of stdarch.

But we did all agree that #1919 makes a correct change according to the documentation. I'll see if I can actually compare the generated assembly for clang and rustc for some of the failing functions.

@usamoi
Copy link
Contributor Author

usamoi commented Sep 20, 2025

I'll see if I can actually compare the generated assembly for clang and rustc for some of the failing functions.

I checked a simplified version of generated assembly of run_vreinterpret_u16_u8. This code

__attribute__((noinline)) void run_vreinterpret_u16_u8(uint8_t *a_vals,
                                                      uint16_t *b_vals) {
  vst1_u16(b_vals, vreinterpret_u16_u8(vld1_u8(a_vals)));
}

compiles to

0000000000400b8c <_Z23run_vreinterpret_u16_u8PhPt>:
  400b8c:	0c40701f 	ld1	{v31.8b}, [x0]
  400b90:	0c00743f 	st1	{v31.4h}, [x1]
  400b94:	d65f03c0 	ret

with gcc

and

00000000000106a4 <run_vreinterpret_u16_u8>:
   106a4:	0c407000 	ld1	{v0.8b}, [x0]
   106a8:	0e201800 	rev16	v0.8b, v0.8b
   106ac:	0c007020 	st1	{v0.8b}, [x1]
   106b0:	d65f03c0 	ret

with clang.

According to https://developer.arm.com/documentation/102159/0400/Load-and-store---data-structures,

Element size also affects endianness handling. In general, if you specify the correct element size to the load and store instructions, bytes are read from memory in the appropriate order. This means that the same code works on little-endian systems and big-endian systems.

Reversing the byte order is reasonable. Moreover, GCC indeed did not generate any instructions for vreinterpret.

@folkertdev
Copy link
Contributor

Right, I think we just missed some? e.g. vreinterpretq_u8_s32 fails in CI, and it still has its custom shuffling implementation while it should really just be a transmute based on what clang does https://godbolt.org/z/r8eY41vsb.

@folkertdev
Copy link
Contributor

folkertdev commented Sep 20, 2025

Ah, no, I was using an old branch, and got confused. I actually think it might be an ABI difference that bites us here, consider:

use core::arch::aarch64::{int32x4_t, uint8x16_t, vreinterpretq_u8_s32};

#[no_mangle]
#[target_feature(enable = "neon")]
pub unsafe extern "C" fn c(x: int32x4_t) -> uint8x16_t {
    vreinterpretq_u8_s32(x)
}

#[no_mangle]
#[target_feature(enable = "neon")]
pub unsafe fn rust(x: int32x4_t) -> uint8x16_t {
    vreinterpretq_u8_s32(x)
}
; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define <16 x i8> @c(<4 x i32> %x) unnamed_addr #0 !dbg !6 {
start:
    #dbg_value(<4 x i32> %x, !32, !DIExpression(), !33)
    #dbg_value(<4 x i32> %x, !34, !DIExpression(), !39)
  %_0 = bitcast <4 x i32> %x to <16 x i8>, !dbg !41
  ret <16 x i8> %_0, !dbg !42
}

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(argmem: readwrite)
define void @rust(ptr dead_on_unwind noalias noundef writable writeonly sret([16 x i8]) align 16 captures(none) dereferenceable(16) initializes((0, 16)) %_0, ptr dead_on_return noalias noundef readonly align 16 captures(none) dereferenceable(16) %x) unnamed_addr #1 !dbg !43 {
start:
    #dbg_declare(ptr %x, !45, !DIExpression(), !46)
    #dbg_declare(ptr %x, !47, !DIExpression(), !50)
  %0 = load <4 x i32>, ptr %x, align 16, !dbg !52
  store <4 x i32> %0, ptr %_0, align 16, !dbg !52
  ret void, !dbg !53
}

with the rust ABI a load and store are performed, while with extern "C" the function just becomes a transmute.

@sayantn
Copy link
Contributor

sayantn commented Sep 20, 2025

As @usamoi's test with gcc and clang shows, they have different behavior for this intrinsic. The ARM spec explicitly says that it should generate NOP. But we see that only the GCC one adheres to this. I think clang has a bug here, not us.

@usamoi
Copy link
Contributor Author

usamoi commented Sep 20, 2025

As @usamoi's test with gcc and clang shows, they have different behavior for this intrinsic. The ARM spec explicitly says that it should generate NOP. But we see that only the GCC one adheres to this. I think clang has a bug here, not us.

It should be emphasized that although the assembly generated by GCC does not contain vrev, it still reverses the byte order, on QEMU.

@usamoi
Copy link
Contributor Author

usamoi commented Sep 20, 2025

I guess ld1 and st1 reverse the byte order on big endian based on the element size. clang loads the vector of bytes, instead of the vector of halfwords, so it generates a vrev to emulate it.

Edit: see also https://llvm.org/docs/BigEndianNEON.html

@sayantn
Copy link
Contributor

sayantn commented Sep 21, 2025

But in C, vreinterpretq and a simple transmute (*(T*)&) generates the same code, the revs we see are merely because of the vld/vsts, not the vreinterpret. https://godbolt.org/z/ax4hK587d

edit: the c++ code, using reinterpret_cast, also gives the same output https://godbolt.org/z/Gh5aGvhvh

edit: I believe the error is in the generated rust files, we print the output of rust using transmute and array indexing, whereas C uses vget_lane

edit: it seems that vgetq_lane bswaps the elements https://godbolt.org/z/8rzTj3WM9

@usamoi
Copy link
Contributor Author

usamoi commented Sep 21, 2025

#[unsafe(no_mangle)]
extern "C" fn run_vreinterpret_u16_u8(a_vals: *mut u8, b_vals: *mut u16) {
    use core::arch::aarch64::*;
    unsafe {
        vst1_u16(b_vals, vreinterpret_u16_u8(vld1_u8(a_vals)));
    }
}
0000000000028ce8 <run_vreinterpret_u16_u8>:
   28ce8: f9400008     	ldr	x8, [x0]
   28cec: f9000028     	str	x8, [x1]
   28cf0: d65f03c0     	ret

It's asm generated by 1.92.0-nightly (dd7fda570 2025-09-20). We can see that it generated ldr and str instead of ld1 and st1. This is incorrect.

@sayantn
Copy link
Contributor

sayantn commented Sep 21, 2025

Ok, so the problem seems to be in vld1 and vst1. They simply transmute and store, where they should bswap first. But the test harness would also need to be changed, because we use transmute to generate the Rust output, whereas we should use vst1.

edit: Just realized, if that is the case, why has been our test suite passing at all??

@folkertdev
Copy link
Contributor

it is probably cleanest to revert #1919, merge this PR, and then properly fix the intrinsics? unless the fix is simple?

@sayantn
Copy link
Contributor

sayantn commented Sep 22, 2025

it is probably cleanest to revert #1919, merge this PR, and then properly fix the intrinsics? unless the fix is simple?

Atp that seems to be the best route, this would probably need more discussions. I am opening a pr to revert #1919, and reopened #1904

@sayantn
Copy link
Contributor

sayantn commented Sep 22, 2025

@usamoi could you add the revert of #1919 here? The PR requires this test fix

@usamoi usamoi marked this pull request as ready for review September 23, 2025 02:05
@usamoi
Copy link
Contributor Author

usamoi commented Sep 23, 2025

@usamoi could you add the revert of #1919 here? The PR requires this test fix

Done.

@sayantn sayantn added this pull request to the merge queue Sep 23, 2025
Merged via the queue into rust-lang:master with commit 999b963 Sep 23, 2025
63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants