Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[x86 & HVX & WASM] Use bounds inference for saturating_narrow instruction selection #7805

Merged
merged 17 commits into from
Apr 30, 2024

Conversation

rootjalex
Copy link
Member

@rootjalex rootjalex commented Aug 24, 2023

There are cases where bounds inference allows us to safely reinterpret unsigned values as signed values, in order to target x86 saturating narrow instructions. I give an example below. I don't add tests to simd_op_check_x86 because we can't test for removing min instructions via the existing mechanism.

Example:

Var x("x");
ImageParam in0(UInt(8), 1);
Func out("out");
out(x) = u8_sat(u16(in0(x)) * 5);
out.vectorize(x, 32);
Target x86("x86-64-linux-avx-avx2-fma-sse41");
out.compile_to_assembly("x86-test0.asm", out.infer_arguments(), x86);

Before:

vpmovzxbw	(%rax,%rcx), %ymm2      # ymm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
vpmovzxbw	16(%rax,%rcx), %ymm3    # ymm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
vpmullw	%ymm0, %ymm3, %ymm3
vpmullw	%ymm0, %ymm2, %ymm2
vpminuw	%ymm1, %ymm2, %ymm2
vpminuw	%ymm1, %ymm3, %ymm3
vpackuswb	%ymm3, %ymm2, %ymm2
vpermq	$216, %ymm2, %ymm2              # ymm2 = ymm2[0,2,1,3]
vmovdqu	%ymm2, (%r13,%rcx)
addq	$32, %rcx
cmpq	%rcx, %rdx
jne	.LBB219_44

After (removed 2 vpminuw instructions):

vpmovzxbw	(%rax,%rcx), %ymm1      # ymm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
vpmovzxbw	16(%rax,%rcx), %ymm2    # ymm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
vpmullw	%ymm0, %ymm2, %ymm2
vpmullw	%ymm0, %ymm1, %ymm1
vpackuswb	%ymm2, %ymm1, %ymm1
vpermq	$216, %ymm1, %ymm1              # ymm1 = ymm1[0,2,1,3]
vmovdqu	%ymm1, (%r13,%rcx)
addq	$32, %rcx
cmpq	%rcx, %rdx
jne	.LBB219_44

@steven-johnson
Copy link
Contributor

I don't add tests to simd_op_check_x86

Understood, but is there any other reasonable way to add checks for this? This sort of stuff is often pretty fragile.

src/Type.cpp Outdated Show resolved Hide resolved
@steven-johnson steven-johnson self-requested a review August 24, 2023 22:12
Copy link
Contributor

@steven-johnson steven-johnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I feel like if we don't have some way to test this, we'll eventually find it broken :-/

@rootjalex
Copy link
Member Author

I don't add tests to simd_op_check_x86

Understood, but is there any other reasonable way to add checks for this? This sort of stuff is often pretty fragile.

I thought about adding a "check_not" to simd_op_check, because that's the only way I could think to test this sort of optimization, but that seems like heavy functionality to use for just this case (I'm also not sure we can verify that this optimization runs by saying "don't give me a vpmin", because other parts of the generated code could contain a vpmin and we don't want false failures). I'm open to suggestions (especially because I intend to make a similar PR for HVX).

@rootjalex rootjalex changed the title [x86] Use bounds inference for saturating_narrow instruction selection [x86 & HVX] Use bounds inference for saturating_narrow instruction selection Aug 25, 2023
@rootjalex
Copy link
Member Author

We can do the same thing for HVX saturating narrow instructions, which I was going to do in a separate PR, but it requires the same changes to Bounds.cpp and Type.h that that the x86 version does, so I am making this one PR.

@rootjalex
Copy link
Member Author

@steven-johnson Luckily, the improvement to HVX can be tested (unlike x86). Unluckily, this revealed that constant bounds inference does not work on u32 like I expected (which is part of why @abadams opened #7807)

src/CodeGen_X86.cpp Outdated Show resolved Hide resolved
src/HexagonOptimize.cpp Outdated Show resolved Hide resolved
src/HexagonOptimize.cpp Outdated Show resolved Hide resolved
@steven-johnson
Copy link
Contributor

Is this PR still an active one? It seems pretty stale.

@steven-johnson
Copy link
Contributor

What's the story, needs more work?

@rootjalex
Copy link
Member Author

This has gotten stale, but I am trying to find time to update it. Andrew and I discussed some issues with using bounds inference in this way, but I think they are fixed by using the new infrastructure in #8179. I will aim to do this in the next two weeks, I've finally started to catch up on my insane backlog.

@abadams
Copy link
Member

abadams commented Apr 28, 2024

I think it should be fairly straight-forward to adapt it to use ConstantInterval, now that the ConstantInterval refactor is in. You could do it now and there wouldn't be much of a merge conflict, or you could wait for #8155

Here's how I use ConstantInterval to do something similar: https://github.com/halide/Halide/pull/8155/files#diff-6cd3cc01186f9ce4abaa1a98c9201bd440983ecc5620fbccae383df1116bdce6R704

All your uses are about something being safe to reinterpret as signed/unsigned, so I think you want a few instances of const bool safe_to_reinterpret = some_expr.type().with_code(...).can_represent(constant_integer_bounds(some_expr))

@rootjalex
Copy link
Member Author

Yeah, realized it wasn't a ton of work, and was bored during an ASPLOS workshop. This is good for review now! Also, the u32 HVX tests now work, because the new ConstantInterval stuff is stronger than find_constant_bounds on u32. Nice side effect.

src/CodeGen_X86.cpp Outdated Show resolved Hide resolved
src/HexagonOptimize.cpp Outdated Show resolved Hide resolved
src/HexagonOptimize.cpp Outdated Show resolved Hide resolved
@rootjalex rootjalex changed the title [x86 & HVX] Use bounds inference for saturating_narrow instruction selection [x86 & HVX & WASM] Use bounds inference for saturating_narrow instruction selection Apr 29, 2024
@rootjalex
Copy link
Member Author

@abadams Sorry to add a change after your approval, but I just saw a wasm talk and was reminded that we can do this for wasm too. It's basically a copy-paste from x86, but would be nice to have eyes just to be safe.
@steven-johnson just want to ping you about the wasm change, if you're curious to look

@abadams
Copy link
Member

abadams commented Apr 29, 2024

The big TODO for wasm now is exploiting constant_integer_bounds to hit the relaxed simd instructions in case where we know they don't have UB

@rootjalex
Copy link
Member Author

The big TODO for wasm now is exploiting constant_integer_bounds to hit the relaxed simd instructions in case where we know they don't have UB

Yep, I think #7312 is for tracking that. This is still a nice side benefit of constant bounds for wasm though

@rootjalex
Copy link
Member Author

I don't think relaxed SIMD has actually shipped in wasm yet, so that's still waiting on the wasm folks

@steven-johnson
Copy link
Contributor

I don't think relaxed SIMD has actually shipped in wasm yet, so that's still waiting on the wasm folks

https://webassembly.org/features/

TL;DR: shipped in Chrome, behind flags elsewhere

@rootjalex rootjalex merged commit 8141197 into main Apr 30, 2024
19 checks passed
@rootjalex rootjalex deleted the rootjalex/x86-sat branch April 30, 2024 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants