Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AArch64] Fold and and cmp into tst #102703

Closed
Kmeakin opened this issue Aug 9, 2024 · 4 comments · Fixed by #110347
Closed

[AArch64] Fold and and cmp into tst #102703

Kmeakin opened this issue Aug 9, 2024 · 4 comments · Fixed by #110347

Comments

@Kmeakin
Copy link
Contributor

Kmeakin commented Aug 9, 2024

https://godbolt.org/z/Mh7q8TY99

InstCombine is able to fold (x & 0xFF) < C) into (x & -C) == 0

eg:

bool ult32_u32(u32 x) { return (x & 0xFF) < 32; }

produces

ult32_u32:
        tst     w0, #0xe0
        cset    w0, eq
        ret

But the same transform is not done in later stages, so if the and is introduced due to eg passing a u8 in a u32 register, the fold is not performed:

bool ult32_u8(u8 x) { return x < 32; }
ult32_u8:
        and     w8, w0, #0xff
        cmp     w8, #32
        cset    w0, lo
        ret
@llvmbot
Copy link
Collaborator

llvmbot commented Aug 9, 2024

@llvm/issue-subscribers-backend-aarch64

Author: Karl Meakin (Kmeakin)

https://godbolt.org/z/Mh7q8TY99

InstCombine is able to fold (x &amp; 0xFF) &lt; C) into (x &amp; -C) != 0

eg:

bool ult32_u32(u32 x) { return (x &amp; 0xFF) &lt; 32; }

produces

ult32_u32:
        tst     w0, #<!-- -->0xe0
        cset    w0, eq
        ret

But the same transform is not done in later stages, so if the and is introduced due to eg passing a u8 in a u32 register, the fold is not performed:

bool ult32_u8(u8 x) { return x &lt; 32; }
ult32_u8:
        and     w8, w0, #<!-- -->0xff
        cmp     w8, #<!-- -->32
        cset    w0, lo
        ret

@pvimal816
Copy link

pvimal816 commented Aug 12, 2024

Hi @Kmeakin
Orthogonal to the root issue, shouldn't (x & -C) != 0 be (x & -C) == 0 instead?

Because, as per my understanding, tst a, b will set Z flag only if a&b is zero, and unless Z flag is set cset a, eq will not set a.

@jf-botto
Copy link
Contributor

I'd like to work on this issue, if possible? Please assign me?

@Kmeakin
Copy link
Contributor Author

Kmeakin commented Sep 20, 2024

Hi @Kmeakin Orthogonal to the root issue, shouldn't (x & -C) != 0 be (x & -C) == 0 instead?

Because, as per my understanding, tst a, b will set Z flag only if a&b is zero, and unless Z flag is set cset a, eq will not set a.

Yes, that was a typo, thanks

davemgreen pushed a commit that referenced this issue Oct 3, 2024
Fixes #102703.

https://godbolt.org/z/nfj8xsb1Y

The following pattern:

```
%2 = and i32 %0, 254
%3 = icmp eq i32 %2, 0
```
is optimised by instcombine into:

```%3 = icmp ult i32 %0, 2```

However, post instcombine leads to worse aarch64 than the unoptimised version.

Pre instcombine:
```
        tst     w0, #0xfe
        cset    w0, eq
        ret
```
Post instcombine:
```
        and     w8, w0, #0xff
        cmp     w8, #2
        cset    w0, lo
        ret
```


In the unoptimised version, SelectionDAG converts `SETCC (AND X 254) 0 EQ` into `CSEL 0 1 1 (ANDS X 254)`, which gets emitted as a `tst`.

In the optimised version, SelectionDAG converts `SETCC (AND X 255) 2 ULT` into `CSEL 0 1 2 (SUBS (AND X 255) 2)`, which gets emitted as an `and`/`cmp`.

This PR adds an optimisation to `AArch64ISelLowering`, converting `SETCC (AND X Y) Z ULT` into `SETCC (AND X (Y & ~(Z - 1))) 0 EQ` when `Z` is a power of two. This makes SelectionDAG/Codegen produce the same optimised code for both examples.
xgupta pushed a commit to xgupta/llvm-project that referenced this issue Oct 4, 2024
Fixes llvm#102703.

https://godbolt.org/z/nfj8xsb1Y

The following pattern:

```
%2 = and i32 %0, 254
%3 = icmp eq i32 %2, 0
```
is optimised by instcombine into:

```%3 = icmp ult i32 %0, 2```

However, post instcombine leads to worse aarch64 than the unoptimised version.

Pre instcombine:
```
        tst     w0, #0xfe
        cset    w0, eq
        ret
```
Post instcombine:
```
        and     w8, w0, #0xff
        cmp     w8, llvm#2
        cset    w0, lo
        ret
```


In the unoptimised version, SelectionDAG converts `SETCC (AND X 254) 0 EQ` into `CSEL 0 1 1 (ANDS X 254)`, which gets emitted as a `tst`.

In the optimised version, SelectionDAG converts `SETCC (AND X 255) 2 ULT` into `CSEL 0 1 2 (SUBS (AND X 255) 2)`, which gets emitted as an `and`/`cmp`.

This PR adds an optimisation to `AArch64ISelLowering`, converting `SETCC (AND X Y) Z ULT` into `SETCC (AND X (Y & ~(Z - 1))) 0 EQ` when `Z` is a power of two. This makes SelectionDAG/Codegen produce the same optimised code for both examples.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants