[RISC-V] suboptimal loop transformation with some for loops

The following code illustrates the problem ([godbolt](https://godbolt.org/z/bW4d1bG8n)):
```C++
void add1(int *x, int *end) {
    for (; x != end; ++x) {
        ++*x;
    }
}

void add1_i(int *x, size_t n) {
    for (size_t i = 0; i != n; ++i) {
        ++x[i];
    }
}

void add1_n(int *x, size_t n) {
    add1(x, x + n);
}

void add1_2(int *x, int *end) {
    size_t n = end - x;
    for (size_t i = 0; i != n; ++i) {
        ++x[i];
    }
}

void add1_3(int *x, int *end) {
    add1_i(x, end - x);
}

void add1_s(std::span<int> r) {
    for (int &x : r) {
        ++x;
    }
}
```
`add1` and `add1_i` generate optimal code with a single induction variable and `bne`.
```
add1(int*, int*):
        beq     a0, a1, .LBB0_2
.LBB0_1:
        lw      a2, 0(a0)
        addi    a2, a2, 1
        sw      a2, 0(a0)
        addi    a0, a0, 4
        bne     a0, a1, .LBB0_1
.LBB0_2:
        ret

add1_i(int*, unsigned long):
        beqz    a1, .LBB1_3
        slli    a1, a1, 2
        add     a1, a1, a0
.LBB1_2:
        lw      a2, 0(a0)
        addi    a2, a2, 1
        sw      a2, 0(a0)
        addi    a0, a0, 4
        bne     a0, a1, .LBB1_2
.LBB1_3:
        ret
```
I would expect `add1_n` to be identical to `add1_i`, but instead this is generated:
```
add1_n(int*, unsigned long):
        beqz    a1, .LBB2_3
        slli    a1, a1, 2
.LBB2_2:
        lw      a2, 0(a0)
        addi    a1, a1, -4
        addi    a2, a2, 1
        sw      a2, 0(a0)
        addi    a0, a0, 4
        bnez    a1, .LBB2_2
.LBB2_3:
        ret
```
One extra add per iteration and a totally useless left shift (useless because `a1` is only used as a loop counter).
`add1_2` is similar to `add1_n` with a subtract (and useless right shift) to get the length, but should be the same as `add1`.
`add1_3` is almost correct. It has the correct loop body, but adds and subtracts `a0` which should have been removed:
```
add1_3(int*, int*):
        beq     a1, a0, .LBB4_3
        sub     a1, a1, a0
        srai    a1, a1, 2
        slli    a1, a1, 2
        add     a1, a1, a0
.LBB4_2:
        lw      a2, 0(a0)
        addi    a2, a2, 1
        sw      a2, 0(a0)
        addi    a0, a0, 4
        bne     a0, a1, .LBB4_2
.LBB4_3:
        ret
```
`add1_s` results in the same as `add1_n`, even though the code does almost exactly what the generated `add1_i` does.
The redundant add and subtract seems to only happen for RISC-V.
The variation in the loop body also happens on aarch64 (when I disable vectorization), but I don't know if there is a performance difference.
On aarch64 there are also the same useless shifts, where the loop counter is not used for indexing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RISC-V] suboptimal loop transformation with some for loops #146241

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RISC-V] suboptimal loop transformation with some for loops #146241

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions