Skip to content

[RISC-V] suboptimal loop transformation with some for loops #146241

Open
@tom-rein

Description

@tom-rein

The following code illustrates the problem (godbolt):

void add1(int *x, int *end) {
    for (; x != end; ++x) {
        ++*x;
    }
}

void add1_i(int *x, size_t n) {
    for (size_t i = 0; i != n; ++i) {
        ++x[i];
    }
}

void add1_n(int *x, size_t n) {
    add1(x, x + n);
}

void add1_2(int *x, int *end) {
    size_t n = end - x;
    for (size_t i = 0; i != n; ++i) {
        ++x[i];
    }
}

void add1_3(int *x, int *end) {
    add1_i(x, end - x);
}

void add1_s(std::span<int> r) {
    for (int &x : r) {
        ++x;
    }
}

add1 and add1_i generate optimal code with a single induction variable and bne.

add1(int*, int*):
        beq     a0, a1, .LBB0_2
.LBB0_1:
        lw      a2, 0(a0)
        addi    a2, a2, 1
        sw      a2, 0(a0)
        addi    a0, a0, 4
        bne     a0, a1, .LBB0_1
.LBB0_2:
        ret

add1_i(int*, unsigned long):
        beqz    a1, .LBB1_3
        slli    a1, a1, 2
        add     a1, a1, a0
.LBB1_2:
        lw      a2, 0(a0)
        addi    a2, a2, 1
        sw      a2, 0(a0)
        addi    a0, a0, 4
        bne     a0, a1, .LBB1_2
.LBB1_3:
        ret

I would expect add1_n to be identical to add1_i, but instead this is generated:

add1_n(int*, unsigned long):
        beqz    a1, .LBB2_3
        slli    a1, a1, 2
.LBB2_2:
        lw      a2, 0(a0)
        addi    a1, a1, -4
        addi    a2, a2, 1
        sw      a2, 0(a0)
        addi    a0, a0, 4
        bnez    a1, .LBB2_2
.LBB2_3:
        ret

One extra add per iteration and a totally useless left shift (useless because a1 is only used as a loop counter).
add1_2 is similar to add1_n with a subtract (and useless right shift) to get the length, but should be the same as add1.
add1_3 is almost correct. It has the correct loop body, but adds and subtracts a0 which should have been removed:

add1_3(int*, int*):
        beq     a1, a0, .LBB4_3
        sub     a1, a1, a0
        srai    a1, a1, 2
        slli    a1, a1, 2
        add     a1, a1, a0
.LBB4_2:
        lw      a2, 0(a0)
        addi    a2, a2, 1
        sw      a2, 0(a0)
        addi    a0, a0, 4
        bne     a0, a1, .LBB4_2
.LBB4_3:
        ret

add1_s results in the same as add1_n, even though the code does almost exactly what the generated add1_i does.
The redundant add and subtract seems to only happen for RISC-V.
The variation in the loop body also happens on aarch64 (when I disable vectorization), but I don't know if there is a performance difference.
On aarch64 there are also the same useless shifts, where the loop counter is not used for indexing.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions