Skip to content

Suboptimal codegen for llvm.vector.reduce of <N x i1> #50466

Open
@calebzulawski

Description

@calebzulawski
Bugzilla Link 51122
Version 12.0
OS All
CC @Arnaud-de-Grandmaison-ARM,@DMG862,@RKSimon,@smithp35

Extended Description

The binary reduction intrinsics on Aarch64 (and ARM) produce suboptimal implementations over vectors of i1. This issue is similar to #38188 .

declare i1 @llvm.vector.reduce.or.v16i1(<16 x i1> %a);

define i1 @mask_reduce_or(<16 x i8> %mask) {
    %mask1 = trunc <16 x i8> %mask to <8 x i1>
    %reduced = call i1 @llvm.vector.reduce.or.v16i1(<8 x i1> %mask1)
    ret i1 %reduced
}

produces

mask_reduce_or:                         // @mask_reduce_or
        umov    w14, v0.b[1]
        umov    w15, v0.b[0]
        umov    w13, v0.b[2]
        orr     w14, w15, w14
        umov    w12, v0.b[3]
        orr     w13, w14, w13
        umov    w11, v0.b[4]
        orr     w12, w13, w12
        umov    w10, v0.b[5]
        orr     w11, w12, w11
        umov    w9, v0.b[6]
        orr     w10, w11, w10
        umov    w8, v0.b[7]
        orr     w9, w10, w9
        orr     w8, w9, w8
        and     w0, w8, #0x1
        ret

when it could instead use vmaxvq (or vpmax on ARM).

The same goes for vector.reduce.and with vminvq (or vpmin on ARM).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions