Skip to content

[NVPTX] Incomplete and inconsistent mechanisms for lowering vectors loads/stores with sub-32-bit values #118851

Closed
@dakersnar

Description

@dakersnar

I started this discussion here: #67322 (comment), and it was requested that I open a new issue. Let me know if there is anything I need to adjust (labels, formatting, etc), I'm relatively new to open source LLVM.

CC @Artem-B, here is a simple repro that demonstrates the issue: https://godbolt.org/z/no5fEGo3h.

There are two major pieces to this issue that I'm hoping to solve:

  1. Support for efficiently lowering load/store v8i8 and store v16i8 is missing from the current backend. These vectors currently get split into two or four scalar loads/stores.
  2. Existing lowering for sub-32-bit load/store vectors is handled inconsistently, load v16xi8 is handled by a DagCombine ([NVPTX] Preserve v16i8 vector loads when legalizing #67322), while lowering for `load/store v8x[i16,f16,bf16] is handled via the type legalizer (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp#L6322).

Given that, I think the goal should be to unify the approaches into a single mechanism, whichever of the two we prefer, and extend it to support the missing vector types.

A clear demonstration of a valid use case, and a patch demonstrating improved code generation would be welcome, and we can discuss the details there. Without actually implementing a prototype and seeing how clean/messy it is, it's hard to tell ahead of time what would work best here.

I actually have two very rough prototypes working with both the DagCombine approach and the Type Legalizer approach. The Dag Combine approach is very slightly cleaner, and the generated PTX is near identical in both cases (with the only real difference being whether the generated vector is made up of u32s or b32s). However, it generates a slightly different SelectionDAG pattern post type-legalization, so it would represent a bit more churn if we were to transition the existing 16-bit vector handling over to it. I was hoping to lean on @Artem-B's experience in this space to see if you had any gut feelings for which approach would be preferred, and also whether you share my thinking that unifying them is a good idea at all.

As a quick example, here is the post-type-legalizer diff of the selectiondag for the approaches, when the input is the "generic_8xi16" function from the compiler explorer link.

Image
Image

Left is with the current Type Legalizer handling and right is with the DagCombine handling (I just changed it for the load, the store is untouched). The PTX ends up near-identical in this simple case, but I'm not sure whether the left side is enabling specific optimizations for v2x[i/f/bf]16 in other cases, as I know there was a lot of work done in that space that I haven't dug super deep into yet, e.g. 620db1f#diff-6fda74ef428299644e9f49a2b0994c0d850a760b89828f655030a114060d075a

Anyway, I can put up a patch for one of the approaches if that would be helpful, or possibly both, if you don't lean in one direction. I'll also note that the Dag Combine approach could be rewritten to match the output of the Type Legalizer approach, and vice versa, so there might be two questions at play here: 1) where do we prefer this lives and 2) what do we prefer the type-legalizer output to look like.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions