Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spv-out] implement array value indexing #723

Merged
merged 1 commit into from
Apr 16, 2021
Merged

Conversation

kvark
Copy link
Member

@kvark kvark commented Apr 16, 2021

Fixes #635
Alternative to #669

This is a quick and dirty solution for array indexing. It unconditionally creates a function-local variable, stores the whole array in there, then indexes with OpAccessChain, and then loads the result back. This is likely not the most efficient way to go in a lot of situations, but it works. Also, there is a fairly good chance that LLVM optimizer in the driver will see the variable and eliminate it if the HW supports indexing directly. It would be amazing if we could verify this.

@kvark
Copy link
Member Author

kvark commented Apr 16, 2021

Here is RDNA2 assembly for the access.wgsl shader:

s_setprio             0x0003
s_mov_b64             exec, -1
s_getpc_b64           s[20:21]
s_inst_prefetch       0x0003
s_bfe_u32             s0, s2, lit(0x00090016)
s_bfe_u32             s1, s2, lit(0x0009000c)
s_bfe_u32             s2, s3, lit(0x00040018)
s_cmp_lt_u32          s2, 1
s_cbranch_scc0        label_003C
s_lshl_b32            s3, s0, 12
s_or_b32              m0, s1, s3
s_sendmsg             sendmsg(MSG_GS_ALLOC_REQ, GS_OP_NOP, 0)
label_003C:
v_mbcnt_lo_u32_b32    v1, -1, 0
v_mbcnt_hi_u32_b32    v1, -1, v1
s_setprio             0x0000
v_mad_u32_u24         v1, s2, 64, v1
s_mov_b64             s[2:3], exec
v_cmpx_lt_u32         exec, v1, s1
v_add_nc_u32          v2, s10, v5
v_lshlrev_b32         v2, 2, v2
s_cbranch_execz       label_009C
s_getpc_b64           s[4:5]
s_add_u32             s4, s4, lit(0x00000160)
s_addc_u32            s5, s5, 0
v_min_u32             v2, 20, v2
global_load_dword     v2, v2, s[4:5]
s_waitcnt             vmcnt(0)
v_cvt_f32_i32         v3, v2
v_xor_b32             v4, lit(0x80000000), v3
label_009C:
s_mov_b64             exec, s[2:3]
v_cmp_gt_u32          vcc, s0, v1
s_and_b64             exec, s[2:3], vcc
s_cbranch_execz       label_00B4
exp                   prim, v0, off, off, off done
label_00B4:
s_waitcnt             expcnt(0)
s_mov_b64             exec, s[2:3]
v_cmp_gt_u32          vcc, s1, v1
s_and_b64             exec, s[2:3], vcc
s_cbranch_execz       label_00D0
exp                   pos0, v3, v4, v3, v3 done
label_00D0:
s_endpgm              

@kvark
Copy link
Member Author

kvark commented Apr 16, 2021

@acowley it's a bit hard to read (for me). It appears that the driver has put the array into internal memory (s4 + 0x00000160), and then did a single load from it (global_load_dword) based on the vertex index. So it's definitely optimizing away those unnecessary load/stores that our naive code generates.

@acowley
Copy link
Contributor

acowley commented Apr 16, 2021

That’s great! How do nested accesses look? Is using OpAccessChain any different than loading and accessing intermediate composites as is done in this PR?

@kvark
Copy link
Member Author

kvark commented Apr 16, 2021

I tried nested accesses on the following WGSL:

[[stage(vertex)]]
fn foo([[builtin(vertex_index)]] vi: u32) -> [[builtin(position)]] vec4<f32> {
	let array = array<array<i32, 2>, 2>(array<i32, 2>(1, 2), array<i32, 2>(3, 4));
	let value = array[vi][vi];
	return vec4<f32>(vec4<i32>(value));
}

It generates the following RDNA2 assembly:

s_setprio             0x0003                                     Flow Control
s_mov_b64             exec, -1                                   Scalar ALU
s_getpc_b64           s[20:21]                                   Vector ALU
s_inst_prefetch       0x0003                                     Scalar ALU
s_bfe_u32             s0, s2, lit(0x00090016)                    Scalar ALU
s_bfe_u32             s1, s2, lit(0x0009000c)                    Scalar ALU
s_bfe_u32             s2, s3, lit(0x00040018)                    Scalar ALU
s_cmp_lt_u32          s2, 1                                      Scalar ALU
s_cbranch_scc0        label_003C                                 Branch
s_lshl_b32            s3, s0, 12                                 Scalar ALU
s_or_b32              m0, s1, s3                                 Scalar ALU
s_sendmsg             sendmsg(MSG_GS_ALLOC_REQ, GS_OP_NOP, 0)    Flow Control
label_003C:
v_mbcnt_lo_u32_b32    v1, -1, 0                                  Vector ALU
v_mbcnt_hi_u32_b32    v1, -1, v1                                 Vector ALU
s_setprio             0x0000                                     Flow Control
v_mad_u32_u24         v1, s2, 64, v1                             Vector ALU
s_mov_b64             s[2:3], exec                               Scalar ALU
v_cmpx_lt_u32         exec, v1, s1                               Vector ALU
v_add_nc_u32          v2, s10, v5                                Vector ALU
s_cbranch_execz       label_010C                                 Branch
s_getpc_b64           s[4:5]                                     Vector ALU
s_add_u32             s4, s4, lit(0x000001d4)                    Scalar ALU
s_addc_u32            s5, s5, 0                                  Scalar ALU
s_movk_i32            s6, 0x0000                                 Scalar ALU
v_lshlrev_b32         v3, 3, v2                                  Vector ALU
v_mul_u32_u24         v2, 4, v2                                  Vector ALU
v_min_u32             v4, 16, v3                                 Vector ALU
v_add_nc_i32          v3, v3, 4                                  Vector ALU
v_add_nc_u32          v2, 16, v2                                 Vector ALU
v_min_u32             v3, 16, v3                                 Vector ALU
s_clause              0x0001                                     Scalar ALU
global_load_dword     v4, v4, s[4:5]                             Vector Memory
global_load_dword     v3, v3, s[4:5]                             Vector Memory
v_cmp_gt_u32          vcc, 32, v2                                Vector ALU
s_waitcnt             vmcnt(1)                                   Flow Control
v_mov_b32             v10, v4                                    Vector ALU
s_waitcnt             vmcnt(0)                                   Flow Control
v_mov_b32             v11, v3                                    Vector ALU
s_and_saveexec_b64    s[4:5], vcc                                Scalar ALU
v_cmp_eq_i32          s[24:25], 20, v2                           Vector ALU
v_cndmask_b32         v9, v10, v11, s[24:25]                     Vector ALU
v_cmp_eq_i32          s[24:25], 24, v2                           Vector ALU
v_cndmask_b32         v9, v9, v12, s[24:25]                      Vector ALU
v_cmp_eq_i32          s[24:25], 28, v2                           Vector ALU
v_cndmask_b32         v2, v9, v13, s[24:25]                      Vector ALU
s_mov_b64             exec, s[4:5]                               Scalar ALU
v_cndmask_b32         v2, 0, v2, vcc                             Vector ALU
v_cvt_f32_i32         v3, v2                                     Vector ALU
v_xor_b32             v4, lit(0x80000000), v3                    Vector ALU
label_010C:
s_mov_b64             exec, s[2:3]                               Scalar ALU
v_cmp_gt_u32          vcc, s0, v1                                Vector ALU
s_and_b64             exec, s[2:3], vcc                          Scalar ALU
s_cbranch_execz       label_0124                                 Branch
exp                   prim, v0, off, off, off done               GDS/Export
label_0124:
s_waitcnt             expcnt(0)                                  Flow Control
s_mov_b64             exec, s[2:3]                               Scalar ALU
v_cmp_gt_u32          vcc, s1, v1                                Vector ALU
s_and_b64             exec, s[2:3], vcc                          Scalar ALU
s_cbranch_execz       label_0140                                 Branch
exp                   pos0, v3, v4, v3, v3 done                  GDS/Export
label_0140:
s_endpgm                                                         Flow Control

Can't say it makes sense to me, but the assembly has exactly 2 memory loads, as we'd optimally expect in this case.

Edit; what's most confusing to me is that these accesses don't look dependent on each other.

@kvark
Copy link
Member Author

kvark commented Apr 16, 2021

The PR is now updated to have our snapshot test access.wgsl to have nested indexing.

@kvark
Copy link
Member Author

kvark commented Apr 16, 2021

@cwfitzgerald noted that there is only vi=0 and vi=1 cases. I modified the test to have 9 different paths now

// This snapshot tests accessing various containers, dereferencing pointers.

[[stage(vertex)]]
fn foo(
    [[builtin(vertex_index)]] vi: u32,
    [[builtin(instance_index)]] ii: u32,
) -> [[builtin(position)]] vec4<f32> {
    let array = array<array<i32, 3>, 3>(
        array<i32, 3>(1, 2, 3),
        array<i32, 3>(4, 5, 6),
        array<i32, 3>(7, 8, 9)
    );
    let value = array[ii][vi];
    return vec4<f32>(vec4<i32>(value));
}

RDNA2 assembly:

s_setprio             0x0003                                     Flow Control
s_mov_b64             exec, -1                                   Scalar ALU
s_getpc_b64           s[20:21]                                   Vector ALU
s_inst_prefetch       0x0003                                     Scalar ALU
s_bfe_u32             s0, s2, lit(0x00090016)                    Scalar ALU
s_bfe_u32             s1, s2, lit(0x0009000c)                    Scalar ALU
s_bfe_u32             s2, s3, lit(0x00040018)                    Scalar ALU
s_cmp_lt_u32          s2, 1                                      Scalar ALU
s_cbranch_scc0        label_003C                                 Branch
s_lshl_b32            s3, s0, 12                                 Scalar ALU
s_or_b32              m0, s1, s3                                 Scalar ALU
s_sendmsg             sendmsg(MSG_GS_ALLOC_REQ, GS_OP_NOP, 0)    Flow Control
label_003C:
v_mbcnt_lo_u32_b32    v1, -1, 0                                  Vector ALU
v_mbcnt_hi_u32_b32    v1, -1, v1                                 Vector ALU
s_setprio             0x0000                                     Flow Control
v_mad_u32_u24         v1, s2, 64, v1                             Vector ALU
s_mov_b64             s[2:3], exec                               Scalar ALU
v_cmpx_lt_u32         exec, v1, s1                               Vector ALU
v_add_nc_u32          v3, s11, v8                                Vector ALU
v_add_nc_u32          v2, s10, v5                                Vector ALU
s_cbranch_execz       label_0134                                 Branch
s_getpc_b64           s[4:5]                                     Vector ALU
s_add_u32             s4, s4, lit(0x000001f8)                    Scalar ALU
s_addc_u32            s5, s5, 0                                  Scalar ALU
s_movk_i32            s6, 0x0000                                 Scalar ALU
v_lshl_add_u32        v3, v3, 1, v3                              Vector ALU
v_mul_u32_u24         v2, 4, v2                                  Vector ALU
v_lshlrev_b32         v3, 2, v3                                  Vector ALU
v_add_nc_u32          v2, 48, v2                                 Vector ALU
v_min_u32             v4, 36, v3                                 Vector ALU
v_add_nc_i32          v5, v3, 4                                  Vector ALU
v_add_nc_i32          v3, v3, 8                                  Vector ALU
v_cmp_gt_u32          vcc, 64, v2                                Vector ALU
v_min_u32             v5, 36, v5                                 Vector ALU
v_min_u32             v3, 36, v3                                 Vector ALU
s_clause              0x0002                                     Scalar ALU
global_load_dword     v4, v4, s[4:5]                             Vector Memory
global_load_dword     v5, v5, s[4:5]                             Vector Memory
global_load_dword     v3, v3, s[4:5]                             Vector Memory
s_waitcnt             vmcnt(2)                                   Flow Control
v_mov_b32             v10, v4                                    Vector ALU
s_waitcnt             vmcnt(1)                                   Flow Control
v_mov_b32             v11, v5                                    Vector ALU
s_waitcnt             vmcnt(0)                                   Flow Control
v_mov_b32             v12, v3                                    Vector ALU
s_and_saveexec_b64    s[4:5], vcc                                Scalar ALU
v_cmp_eq_i32          s[24:25], 52, v2                           Vector ALU
v_cndmask_b32         v9, v10, v11, s[24:25]                     Vector ALU
v_cmp_eq_i32          s[24:25], 56, v2                           Vector ALU
v_cndmask_b32         v9, v9, v12, s[24:25]                      Vector ALU
v_cmp_eq_i32          s[24:25], 60, v2                           Vector ALU
v_cndmask_b32         v2, v9, v13, s[24:25]                      Vector ALU
s_mov_b64             exec, s[4:5]                               Scalar ALU
v_cndmask_b32         v2, 0, v2, vcc                             Vector ALU
v_cvt_f32_i32         v3, v2                                     Vector ALU
v_xor_b32             v4, lit(0x80000000), v3                    Vector ALU
label_0134:
s_mov_b64             exec, s[2:3]                               Scalar ALU
v_cmp_gt_u32          vcc, s0, v1                                Vector ALU
s_and_b64             exec, s[2:3], vcc                          Scalar ALU
s_cbranch_execz       label_014C                                 Branch
exp                   prim, v0, off, off, off done               GDS/Export
label_014C:
s_waitcnt             expcnt(0)                                  Flow Control
s_mov_b64             exec, s[2:3]                               Scalar ALU
v_cmp_gt_u32          vcc, s1, v1                                Vector ALU
s_and_b64             exec, s[2:3], vcc                          Scalar ALU
s_cbranch_execz       label_0168                                 Branch
exp                   pos0, v3, v4, v3, v3 done                  GDS/Export
label_0168:
s_endpgm                                                         Flow Control

Copy link
Collaborator

@Napokue Napokue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

Copy link
Member

@jimblandy jimblandy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Some minor suggestions.

src/back/spv/writer.rs Outdated Show resolved Hide resolved
src/back/spv/writer.rs Outdated Show resolved Hide resolved
@kvark kvark merged commit bb716f9 into gfx-rs:master Apr 16, 2021
@kvark kvark deleted the spv-array branch April 16, 2021 21:11
sagudev added a commit to sagudev/wgpu that referenced this pull request Sep 1, 2024
gfx-rs/naga#723 back from dead

Co-authored-by: Dzmitry Malyshau <kvarkus@gmail.com>
Signed-off-by: sagudev <16504129+sagudev@users.noreply.github.com>
sagudev added a commit to sagudev/wgpu that referenced this pull request Sep 2, 2024
gfx-rs/naga#723 back from dead

Co-authored-by: Dzmitry Malyshau <kvarkus@gmail.com>
Signed-off-by: sagudev <16504129+sagudev@users.noreply.github.com>
ErichDonGubler pushed a commit to sagudev/wgpu that referenced this pull request Sep 4, 2024
gfx-rs/naga#723 back from dead

Co-authored-by: Dzmitry Malyshau <kvarkus@gmail.com>
Signed-off-by: sagudev <16504129+sagudev@users.noreply.github.com>
jimblandy pushed a commit to sagudev/wgpu that referenced this pull request Sep 5, 2024
gfx-rs/naga#723 back from dead

Co-authored-by: Dzmitry Malyshau <kvarkus@gmail.com>
Signed-off-by: sagudev <16504129+sagudev@users.noreply.github.com>
jimblandy added a commit to sagudev/wgpu that referenced this pull request Oct 3, 2024
Bring gfx-rs/naga#723 back from the dead.

Signed-off-by: sagudev <16504129+sagudev@users.noreply.github.com>
Co-authored-by: Dzmitry Malyshau <kvarkus@gmail.com>
Co-authored-by: Jim Blandy <jimb@red-bean.com>
jimblandy added a commit to gfx-rs/wgpu that referenced this pull request Oct 3, 2024
Bring gfx-rs/naga#723 back from the dead.

Signed-off-by: sagudev <16504129+sagudev@users.noreply.github.com>
Co-authored-by: Dzmitry Malyshau <kvarkus@gmail.com>
Co-authored-by: Jim Blandy <jimb@red-bean.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Indexing constant arrays in SPIR-V
4 participants