Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DX12 Sampler Heap Requirements Overly Strict #3350

Open
Tracked by #2719
cwfitzgerald opened this issue Jan 4, 2023 · 10 comments
Open
Tracked by #2719

DX12 Sampler Heap Requirements Overly Strict #3350

cwfitzgerald opened this issue Jan 4, 2023 · 10 comments
Assignees
Labels
api: dx12 Issues with DX12 or DXGI help required We need community help to make this happen. type: enhancement New feature or request

Comments

@cwfitzgerald
Copy link
Member

cwfitzgerald commented Jan 4, 2023

DX12 is totally weird about samplers. At any one time there may be exactly one sampler heap bound to a command buffer. Each sampler heap may contain no more than 2048 sampler descriptors. Currently we only ever use a single sampler heap. This means that the maximum amount of sampler bindings we can allocate in all live bind groups is 2048. This is extremely restrictive, especially when compared to other apis.

You can switch which sampler heap is bound to the command buffer at any time, but according to the command list documentation this may be expensive and cause flushes.

We need a better strategy of dealing with sampler descriptors. I suspect this will involve some kind of sampler descriptor cache where we try to re-use the same sampler heap as much as possible, only overflowing if we run out of space, then just switching sampler heaps as rarely as possible.

@cwfitzgerald cwfitzgerald added type: enhancement New feature or request help required We need community help to make this happen. api: dx12 Issues with DX12 or DXGI labels Jan 4, 2023
@Davidster
Copy link
Contributor

Davidster commented Mar 9, 2023

After experimenting with PR #3499 and some discussions in Matrix, we've made a more concrete plan for how we'll attempt to solve this.

The naïve solution is to dedupe the samplers that are uploaded to the GPU sampler heap. The problem with this is that the root signature is currently set up to have 1 descriptor table for every bind group in the pipeline where that bind group's samplers are stored. This means that the samplers of each bind group currently need to be contiguous in the heap, since descriptor tables are contiguous ranges in memory. This is a problem for deduping, since the samplers would no longer be contiguous.

One option would be to add more descriptor tables, but this would mean that we'd have one descriptor table per sample in the root signature. This is bad since the root signature is precious space, containing a max of 64 DWORDS where each descriptor table takes up 1 DWORD. As is recommended by the dx12 docs, we don't want to waste any space in the root signature as it is used for lots of stuff such as push constants.

One proposed solution to this is to use root constants instead of descriptor tables to store indices into the sampler heap. The entire sampler heap will be bound as a single descriptor table pointing to an array all the samplers. Since there's a maximum of 2048 samplers in the heap, we would only need 12 bits to represent the sampler's index in the heap. This means that we could store 32 sampler indices in the root signature in 12 root constants (root constants are 1 DWORD or 32 bits each, and 32*12bits == 12*32bits). So that would be 13 DWORDS (12 root constants + 1 descriptor table) instead of 32 DWORDS if we added 1 descriptor table per sampler as per the previous paragraph.

Another similar solution is to allocate an extra buffer which contains arrays of sampler indices. We can use that to store the list of samplers for each bind group and then store a pointer to that list in the root signature. This would add an extra indirection to fetching the samplers but would take up less space in the root signature (1 DWORD per bind group + 1 for the sampler array). It's also a simpler solution that would work for both fixed samplers and also sampler arrays. In fact, even if we use the other approach for fixed samplers, we would still need to do this to keep supporting sampler arrays. @cwfitzgerald and I are in agreement that we start with this approach for all cases and see how it goes.

Some other relevant info:

  • The WebGPU standard requires that we support at least 32 samplers per pipeline.
  • WebGPU already requires Tier 2 hardware, so we can rely on being able to have 2048 samplers.
    • it's not actually 2048, it's really 2032 because 16 samplers are reserved by dx12
  • For a maximum bind group count of 8, we use 17 dwords for all the bindings. For max of 4, we use 9.

@teoxoy
Copy link
Member

teoxoy commented Mar 9, 2023

Thanks for the investigation; I'd also say that the 2nd approach sounds like a good start!

@cwfitzgerald
Copy link
Member Author

Nice! Also looping in @jimblandy on this

@jimblandy
Copy link
Member

@teoxoy If the CTS were running into problems caused by this not-great allocation system in Mozilla CI, what would the symptoms be?

Basically, as long as this is an optimality problem, it's low priority; once it's a blocker to CTS health, then it's a higher priority.

@Davidster
Copy link
Contributor

you would get a panic with the "not enough memory left" error message, as in this issue: #2857

@teoxoy
Copy link
Member

teoxoy commented Dec 4, 2023

WebGPU requires FL11_1 or (FL11_0 + resource binding tier 2) (ref).

The table here lists FL11_1 as having a minimum resource binding tier of 3, so I thought we can assume tier 2 as a minimum but the table here lists haswell & broadwell as being FL11_1 with a resource binding tier of 1.

We already dropped support for haswell on d3d12 (#4709) but I thought we could still support broadwell.

Is the table on wikipedia wrong? If not, is it valid that broadwell only supports resource binding tier 1 given that the microsoft docs say FL11_1 should support tier 3?

@cwfitzgerald
Copy link
Member Author

The table is probably wrong in some way - I'm not particularly interested in supporting it, especially when there's no good solution for samplers without at least RBT2.

@teoxoy
Copy link
Member

teoxoy commented Dec 4, 2023

I guess dropping it would be fine since tier 1 has the additional limit of:

Max Samplers in all descriptor tables per shader stage: 16

from https://microsoft.github.io/DirectX-Specs/d3d/ResourceBinding.html#levels-of-hardware-support

which sounds even more restricting.

It would be good to know if the table was wrong or it's actually the case that it only supports tier 1.
Are you aware of any other resources that list this kind of information? That table on wikipedia is the only resource I could find.

@cwfitzgerald
Copy link
Member Author

cwfitzgerald commented Jan 11, 2024

Detailing the resolved plan a little more:

Samplers are fully de-duplicated in the backend. On creation, they are added to the sampler heap. On destruction, that slot in the sampler heap is up for re-use. Sampler creation fails if there is no more slots on the heap. This is allowed because we are updating descriptors that are not actively used by the gpu.

Globally, for all shaders that bind samplers, there is a single descriptor table which binds the entire sampler heap. This is not part of a bind group.

For each bindgroup that contains a sampler, there is a single descriptor pointing to a buffer with u32s. Each u32 is an index into the sampler heap corresponding to that sampler in the bind group. When accessing the sampler, the appropriate position in the buffer is read and the index is used to index into the sampler descriptor set binding.

When creating a bind group, this sampler index buffer is created and populated with all the appropriate sampler indexes. This is then bound as part of binding a bind group.

@group(0) @binding(0) var sampler_0: sampler; // sampler buffer 0 index 0 
@group(0) @binding(1) var texture_0: texture2d<f32>;
@group(0) @binding(2) var sampler_1: sampler; // sampler buffer 0 index 1

@group(1) @binding(0) var sampler_2: sampler; // sampler buffer 1 index 0 
@group(1) @binding(1) var sampler_3: sampler; // sampler buffer 1 index 1
@group(1) @binding(2) var sampler_4: sampler; // sampler buffer 1 index 2

// In the HLSL output, the equiv of:
var sampler_descriptors: binding_array<sampler>; // Contains 2032 possibly invalid samplers.
@group(0) @binding(wherever) var<storage> sampler_array_bind_group_0: array<u32>; // Contains two values in this example.
@group(1) @binding(wherever) var<storage> sampler_array_bind_group_1: array<u32>; // Contains three values in this example.

// When reading a sampler, wgsl of this:
textureSample(texture_0, sampler_1, uv);
textureSample(texture_0, sampler_4, uv);

// Gets translated to:
textureSample(texture_0, sampler_descriptors[sampler_array_bind_group_0[1]], uv);
textureSample(texture_0, sampler_descriptors[sampler_array_bind_group_1[2]], uv);

Note for the Future

If we use SM6.6 bindless, we can index directly into the sampler heap, skipping the sampler_descriptor binding.

@nical
Copy link
Contributor

nical commented Jan 17, 2024

I suspect that some bevy demos such as https://bevyengine.org/examples-webgpu/3D%20Rendering/load-gltf/ are hitting this issue very easily.

@ErichDonGubler ErichDonGubler removed their assignment Feb 7, 2024
@teoxoy teoxoy self-assigned this Jul 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: dx12 Issues with DX12 or DXGI help required We need community help to make this happen. type: enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

6 participants