Add a compute shader sample #1

goodartistscopy · 2024-02-10T23:22:24Z

The sample generates an animated gif of the evolution of a Game of Life automaton. The sample demonstrates:

The use of a compute shader that writes into a storage texture
The classic "ping-pong" technique between two textures for iterative computation
The use of local workgroup memory (optional: see the shader code for details)

mpizenberg · 2024-02-19T22:11:48Z

Thanks a lot for this example. I have a few questions.

Regarding textures, isn't there a setting that enables automatic texture wrapping for coordinates that fall outside of texture limits (<0, >width, ...)?
The staging buffer is only used for the optimized shader right?
Since the execution time is dominated by the gif creation, we can't really experience the local cache optimization. How to make it matter more?

goodartistscopy · 2024-02-20T14:20:03Z

1. Regarding textures, isn't there a setting that enables automatic texture wrapping for coordinates that fall outside of texture limits (<0, >width, ...)?

There is, when the texture is accessed with a sampler using one of the textureSample() functions. You need to create a GPUSampler, and bind it like any other resource. The default "address mode" of a sampler is indeed "repeat" which would implement the donut coordinates system.
Here I'm just accessing the texels without interpolation, so I figured using a sampler would be a waste.

2. The staging buffer is only used for the optimized shader right?

The staging buffer is used as a CPU mappable copy of the output storage texture (if a texture is used as storage texture it can't also be mapped for reading on CPU). So it's used in both paths.

3. Since the execution time is dominated by the gif creation, we can't really experience the local cache optimization. How to make it matter more?

Right. It's mostly meant to illustrate how to use the local shared memory as a user-managed cache. I'm not sure the shader does enough repeated accesses to the input buffer to really matter (reads are amplified by 9, besides they're very regular).

Another factor is that modern GPUs have caches, so the "slow" implementation might actually be as fast as the other one. It would be interesting to benchmark, using the "timer query" feature to measure the time spent in the compute shader.

Where it becomes important is if the threads do many incoherent reads and/or writes. Then a custom managed scratchpad may more decisively beat the general purpose caches of the memory sub-system.
So the answer to "How to make it matter more?" would be "Use a more complex workload" :) (a sort algorithm may be a good candidate).

mpizenberg · 2024-02-20T17:31:49Z

Here I'm just accessing the texels without interpolation, so I figured using a sampler would be a waste.

Ah yes, ok I understand.

a sort algorithm may be a good candidate

XD ok ok

goodartistscopy added 3 commits February 11, 2024 00:17

Add a compute shader sample

0ed60e3

Update README

1e4d784

Remove useless variables

de94957

goodartistscopy force-pushed the compute_sample branch from 2f46697 to de94957 Compare February 15, 2024 15:33

Format example 6

0a1026d

mpizenberg merged commit ca29e70 into mpizenberg:main Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a compute shader sample #1

Add a compute shader sample #1

Uh oh!

goodartistscopy commented Feb 10, 2024

Uh oh!

mpizenberg commented Feb 19, 2024

Uh oh!

goodartistscopy commented Feb 20, 2024

Uh oh!

mpizenberg commented Feb 20, 2024

Uh oh!

Uh oh!

Add a compute shader sample #1

Add a compute shader sample #1

Uh oh!

Conversation

goodartistscopy commented Feb 10, 2024

Uh oh!

mpizenberg commented Feb 19, 2024

Uh oh!

goodartistscopy commented Feb 20, 2024

Uh oh!

mpizenberg commented Feb 20, 2024

Uh oh!

Uh oh!