Use instancing for sprite rendering #8872

opstic · 2023-06-17T18:43:36Z

Objective

Improve sprite rendering performance

Solution

Store per-instance data instead of 6 vertices to the vertex buffer and change the VertexStepMode to Instance.
This allows for less data being transferred.
Originally, 1728 bits (216 bytes) had to be transferred to the GPU for each sprite.
Now, only 768 bits (96 bytes) are needed per-sprite.

Performance

(Sprite size set to 0.01 to minimize sampler performance impact, buffer size captured from Intel GPA)

10000 sprites in bevymark
Original: 2150408 bytes transferred per frame, 145-147 fps.
Instanced: 960004 bytes transferred per frame, 158-161 fps.

mockersf · 2023-06-22T10:50:55Z

there are FPS regressions for me on both bevymark and many_sprites:

many_sprites from ~820 fps to ~760 fps
bevymark -- 1000 100 from 45 fps to 39 fps

Elabajaba · 2023-06-25T20:09:09Z

This is (surprisingly) faster on my system. (for bevymark, I'm seeing slightly better minimums, and ~112fps average vs ~90fps average for our current batching, for many_sprites perf is the same).

The old wisdom was that you want to use batching for sprites because you're drawing quads, and GPUs run wavefronts of 32-64 vertices (so you end up with very low utilization per wavefront). However, these days (some? AMD, maybe Nvidia, others idk) GPU drivers can merge instances into wavefronts. (but sometimes it merges and it's still slow? idk)

Basically, need people to test this on a bunch of different hardware and check wavefront occupancy to figure out who does merging and who doesn't and if we're okay with the performance tradeoffs. (or maybe even have both paths, but have it be an option?)

# Objective - Supercedes #8872 - Improve sprite rendering performance after the regression in #9236 ## Solution - Use an instance-rate vertex buffer to store per-instance data. - Store color, UV offset and scale, and a transform per instance. - Convert Sprite rect, custom_size, anchor, and flip_x/_y to an affine 3x4 matrix and store the transpose of that in the per-instance data. This is similar to how MeshUniform uses transpose affine matrices. - Use a special index buffer that has batches of 6 indices referencing 4 vertices. The lower 2 bits indicate the x and y of a quad such that the corners are: ``` 10 11 00 01 ``` UVs are implicit but get modified by UV offset and scale The remaining upper bits contain the instance index. ## Benchmarks I will compare versus `main` before #9236 because the results should be as good as or faster than that. Running `bevymark -- 10000 16` on an M1 Max with `main` at `e8b38925` in yellow, this PR in red: ![Screenshot 2023-08-27 at 18 44 10](https://github.com/bevyengine/bevy/assets/302146/bdc5c929-d547-44bb-b519-20dce676a316) Looking at the median frame times, that's a 37% reduction from before. --- ## Changelog - Changed: Improved sprite rendering performance by leveraging an instance-rate vertex buffer. --------- Co-authored-by: Giacomo Stevanato <giaco.stevanato@gmail.com>

superdump · 2023-09-02T22:21:07Z

Thanks for the contribution. Closing this as #9597 was merged.

# Objective - Supercedes bevyengine#8872 - Improve sprite rendering performance after the regression in bevyengine#9236 ## Solution - Use an instance-rate vertex buffer to store per-instance data. - Store color, UV offset and scale, and a transform per instance. - Convert Sprite rect, custom_size, anchor, and flip_x/_y to an affine 3x4 matrix and store the transpose of that in the per-instance data. This is similar to how MeshUniform uses transpose affine matrices. - Use a special index buffer that has batches of 6 indices referencing 4 vertices. The lower 2 bits indicate the x and y of a quad such that the corners are: ``` 10 11 00 01 ``` UVs are implicit but get modified by UV offset and scale The remaining upper bits contain the instance index. ## Benchmarks I will compare versus `main` before bevyengine#9236 because the results should be as good as or faster than that. Running `bevymark -- 10000 16` on an M1 Max with `main` at `e8b38925` in yellow, this PR in red: ![Screenshot 2023-08-27 at 18 44 10](https://github.com/bevyengine/bevy/assets/302146/bdc5c929-d547-44bb-b519-20dce676a316) Looking at the median frame times, that's a 37% reduction from before. --- ## Changelog - Changed: Improved sprite rendering performance by leveraging an instance-rate vertex buffer. --------- Co-authored-by: Giacomo Stevanato <giaco.stevanato@gmail.com>

opstic added 2 commits June 17, 2023 11:44

Use instancing for sprite rendering

60738c0

Name transform matrix variables better

988dcda

opstic marked this pull request as ready for review June 17, 2023 19:15

alice-i-cecile added C-Performance A change motivated by improving speed, memory usage or compile times A-Rendering Drawing game state to the screen labels Jun 17, 2023

opstic added 3 commits July 12, 2023 00:44

Merge branch 'bevyengine:main' into sprite-instancing

bd0cf5a

Merge branch 'bevyengine:main' into sprite-instancing

f0d077b

Merge branch 'bevyengine:main' into sprite-instancing

423f6bc

superdump mentioned this pull request Aug 27, 2023

Use instancing for sprites #9597

Merged

opstic closed this by deleting the head repository Sep 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Use instancing for sprite rendering #8872

Use instancing for sprite rendering #8872

Uh oh!

opstic commented Jun 17, 2023 •

edited

Loading

Uh oh!

mockersf commented Jun 22, 2023

Uh oh!

Elabajaba commented Jun 25, 2023

Uh oh!

superdump commented Sep 2, 2023

Uh oh!

Uh oh!

Uh oh!

Use instancing for sprite rendering #8872

Use instancing for sprite rendering #8872

Uh oh!

Conversation

opstic commented Jun 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Objective

Solution

Performance

Uh oh!

mockersf commented Jun 22, 2023

Uh oh!

Elabajaba commented Jun 25, 2023

Uh oh!

superdump commented Sep 2, 2023

Uh oh!

Uh oh!

opstic commented Jun 17, 2023 •

edited

Loading