-
-
Notifications
You must be signed in to change notification settings - Fork 4k
Use instancing for sprite rendering #8872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
there are FPS regressions for me on both
|
This is (surprisingly) faster on my system. (for bevymark, I'm seeing slightly better minimums, and ~112fps average vs ~90fps average for our current batching, for many_sprites perf is the same). The old wisdom was that you want to use batching for sprites because you're drawing quads, and GPUs run wavefronts of 32-64 vertices (so you end up with very low utilization per wavefront). However, these days (some? AMD, maybe Nvidia, others idk) GPU drivers can merge instances into wavefronts. (but sometimes it merges and it's still slow? idk) Basically, need people to test this on a bunch of different hardware and check wavefront occupancy to figure out who does merging and who doesn't and if we're okay with the performance tradeoffs. (or maybe even have both paths, but have it be an option?) |
# Objective - Supercedes #8872 - Improve sprite rendering performance after the regression in #9236 ## Solution - Use an instance-rate vertex buffer to store per-instance data. - Store color, UV offset and scale, and a transform per instance. - Convert Sprite rect, custom_size, anchor, and flip_x/_y to an affine 3x4 matrix and store the transpose of that in the per-instance data. This is similar to how MeshUniform uses transpose affine matrices. - Use a special index buffer that has batches of 6 indices referencing 4 vertices. The lower 2 bits indicate the x and y of a quad such that the corners are: ``` 10 11 00 01 ``` UVs are implicit but get modified by UV offset and scale The remaining upper bits contain the instance index. ## Benchmarks I will compare versus `main` before #9236 because the results should be as good as or faster than that. Running `bevymark -- 10000 16` on an M1 Max with `main` at `e8b38925` in yellow, this PR in red:  Looking at the median frame times, that's a 37% reduction from before. --- ## Changelog - Changed: Improved sprite rendering performance by leveraging an instance-rate vertex buffer. --------- Co-authored-by: Giacomo Stevanato <giaco.stevanato@gmail.com>
Thanks for the contribution. Closing this as #9597 was merged. |
# Objective - Supercedes bevyengine#8872 - Improve sprite rendering performance after the regression in bevyengine#9236 ## Solution - Use an instance-rate vertex buffer to store per-instance data. - Store color, UV offset and scale, and a transform per instance. - Convert Sprite rect, custom_size, anchor, and flip_x/_y to an affine 3x4 matrix and store the transpose of that in the per-instance data. This is similar to how MeshUniform uses transpose affine matrices. - Use a special index buffer that has batches of 6 indices referencing 4 vertices. The lower 2 bits indicate the x and y of a quad such that the corners are: ``` 10 11 00 01 ``` UVs are implicit but get modified by UV offset and scale The remaining upper bits contain the instance index. ## Benchmarks I will compare versus `main` before bevyengine#9236 because the results should be as good as or faster than that. Running `bevymark -- 10000 16` on an M1 Max with `main` at `e8b38925` in yellow, this PR in red:  Looking at the median frame times, that's a 37% reduction from before. --- ## Changelog - Changed: Improved sprite rendering performance by leveraging an instance-rate vertex buffer. --------- Co-authored-by: Giacomo Stevanato <giaco.stevanato@gmail.com>
Objective
Solution
Store per-instance data instead of 6 vertices to the vertex buffer and change the
VertexStepMode
toInstance
.This allows for less data being transferred.
Originally, 1728 bits (216 bytes) had to be transferred to the GPU for each sprite.
Now, only 768 bits (96 bytes) are needed per-sprite.
Performance
(Sprite size set to 0.01 to minimize sampler performance impact, buffer size captured from Intel GPA)
bevymark
Original: 2150408 bytes transferred per frame, 145-147 fps.
Instanced: 960004 bytes transferred per frame, 158-161 fps.