Skip to content

Use instancing for sprite rendering #8872

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

Use instancing for sprite rendering #8872

wants to merge 5 commits into from

Conversation

opstic
Copy link
Contributor

@opstic opstic commented Jun 17, 2023

Objective

  • Improve sprite rendering performance

Solution

Store per-instance data instead of 6 vertices to the vertex buffer and change the VertexStepMode to Instance.
This allows for less data being transferred.
Originally, 1728 bits (216 bytes) had to be transferred to the GPU for each sprite.
Now, only 768 bits (96 bytes) are needed per-sprite.

Performance

(Sprite size set to 0.01 to minimize sampler performance impact, buffer size captured from Intel GPA)

  • 10000 sprites in bevymark
    Original: 2150408 bytes transferred per frame, 145-147 fps.
    Instanced: 960004 bytes transferred per frame, 158-161 fps.

@opstic opstic marked this pull request as ready for review June 17, 2023 19:15
@alice-i-cecile alice-i-cecile added C-Performance A change motivated by improving speed, memory usage or compile times A-Rendering Drawing game state to the screen labels Jun 17, 2023
@mockersf
Copy link
Member

there are FPS regressions for me on both bevymark and many_sprites:

  • many_sprites from ~820 fps to ~760 fps
  • bevymark -- 1000 100 from 45 fps to 39 fps

@Elabajaba
Copy link
Contributor

This is (surprisingly) faster on my system. (for bevymark, I'm seeing slightly better minimums, and ~112fps average vs ~90fps average for our current batching, for many_sprites perf is the same).

The old wisdom was that you want to use batching for sprites because you're drawing quads, and GPUs run wavefronts of 32-64 vertices (so you end up with very low utilization per wavefront). However, these days (some? AMD, maybe Nvidia, others idk) GPU drivers can merge instances into wavefronts. (but sometimes it merges and it's still slow? idk)

Basically, need people to test this on a bunch of different hardware and check wavefront occupancy to figure out who does merging and who doesn't and if we're okay with the performance tradeoffs. (or maybe even have both paths, but have it be an option?)

github-merge-queue bot pushed a commit that referenced this pull request Sep 2, 2023
# Objective

- Supercedes #8872 
- Improve sprite rendering performance after the regression in #9236 

## Solution

- Use an instance-rate vertex buffer to store per-instance data.
- Store color, UV offset and scale, and a transform per instance.
- Convert Sprite rect, custom_size, anchor, and flip_x/_y to an affine
3x4 matrix and store the transpose of that in the per-instance data.
This is similar to how MeshUniform uses transpose affine matrices.
- Use a special index buffer that has batches of 6 indices referencing 4
vertices. The lower 2 bits indicate the x and y of a quad such that the
corners are:
  ```
  10    11

  00    01
  ```
UVs are implicit but get modified by UV offset and scale The remaining
upper bits contain the instance index.

## Benchmarks

I will compare versus `main` before #9236 because the results should be
as good as or faster than that. Running `bevymark -- 10000 16` on an M1
Max with `main` at `e8b38925` in yellow, this PR in red:

![Screenshot 2023-08-27 at 18 44
10](https://github.com/bevyengine/bevy/assets/302146/bdc5c929-d547-44bb-b519-20dce676a316)

Looking at the median frame times, that's a 37% reduction from before.

---

## Changelog

- Changed: Improved sprite rendering performance by leveraging an
instance-rate vertex buffer.

---------

Co-authored-by: Giacomo Stevanato <giaco.stevanato@gmail.com>
@superdump
Copy link
Contributor

Thanks for the contribution. Closing this as #9597 was merged.

@opstic opstic closed this by deleting the head repository Sep 16, 2023
rdrpenguin04 pushed a commit to rdrpenguin04/bevy that referenced this pull request Jan 9, 2024
# Objective

- Supercedes bevyengine#8872 
- Improve sprite rendering performance after the regression in bevyengine#9236 

## Solution

- Use an instance-rate vertex buffer to store per-instance data.
- Store color, UV offset and scale, and a transform per instance.
- Convert Sprite rect, custom_size, anchor, and flip_x/_y to an affine
3x4 matrix and store the transpose of that in the per-instance data.
This is similar to how MeshUniform uses transpose affine matrices.
- Use a special index buffer that has batches of 6 indices referencing 4
vertices. The lower 2 bits indicate the x and y of a quad such that the
corners are:
  ```
  10    11

  00    01
  ```
UVs are implicit but get modified by UV offset and scale The remaining
upper bits contain the instance index.

## Benchmarks

I will compare versus `main` before bevyengine#9236 because the results should be
as good as or faster than that. Running `bevymark -- 10000 16` on an M1
Max with `main` at `e8b38925` in yellow, this PR in red:

![Screenshot 2023-08-27 at 18 44
10](https://github.com/bevyengine/bevy/assets/302146/bdc5c929-d547-44bb-b519-20dce676a316)

Looking at the median frame times, that's a 37% reduction from before.

---

## Changelog

- Changed: Improved sprite rendering performance by leveraging an
instance-rate vertex buffer.

---------

Co-authored-by: Giacomo Stevanato <giaco.stevanato@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants