Use radix sort for sort phase and sprite sorting

## Background

We currently use `sort_by_key` to sort `PhaseItem`s in the sort phase, and `sort_unstable_by` to sort extracted sprites in `queue_sprites`.
* `sort_by_key` - _This sort is stable (i.e., does not reorder equal elements) and O(m * n * log(n)) worst-case, where the key function is O(m)._
* `sort_unstable_by` - _This sort is unstable (i.e., may reorder equal elements), in-place (i.e., does not allocate), and O(n * log(n)) worst-case._

For large numbers of items to sort, such as in `bevymark` or `many_sprites` where `n > 100_000`, and given sort keys that are a small and fixed number of bits such as the f32 that we are commonly using, a radix sort could/should be much faster.

## crates

I investigated sorting 10M random `f32`s in the range `0.1f32..1000.0f32` (the default perspective camera near and far planes) and observed the following with the available radix sort crates:
```
1053.388ms      stable
391.499ms       unstable
60.359ms        rdst_single
16.067ms        rdst
79.550ms        voracious_sort
79.945ms        voracious_stable_sort
13.895ms        voracious_mt_sort
65.653ms        radsort
```

* multi-threading
  * `rdst` and `voracious_mt_sort` results above are multi-threaded using `rayon`, all the rest of the results are single-threaded
  * `rdst` has a hard dependency on `rayon`, though it can be configured to do single-threaded sorting. I created an issue about making multi-threading optional as we already have a threadpool in bevy and maybe we don't want to add another: https://github.com/Nessex/rdst/issues/2
  * `voracious_radix_sort` has a non-default multi-threading feature
  * `radsort` is single-threaded only
* According to crates.io, the crates have the following footprints:
  * `radsort` - 17.1kB
  * `rdst` - 34.3kB
  * `voracious_radix_sort` - 178kB
* Flexibility
  * `radsort` was the easiest to integrate using its `sort_by_key` function
  * `rdst` and `voracious_radix_sort` both require implementation of some traits on the type being sorted, and require that the type implement `Copy`. This cannot be done for batched sprites currently because `BatchedPhaseItem` contains a `Range<f32>` for which it is not possible to implement `Copy`

## Proof of Concept

I made a branch here that uses `radsort`: https://github.com/superdump/bevy/tree/render-radix-sort .

The sorting of `ExtractedSprites` in `queue_sprites` is significantly improved by using `radsort::sort_by_key` like this:
```rust
            radsort::sort_by_key(extracted_sprites, |extracted_sprite| {
                (
                    extracted_sprite.transform.translation.z,
                    match extracted_sprite.image_handle_id {
                        HandleId::Id(uuid, id) => {
                            ((uuid.as_u128() & ((1 << 64) - 1)) << 64) | (id as u128)
                        }
                        HandleId::AssetPathId(id) => {
                            ((id.source_path_id().value() as u128) << 64)
                                | (id.label_id().value() as u128)
                        }
                    },
                )
            });
```

With that, on an M1 Max, the median execution time of `queue_sprites` in `bevymark -- 20000 8` over 1500 frames increased from 9.82ms to 17.09ms, which makes no sense to me. In `many_sprites` it decreased from 11.34ms to 9.47ms.

The median execution time of `sort_phase_system` for the relevant phase (`Transparent2d` for sprites, `Opaque3d` for 3D meshes) in `many_cubes -- sphere` it decreased from 0.728ms to 0.094ms. In `bevymark -- 20000 8` it increased from 0.184ms to 1.95ms, which makes no sense. And in `many_sprites` it increased from 0.106ms to 1.19ms.

This is quite confusing. I haven't been able to figure out why the sort performance gets worse. `radsort` claims both best and worst execution time of O(n), space complexity of O(n), and that it is a stable sort so it does not reorder equal items. On the branch, all the `sort_key` implementations are inlined.

## Next steps

* Figure out why `radsort` is _much_ slower in many cases
* Try `voracious_radix_sort` or `rdst` to see if they perform consistently better, though `rdst` would likely not be approved unless its multi-threading were made optional.
  * Make multi-threading optional in `rdst` if it turns out to be a preferable solution due to the smaller crate footprint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Use radix sort for sort phase and sprite sorting #4291

Background

crates

Proof of Concept

Next steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Use radix sort for sort phase and sprite sorting #4291

Description

Background

crates

Proof of Concept

Next steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions