Skip to content

Use radix sort for sort phase and sprite sorting #4291

@superdump

Description

@superdump

Background

We currently use sort_by_key to sort PhaseItems in the sort phase, and sort_unstable_by to sort extracted sprites in queue_sprites.

  • sort_by_key - This sort is stable (i.e., does not reorder equal elements) and O(m * n * log(n)) worst-case, where the key function is O(m).
  • sort_unstable_by - This sort is unstable (i.e., may reorder equal elements), in-place (i.e., does not allocate), and O(n * log(n)) worst-case.

For large numbers of items to sort, such as in bevymark or many_sprites where n > 100_000, and given sort keys that are a small and fixed number of bits such as the f32 that we are commonly using, a radix sort could/should be much faster.

crates

I investigated sorting 10M random f32s in the range 0.1f32..1000.0f32 (the default perspective camera near and far planes) and observed the following with the available radix sort crates:

1053.388ms      stable
391.499ms       unstable
60.359ms        rdst_single
16.067ms        rdst
79.550ms        voracious_sort
79.945ms        voracious_stable_sort
13.895ms        voracious_mt_sort
65.653ms        radsort
  • multi-threading
    • rdst and voracious_mt_sort results above are multi-threaded using rayon, all the rest of the results are single-threaded
    • rdst has a hard dependency on rayon, though it can be configured to do single-threaded sorting. I created an issue about making multi-threading optional as we already have a threadpool in bevy and maybe we don't want to add another: Make rayon optional nessex/rdst#2
    • voracious_radix_sort has a non-default multi-threading feature
    • radsort is single-threaded only
  • According to crates.io, the crates have the following footprints:
    • radsort - 17.1kB
    • rdst - 34.3kB
    • voracious_radix_sort - 178kB
  • Flexibility
    • radsort was the easiest to integrate using its sort_by_key function
    • rdst and voracious_radix_sort both require implementation of some traits on the type being sorted, and require that the type implement Copy. This cannot be done for batched sprites currently because BatchedPhaseItem contains a Range<f32> for which it is not possible to implement Copy

Proof of Concept

I made a branch here that uses radsort: https://github.com/superdump/bevy/tree/render-radix-sort .

The sorting of ExtractedSprites in queue_sprites is significantly improved by using radsort::sort_by_key like this:

            radsort::sort_by_key(extracted_sprites, |extracted_sprite| {
                (
                    extracted_sprite.transform.translation.z,
                    match extracted_sprite.image_handle_id {
                        HandleId::Id(uuid, id) => {
                            ((uuid.as_u128() & ((1 << 64) - 1)) << 64) | (id as u128)
                        }
                        HandleId::AssetPathId(id) => {
                            ((id.source_path_id().value() as u128) << 64)
                                | (id.label_id().value() as u128)
                        }
                    },
                )
            });

With that, on an M1 Max, the median execution time of queue_sprites in bevymark -- 20000 8 over 1500 frames increased from 9.82ms to 17.09ms, which makes no sense to me. In many_sprites it decreased from 11.34ms to 9.47ms.

The median execution time of sort_phase_system for the relevant phase (Transparent2d for sprites, Opaque3d for 3D meshes) in many_cubes -- sphere it decreased from 0.728ms to 0.094ms. In bevymark -- 20000 8 it increased from 0.184ms to 1.95ms, which makes no sense. And in many_sprites it increased from 0.106ms to 1.19ms.

This is quite confusing. I haven't been able to figure out why the sort performance gets worse. radsort claims both best and worst execution time of O(n), space complexity of O(n), and that it is a stable sort so it does not reorder equal items. On the branch, all the sort_key implementations are inlined.

Next steps

  • Figure out why radsort is much slower in many cases
  • Try voracious_radix_sort or rdst to see if they perform consistently better, though rdst would likely not be approved unless its multi-threading were made optional.
    • Make multi-threading optional in rdst if it turns out to be a preferable solution due to the smaller crate footprint

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-RenderingDrawing game state to the screenC-PerformanceA change motivated by improving speed, memory usage or compile times

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions