-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Description
Background
We currently use sort_by_key
to sort PhaseItem
s in the sort phase, and sort_unstable_by
to sort extracted sprites in queue_sprites
.
sort_by_key
- This sort is stable (i.e., does not reorder equal elements) and O(m * n * log(n)) worst-case, where the key function is O(m).sort_unstable_by
- This sort is unstable (i.e., may reorder equal elements), in-place (i.e., does not allocate), and O(n * log(n)) worst-case.
For large numbers of items to sort, such as in bevymark
or many_sprites
where n > 100_000
, and given sort keys that are a small and fixed number of bits such as the f32 that we are commonly using, a radix sort could/should be much faster.
crates
I investigated sorting 10M random f32
s in the range 0.1f32..1000.0f32
(the default perspective camera near and far planes) and observed the following with the available radix sort crates:
1053.388ms stable
391.499ms unstable
60.359ms rdst_single
16.067ms rdst
79.550ms voracious_sort
79.945ms voracious_stable_sort
13.895ms voracious_mt_sort
65.653ms radsort
- multi-threading
rdst
andvoracious_mt_sort
results above are multi-threaded usingrayon
, all the rest of the results are single-threadedrdst
has a hard dependency onrayon
, though it can be configured to do single-threaded sorting. I created an issue about making multi-threading optional as we already have a threadpool in bevy and maybe we don't want to add another: Make rayon optional nessex/rdst#2voracious_radix_sort
has a non-default multi-threading featureradsort
is single-threaded only
- According to crates.io, the crates have the following footprints:
radsort
- 17.1kBrdst
- 34.3kBvoracious_radix_sort
- 178kB
- Flexibility
radsort
was the easiest to integrate using itssort_by_key
functionrdst
andvoracious_radix_sort
both require implementation of some traits on the type being sorted, and require that the type implementCopy
. This cannot be done for batched sprites currently becauseBatchedPhaseItem
contains aRange<f32>
for which it is not possible to implementCopy
Proof of Concept
I made a branch here that uses radsort
: https://github.com/superdump/bevy/tree/render-radix-sort .
The sorting of ExtractedSprites
in queue_sprites
is significantly improved by using radsort::sort_by_key
like this:
radsort::sort_by_key(extracted_sprites, |extracted_sprite| {
(
extracted_sprite.transform.translation.z,
match extracted_sprite.image_handle_id {
HandleId::Id(uuid, id) => {
((uuid.as_u128() & ((1 << 64) - 1)) << 64) | (id as u128)
}
HandleId::AssetPathId(id) => {
((id.source_path_id().value() as u128) << 64)
| (id.label_id().value() as u128)
}
},
)
});
With that, on an M1 Max, the median execution time of queue_sprites
in bevymark -- 20000 8
over 1500 frames increased from 9.82ms to 17.09ms, which makes no sense to me. In many_sprites
it decreased from 11.34ms to 9.47ms.
The median execution time of sort_phase_system
for the relevant phase (Transparent2d
for sprites, Opaque3d
for 3D meshes) in many_cubes -- sphere
it decreased from 0.728ms to 0.094ms. In bevymark -- 20000 8
it increased from 0.184ms to 1.95ms, which makes no sense. And in many_sprites
it increased from 0.106ms to 1.19ms.
This is quite confusing. I haven't been able to figure out why the sort performance gets worse. radsort
claims both best and worst execution time of O(n), space complexity of O(n), and that it is a stable sort so it does not reorder equal items. On the branch, all the sort_key
implementations are inlined.
Next steps
- Figure out why
radsort
is much slower in many cases - Try
voracious_radix_sort
orrdst
to see if they perform consistently better, thoughrdst
would likely not be approved unless its multi-threading were made optional.- Make multi-threading optional in
rdst
if it turns out to be a preferable solution due to the smaller crate footprint
- Make multi-threading optional in
Metadata
Metadata
Assignees
Labels
Type
Projects
Status