Description
What problem does this solve or what need does it fill?
More optimizations could be done in systems that could iterate over queries in a batched, packed way rather than individually. A clear example would be SIMD.
What solution would you like?
I'd like to be able to use queries in the following way:
world
.query_filtered::<Batched<(&mut Position, &Velocity)>, Without<FrozenInPlace>>()
.par_for_each_mut(
&task_pool,
4,
|(mut positions, velocities): (impl DerefMut<&mut [Position]>, impl Deref<&[Velocity]>)| {
use core::intrinsics::{assume, likely};
if unsafe { likely(positions.len() == 4) } {
unsafe { assume(velocities.len() == 4); } // hopefully done by Bevy
// some batch fast path
} else {
for (mut pos, vel) in positions.iter_mut().zip(velocities.iter()) {
// fallback scalar slow path
}
}
}
);
I'm using Deref
& DerefMut
informally in this example. I imagine it would be some wrapper like Res
, but I don't want to suggest any names.
What alternative(s) have you considered?
Manual buffering to reconstruct this information, but it's hard to implement especially for parallel iteration.
Additional context
For densely stored archetypes, I would like it to return however many spatially contiguous items as it's able to, up to the batch_size
setting of course. If the total number of compatible rows for a batched query is equal to batch_size
, but it's spread across different archetypes/tables and therefore not contiguous, I would prefer to get N separate batches under batch_size
. That is what the &[Component]
API would require anyway.
It would also be great if Bevy could prove to the compiler that all of these component-slices are of the same length. That would perhaps improve optimization.