Skip to content

Conversation

lawnjelly
Copy link
Member

@lawnjelly lawnjelly commented May 9, 2025

Adds new allocation-free Node function for retrieving children from nodes efficiently using LocalVector.

Fixes worst aspect of #106185

MRP

ManyStaticColliders2.zip

(release build)
Before this PR: 160fps
After this PR: 360fps
Physics Interpolation off: 360fps

Note that this PR likely doesn't solve all performance issues in #106185, but seems to ameliorate a particular problem in 4.x. I had to double check the timings, because the difference is so pronounced.

Discussion

#104269 introduced the new 3D FTI and uses a reference approach traversing the entire scene tree which is expected to not be super fast with large numbers of nodes in the tree. This is to be considerably optimized in #105728 (which only updates branches that have changed) and #105901.

However the reason for most of the slowdown in #106185 was unexpected - it was due to changes made for 4.x in #75627 which seem to make it (very?) inefficient to access children via get_child_count() and get_child(). This problem does not exist in 3.x.

Profiling shows this to be the case (this is debug, but should represent release to an extent):

profile_traversal_master

Investigating suggested that get_child_count() and get_child() could be very inefficient, especially in debug with thread checks.

Notes

  • I don't use 4.x regularly, so was somewhat surprised by this inefficiency.
  • Welcome any alternative naming.
  • We should probably review existing code in 4.x that uses the get_child_count() / get_child() pattern and consider replacing.
  • The existing Node::get_children() command should probably become a wrapper around get_children_fast() or use a similar approach, because it is likely also inefficient.

Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. I didn't run the code locally, but I compared it against the existing get_children() and this appears to do the same thing without most of the overhead.

@lawnjelly lawnjelly marked this pull request as draft May 9, 2025 17:50
@lawnjelly lawnjelly force-pushed the fti_get_children_fast branch from b54b49b to 432f7ae Compare May 9, 2025 17:52
@lawnjelly
Copy link
Member Author

lawnjelly commented May 9, 2025

I figured out why the timings were so good! 😁
I forgot an & on the parameter so it was returning 0 each time!

Fixed now, so the improvement won't be quite as good lol.

Tested and it's still loads faster. 👍

Yes the results are still 360fps after the PR (as before), and a benchmark testing just the iteration of the children (20000) shows get_children_fast() is approx 16x faster than the old get_child_count() and get_child() approach.

UPDATE:
Ah, another bug to fix yet in the SceneTreeFTI, just fixing. Can't use a single Vector as the scene tree is recursive.
May have to finish this up tomorrow, with FixedVector or a LocalVector pool.

@lawnjelly lawnjelly marked this pull request as ready for review May 9, 2025 17:55
@lawnjelly lawnjelly marked this pull request as draft May 9, 2025 18:17
@KoBeWi
Copy link
Member

KoBeWi commented May 9, 2025

You could skip children cache if you don't care about order of nodes and don't need to differentiate internal nodes.

if (!s) {
for (int n = 0; n < p_node->get_child_count(); n++) {
_update_dirty_nodes(p_node->get_child(n), p_current_frame, p_interpolation_fraction, p_active, nullptr, p_depth + 1);
for (uint32_t n = 0; n < data.temp_child_list.size(); n++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be changed to range loop.

if (p_include_internal) {
uint32_t num_children = data.children_cache.size();
r_children.resize(num_children);
for (uint32_t n = 0; n < num_children; n++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth tring a memcpy here, it would have a very good chance of using SIMD for the copy.

@lawnjelly
Copy link
Member Author

Actually on reflection based on the recursion problem, I think I might try a simpler approach to this problem tomorrow, and just provide fast access directly to the children cache for SceneTreeFTI, I have a draft PR for this.

It seems it is already cached as a LocalVector, so we can avoid the recursion problem by accessing it directly and avoiding the safety protections which are what is slowing this down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants