-
-
Notifications
You must be signed in to change notification settings - Fork 23.4k
FTI - Add Node::get_children_fast()
for SceneTreeFTI
#106214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. I didn't run the code locally, but I compared it against the existing get_children() and this appears to do the same thing without most of the overhead.
b54b49b
to
432f7ae
Compare
I figured out why the timings were so good! 😁 Fixed now, so the improvement won't be quite as good lol. Tested and it's still loads faster. 👍 Yes the results are still 360fps after the PR (as before), and a benchmark testing just the iteration of the children (20000) shows UPDATE: |
You could skip children cache if you don't care about order of nodes and don't need to differentiate internal nodes. |
if (!s) { | ||
for (int n = 0; n < p_node->get_child_count(); n++) { | ||
_update_dirty_nodes(p_node->get_child(n), p_current_frame, p_interpolation_fraction, p_active, nullptr, p_depth + 1); | ||
for (uint32_t n = 0; n < data.temp_child_list.size(); n++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be changed to range loop.
if (p_include_internal) { | ||
uint32_t num_children = data.children_cache.size(); | ||
r_children.resize(num_children); | ||
for (uint32_t n = 0; n < num_children; n++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth tring a memcpy here, it would have a very good chance of using SIMD for the copy.
Actually on reflection based on the recursion problem, I think I might try a simpler approach to this problem tomorrow, and just provide fast access directly to the children cache for It seems it is already cached as a |
Adds new allocation-free
Node
function for retrieving children from nodes efficiently usingLocalVector
.Fixes worst aspect of #106185
MRP
ManyStaticColliders2.zip
(release build)
Before this PR: 160fps
After this PR: 360fps
Physics Interpolation off: 360fps
Note that this PR likely doesn't solve all performance issues in #106185, but seems to ameliorate a particular problem in 4.x. I had to double check the timings, because the difference is so pronounced.
Discussion
#104269 introduced the new 3D FTI and uses a reference approach traversing the entire scene tree which is expected to not be super fast with large numbers of nodes in the tree. This is to be considerably optimized in #105728 (which only updates branches that have changed) and #105901.
However the reason for most of the slowdown in #106185 was unexpected - it was due to changes made for 4.x in #75627 which seem to make it (very?) inefficient to access children via
get_child_count()
andget_child()
. This problem does not exist in 3.x.Profiling shows this to be the case (this is debug, but should represent release to an extent):
Investigating suggested that
get_child_count()
andget_child()
could be very inefficient, especially in debug with thread checks.Notes
get_child_count()
/get_child()
pattern and consider replacing.Node::get_children()
command should probably become a wrapper aroundget_children_fast()
or use a similar approach, because it is likely also inefficient.