-
Notifications
You must be signed in to change notification settings - Fork 658
Packed backward pass speedup via unrolled camera position indexing #831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Packed backward pass speedup via unrolled camera position indexing #831
Conversation
Co-authored-by: AbhinavGrover <abhinav.grover@applied.co>
|
@liruilong940607 Please take a look and let me know if the PR text needs some theoretical justification too. |
|
This is a great finding! But I'm concerned that the current way of writing it (loop over B and C) might leads to very slow speed when B and C is large. Ideally we should vectorize the compute there. |
|
btw plz run the formatter to pass the test |
|
I was using python3.10's formatter and hence failed the github tests. Used python3.8's formatter now. It should pass now. |
|
Merging this now! Thank you for looking into this. |
…erfstudio-project#831) * Optimize packed viewdir pass by reducing many-to-one indexing operations Co-authored-by: AbhinavGrover <abhinav.grover@applied.co> * Copy the indptr to cpu to avoid GPU sync in the loop * Resolved PR comments * Format * Format * typo --------- Co-authored-by: AbhinavGrover <abhinav.grover@applied.co>
Description
When running 3DGS with
packed=True, pose_opt=Trueand using a CUDA device, the computation time for this campos indexing operation in the backwards pass is huge. This is because the gradients of ALL dirs would need to be accumulated into the gradient of the camera positions (which are much smaller in number), in the backward pass. The pytorch backward cuda kernel for this operation is very expensive due to numerous GPU atomic operations.This PR unrolls the indexing and relies on pytorch's broadcasting for faster backwards pass, while keeping the overall numerical calculation exactly the same.
Training times
I ran the
examples/benchmarks/basic.shon RTX 3090. Major improvement in the training time when pose_opt and packed are true. No major performance effect in the other case.batch_size=1 max_steps=5000 packed=true pose_opt=true
Before this change
After this change
batch_size=1 max_steps=5000 packed=false pose_opt=false
Before this change
After this change