Packed backward pass speedup via unrolled camera position indexing #831

gurki-bajwa-ai · 2025-11-13T01:22:39Z

Description

When running 3DGS with packed=True, pose_opt=True and using a CUDA device, the computation time for this campos indexing operation in the backwards pass is huge. This is because the gradients of ALL dirs would need to be accumulated into the gradient of the camera positions (which are much smaller in number), in the backward pass. The pytorch backward cuda kernel for this operation is very expensive due to numerous GPU atomic operations.

This PR unrolls the indexing and relies on pytorch's broadcasting for faster backwards pass, while keeping the overall numerical calculation exactly the same.

Training times

I ran the examples/benchmarks/basic.sh on RTX 3090. Major improvement in the training time when pose_opt and packed are true. No major performance effect in the other case.

batch_size=1 max_steps=5000 packed=true pose_opt=true

Before this change

Scene	Train Time	Number of splats	PSNR	SSIM
garden	496.45s	2084780	25.25	0.77
bicycle	283.62s	2048744	23.27	0.60
stump	212.59s	2540310	24.32	0.64
bonsai	345.74s	927848	29.14	0.92
counter	370.20s	622163	26.96	0.88

After this change

Scene	Train Time	Number of splats	PSNR	SSIM
garden	114.48s	2088360	25.30	0.76
bicycle	95.26s	2026429	23.30	0.60
stump	92.29s	2522318	24.32	0.64
bonsai	99.4s	925365	29.22	0.92
counter	97.83s	622668	26.86	0.88

batch_size=1 max_steps=5000 packed=false pose_opt=false

Before this change

Scene	Train Time	Number of splats	PSNR	SSIM
garden	119.15s	2076023	25.63	0.78
bicycle	96.80s	2043134	23.40	0.61
stump	100.20s	2592472	24.51	0.65
bonsai	99.70s	911051	29.48	0.93
counter	97.50s	631954	26.95	0.88

After this change

Scene	Train Time	Number of splats	PSNR	SSIM
garden	117.12s	2069605	25.64	0.78
bicycle	96.47s	2047385	23.40	0.61
stump	98.24s	2522734	24.53	0.65
bonsai	98.94s	917302	29.42	0.93
counter	96.19s	631444	26.94	0.88

Co-authored-by: AbhinavGrover <abhinav.grover@applied.co>

gurki-bajwa-ai · 2025-11-13T01:45:21Z

@liruilong940607 Please take a look and let me know if the PR text needs some theoretical justification too.

liruilong940607 · 2025-11-13T17:27:02Z

This is a great finding! But I'm concerned that the current way of writing it (loop over B and C) might leads to very slow speed when B and C is large. Ideally we should vectorize the compute there.

gsplat/rendering.py

liruilong940607 · 2025-11-14T17:28:03Z

btw plz run the formatter to pass the test black . gsplat/ tests/ examples/ profiling

gurki-bajwa-ai · 2025-11-14T20:49:34Z

I was using python3.10's formatter and hence failed the github tests. Used python3.8's formatter now. It should pass now.

liruilong940607 · 2025-11-17T15:19:23Z

Merging this now! Thank you for looking into this.

…erfstudio-project#831) * Optimize packed viewdir pass by reducing many-to-one indexing operations Co-authored-by: AbhinavGrover <abhinav.grover@applied.co> * Copy the indptr to cpu to avoid GPU sync in the loop * Resolved PR comments * Format * Format * typo --------- Co-authored-by: AbhinavGrover <abhinav.grover@applied.co>

Optimize packed viewdir pass by reducing many-to-one indexing operations

46e8249

Co-authored-by: AbhinavGrover <abhinav.grover@applied.co>

Copy the indptr to cpu to avoid GPU sync in the loop

a7b645d

gurki-bajwa-ai commented Nov 13, 2025

View reviewed changes

gsplat/rendering.py Outdated Show resolved Hide resolved

gurki-bajwa-ai added 3 commits November 14, 2025 11:33

Resolved PR comments

07e9a12

Format

4ebb112

Format

646aefb

typo

53916f9

liruilong940607 approved these changes Nov 17, 2025

View reviewed changes

liruilong940607 merged commit e35a43a into nerfstudio-project:main Nov 17, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Packed backward pass speedup via unrolled camera position indexing #831

Packed backward pass speedup via unrolled camera position indexing #831

Uh oh!

gurki-bajwa-ai commented Nov 13, 2025 •

edited

Loading

Uh oh!

gurki-bajwa-ai commented Nov 13, 2025

Uh oh!

liruilong940607 commented Nov 13, 2025

Uh oh!

Uh oh!

liruilong940607 commented Nov 14, 2025

Uh oh!

gurki-bajwa-ai commented Nov 14, 2025 •

edited

Loading

Uh oh!

liruilong940607 commented Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Packed backward pass speedup via unrolled camera position indexing #831

Packed backward pass speedup via unrolled camera position indexing #831

Uh oh!

Conversation

gurki-bajwa-ai commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Training times

batch_size=1 max_steps=5000 packed=true pose_opt=true

batch_size=1 max_steps=5000 packed=false pose_opt=false

Uh oh!

gurki-bajwa-ai commented Nov 13, 2025

Uh oh!

liruilong940607 commented Nov 13, 2025

Uh oh!

Uh oh!

liruilong940607 commented Nov 14, 2025

Uh oh!

gurki-bajwa-ai commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liruilong940607 commented Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gurki-bajwa-ai commented Nov 13, 2025 •

edited

Loading

gurki-bajwa-ai commented Nov 14, 2025 •

edited

Loading