Support Multi-GPU training based on the paper "On Scaling Up 3D Gaussian Splatting Training" #253

liruilong940607 · 2024-07-03T15:58:54Z

Paper link: https://daohanlu.github.io/scaling-up-3dgs/

Latest results:

Scripts: bash benchmarks/basic_4gpus.sh

liruilong940607 · 2024-07-03T16:09:04Z

WIP on cleaning up

ichsan2895 · 2024-07-08T10:33:06Z

It is strange that Gsplat-MCMC perform worse than Gsplat-Default.
I have shown some experiment in nerfstudio's discord server. Gsplat-MCMC got better result despite it has lower GSs count. If it got worse, try change another hyperparameter such as --opacity_reg 0.001 instead of the default 0.01.

For example, Egypt dataset from ns-download-data
python3 gsplat/examples/simple_trainer.py --data-dir path/to/egypt --data_factor 1
PSNR: 22.138, SSIM: 0.7643, LPIPS: 0.272 Time: 0.063s/image Number of GS: 5803430

python3 gsplat/examples/simple_trainer_mcmc.py --data-dir path/to/egypt --data_factor 1
PSNR: 24.001, SSIM: 0.7919, LPIPS: 0.277 Time: 0.026s/image Number of GS: 1000000

liruilong940607 · 2024-07-08T15:28:59Z

It is strange that Gsplat-MCMC perform worse than Gsplat-Default. I have shown some experiment in nerfstudio's discord server. Gsplat-MCMC got better result despite it has lower GSs count. If it got worse, try change another hyperparameter such as --opacity_reg 0.001 instead of the default 0.01.

For example, Egypt dataset from ns-download-data python3 gsplat/examples/simple_trainer.py --data-dir path/to/egypt --data_factor 1 PSNR: 22.138, SSIM: 0.7643, LPIPS: 0.272 Time: 0.063s/image Number of GS: 5803430

python3 gsplat/examples/simple_trainer_mcmc.py --data-dir path/to/egypt --data_factor 1 PSNR: 24.001, SSIM: 0.7919, LPIPS: 0.277 Time: 0.026s/image Number of GS: 1000000

As I have noted, I'm NOT comparing between the two as they are clearly in different settings (the number of GSs converged are different)

liruilong940607 · 2024-07-18T16:33:00Z

examples/simple_trainer.py

+    # Distribute the GSs to different ranks (also works for single rank)
+    points = points[world_rank::world_size]
+    rgbs = rgbs[world_rank::world_size]
+    scales = scales[world_rank::world_size]


@TarzanZhao One question here. Any idea on the best way to split the GSs for multi-GPU initialization? I'm currently essentially randomly split them, which might not ideal for minimizing data transfer?

In our Grendel code, we just assign each GPU a contiguous chunk of the point cloud from colmap. You can try our implementation here: https://github.com/nyu-systems/Grendel-GS/blob/0ea84e456d58946aa9708e1932d7c3466edd6a98/scene/gaussian_model.py#L181

Neither our current solution nor yours may be optimal. However, because our grendel implements load balancing techniques for Gaussian distributions during training, an uneven split at the beginning does not significantly impact the overall training speed.

liruilong940607 · 2024-07-18T16:41:15Z

gsplat/rendering.py

+    if distributed:
+        world_rank = torch.distributed.get_rank()
+        world_size = torch.distributed.get_world_size()
+        N_world = [None] * world_size
+        C_world = [None] * world_size
+        torch.distributed.all_gather_object(N_world, N)
+        torch.distributed.all_gather_object(C_world, C)
+
+        # [TODO] `all_gather` is not differentiable w.r.t. viewmats and Ks
+        out_tensor_list = [torch.empty((C_i, 4, 4), device=device) for C_i in C_world]
+        torch.distributed.all_gather(out_tensor_list, viewmats.contiguous())
+        viewmats = torch.cat(out_tensor_list, dim=0)
+        out_tensor_list = [torch.empty((C_i, 3, 3), device=device) for C_i in C_world]
+        torch.distributed.all_gather(out_tensor_list, Ks.contiguous())
+        Ks = torch.cat(out_tensor_list, dim=0)
+        C = len(viewmats)  # Silently change C from local #Cameras to global #Cameras.


@TarzanZhao The current design of this API is that it takes in the local cameras that needs to be rendered by this rank. So before the projection stage, an all_gather needs to be done for each rank to get access to all cameras.

I guess another option for the API design is that this function takes in the "global" cameras (all cameras needed to be rendered by all ranks), which would avoid this all_gather operation. But this would mean that user have to load the same global set of cameras in each rank, but only supervise on a subset of them, which feels a bit counter intuitive from the user experience perspective.

I personally prefer the first design choice but that means two all_gather and sync operations here. Do you think the affect of this would be large in your experience?

Using all_gather_object() is often slow in my experience because it transfers data from the CPU to the GPU, then to another GPU, and finally back to the CPU. It's advisable to avoid using this function as much as possible. If you prefer the first design, then please try to avoid using all_gather_object() at least.

In our Grendel's implementation, every gpu keep the same dataset and uses the same random seed to generate the same batch across GPU every time. This is the other option that you have mentioned. Then there is no need for all-gather cameras.

liruilong940607 · 2024-07-18T16:43:23Z

gsplat/rendering.py

+    # on which cameras they are visible to, which we already figured out in the projection
+    # stage.
+    if distributed:
+        if packed:


@TarzanZhao This is the sparse all to all logic, which we called packed mode (only visible GSs are returned from the projection function).

I'm not sure if I'm implementing this part in a most efficient way. Would love to get your thoughts

depth and radii do not have to use functional version of all2all.

collected_cnts does not need functional version of all2all.

[cnt.item() for cnt in cnts] will invoke many times of GPU-CPU communication. The total latency will be large. You can change to transfer cnts to cpu by single call.

In general, this code has more kernels than ours: https://github.com/nyu-systems/Grendel-GS/blob/0ea84e456d58946aa9708e1932d7c3466edd6a98/gaussian_renderer/__init__.py#L176

More kernels will add significant kernel launch overheads, especially when images have small resolutions.

liruilong940607 · 2024-07-18T16:46:34Z

@TarzanZhao My biggest problem with this PR in its current stage is that it's far from getting 3.5x speedup on 4x GPUs. I'm not sure how much NVlinks play a role in this, which I don't have in my server setup. But would love to know if any thing can be improved on the implementation side, which I think its mostly just the update in the rasterization() function.

liruilong940607 · 2024-07-18T16:48:30Z

I think I can live with like ~3x speedup on 4 GPUs because I'm not implementing the tile-based balancing and GS rebalancing logic. Is that a reasonable expectation?

liruilong940607 · 2024-07-18T17:19:56Z

Using packed mode:

4 GPUs with batch size 1 per GPU (effective batch size 4)

CUDA_VISIBLE_DEVICES=4,5,6,7 python simple_trainer.py --steps_scaler 0.25 --eval_steps -1 --disable_viewer --packed
>>> Step:  7499 {'mem': 2.944546699523926, 'ellipse_time': 1661.8717801570892, 'num_GS': 1356072}
>>> Step:  7499 {'mem': 3.056483268737793, 'ellipse_time': 1661.9940576553345, 'num_GS': 1432245}
>>> Step:  7499 {'mem': 3.076539993286133, 'ellipse_time': 1661.1862392425537, 'num_GS': 1430476}
>>> Step:  7499 {'mem': 3.133754253387451, 'ellipse_time': 1661.065690279007, 'num_GS': 1470615}
>>> PSNR: 27.431, SSIM: 0.8689, LPIPS: 0.075 Time: 0.087s/image Number of GS: 1356072

2 GPUs with batch size 2 per GPU (effective batch size 4)

CUDA_VISIBLE_DEVICES=4,5 python simple_trainer.py --steps_scaler 0.25 --eval_steps -1 --disable_viewer --packed --batch_size 2
>>> Step:  7499 {'mem': 6.096761226654053, 'ellipse_time': 1792.5143909454346, 'num_GS': 2868191}
>>> Step:  7499 {'mem': 6.014559268951416, 'ellipse_time': 1791.8004696369171, 'num_GS': 2786602}
>>> PSNR: 27.403, SSIM: 0.8682, LPIPS: 0.075 Time: 0.075s/image Number of GS: 2786602

liruilong940607 · 2024-07-19T19:37:30Z

As a reference. Running the official repo in the same GPU environment (without NVlinks) get:

4 GPUs with total batch size 4:

# 4g_4b.sh
torchrun --standalone --nnodes=1 --nproc-per-node=4 train.py \
        -s ${dataset_folder}/${scene} \
        --images ${image_folder} \
        --llffhold 8 \
        --iterations 30000 \
        --log_interval 250 \
        --model_path ${expe_folder}/4g_4b/${expe_name} \
        --bsz $BSZ \
        $monitor_opts \
        --test_iterations 30000 \
        --save_iterations 30000 \
        --eval

Training Finished after 15m30s
[ITER 29997] Evaluating test: L1 0.028009627014398575 PSNR 27.24338150024414 [19/07 17:24:22]
[ITER 29997] Evaluating train: L1 0.02116350270807743 PSNR 29.852651596069336 [19/07 17:24:23]

Note the LPIPS is not the same LPIPS as in this repo.

TarzanZhao

Cool!! I'm very excited that our paper's distributed strategy can be adopted by gsplat so quickly. Thanks so much! I leave some comments and reference code and hope these can help a little bit.

TarzanZhao · 2024-07-19T21:21:41Z

examples/simple_trainer.py

+    # Distribute the GSs to different ranks (also works for single rank)
+    points = points[world_rank::world_size]
+    rgbs = rgbs[world_rank::world_size]
+    scales = scales[world_rank::world_size]


In our Grendel code, we just assign each GPU a contiguous chunk of the point cloud from colmap. You can try our implementation here: https://github.com/nyu-systems/Grendel-GS/blob/0ea84e456d58946aa9708e1932d7c3466edd6a98/scene/gaussian_model.py#L181

Neither our current solution nor yours may be optimal. However, because our grendel implements load balancing techniques for Gaussian distributions during training, an uneven split at the beginning does not significantly impact the overall training speed.

TarzanZhao · 2024-07-19T23:10:48Z

gsplat/rendering.py

+    if distributed:
+        world_rank = torch.distributed.get_rank()
+        world_size = torch.distributed.get_world_size()
+        N_world = [None] * world_size
+        C_world = [None] * world_size
+        torch.distributed.all_gather_object(N_world, N)
+        torch.distributed.all_gather_object(C_world, C)
+
+        # [TODO] `all_gather` is not differentiable w.r.t. viewmats and Ks
+        out_tensor_list = [torch.empty((C_i, 4, 4), device=device) for C_i in C_world]
+        torch.distributed.all_gather(out_tensor_list, viewmats.contiguous())
+        viewmats = torch.cat(out_tensor_list, dim=0)
+        out_tensor_list = [torch.empty((C_i, 3, 3), device=device) for C_i in C_world]
+        torch.distributed.all_gather(out_tensor_list, Ks.contiguous())
+        Ks = torch.cat(out_tensor_list, dim=0)
+        C = len(viewmats)  # Silently change C from local #Cameras to global #Cameras.


Using all_gather_object() is often slow in my experience because it transfers data from the CPU to the GPU, then to another GPU, and finally back to the CPU. It's advisable to avoid using this function as much as possible. If you prefer the first design, then please try to avoid using all_gather_object() at least.

In our Grendel's implementation, every gpu keep the same dataset and uses the same random seed to generate the same batch across GPU every time. This is the other option that you have mentioned. Then there is no need for all-gather cameras.

TarzanZhao · 2024-07-19T23:44:50Z

gsplat/rendering.py

+    # on which cameras they are visible to, which we already figured out in the projection
+    # stage.
+    if distributed:
+        if packed:


depth and radii do not have to use functional version of all2all.

collected_cnts does not need functional version of all2all.

[cnt.item() for cnt in cnts] will invoke many times of GPU-CPU communication. The total latency will be large. You can change to transfer cnts to cpu by single call.

In general, this code has more kernels than ours: https://github.com/nyu-systems/Grendel-GS/blob/0ea84e456d58946aa9708e1932d7c3466edd6a98/gaussian_renderer/__init__.py#L176

More kernels will add significant kernel launch overheads, especially when images have small resolutions.

TarzanZhao · 2024-07-19T23:48:34Z

gsplat/rendering.py

+            )  # [C_i, N, :]
+
+            # collected contains:
+            radii = collected[..., 0].int()


You can use torch.split() to avoid many kernel launches here.

TarzanZhao · 2024-07-19T23:56:30Z

@TarzanZhao My biggest problem with this PR in its current stage is that it's far from getting 3.5x speedup on 4x GPUs. I'm not sure how much NVlinks play a role in this, which I don't have in my server setup. But would love to know if any thing can be improved on the implementation side, which I think its mostly just the update in the rasterization() function.

Overall, I think there are two major reasons:

Data loading might be slow.
There are more kernel launches than our implementation, but this is related to the data format required by gsplat. You want the code to be clear and modular, and for some functions to be decoupled, so some additional data format conversion kernels are needed. I think this is reasonable.

liruilong940607 · 2024-08-01T03:20:10Z

After some optimization, the training time of 4 GPUs is brought down to 16m25s!

4 GPUs:

CUDA_VISIBLE_DEVICES=4,5,6,7 python simple_trainer.py --steps_scaler 0.25 --eval_steps -1 --disable_viewer --packed

>>> Step:  7499 {'mem': 2.8663759231567383, 'ellipse_time': 985.6521210670471, 'num_GS': 1352151}
>>> Step:  7499 {'mem': 2.971991539001465, 'ellipse_time': 984.3997764587402, 'num_GS': 1406792}
>>> Step:  7499 {'mem': 3.0060572624206543, 'ellipse_time': 985.4934339523315, 'num_GS': 1454676}
>>> Step:  7499 {'mem': 3.0007195472717285, 'ellipse_time': 985.5467941761017, 'num_GS': 1428084}

2 GPUs:

CUDA_VISIBLE_DEVICES=4,5 python simple_trainer.py --steps_scaler 0.25 --eval_steps -1 --disable_viewer --packed --batch_size 2

>>> Step:  7499 {'mem': 5.9516401290893555, 'ellipse_time': 1507.71635222435, 'num_GS': 2866948}
>>> Step:  7499 {'mem': 5.851601600646973, 'ellipse_time': 1506.6484203338623, 'num_GS': 2769727}

liruilong940607 · 2024-08-03T05:00:50Z

MCMC on 4 GPUs

CUDA_VISIBLE_DEVICES=4,5,6,7 python simple_trainer_mcmc.py --steps_scaler 0.25 --eval_steps -1 --packed

>>> Step:  7499 {'mem': 2.4360504150390625, 'ellipse_time': 991.8759768009186, 'num_GS': 1000000}
>>> Step:  7499 {'mem': 2.403714179992676, 'ellipse_time': 992.4375550746918, 'num_GS': 1000000}
>>> Step:  7499 {'mem': 2.42301607131958, 'ellipse_time': 992.5172688961029, 'num_GS': 1000000}
>>> Step:  7499 {'mem': 2.411731243133545, 'ellipse_time': 991.4808518886566, 'num_GS': 1000000}

PSNR: 27.714, SSIM: 0.8744, LPIPS: 0.079 Time: 0.074s/image Number of GS: 1000000

Ben-Mack · 2024-08-06T00:19:03Z

This PR is great! I wonder that can it be used in a way to use 1 GPU to process multiple batch?
So it will allow much larger scene to be trained with just 1 GPU! (larger scene in term number of gaussians, or num of images, ideally both)
I think this 1 GPU use case can help a lot of people to train high quality GS without highend GPU.
@liruilong940607 Do you think it is possible?

liruilong940607 · 2024-08-06T01:39:15Z

@Ben-Mack For a single GPU, I guess there isn't much you can do other than looping over images?

…ian Splatting Training" (nerfstudio-project#253) * checkin the code * nicer API * mcmc script now can works with multigpu * trainer supports multi gpu * get rid of reduduant code * func doc * support packed mode * format * more exp * multi GPU viewer * optim * cleanup * cleanup * merge main * MCMC * doc * scripts * scripts and performance --------- Co-authored-by: Ruilong Li <397653553@qq.com>

Ruilong Li added 6 commits July 1, 2024 20:46

checkin the code

9e71a20

nicer API

57fd307

mcmc script now can works with multigpu

82294a2

trainer supports multi gpu

25fef37

get rid of reduduant code

824ef47

func doc

2e07042

Ruilong Li added 4 commits July 8, 2024 21:08

support packed mode

663cc65

format

7b1a212

more exp

807eded

multi GPU viewer

839c012

liruilong940607 commented Jul 18, 2024

View reviewed changes

TarzanZhao reviewed Jul 19, 2024

View reviewed changes

optim

58ade53

Ruilong Li added 5 commits August 2, 2024 00:24

cleanup

f77f77f

cleanup

d19719e

prevent race condition when JIT in multiprocess

2f199e6

Merge branch 'main' into grendel

dd42d9c

merge main

eca5303

Ruilong Li added 4 commits August 3, 2024 05:00

MCMC

3f9d1f8

doc

c44f3d0

scripts

df83a41

scripts and performance

08fdeec

liruilong940607 merged commit f92fd3f into main Aug 3, 2024

liruilong940607 deleted the grendel branch August 3, 2024 07:21

Support Multi-GPU training based on the paper "On Scaling Up 3D Gaussian Splatting Training" #253

Support Multi-GPU training based on the paper "On Scaling Up 3D Gaussian Splatting Training" #253

Uh oh!

Conversation

liruilong940607 commented Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liruilong940607 commented Jul 3, 2024

Uh oh!

ichsan2895 commented Jul 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liruilong940607 commented Jul 8, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liruilong940607 commented Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liruilong940607 commented Jul 18, 2024

Uh oh!

liruilong940607 commented Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liruilong940607 commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TarzanZhao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TarzanZhao commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liruilong940607 commented Aug 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liruilong940607 commented Aug 3, 2024

Uh oh!

Ben-Mack commented Aug 6, 2024

Uh oh!

liruilong940607 commented Aug 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

liruilong940607 commented Jul 3, 2024 •

edited

Loading

ichsan2895 commented Jul 8, 2024 •

edited

Loading

liruilong940607 commented Jul 18, 2024 •

edited

Loading

liruilong940607 commented Jul 18, 2024 •

edited

Loading

liruilong940607 commented Jul 19, 2024 •

edited

Loading

TarzanZhao commented Jul 19, 2024 •

edited

Loading

liruilong940607 commented Aug 1, 2024 •

edited

Loading