Skip to content

Conversation

@liruilong940607
Copy link
Collaborator

@liruilong940607 liruilong940607 commented Jul 3, 2024

Paper link: https://daohanlu.github.io/scaling-up-3dgs/

Latest results:

Screenshot 2024-08-03 at 12 18 03 AM

Scripts: bash benchmarks/basic_4gpus.sh

@liruilong940607
Copy link
Collaborator Author

WIP on cleaning up

@ichsan2895
Copy link
Contributor

ichsan2895 commented Jul 8, 2024

It is strange that Gsplat-MCMC perform worse than Gsplat-Default.
I have shown some experiment in nerfstudio's discord server. Gsplat-MCMC got better result despite it has lower GSs count. If it got worse, try change another hyperparameter such as --opacity_reg 0.001 instead of the default 0.01.

For example, Egypt dataset from ns-download-data
python3 gsplat/examples/simple_trainer.py --data-dir path/to/egypt --data_factor 1
PSNR: 22.138, SSIM: 0.7643, LPIPS: 0.272 Time: 0.063s/image Number of GS: 5803430

python3 gsplat/examples/simple_trainer_mcmc.py --data-dir path/to/egypt --data_factor 1
PSNR: 24.001, SSIM: 0.7919, LPIPS: 0.277 Time: 0.026s/image Number of GS: 1000000

@liruilong940607
Copy link
Collaborator Author

It is strange that Gsplat-MCMC perform worse than Gsplat-Default. I have shown some experiment in nerfstudio's discord server. Gsplat-MCMC got better result despite it has lower GSs count. If it got worse, try change another hyperparameter such as --opacity_reg 0.001 instead of the default 0.01.

For example, Egypt dataset from ns-download-data python3 gsplat/examples/simple_trainer.py --data-dir path/to/egypt --data_factor 1 PSNR: 22.138, SSIM: 0.7643, LPIPS: 0.272 Time: 0.063s/image Number of GS: 5803430

python3 gsplat/examples/simple_trainer_mcmc.py --data-dir path/to/egypt --data_factor 1 PSNR: 24.001, SSIM: 0.7919, LPIPS: 0.277 Time: 0.026s/image Number of GS: 1000000

As I have noted, I'm NOT comparing between the two as they are clearly in different settings (the number of GSs converged are different)

Comment on lines +193 to +196
# Distribute the GSs to different ranks (also works for single rank)
points = points[world_rank::world_size]
rgbs = rgbs[world_rank::world_size]
scales = scales[world_rank::world_size]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TarzanZhao One question here. Any idea on the best way to split the GSs for multi-GPU initialization? I'm currently essentially randomly split them, which might not ideal for minimizing data transfer?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our Grendel code, we just assign each GPU a contiguous chunk of the point cloud from colmap. You can try our implementation here: https://github.com/nyu-systems/Grendel-GS/blob/0ea84e456d58946aa9708e1932d7c3466edd6a98/scene/gaussian_model.py#L181

Neither our current solution nor yours may be optimal. However, because our grendel implements load balancing techniques for Gaussian distributions during training, an uneven split at the beginning does not significantly impact the overall training speed.

Comment on lines 243 to 258
if distributed:
world_rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
N_world = [None] * world_size
C_world = [None] * world_size
torch.distributed.all_gather_object(N_world, N)
torch.distributed.all_gather_object(C_world, C)

# [TODO] `all_gather` is not differentiable w.r.t. viewmats and Ks
out_tensor_list = [torch.empty((C_i, 4, 4), device=device) for C_i in C_world]
torch.distributed.all_gather(out_tensor_list, viewmats.contiguous())
viewmats = torch.cat(out_tensor_list, dim=0)
out_tensor_list = [torch.empty((C_i, 3, 3), device=device) for C_i in C_world]
torch.distributed.all_gather(out_tensor_list, Ks.contiguous())
Ks = torch.cat(out_tensor_list, dim=0)
C = len(viewmats) # Silently change C from local #Cameras to global #Cameras.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TarzanZhao The current design of this API is that it takes in the local cameras that needs to be rendered by this rank. So before the projection stage, an all_gather needs to be done for each rank to get access to all cameras.

I guess another option for the API design is that this function takes in the "global" cameras (all cameras needed to be rendered by all ranks), which would avoid this all_gather operation. But this would mean that user have to load the same global set of cameras in each rank, but only supervise on a subset of them, which feels a bit counter intuitive from the user experience perspective.

I personally prefer the first design choice but that means two all_gather and sync operations here. Do you think the affect of this would be large in your experience?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using all_gather_object() is often slow in my experience because it transfers data from the CPU to the GPU, then to another GPU, and finally back to the CPU. It's advisable to avoid using this function as much as possible. If you prefer the first design, then please try to avoid using all_gather_object() at least.

In our Grendel's implementation, every gpu keep the same dataset and uses the same random seed to generate the same batch across GPU every time. This is the other option that you have mentioned. Then there is no need for all-gather cameras.

# on which cameras they are visible to, which we already figured out in the projection
# stage.
if distributed:
if packed:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TarzanZhao This is the sparse all to all logic, which we called packed mode (only visible GSs are returned from the projection function).

I'm not sure if I'm implementing this part in a most efficient way. Would love to get your thoughts

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. depth and radii do not have to use functional version of all2all.
  2. collected_cnts does not need functional version of all2all.
  3. [cnt.item() for cnt in cnts] will invoke many times of GPU-CPU communication. The total latency will be large. You can change to transfer cnts to cpu by single call.

In general, this code has more kernels than ours: https://github.com/nyu-systems/Grendel-GS/blob/0ea84e456d58946aa9708e1932d7c3466edd6a98/gaussian_renderer/__init__.py#L176

More kernels will add significant kernel launch overheads, especially when images have small resolutions.

@liruilong940607
Copy link
Collaborator Author

liruilong940607 commented Jul 18, 2024

@TarzanZhao My biggest problem with this PR in its current stage is that it's far from getting 3.5x speedup on 4x GPUs. I'm not sure how much NVlinks play a role in this, which I don't have in my server setup. But would love to know if any thing can be improved on the implementation side, which I think its mostly just the update in the rasterization() function.

@liruilong940607
Copy link
Collaborator Author

I think I can live with like ~3x speedup on 4 GPUs because I'm not implementing the tile-based balancing and GS rebalancing logic. Is that a reasonable expectation?

@liruilong940607
Copy link
Collaborator Author

liruilong940607 commented Jul 18, 2024

Using packed mode:

  • 4 GPUs with batch size 1 per GPU (effective batch size 4)
CUDA_VISIBLE_DEVICES=4,5,6,7 python simple_trainer.py --steps_scaler 0.25 --eval_steps -1 --disable_viewer --packed
>>> Step:  7499 {'mem': 2.944546699523926, 'ellipse_time': 1661.8717801570892, 'num_GS': 1356072}
>>> Step:  7499 {'mem': 3.056483268737793, 'ellipse_time': 1661.9940576553345, 'num_GS': 1432245}
>>> Step:  7499 {'mem': 3.076539993286133, 'ellipse_time': 1661.1862392425537, 'num_GS': 1430476}
>>> Step:  7499 {'mem': 3.133754253387451, 'ellipse_time': 1661.065690279007, 'num_GS': 1470615}
>>> PSNR: 27.431, SSIM: 0.8689, LPIPS: 0.075 Time: 0.087s/image Number of GS: 1356072
  • 2 GPUs with batch size 2 per GPU (effective batch size 4)
CUDA_VISIBLE_DEVICES=4,5 python simple_trainer.py --steps_scaler 0.25 --eval_steps -1 --disable_viewer --packed --batch_size 2
>>> Step:  7499 {'mem': 6.096761226654053, 'ellipse_time': 1792.5143909454346, 'num_GS': 2868191}
>>> Step:  7499 {'mem': 6.014559268951416, 'ellipse_time': 1791.8004696369171, 'num_GS': 2786602}
>>> PSNR: 27.403, SSIM: 0.8682, LPIPS: 0.075 Time: 0.075s/image Number of GS: 2786602

@liruilong940607
Copy link
Collaborator Author

liruilong940607 commented Jul 19, 2024

As a reference. Running the official repo in the same GPU environment (without NVlinks) get:

  • 4 GPUs with total batch size 4:
# 4g_4b.sh
torchrun --standalone --nnodes=1 --nproc-per-node=4 train.py \
        -s ${dataset_folder}/${scene} \
        --images ${image_folder} \
        --llffhold 8 \
        --iterations 30000 \
        --log_interval 250 \
        --model_path ${expe_folder}/4g_4b/${expe_name} \
        --bsz $BSZ \
        $monitor_opts \
        --test_iterations 30000 \
        --save_iterations 30000 \
        --eval

Training Finished after 15m30s
[ITER 29997] Evaluating test: L1 0.028009627014398575 PSNR 27.24338150024414 [19/07 17:24:22]
[ITER 29997] Evaluating train: L1 0.02116350270807743 PSNR 29.852651596069336 [19/07 17:24:23]

Note the LPIPS is not the same LPIPS as in this repo.

Copy link
Collaborator

@TarzanZhao TarzanZhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!! I'm very excited that our paper's distributed strategy can be adopted by gsplat so quickly. Thanks so much! I leave some comments and reference code and hope these can help a little bit.

Comment on lines +193 to +196
# Distribute the GSs to different ranks (also works for single rank)
points = points[world_rank::world_size]
rgbs = rgbs[world_rank::world_size]
scales = scales[world_rank::world_size]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our Grendel code, we just assign each GPU a contiguous chunk of the point cloud from colmap. You can try our implementation here: https://github.com/nyu-systems/Grendel-GS/blob/0ea84e456d58946aa9708e1932d7c3466edd6a98/scene/gaussian_model.py#L181

Neither our current solution nor yours may be optimal. However, because our grendel implements load balancing techniques for Gaussian distributions during training, an uneven split at the beginning does not significantly impact the overall training speed.

Comment on lines 243 to 258
if distributed:
world_rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
N_world = [None] * world_size
C_world = [None] * world_size
torch.distributed.all_gather_object(N_world, N)
torch.distributed.all_gather_object(C_world, C)

# [TODO] `all_gather` is not differentiable w.r.t. viewmats and Ks
out_tensor_list = [torch.empty((C_i, 4, 4), device=device) for C_i in C_world]
torch.distributed.all_gather(out_tensor_list, viewmats.contiguous())
viewmats = torch.cat(out_tensor_list, dim=0)
out_tensor_list = [torch.empty((C_i, 3, 3), device=device) for C_i in C_world]
torch.distributed.all_gather(out_tensor_list, Ks.contiguous())
Ks = torch.cat(out_tensor_list, dim=0)
C = len(viewmats) # Silently change C from local #Cameras to global #Cameras.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using all_gather_object() is often slow in my experience because it transfers data from the CPU to the GPU, then to another GPU, and finally back to the CPU. It's advisable to avoid using this function as much as possible. If you prefer the first design, then please try to avoid using all_gather_object() at least.

In our Grendel's implementation, every gpu keep the same dataset and uses the same random seed to generate the same batch across GPU every time. This is the other option that you have mentioned. Then there is no need for all-gather cameras.

# on which cameras they are visible to, which we already figured out in the projection
# stage.
if distributed:
if packed:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. depth and radii do not have to use functional version of all2all.
  2. collected_cnts does not need functional version of all2all.
  3. [cnt.item() for cnt in cnts] will invoke many times of GPU-CPU communication. The total latency will be large. You can change to transfer cnts to cpu by single call.

In general, this code has more kernels than ours: https://github.com/nyu-systems/Grendel-GS/blob/0ea84e456d58946aa9708e1932d7c3466edd6a98/gaussian_renderer/__init__.py#L176

More kernels will add significant kernel launch overheads, especially when images have small resolutions.

) # [C_i, N, :]

# collected contains:
radii = collected[..., 0].int()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use torch.split() to avoid many kernel launches here.

@TarzanZhao
Copy link
Collaborator

TarzanZhao commented Jul 19, 2024

@TarzanZhao My biggest problem with this PR in its current stage is that it's far from getting 3.5x speedup on 4x GPUs. I'm not sure how much NVlinks play a role in this, which I don't have in my server setup. But would love to know if any thing can be improved on the implementation side, which I think its mostly just the update in the rasterization() function.

Overall, I think there are two major reasons:

  1. Data loading might be slow.
  2. There are more kernel launches than our implementation, but this is related to the data format required by gsplat. You want the code to be clear and modular, and for some functions to be decoupled, so some additional data format conversion kernels are needed. I think this is reasonable.

@liruilong940607
Copy link
Collaborator Author

liruilong940607 commented Aug 1, 2024

After some optimization, the training time of 4 GPUs is brought down to 16m25s!

  • 4 GPUs:
CUDA_VISIBLE_DEVICES=4,5,6,7 python simple_trainer.py --steps_scaler 0.25 --eval_steps -1 --disable_viewer --packed

>>> Step:  7499 {'mem': 2.8663759231567383, 'ellipse_time': 985.6521210670471, 'num_GS': 1352151}
>>> Step:  7499 {'mem': 2.971991539001465, 'ellipse_time': 984.3997764587402, 'num_GS': 1406792}
>>> Step:  7499 {'mem': 3.0060572624206543, 'ellipse_time': 985.4934339523315, 'num_GS': 1454676}
>>> Step:  7499 {'mem': 3.0007195472717285, 'ellipse_time': 985.5467941761017, 'num_GS': 1428084}
  • 2 GPUs:
CUDA_VISIBLE_DEVICES=4,5 python simple_trainer.py --steps_scaler 0.25 --eval_steps -1 --disable_viewer --packed --batch_size 2

>>> Step:  7499 {'mem': 5.9516401290893555, 'ellipse_time': 1507.71635222435, 'num_GS': 2866948}
>>> Step:  7499 {'mem': 5.851601600646973, 'ellipse_time': 1506.6484203338623, 'num_GS': 2769727}

@liruilong940607
Copy link
Collaborator Author

MCMC on 4 GPUs

CUDA_VISIBLE_DEVICES=4,5,6,7 python simple_trainer_mcmc.py --steps_scaler 0.25 --eval_steps -1 --packed

>>> Step:  7499 {'mem': 2.4360504150390625, 'ellipse_time': 991.8759768009186, 'num_GS': 1000000}
>>> Step:  7499 {'mem': 2.403714179992676, 'ellipse_time': 992.4375550746918, 'num_GS': 1000000}
>>> Step:  7499 {'mem': 2.42301607131958, 'ellipse_time': 992.5172688961029, 'num_GS': 1000000}
>>> Step:  7499 {'mem': 2.411731243133545, 'ellipse_time': 991.4808518886566, 'num_GS': 1000000}

PSNR: 27.714, SSIM: 0.8744, LPIPS: 0.079 Time: 0.074s/image Number of GS: 1000000

@liruilong940607 liruilong940607 merged commit f92fd3f into main Aug 3, 2024
@liruilong940607 liruilong940607 deleted the grendel branch August 3, 2024 07:21
@Ben-Mack
Copy link

Ben-Mack commented Aug 6, 2024

This PR is great! I wonder that can it be used in a way to use 1 GPU to process multiple batch?
So it will allow much larger scene to be trained with just 1 GPU! (larger scene in term number of gaussians, or num of images, ideally both)
I think this 1 GPU use case can help a lot of people to train high quality GS without highend GPU.
@liruilong940607 Do you think it is possible?

@liruilong940607
Copy link
Collaborator Author

@Ben-Mack For a single GPU, I guess there isn't much you can do other than looping over images?

scy5335 pushed a commit to scy5335/gsplat that referenced this pull request Mar 21, 2025
…ian Splatting Training" (nerfstudio-project#253)

* checkin the code

* nicer API

* mcmc script now can works with multigpu

* trainer supports multi gpu

* get rid of reduduant code

* func doc

* support packed mode

* format

* more exp

* multi GPU viewer

* optim

* cleanup

* cleanup

* merge main

* MCMC

* doc

* scripts

* scripts and performance

---------

Co-authored-by: Ruilong Li <397653553@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants