Skip to content

Conversation

@Latios96
Copy link

Recently, we got a new NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition in the lab. Curious about the performance of this card, I tried out PBRT and noticed the performance was not so fast as I would expect it. Other benchmark showed the expected results, so I thought PBRT was not saturating this GPU enough. Digging around in the source code I found the queue size is fixed to 1024 * 1024 and this todocomment.

This PR adresses proposes a fix for the todo. Instead of using a fixed queue size, it looks how much GPU memory is available after allocating the scene and bases the queue size on that. In scenes with high memory usage, CUDA will report there is no memory available anymore, but the scene will still render fine. In this case, a fixed small amount of memory is used for the queues. The amount is determined through experimentation, see below.

Overall, the performance is increased. When GPU memory is available, increasing the queue size leads to higher performance, because the GPU is more saturated. When the GPU memory is already fully covered by the scene, using a smaller queue size leads to higher performance because of less memory transfers. It was not investigated if changing the queue size has an influence on the CPU performance.

Performance Measurements

To measure performance improvement, one can think of three possible scenarios:

  1. The memory usage of the scene is negligible compared to the available GPU memory
  2. The memory usage of the scene is significant, but does not cover the whole GPU memory
  3. The scene covers the whole GPU memory

I run the tests on the following GPUs:

Name Memory SMs Frequency Shader Model
NVIDIA GeForce RTX 3080 Laptop GPU 15993.125 MiB 48 1545 MHz 8.6
NVIDIA GeForce RTX 4070 Ti 12281.375 MiB 60 2730 MHz 8.9
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition 97225.5625 MiB 188 2280 MHz 12.0

I selected the test scenes pbrt-book, landscape, watercolor and disney-cloud.

On the 3080, all scenes fit into memory.
On the 4070, all scenes except the watercolor scene fit into memory.
On the A6000, the memory usage of all scenes is negligible compared to the available memory.

Since PBRT launches a single sample per pixel at once, the queue size is limited to the number of pixels in the image. To see how this influences the performance (for larger images the speedup should be higher), I rendered the scenes in Full HD and 4K.

Determining a good minimal amount of memory

Since the watercolor scene covers all GPU memory in the 4070, I used this as a test scenario.
I rendered the scene using different minimum amount of memory for the queues:

Minimal Memory Render Time Full HD Render Time 4K
100mb 500s 4200s
200mb 383s 5300s
300mb 678s 6400s

I chose 100mb as the minimum amount of memory for the queues. It is not the fastest for Full HD, but the speedup in 4k is larger.

Unfortunatly, the measurements depend pretty much on what else is going on on the GPU. Even when doing nothing beside running pbrt, the numbers are sometimes not easy to reproduce. But overall, using a way smaller queue size than 1024 * 1024 gives faster results in case the GPU memory is fully used.

Results

The measurements were taken with 100mb as a minimal amount of memory for the queues.

NVIDIA GeForce RTX 3080 Laptop GPU

Scene Name max queue size baseline max queue size improved improvement render time baseline render time improved improvement
disney-cloud 4k 1036800 4147200 300.0% 2975.018s 2834.971s 4.71%
disney-cloud fullhd 1036800 2073600 100.0% 728.706s 735.707s -0.96%
landscape 4k 1036800 2073600 100.0% 98.755s 97.77s 1.0%
landscape fullhd 1036800 2073600 100.0% 94.156s 97.806s -3.88%
pbrt-book 4k 1036800 8294400 700.0% 154.649s 148.67s 3.87%
pbrt-book fullhd 1036800 2073600 100.0% 39.184s 37.386s 4.59%
watercolor 4k 1036800 2073600 100.0% 384.215s 378.665s 1.44%
watercolor fullhd 1036800 2073600 100.0% 99.077s 94.884s 4.23%

NVIDIA GeForce RTX 4070 Ti

Scene Name max queue size baseline max queue size improved improvement render time baseline render time improved improvement
disney-cloud 4k 1036800 4147200 300.0% 1431.653s 1215.619s 15.09%
disney-cloud fullhd 1036800 2073600 100.0% 366.198s 324.755s 11.32%
landscape 4k 1036800 2073600 100.0% 42.882s 39.356s 8.22%
landscape fullhd 1036800 2073600 100.0% 42.768s 39.427s 7.81%
pbrt-book 4k 1036800 4147200 300.0% 86.599s 74.244s 14.27%
pbrt-book fullhd 1036800 2073600 100.0% 22.596s 20.361s 9.89%
watercolor 4k 1036800 53760 -94.81% 5741.823s 4200.00s 26.00%
watercolor fullhd 1036800 53760 -94.81% 2166.88s 500.00s 76.00%

NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition

Scene Name max queue size baseline max queue size improved improvement render time baseline render time improved improvement
disney-cloud 4k 1036800 8294400 700.0% 989.652s 603.667s 39.0%
disney-cloud fullhd 1036800 2073600 100.0% 274.601s 204.597s 25.49%
landscape 4k 1036800 2073600 100.0% 22.215s 19.036s 14.31%
landscape fullhd 1036800 2073600 100.0% 22.215s 19.023s 14.37%
pbrt-book 4k 1036800 8294400 700.0% 43.567s 36.391s 16.47%
pbrt-book fullhd 1036800 2073600 100.0% 11.945s 9.95s 16.7%
watercolor 4k 1036800 8294400 700.0% 108.759s 51.622s 52.54%
watercolor fullhd 1036800 2073600 100.0% 28.311s 19.921s 29.64%

Discussion

Overall, the queue sizes are larger than before, except for the watercolor on the 4070. Since watercolor scene already covers the GPU memory of the 4070, the queue size is smaller.

The 3080 does not show any notable performance improvements, probably because the GPU is already saturated with a smaller queue size.

The 4070 shows slight improvements overall, but an extreme improvement of 76% in case of the watercolor scene in Full HD. In this case, the queue size is smaller, leading to less memory transfers. However, for the 4K version the speed up is not so large, probably because the film needs more memory in this case.

The A6000 overall shows large speedups, up to 2x in case of watercolor 4k, about 40% for disney-cloud 4k. The speedsup are largest for the 4K renderings. For Full HD, the queue size is probably still to small to saturate the GPU, making the performance gain smaller.

Overall, the performance is increased, but requires more memory. One could think off on adding some option to cap the memory usage to x gigabytes. I would argue since only the available GPU memory is used, it's okay to just use all of the available memory, to achieve maximum performance. Especially on large GPUs like the A6000 this seems necessary, because otherwise the GPU is running way under its potential.

@pbrt4bounty
Copy link
Contributor

Hi.. it seems that Matt is quite busy so in the meantime,, I merged this change on my fork: pbrt4bounty@c901762 adding too a log message on 'verbose' mode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants