base gpu queue size on the amount of available memory #511

Latios96 · 2025-09-18T17:20:07Z

Recently, we got a new NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition in the lab. Curious about the performance of this card, I tried out PBRT and noticed the performance was not so fast as I would expect it. Other benchmark showed the expected results, so I thought PBRT was not saturating this GPU enough. Digging around in the source code I found the queue size is fixed to 1024 * 1024 and this todocomment.

This PR adresses proposes a fix for the todo. Instead of using a fixed queue size, it looks how much GPU memory is available after allocating the scene and bases the queue size on that. In scenes with high memory usage, CUDA will report there is no memory available anymore, but the scene will still render fine. In this case, a fixed small amount of memory is used for the queues. The amount is determined through experimentation, see below.

Overall, the performance is increased. When GPU memory is available, increasing the queue size leads to higher performance, because the GPU is more saturated. When the GPU memory is already fully covered by the scene, using a smaller queue size leads to higher performance because of less memory transfers. It was not investigated if changing the queue size has an influence on the CPU performance.

Performance Measurements

To measure performance improvement, one can think of three possible scenarios:

The memory usage of the scene is negligible compared to the available GPU memory
The memory usage of the scene is significant, but does not cover the whole GPU memory
The scene covers the whole GPU memory

I run the tests on the following GPUs:

Name	Memory	SMs	Frequency	Shader Model
NVIDIA GeForce RTX 3080 Laptop GPU	15993.125 MiB	48	1545 MHz	8.6
NVIDIA GeForce RTX 4070 Ti	12281.375 MiB	60	2730 MHz	8.9
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition	97225.5625 MiB	188	2280 MHz	12.0

I selected the test scenes pbrt-book, landscape, watercolor and disney-cloud.

On the 3080, all scenes fit into memory.
On the 4070, all scenes except the watercolor scene fit into memory.
On the A6000, the memory usage of all scenes is negligible compared to the available memory.

Since PBRT launches a single sample per pixel at once, the queue size is limited to the number of pixels in the image. To see how this influences the performance (for larger images the speedup should be higher), I rendered the scenes in Full HD and 4K.

Determining a good minimal amount of memory

Since the watercolor scene covers all GPU memory in the 4070, I used this as a test scenario.
I rendered the scene using different minimum amount of memory for the queues:

Minimal Memory	Render Time Full HD	Render Time 4K
100mb	500s	4200s
200mb	383s	5300s
300mb	678s	6400s

I chose 100mb as the minimum amount of memory for the queues. It is not the fastest for Full HD, but the speedup in 4k is larger.

Unfortunatly, the measurements depend pretty much on what else is going on on the GPU. Even when doing nothing beside running pbrt, the numbers are sometimes not easy to reproduce. But overall, using a way smaller queue size than 1024 * 1024 gives faster results in case the GPU memory is fully used.

Results

The measurements were taken with 100mb as a minimal amount of memory for the queues.

NVIDIA GeForce RTX 3080 Laptop GPU

Scene Name	max queue size baseline	max queue size improved	improvement	render time baseline	render time improved	improvement
disney-cloud 4k	1036800	4147200	300.0%	2975.018s	2834.971s	4.71%
disney-cloud fullhd	1036800	2073600	100.0%	728.706s	735.707s	-0.96%
landscape 4k	1036800	2073600	100.0%	98.755s	97.77s	1.0%
landscape fullhd	1036800	2073600	100.0%	94.156s	97.806s	-3.88%
pbrt-book 4k	1036800	8294400	700.0%	154.649s	148.67s	3.87%
pbrt-book fullhd	1036800	2073600	100.0%	39.184s	37.386s	4.59%
watercolor 4k	1036800	2073600	100.0%	384.215s	378.665s	1.44%
watercolor fullhd	1036800	2073600	100.0%	99.077s	94.884s	4.23%

NVIDIA GeForce RTX 4070 Ti

Scene Name	max queue size baseline	max queue size improved	improvement	render time baseline	render time improved	improvement
disney-cloud 4k	1036800	4147200	300.0%	1431.653s	1215.619s	15.09%
disney-cloud fullhd	1036800	2073600	100.0%	366.198s	324.755s	11.32%
landscape 4k	1036800	2073600	100.0%	42.882s	39.356s	8.22%
landscape fullhd	1036800	2073600	100.0%	42.768s	39.427s	7.81%
pbrt-book 4k	1036800	4147200	300.0%	86.599s	74.244s	14.27%
pbrt-book fullhd	1036800	2073600	100.0%	22.596s	20.361s	9.89%
watercolor 4k	1036800	53760	-94.81%	5741.823s	4200.00s	26.00%
watercolor fullhd	1036800	53760	-94.81%	2166.88s	500.00s	76.00%

NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition

Scene Name	max queue size baseline	max queue size improved	improvement	render time baseline	render time improved	improvement
disney-cloud 4k	1036800	8294400	700.0%	989.652s	603.667s	39.0%
disney-cloud fullhd	1036800	2073600	100.0%	274.601s	204.597s	25.49%
landscape 4k	1036800	2073600	100.0%	22.215s	19.036s	14.31%
landscape fullhd	1036800	2073600	100.0%	22.215s	19.023s	14.37%
pbrt-book 4k	1036800	8294400	700.0%	43.567s	36.391s	16.47%
pbrt-book fullhd	1036800	2073600	100.0%	11.945s	9.95s	16.7%
watercolor 4k	1036800	8294400	700.0%	108.759s	51.622s	52.54%
watercolor fullhd	1036800	2073600	100.0%	28.311s	19.921s	29.64%

Discussion

Overall, the queue sizes are larger than before, except for the watercolor on the 4070. Since watercolor scene already covers the GPU memory of the 4070, the queue size is smaller.

The 3080 does not show any notable performance improvements, probably because the GPU is already saturated with a smaller queue size.

The 4070 shows slight improvements overall, but an extreme improvement of 76% in case of the watercolor scene in Full HD. In this case, the queue size is smaller, leading to less memory transfers. However, for the 4K version the speed up is not so large, probably because the film needs more memory in this case.

The A6000 overall shows large speedups, up to 2x in case of watercolor 4k, about 40% for disney-cloud 4k. The speedsup are largest for the 4K renderings. For Full HD, the queue size is probably still to small to saturate the GPU, making the performance gain smaller.

Overall, the performance is increased, but requires more memory. One could think off on adding some option to cap the memory usage to x gigabytes. I would argue since only the available GPU memory is used, it's okay to just use all of the available memory, to achieve maximum performance. Especially on large GPUs like the A6000 this seems necessary, because otherwise the GPU is running way under its potential.

pbrt4bounty · 2025-09-24T19:52:31Z

Hi.. it seems that Matt is quite busy so in the meantime,, I merged this change on my fork: pbrt4bounty@c901762 adding too a log message on 'verbose' mode

base gpu queue size on the amount of available memory

8d14148

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

base gpu queue size on the amount of available memory #511

base gpu queue size on the amount of available memory #511

Latios96 commented Sep 18, 2025

Uh oh!

pbrt4bounty commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

base gpu queue size on the amount of available memory #511

Are you sure you want to change the base?

base gpu queue size on the amount of available memory #511

Conversation

Latios96 commented Sep 18, 2025

Performance Measurements

Determining a good minimal amount of memory

Results

Discussion

Uh oh!

pbrt4bounty commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants