base gpu queue size on the amount of available memory #511
+40
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Recently, we got a new NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition in the lab. Curious about the performance of this card, I tried out PBRT and noticed the performance was not so fast as I would expect it. Other benchmark showed the expected results, so I thought PBRT was not saturating this GPU enough. Digging around in the source code I found the queue size is fixed to
1024 * 1024and this todocomment.This PR adresses proposes a fix for the todo. Instead of using a fixed queue size, it looks how much GPU memory is available after allocating the scene and bases the queue size on that. In scenes with high memory usage, CUDA will report there is no memory available anymore, but the scene will still render fine. In this case, a fixed small amount of memory is used for the queues. The amount is determined through experimentation, see below.
Overall, the performance is increased. When GPU memory is available, increasing the queue size leads to higher performance, because the GPU is more saturated. When the GPU memory is already fully covered by the scene, using a smaller queue size leads to higher performance because of less memory transfers. It was not investigated if changing the queue size has an influence on the CPU performance.
Performance Measurements
To measure performance improvement, one can think of three possible scenarios:
I run the tests on the following GPUs:
I selected the test scenes pbrt-book, landscape, watercolor and disney-cloud.
On the 3080, all scenes fit into memory.
On the 4070, all scenes except the watercolor scene fit into memory.
On the A6000, the memory usage of all scenes is negligible compared to the available memory.
Since PBRT launches a single sample per pixel at once, the queue size is limited to the number of pixels in the image. To see how this influences the performance (for larger images the speedup should be higher), I rendered the scenes in Full HD and 4K.
Determining a good minimal amount of memory
Since the watercolor scene covers all GPU memory in the 4070, I used this as a test scenario.
I rendered the scene using different minimum amount of memory for the queues:
I chose 100mb as the minimum amount of memory for the queues. It is not the fastest for Full HD, but the speedup in 4k is larger.
Unfortunatly, the measurements depend pretty much on what else is going on on the GPU. Even when doing nothing beside running pbrt, the numbers are sometimes not easy to reproduce. But overall, using a way smaller queue size than 1024 * 1024 gives faster results in case the GPU memory is fully used.
Results
The measurements were taken with 100mb as a minimal amount of memory for the queues.
NVIDIA GeForce RTX 3080 Laptop GPU
NVIDIA GeForce RTX 4070 Ti
NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
Discussion
Overall, the queue sizes are larger than before, except for the watercolor on the 4070. Since watercolor scene already covers the GPU memory of the 4070, the queue size is smaller.
The 3080 does not show any notable performance improvements, probably because the GPU is already saturated with a smaller queue size.
The 4070 shows slight improvements overall, but an extreme improvement of 76% in case of the watercolor scene in Full HD. In this case, the queue size is smaller, leading to less memory transfers. However, for the 4K version the speed up is not so large, probably because the film needs more memory in this case.
The A6000 overall shows large speedups, up to 2x in case of watercolor 4k, about 40% for disney-cloud 4k. The speedsup are largest for the 4K renderings. For Full HD, the queue size is probably still to small to saturate the GPU, making the performance gain smaller.
Overall, the performance is increased, but requires more memory. One could think off on adding some option to cap the memory usage to x gigabytes. I would argue since only the available GPU memory is used, it's okay to just use all of the available memory, to achieve maximum performance. Especially on large GPUs like the A6000 this seems necessary, because otherwise the GPU is running way under its potential.