Conversation
- Precompute inverse ray direction for BVH traversal (eliminates 3 reciprocal ops per AABB test) - Add __launch_bounds__(256) to render kernel for better register allocation - GPU-side converged pixel counting via warp-shuffle reduction (replaces host-side D2H copy) - Dedicated compute stream for CUDA render kernel (enables overlap with display pipeline) - OptiX: dedicated CUDA stream for launches (replaces cudaDeviceSynchronize) - OptiX: GPU-side gamma correction + pinned memory async D2H pipeline - OptiX: proper resource cleanup for new stream/buffer resources Co-authored-by: pmudry <4624112+pmudry@users.noreply.github.com>
Create website/docs/how-it-works/gpu-performance.md documenting: - Precomputed inverse ray direction for BVH traversal - __launch_bounds__ on render kernel - GPU-side converged pixel counting (warp-shuffle) - Dedicated CUDA streams for render and display - OptiX dedicated render stream - OptiX GPU-side gamma correction with pinned memory - Summary comparison table of all techniques Co-authored-by: pmudry <4624112+pmudry@users.noreply.github.com>
Co-authored-by: pmudry <4624112+pmudry@users.noreply.github.com>
…formatting - Move d_converged_count from local static to file-scope s_d_converged_count with cleanup - Add getOptiXStream() helper to reduce stream fallback pattern duplication - Fix digit grouping in documentation (65,536 instead of 65 536) Co-authored-by: pmudry <4624112+pmudry@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Optimize performance of OptiX and CUDA renderers for path-tracing
GPU performance optimizations for CUDA and OptiX real-time path tracing
Mar 18, 2026
Owner
|
@copilot Have the changes been applied or not ? |
Contributor
Author
Yes, all changes are applied and pushed across 4 commits ( CUDA renderer (4 optimizations):
OptiX renderer (3 optimizations):
Documentation:
Note: this is a CUDA/OptiX codebase so there's no way to compile-verify in this CI environment (no GPU/CUDA toolkit). The changes are structurally correct and follow existing patterns in the codebase. |
Owner
|
Bug fixes included:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Targeted latency and throughput optimizations for the CUDA and OptiX rendering pipelines, focusing on eliminating redundant computation in hot loops, replacing blocking synchronization with stream-based concurrency, and moving work from host to device.
CUDA renderer
hit_aabb()was computing1/dir.{x,y,z}per AABB test; now computed once per ray inhit_scene()and passed through. Eliminates 3fdivper node visited.__launch_bounds__(256)onrenderAccKernel— gives nvcc explicit occupancy target for register allocation on this register-heavy kernel.GPU-side converged pixel counting — replaced
countConvergedPixels()(fullint[]D2H copy + host loop) with a warp-shuffle reduction kernel. Transfers 4 bytes instead of ~3.5 MB at 720p.Dedicated compute stream — render kernel launches on
s_compute_streamwithcudaStreamSynchronize()instead ofcudaDeviceSynchronize(), enabling overlap with the display pipeline.OptiX renderer
Dedicated render stream —
optixLaunch()+ async param upload on a non-blocking stream withgetOptiXStream()helper. Replaces global device sync.GPU-side gamma correction with pinned memory async D2H — new
optixRendererConvertAccumToDisplay()does float4→uint8 conversion on device, then async-copies ~2.7 MB (uint8 RGB) instead of downloading ~14 MB (float4 RGBA) for host-side conversion. Applied to both offline and progressive renderers.Proper cleanup for all new resources (streams, pinned buffers, device counters).
Bug fixes
d_converged_countwas a function-localstaticwith no cleanup path. Moved to file-scopes_d_converged_countfreed incleanupCudaStreams().Documentation
website/docs/how-it-works/gpu-performance.mddocumenting all techniques with code snippets and estimated impact.🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.