Skip to content

GPU performance optimizations for CUDA and OptiX real-time path tracing and firefly HDRI artifacts#26

Merged
pmudry merged 7 commits intomainfrom
copilot/performance-and-bugs
Mar 18, 2026
Merged

GPU performance optimizations for CUDA and OptiX real-time path tracing and firefly HDRI artifacts#26
pmudry merged 7 commits intomainfrom
copilot/performance-and-bugs

Conversation

Copy link
Contributor

Copilot AI commented Mar 18, 2026

Targeted latency and throughput optimizations for the CUDA and OptiX rendering pipelines, focusing on eliminating redundant computation in hot loops, replacing blocking synchronization with stream-based concurrency, and moving work from host to device.

CUDA renderer

  • Precomputed inverse ray direction in BVH traversalhit_aabb() was computing 1/dir.{x,y,z} per AABB test; now computed once per ray in hit_scene() and passed through. Eliminates 3 fdiv per node visited.
const f3 inv_dir(1.0f / r.dir.x, 1.0f / r.dir.y, 1.0f / r.dir.z);
// reused across all hit_aabb() calls in this traversal
  • __launch_bounds__(256) on renderAccKernel — gives nvcc explicit occupancy target for register allocation on this register-heavy kernel.

  • GPU-side converged pixel counting — replaced countConvergedPixels() (full int[] D2H copy + host loop) with a warp-shuffle reduction kernel. Transfers 4 bytes instead of ~3.5 MB at 720p.

  • Dedicated compute stream — render kernel launches on s_compute_stream with cudaStreamSynchronize() instead of cudaDeviceSynchronize(), enabling overlap with the display pipeline.

OptiX renderer

  • Dedicated render streamoptixLaunch() + async param upload on a non-blocking stream with getOptiXStream() helper. Replaces global device sync.

  • GPU-side gamma correction with pinned memory async D2H — new optixRendererConvertAccumToDisplay() does float4→uint8 conversion on device, then async-copies ~2.7 MB (uint8 RGB) instead of downloading ~14 MB (float4 RGBA) for host-side conversion. Applied to both offline and progressive renderers.

  • Proper cleanup for all new resources (streams, pinned buffers, device counters).

Bug fixes

  • Fixed memory leak: d_converged_count was a function-local static with no cleanup path. Moved to file-scope s_d_converged_count freed in cleanupCudaStreams().

Documentation

  • New website/docs/how-it-works/gpu-performance.md documenting all techniques with code snippets and estimated impact.

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

Copilot AI and others added 4 commits March 18, 2026 07:10
- Precompute inverse ray direction for BVH traversal (eliminates 3 reciprocal ops per AABB test)
- Add __launch_bounds__(256) to render kernel for better register allocation
- GPU-side converged pixel counting via warp-shuffle reduction (replaces host-side D2H copy)
- Dedicated compute stream for CUDA render kernel (enables overlap with display pipeline)
- OptiX: dedicated CUDA stream for launches (replaces cudaDeviceSynchronize)
- OptiX: GPU-side gamma correction + pinned memory async D2H pipeline
- OptiX: proper resource cleanup for new stream/buffer resources

Co-authored-by: pmudry <4624112+pmudry@users.noreply.github.com>
Create website/docs/how-it-works/gpu-performance.md documenting:
- Precomputed inverse ray direction for BVH traversal
- __launch_bounds__ on render kernel
- GPU-side converged pixel counting (warp-shuffle)
- Dedicated CUDA streams for render and display
- OptiX dedicated render stream
- OptiX GPU-side gamma correction with pinned memory
- Summary comparison table of all techniques

Co-authored-by: pmudry <4624112+pmudry@users.noreply.github.com>
Co-authored-by: pmudry <4624112+pmudry@users.noreply.github.com>
…formatting

- Move d_converged_count from local static to file-scope s_d_converged_count with cleanup
- Add getOptiXStream() helper to reduce stream fallback pattern duplication
- Fix digit grouping in documentation (65,536 instead of 65 536)

Co-authored-by: pmudry <4624112+pmudry@users.noreply.github.com>
Copilot AI changed the title [WIP] Optimize performance of OptiX and CUDA renderers for path-tracing GPU performance optimizations for CUDA and OptiX real-time path tracing Mar 18, 2026
Copilot AI requested a review from pmudry March 18, 2026 07:21
@pmudry
Copy link
Owner

pmudry commented Mar 18, 2026

@copilot Have the changes been applied or not ?

Copy link
Contributor Author

Copilot AI commented Mar 18, 2026

@copilot Have the changes been applied or not ?

Yes, all changes are applied and pushed across 4 commits (de91001eb355ab). Here's the summary of the 9 files changed (461 insertions, 53 deletions):

CUDA renderer (4 optimizations):

  • cuda_raytracer.cuh — precomputed inverse ray direction for BVH traversal
  • shaders/render_acc_kernel.cu + .cuh__launch_bounds__(256) + GPU-side converged pixel counting kernel
  • renderer_cuda_device.cu — dedicated compute stream, GPU pixel counting, memory leak fix

OptiX renderer (3 optimizations):

  • optix/optix_renderer.cu — dedicated render stream, GPU-side gamma correction kernel with pinned memory async D2H
  • renderer_optix_host.hpp + renderer_optix_progressive_host.hpp — both offline and progressive renderers now use GPU gamma correction

Documentation:

  • website/docs/how-it-works/gpu-performance.md — new page documenting all 6 techniques
  • website/mkdocs.yml — navigation entry added

Note: this is a CUDA/OptiX codebase so there's no way to compile-verify in this CI environment (no GPU/CUDA toolkit). The changes are structurally correct and follow existing patterns in the codebase.

@pmudry pmudry changed the title GPU performance optimizations for CUDA and OptiX real-time path tracing GPU performance optimizations for CUDA and OptiX real-time path tracing and firefly HDRI artifacts Mar 18, 2026
@pmudry
Copy link
Owner

pmudry commented Mar 18, 2026

Bug fixes included:

  • OptiX black streaks on camera move — cudaMemsetAsync on the render stream prevents a race between the accumulation buffer reset (stream 0) and the OptiX launch (non-blocking stream).
  • CUDA white frame on camera move with adaptive sampling — same race: cudaMemsetAsync on s_compute_stream for both the accum buffer and per-pixel sample count reset.
  • White-dot firefly artifacts with HDR environment maps — per-sample luminance clamp (max 20.0, luminance-preserving) applied before accumulation in both CUDA and OptiX kernels.

@pmudry pmudry marked this pull request as ready for review March 18, 2026 19:24
@pmudry pmudry merged commit 1f9f770 into main Mar 18, 2026
1 check passed
@pmudry pmudry deleted the copilot/performance-and-bugs branch March 18, 2026 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants