GPU performance optimizations for CUDA and OptiX real-time path tracing and firefly HDRI artifacts by Copilot · Pull Request #26 · pmudry/RayON

Copilot · 2026-03-18T06:52:16Z

Targeted latency and throughput optimizations for the CUDA and OptiX rendering pipelines, focusing on eliminating redundant computation in hot loops, replacing blocking synchronization with stream-based concurrency, and moving work from host to device.

CUDA renderer

Precomputed inverse ray direction in BVH traversal — hit_aabb() was computing 1/dir.{x,y,z} per AABB test; now computed once per ray in hit_scene() and passed through. Eliminates 3 fdiv per node visited.

const f3 inv_dir(1.0f / r.dir.x, 1.0f / r.dir.y, 1.0f / r.dir.z);
// reused across all hit_aabb() calls in this traversal

__launch_bounds__(256) on renderAccKernel — gives nvcc explicit occupancy target for register allocation on this register-heavy kernel.
GPU-side converged pixel counting — replaced countConvergedPixels() (full int[] D2H copy + host loop) with a warp-shuffle reduction kernel. Transfers 4 bytes instead of ~3.5 MB at 720p.
Dedicated compute stream — render kernel launches on s_compute_stream with cudaStreamSynchronize() instead of cudaDeviceSynchronize(), enabling overlap with the display pipeline.

OptiX renderer

Dedicated render stream — optixLaunch() + async param upload on a non-blocking stream with getOptiXStream() helper. Replaces global device sync.
GPU-side gamma correction with pinned memory async D2H — new optixRendererConvertAccumToDisplay() does float4→uint8 conversion on device, then async-copies ~2.7 MB (uint8 RGB) instead of downloading ~14 MB (float4 RGBA) for host-side conversion. Applied to both offline and progressive renderers.
Proper cleanup for all new resources (streams, pinned buffers, device counters).

Bug fixes

Fixed memory leak: d_converged_count was a function-local static with no cleanup path. Moved to file-scope s_d_converged_count freed in cleanupCudaStreams().

Documentation

New website/docs/how-it-works/gpu-performance.md documenting all techniques with code snippets and estimated impact.

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

- Precompute inverse ray direction for BVH traversal (eliminates 3 reciprocal ops per AABB test) - Add __launch_bounds__(256) to render kernel for better register allocation - GPU-side converged pixel counting via warp-shuffle reduction (replaces host-side D2H copy) - Dedicated compute stream for CUDA render kernel (enables overlap with display pipeline) - OptiX: dedicated CUDA stream for launches (replaces cudaDeviceSynchronize) - OptiX: GPU-side gamma correction + pinned memory async D2H pipeline - OptiX: proper resource cleanup for new stream/buffer resources Co-authored-by: pmudry <4624112+pmudry@users.noreply.github.com>

Create website/docs/how-it-works/gpu-performance.md documenting: - Precomputed inverse ray direction for BVH traversal - __launch_bounds__ on render kernel - GPU-side converged pixel counting (warp-shuffle) - Dedicated CUDA streams for render and display - OptiX dedicated render stream - OptiX GPU-side gamma correction with pinned memory - Summary comparison table of all techniques Co-authored-by: pmudry <4624112+pmudry@users.noreply.github.com>

Co-authored-by: pmudry <4624112+pmudry@users.noreply.github.com>

…formatting - Move d_converged_count from local static to file-scope s_d_converged_count with cleanup - Add getOptiXStream() helper to reduce stream fallback pattern duplication - Fix digit grouping in documentation (65,536 instead of 65 536) Co-authored-by: pmudry <4624112+pmudry@users.noreply.github.com>

pmudry · 2026-03-18T18:42:57Z

@copilot Have the changes been applied or not ?

Copilot · 2026-03-18T18:44:41Z

@copilot Have the changes been applied or not ?

Yes, all changes are applied and pushed across 4 commits (de91001→eb355ab). Here's the summary of the 9 files changed (461 insertions, 53 deletions):

CUDA renderer (4 optimizations):

cuda_raytracer.cuh — precomputed inverse ray direction for BVH traversal
shaders/render_acc_kernel.cu + .cuh — __launch_bounds__(256) + GPU-side converged pixel counting kernel
renderer_cuda_device.cu — dedicated compute stream, GPU pixel counting, memory leak fix

OptiX renderer (3 optimizations):

optix/optix_renderer.cu — dedicated render stream, GPU-side gamma correction kernel with pinned memory async D2H
renderer_optix_host.hpp + renderer_optix_progressive_host.hpp — both offline and progressive renderers now use GPU gamma correction

Documentation:

website/docs/how-it-works/gpu-performance.md — new page documenting all 6 techniques
website/mkdocs.yml — navigation entry added

Note: this is a CUDA/OptiX codebase so there's no way to compile-verify in this CI environment (no GPU/CUDA toolkit). The changes are structurally correct and follow existing patterns in the codebase.

pmudry · 2026-03-18T19:24:17Z

Bug fixes included:

OptiX black streaks on camera move — cudaMemsetAsync on the render stream prevents a race between the accumulation buffer reset (stream 0) and the OptiX launch (non-blocking stream).
CUDA white frame on camera move with adaptive sampling — same race: cudaMemsetAsync on s_compute_stream for both the accum buffer and per-pixel sample count reset.
White-dot firefly artifacts with HDR environment maps — per-sample luminance clamp (max 20.0, luminance-preserving) applied before accumulation in both CUDA and OptiX kernels.

Initial plan

a6aa7f3

Copilot AI assigned Copilot and pmudry Mar 18, 2026

Copilot started work on behalf of pmudry March 18, 2026 06:52 View session

Copilot AI and others added 4 commits March 18, 2026 07:10

Add GPU performance techniques docs page (force-add past gitignore)

d84de41

Co-authored-by: pmudry <4624112+pmudry@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Optimize performance of OptiX and CUDA renderers for path-tracing~~ GPU performance optimizations for CUDA and OptiX real-time path tracing Mar 18, 2026

Copilot AI requested a review from pmudry March 18, 2026 07:21

Copilot finished work on behalf of pmudry March 18, 2026 07:21

Copilot started work on behalf of pmudry March 18, 2026 18:43 View session

Copilot finished work on behalf of pmudry March 18, 2026 18:45

pmudry added 2 commits March 18, 2026 20:12

Fixed incoming bugs with shared buffers and init status

b4ec18c

Fixing HDRI firefly problem

0aa1061

pmudry changed the title ~~GPU performance optimizations for CUDA and OptiX real-time path tracing~~ GPU performance optimizations for CUDA and OptiX real-time path tracing and firefly HDRI artifacts Mar 18, 2026

pmudry marked this pull request as ready for review March 18, 2026 19:24

pmudry merged commit 1f9f770 into main Mar 18, 2026
1 check passed

pmudry deleted the copilot/performance-and-bugs branch March 18, 2026 19:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU performance optimizations for CUDA and OptiX real-time path tracing and firefly HDRI artifacts#26

GPU performance optimizations for CUDA and OptiX real-time path tracing and firefly HDRI artifacts#26
pmudry merged 7 commits intomainfrom
copilot/performance-and-bugs

Copilot AI commented Mar 18, 2026 •

edited

Loading

Uh oh!

pmudry commented Mar 18, 2026

Uh oh!

Copilot AI commented Mar 18, 2026

Uh oh!

pmudry commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CUDA renderer

OptiX renderer

Bug fixes

Documentation

Uh oh!

pmudry commented Mar 18, 2026

Uh oh!

Copilot AI commented Mar 18, 2026

Uh oh!

pmudry commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Mar 18, 2026 •

edited

Loading