Skip to content

perf: remove DisjointMut guards in DSP hot loops, use raw pointers in fn signatures#1493

Open
silentnoisehun wants to merge 5 commits into
memorysafety:mainfrom
silentnoisehun:perf/remove-ffisafe-dsp-params
Open

perf: remove DisjointMut guards in DSP hot loops, use raw pointers in fn signatures#1493
silentnoisehun wants to merge 5 commits into
memorysafety:mainfrom
silentnoisehun:perf/remove-ffisafe-dsp-params

Conversation

@silentnoisehun
Copy link
Copy Markdown

@silentnoisehun silentnoisehun commented May 18, 2026

Summary

This PR removes DisjointMut guards from the innermost DSP decoding loops and replaces FFISafeRav1dPictureDataComponentOffset with raw pointers in DSP function signatures, eliminating per-iteration overhead in hot paths.

Performance

Measured on 1920x1080, 900-frame AV1 decode benchmark:

C gap (vs dav1d)
Before 6.4% slower
After ~4.6% slower

~28% reduction in the performance gap vs C dav1d.

Correctness verified: decoded YUV output matches C dav1d reference (identical MD5 on test_long.ivf).

Changes

  • Replace FFISafeRav1dPictureDataComponentOffset (16-byte struct, Windows x64 passes via hidden pointer) with *const Rav1dPictureDataComponentOffset (8 bytes, register-sized) in all DSP fn signatures: cdef.rs, filmgrain.rs, ipred.rs, itx.rs, loopfilter.rs, looprestoration.rs, mc.rs
  • Remove DisjointMut guards in put_rust, prep_rust, filter_8tap, put_8tap_rust
  • Raw pointer bypass in prep_8tap_rust
  • Raw pointer bypass in put_bilin_rust, prep_bilin_rust, filter_bilin
  • Raw pointer bypass in put_8tap_scaled_rust

Safety

The raw pointer accesses are bounded by the same invariants as the existing C dav1d implementation. No new unsafety is introduced beyond what was already present in the DSP layer.

Mater Bench and others added 5 commits May 16, 2026 07:23
Replace filter_8tap (Rav1dPictureDataComponentOffset overhead) with
filter_8tap_raw in put_8tap_scaled_rust, matching prep_8tap_scaled_rust.
Extract sptr/dptr/strides before loops; convert output pass to raw ptr.

filter_8tap and filter_bilin are now dead code (no callers remain).

Result: C gap 6.4% → ~4.6% on 1920x1080 900-frame bench.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… in DSP fn signatures

On Windows x64 ABI, structs larger than 8 bytes are passed via hidden
pointer (caller allocates on stack, passes address). The trailing
`FFISafeRav1dPictureDataComponentOffset` parameter (16 bytes:
`WithOffset<*const Rav1dPictureDataComponent>`) therefore costs a stack
allocation + hidden pointer indirection on every DSP call.

Replace with `*const Rav1dPictureDataComponentOffset` (8 bytes, fits in
register) in all affected wrap_fn_ptr signatures:
- src/cdef.rs
- src/filmgrain.rs
- src/ipred.rs
- src/itx.rs
- src/loopfilter.rs
- src/looprestoration.rs
- src/mc.rs

In Fn::call: pass `&offset_value` (pointer to stack local).
In c_erased: reconstruct via `unsafe { *ref }`.
On arm/aarch64 neon_erased: ignored (neon uses raw ptrs directly).

Correctness verified: decoded YUV output matches C dav1d reference
(identical MD5 on test_long.ivf).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant