Skip to content

Optimize dun_render FullyDark and FullyLit paths#8571

Open
glebm wants to merge 1 commit into
diasurgical:masterfrom
glebm:dun-render-opt
Open

Optimize dun_render FullyDark and FullyLit paths#8571
glebm wants to merge 1 commit into
diasurgical:masterfrom
glebm:dun-render-opt

Conversation

@glebm
Copy link
Copy Markdown
Collaborator

@glebm glebm commented May 26, 2026

Add overlapped_memset.hpp with FillBytesUpTo32, FillBytesUpTo64 (moved
from light_render.cpp), and CopyBytesUpTo32. All three avoid PLT/IFUNC
dispatch overhead for small variable-length memset/memcpy calls on Linux
by using fixed-size inline memcpy calls with overlapping stores/loads.

  • RenderLineOpaque: use FillBytesUpTo32 instead of memset.
    Fixes ~3x slowdown vs PartiallyLit on the TransparentSquare RLE path.
  • RenderLineOpaque: use CopyBytesUpTo32 instead of memcpy.
    TransparentSquare FullyLit was 6x slower than FullyDark despite doing
    less work per pixel; now on par with FullyDark (~5.8x improvement).
  • Remove RenderTriangleLower/Upper<FullyDark, false> loop specializations.
    Falling back to the unrolled N-template lets the compiler see each blit
    width as a compile-time constant: FullyDark triangles ~5x faster,
    trapezoids ~2x faster, RenderBlackTile ~4x faster.
  • RenderLineTransparentOrOpaqueN: for static N, bypass the fill/copy
    helpers and call BlitFillDirect (FullyDark) or BlitPixelsDirect (FullyLit)
    directly, guaranteeing a single vector instruction from the compiler.

Benchmarks (ns, RelWithDebInfo, single core, CPU scaling warnings apply):

Benchmark Baseline +FillUpTo32 -LoopSpecs +CopyUpTo32
LTri So/FL 7266 ns 8377 ns 7745 ns 8427 ns
LTri So/FD 21506 ns 18274 ns 4151 ns 5758 ns
LTri So/PL 69289 ns 61874 ns 60053 ns 67337 ns
LTri Tr/FL 83602 ns 77991 ns 79224 ns 86274 ns
LTri Tr/FD 70809 ns 68356 ns 62336 ns 86026 ns
LTri Tr/PL 109707 ns 95830 ns 95321 ns 103802 ns
RTri So/FL 6220 ns 5953 ns 6052 ns 7392 ns
RTri So/FD 17101 ns 17778 ns 4145 ns 4878 ns
RTri So/PL 62913 ns 63671 ns 64209 ns 70015 ns
RTri Tr/FL 81890 ns 81554 ns 79492 ns 82832 ns
RTri Tr/FD 76801 ns 66247 ns 63267 ns 77821 ns
RTri Tr/PL 99755 ns 102136 ns 100410 ns 104500 ns
TrSq So/FL 322304 ns 340121 ns 330975 ns 56543 ns
TrSq So/FD 178437 ns 63454 ns 54919 ns 57429 ns
TrSq So/PL 127834 ns 141316 ns 131048 ns 147341 ns
TrSq Tr/FL 141366 ns 142851 ns 143384 ns 153830 ns
TrSq Tr/FD 125299 ns 121498 ns 130854 ns 133110 ns
TrSq Tr/PL 183519 ns 177058 ns 179739 ns 177087 ns
Sq So/FL 9618 ns 8923 ns 8291 ns 10151 ns
Sq So/FD 4885 ns 5449 ns 5426 ns 9530 ns
Sq So/PL 118292 ns 122309 ns 118639 ns 118512 ns
Sq Tr/FL 166114 ns 165816 ns 165845 ns 167127 ns
Sq Tr/FD 113779 ns 116176 ns 113845 ns 118415 ns
Sq Tr/PL 208899 ns 210274 ns 207931 ns 208100 ns
LTrap So/FL 2386 ns 2663 ns 2389 ns 2825 ns
LTrap So/FD 4159 ns 3879 ns 1723 ns 1924 ns
LTrap So/PL 31172 ns 30478 ns 30913 ns 31173 ns
LTrap Tr/FL 41760 ns 46841 ns 41622 ns 42224 ns
LTrap Tr/FD 31138 ns 31824 ns 32467 ns 33399 ns
LTrap Tr/PL 54863 ns 52467 ns 52177 ns 53973 ns
RTrap So/FL 2387 ns 2369 ns 2280 ns 3046 ns
RTrap So/FD 3901 ns 3772 ns 1522 ns 1914 ns
RTrap So/PL 29838 ns 30107 ns 30544 ns 34756 ns
RTrap Tr/FL 43348 ns 41696 ns 40635 ns 43537 ns
RTrap Tr/FD 32092 ns 31350 ns 31687 ns 31598 ns
RTrap Tr/PL 52474 ns 51980 ns 51348 ns 51540 ns
RenderBlackTile 98.7 ns 107 ns 21.2 ns 23.8 ns

@glebm glebm enabled auto-merge (rebase) May 26, 2026 16:37
Add overlapped_memset.hpp with FillBytesUpTo32, FillBytesUpTo64 (moved
from light_render.cpp), and CopyBytesUpTo32. All three avoid PLT/IFUNC
dispatch overhead for small variable-length memset/memcpy calls on Linux
by using fixed-size inline memcpy calls with overlapping stores/loads.

- RenderLineOpaque<FullyDark>: use FillBytesUpTo32 instead of memset.
  Fixes ~3x slowdown vs PartiallyLit on the TransparentSquare RLE path.
- RenderLineOpaque<FullyLit>: use CopyBytesUpTo32 instead of memcpy.
  TransparentSquare FullyLit was 6x slower than FullyDark despite doing
  less work per pixel; now on par with FullyDark (~5.8x improvement).
- Remove RenderTriangleLower/Upper<FullyDark, false> loop specializations.
  Falling back to the unrolled N-template lets the compiler see each blit
  width as a compile-time constant: FullyDark triangles ~5x faster,
  trapezoids ~2x faster, RenderBlackTile ~4x faster.
- RenderLineTransparentOrOpaqueN: for static N, bypass the fill/copy
  helpers and call BlitFillDirect (FullyDark) or BlitPixelsDirect (FullyLit)
  directly, guaranteeing a single vector instruction from the compiler.

Benchmarks (ns, RelWithDebInfo, single core, CPU scaling warnings apply):

Benchmark              | Baseline | +FillUpTo32 | -LoopSpecs | +CopyUpTo32
-----------------------|----------|-------------|------------|-----------
LTri  So/FL            |  7266 ns |     8377 ns |    7745 ns |    8427 ns
LTri  So/FD            | 21506 ns |    18274 ns |    4151 ns |    5758 ns
LTri  So/PL            | 69289 ns |    61874 ns |   60053 ns |   67337 ns
LTri  Tr/FL            | 83602 ns |    77991 ns |   79224 ns |   86274 ns
LTri  Tr/FD            | 70809 ns |    68356 ns |   62336 ns |   86026 ns
LTri  Tr/PL            |109707 ns |    95830 ns |   95321 ns |  103802 ns
RTri  So/FL            |  6220 ns |     5953 ns |    6052 ns |    7392 ns
RTri  So/FD            | 17101 ns |    17778 ns |    4145 ns |    4878 ns
RTri  So/PL            | 62913 ns |    63671 ns |   64209 ns |   70015 ns
RTri  Tr/FL            | 81890 ns |    81554 ns |   79492 ns |   82832 ns
RTri  Tr/FD            | 76801 ns |    66247 ns |   63267 ns |   77821 ns
RTri  Tr/PL            | 99755 ns |   102136 ns |  100410 ns |  104500 ns
TrSq  So/FL            |322304 ns |   340121 ns |  330975 ns |   56543 ns
TrSq  So/FD            |178437 ns |    63454 ns |   54919 ns |   57429 ns
TrSq  So/PL            |127834 ns |   141316 ns |  131048 ns |  147341 ns
TrSq  Tr/FL            |141366 ns |   142851 ns |  143384 ns |  153830 ns
TrSq  Tr/FD            |125299 ns |   121498 ns |  130854 ns |  133110 ns
TrSq  Tr/PL            |183519 ns |   177058 ns |  179739 ns |  177087 ns
Sq    So/FL            |  9618 ns |     8923 ns |    8291 ns |   10151 ns
Sq    So/FD            |  4885 ns |     5449 ns |    5426 ns |    9530 ns
Sq    So/PL            |118292 ns |   122309 ns |  118639 ns |  118512 ns
Sq    Tr/FL            |166114 ns |   165816 ns |  165845 ns |  167127 ns
Sq    Tr/FD            |113779 ns |   116176 ns |  113845 ns |  118415 ns
Sq    Tr/PL            |208899 ns |   210274 ns |  207931 ns |  208100 ns
LTrap So/FL            |  2386 ns |     2663 ns |    2389 ns |    2825 ns
LTrap So/FD            |  4159 ns |     3879 ns |    1723 ns |    1924 ns
LTrap So/PL            | 31172 ns |    30478 ns |   30913 ns |   31173 ns
LTrap Tr/FL            | 41760 ns |    46841 ns |   41622 ns |   42224 ns
LTrap Tr/FD            | 31138 ns |    31824 ns |   32467 ns |   33399 ns
LTrap Tr/PL            | 54863 ns |    52467 ns |   52177 ns |   53973 ns
RTrap So/FL            |  2387 ns |     2369 ns |    2280 ns |    3046 ns
RTrap So/FD            |  3901 ns |     3772 ns |    1522 ns |    1914 ns
RTrap So/PL            | 29838 ns |    30107 ns |   30544 ns |   34756 ns
RTrap Tr/FL            | 43348 ns |    41696 ns |   40635 ns |   43537 ns
RTrap Tr/FD            | 32092 ns |    31350 ns |   31687 ns |   31598 ns
RTrap Tr/PL            | 52474 ns |    51980 ns |   51348 ns |   51540 ns
RenderBlackTile        |   98.7 ns |      107 ns |    21.2 ns |    23.8 ns

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@glebm glebm force-pushed the dun-render-opt branch from fa6ff9e to 037b288 Compare May 26, 2026 17:02
@glebm glebm changed the title Optimize dun_render FullyDark path Optimize dun_render FullyDark and FullyLit paths May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant