Optimize dun_render FullyDark and FullyLit paths#8571
Open
glebm wants to merge 1 commit into
Open
Conversation
Add overlapped_memset.hpp with FillBytesUpTo32, FillBytesUpTo64 (moved from light_render.cpp), and CopyBytesUpTo32. All three avoid PLT/IFUNC dispatch overhead for small variable-length memset/memcpy calls on Linux by using fixed-size inline memcpy calls with overlapping stores/loads. - RenderLineOpaque<FullyDark>: use FillBytesUpTo32 instead of memset. Fixes ~3x slowdown vs PartiallyLit on the TransparentSquare RLE path. - RenderLineOpaque<FullyLit>: use CopyBytesUpTo32 instead of memcpy. TransparentSquare FullyLit was 6x slower than FullyDark despite doing less work per pixel; now on par with FullyDark (~5.8x improvement). - Remove RenderTriangleLower/Upper<FullyDark, false> loop specializations. Falling back to the unrolled N-template lets the compiler see each blit width as a compile-time constant: FullyDark triangles ~5x faster, trapezoids ~2x faster, RenderBlackTile ~4x faster. - RenderLineTransparentOrOpaqueN: for static N, bypass the fill/copy helpers and call BlitFillDirect (FullyDark) or BlitPixelsDirect (FullyLit) directly, guaranteeing a single vector instruction from the compiler. Benchmarks (ns, RelWithDebInfo, single core, CPU scaling warnings apply): Benchmark | Baseline | +FillUpTo32 | -LoopSpecs | +CopyUpTo32 -----------------------|----------|-------------|------------|----------- LTri So/FL | 7266 ns | 8377 ns | 7745 ns | 8427 ns LTri So/FD | 21506 ns | 18274 ns | 4151 ns | 5758 ns LTri So/PL | 69289 ns | 61874 ns | 60053 ns | 67337 ns LTri Tr/FL | 83602 ns | 77991 ns | 79224 ns | 86274 ns LTri Tr/FD | 70809 ns | 68356 ns | 62336 ns | 86026 ns LTri Tr/PL |109707 ns | 95830 ns | 95321 ns | 103802 ns RTri So/FL | 6220 ns | 5953 ns | 6052 ns | 7392 ns RTri So/FD | 17101 ns | 17778 ns | 4145 ns | 4878 ns RTri So/PL | 62913 ns | 63671 ns | 64209 ns | 70015 ns RTri Tr/FL | 81890 ns | 81554 ns | 79492 ns | 82832 ns RTri Tr/FD | 76801 ns | 66247 ns | 63267 ns | 77821 ns RTri Tr/PL | 99755 ns | 102136 ns | 100410 ns | 104500 ns TrSq So/FL |322304 ns | 340121 ns | 330975 ns | 56543 ns TrSq So/FD |178437 ns | 63454 ns | 54919 ns | 57429 ns TrSq So/PL |127834 ns | 141316 ns | 131048 ns | 147341 ns TrSq Tr/FL |141366 ns | 142851 ns | 143384 ns | 153830 ns TrSq Tr/FD |125299 ns | 121498 ns | 130854 ns | 133110 ns TrSq Tr/PL |183519 ns | 177058 ns | 179739 ns | 177087 ns Sq So/FL | 9618 ns | 8923 ns | 8291 ns | 10151 ns Sq So/FD | 4885 ns | 5449 ns | 5426 ns | 9530 ns Sq So/PL |118292 ns | 122309 ns | 118639 ns | 118512 ns Sq Tr/FL |166114 ns | 165816 ns | 165845 ns | 167127 ns Sq Tr/FD |113779 ns | 116176 ns | 113845 ns | 118415 ns Sq Tr/PL |208899 ns | 210274 ns | 207931 ns | 208100 ns LTrap So/FL | 2386 ns | 2663 ns | 2389 ns | 2825 ns LTrap So/FD | 4159 ns | 3879 ns | 1723 ns | 1924 ns LTrap So/PL | 31172 ns | 30478 ns | 30913 ns | 31173 ns LTrap Tr/FL | 41760 ns | 46841 ns | 41622 ns | 42224 ns LTrap Tr/FD | 31138 ns | 31824 ns | 32467 ns | 33399 ns LTrap Tr/PL | 54863 ns | 52467 ns | 52177 ns | 53973 ns RTrap So/FL | 2387 ns | 2369 ns | 2280 ns | 3046 ns RTrap So/FD | 3901 ns | 3772 ns | 1522 ns | 1914 ns RTrap So/PL | 29838 ns | 30107 ns | 30544 ns | 34756 ns RTrap Tr/FL | 43348 ns | 41696 ns | 40635 ns | 43537 ns RTrap Tr/FD | 32092 ns | 31350 ns | 31687 ns | 31598 ns RTrap Tr/PL | 52474 ns | 51980 ns | 51348 ns | 51540 ns RenderBlackTile | 98.7 ns | 107 ns | 21.2 ns | 23.8 ns Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add overlapped_memset.hpp with FillBytesUpTo32, FillBytesUpTo64 (moved
from light_render.cpp), and CopyBytesUpTo32. All three avoid PLT/IFUNC
dispatch overhead for small variable-length memset/memcpy calls on Linux
by using fixed-size inline memcpy calls with overlapping stores/loads.
Fixes ~3x slowdown vs PartiallyLit on the TransparentSquare RLE path.
TransparentSquare FullyLit was 6x slower than FullyDark despite doing
less work per pixel; now on par with FullyDark (~5.8x improvement).
Falling back to the unrolled N-template lets the compiler see each blit
width as a compile-time constant: FullyDark triangles ~5x faster,
trapezoids ~2x faster, RenderBlackTile ~4x faster.
helpers and call BlitFillDirect (FullyDark) or BlitPixelsDirect (FullyLit)
directly, guaranteeing a single vector instruction from the compiler.
Benchmarks (ns, RelWithDebInfo, single core, CPU scaling warnings apply):