Skip to content

Refactor AVX2 blitters using macros #2055

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

Starbuck5
Copy link
Member

This moves code repetition out of the blitter functions themselves into the macros developed for running the new AVX2 alpha blitters. Also adjusted a couple set_ instructions (like _mm256_set_epi16) to use set1_ variants instead so the value can just be provided once.

Makes simd_blitters_avx2 800 lines shorter.

I believe this has a small performance regression, which could be solved by rewiring the RUN_AVX2_BLITTER implementation strategy. Maybe we've got to put the DUFF loops back in there?

@Starbuck5 Starbuck5 requested a review from a team as a code owner March 26, 2023 05:57
@Starbuck5 Starbuck5 marked this pull request as draft March 26, 2023 05:57
@Starbuck5 Starbuck5 force-pushed the avx2-blitter-refactors branch from cb79e09 to 33a213f Compare April 18, 2023 05:56
@yunline yunline added Code quality/robustness Code quality and resilience to changes SIMD labels Jun 4, 2023
@Starbuck5 Starbuck5 force-pushed the avx2-blitter-refactors branch from 33a213f to ae116c2 Compare December 10, 2023 21:05
@itzpr3d4t0r
Copy link
Member

itzpr3d4t0r commented Jan 2, 2024

I don't know if you want to move this forward or what but i have done a lot of testing and have some findings to share.
Basically the LOOP_UNROLLED isn't improving the situation much, but what's making it perform sligtly worse is the maskload/maskstore intructions compared to going pixel by pixel with something like this:

LOOP_UNROLLED4(
      {
          mm_src = _mm_cvtsi32_si128(*srcp);
          mm_dst = _mm_cvtsi32_si128(*dstp);

          {BLIT_CODE}

          *dstp = _mm_cvtsi128_si32(mm_dst);

          srcp += srcpxskip;
          dstp += dstpxskip;
      },
      n, pre_8_width);

If you take a look at the latency/CPI of these instructions and sum them up you can see that it's faster if we're talking like 1-3/4 excess pixels but it's slower for 4-7 pixels, hence the maskload is better in that case.
I've managed to come up with another strategy that does two pixels at a time with _mm_loadl_epi64 so that we shave off some cycles and does a final pixel if there with _mm_cvtsi32_si128 and this was better all the time. Downside to that is that we'd have to pass in an AVX and an SSE blit code all the time.

@itzpr3d4t0r
Copy link
Member

Anyway this implementation could still use some simplifications like we don't actually need two shuffle masks for the 16 bit shuffle macro since we can directly use _mm256_unpacklo_epi8 / _mm256_unpackhi_epi8 with a zeroed register and get the same result. Which is what i did in #2642.

@itzpr3d4t0r
Copy link
Member

But I'd really encourage you to revive this PR if possible or maybe consider working on this together? Your choice.

@Starbuck5
Copy link
Member Author

Hey itz, I would like to resolve this as well.

All my efforts to find a way out for performance failed earlier this year, but I think a sub-5% performance regression is acceptable for blend modes. I wouldn't want to do that to the PREMUL blend, but for the others I think it would be acceptable. They're all so fast anyway. A cleaner and more understandable codebase has its own benefits.

I'd like to avoid interleaving SSE2 and AVX code, again for the simplicity of the codebase.

@Starbuck5
Copy link
Member Author

I've let this PR linger too long, I think it should be closed. It can be revisited in the future if necessary.

@Starbuck5 Starbuck5 closed this Nov 2, 2024
@Starbuck5 Starbuck5 deleted the avx2-blitter-refactors branch November 2, 2024 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Code quality/robustness Code quality and resilience to changes SIMD
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants