Add AVX2 Vector4Octet.Pack implementation#1402
Conversation
|
@saucecontrol Is there something I can do here to avoid the permutation? |
There was a problem hiding this comment.
LGTM, but check the second suggestion.
I also realized that we can get this much better by implementing a HwIntrinsics AVX2 version of FromYCbCrSimdVector8. I suggest to check benchmarks for the color converter instead of isolated benchmarking of Pack.
| Vector4 vo = Vector4.One; | ||
| Vector128<float> valpha = Unsafe.As<Vector4, Vector128<float>>(ref vo); | ||
|
|
||
| ref byte control = ref MemoryMarshal.GetReference(SimdUtils.HwIntrinsics.PermuteMaskDeinterleave8x32); | ||
| Vector256<int> vcontrol = Unsafe.As<byte, Vector256<int>>(ref control); |
There was a problem hiding this comment.
By having two versions of Pack, (or inlining the HwIntrinsics AVX2 version) we can move these loads outside of the for loop calling Pack.
(Not necessarily a suggestion for this PR since it extends the scope quite a lot)
| Vector256<int> vcontrol = Unsafe.As<byte, Vector256<int>>(ref control); | ||
|
|
||
| Vector256<float> r0 = Avx.InsertVector128( | ||
| Unsafe.As<Vector4, Vector128<float>>(ref r.A).ToVector256(), |
There was a problem hiding this comment.
If I'm not getting it wrong, we can spare ToVector256 in lines dealing with r.A, g.A and b.A, since the upper 4 elements will be overwritten anyways:
| Unsafe.As<Vector4, Vector128<float>>(ref r.A).ToVector256(), | |
| Unsafe.As<Vector4, Vector256<float>>(ref r.A), |
There was a problem hiding this comment.
Weirdly when I tried that I got access violations!
There was a problem hiding this comment.
Are you sure it was on .A stuff and not .B?
There was a problem hiding this comment.
I’ll double check. Was super surprised to see it as it didn’t make sense.
| 1); | ||
|
|
||
| Vector256<float> r2 = Avx.InsertVector128( | ||
| Unsafe.As<Vector4, Vector128<float>>(ref r.B).ToVector256(), |
There was a problem hiding this comment.
With a wider refactor, it's also possible to save the conversion here.
There was a problem hiding this comment.
That'd certainly be easier if I was using pointers.
There was a problem hiding this comment.
What I meant is inlining the Pack stuff into AVX2 color space conversion code. That would remove a bunch of unnecessary loads/stores.
There was a problem hiding this comment.
Yeah, that's what I mean. It's difficult to inline at the moment because I need the 128bit offset when I'm aligning at 256bit.
If I fixed the r, g, b inputs then I could simply load up the vector from the offset.
There was a problem hiding this comment.
I'm pushing the current state.
|
@antonfirsov Experimenting with a separate implementation. First time I've seen sub 9ms. Need to up the warmup/run count on these benchmarks though there's always too much error.
|
Codecov Report
@@ Coverage Diff @@
## master #1402 +/- ##
==========================================
- Coverage 82.90% 82.89% -0.01%
==========================================
Files 690 690
Lines 31008 31017 +9
Branches 3560 3561 +1
==========================================
+ Hits 25706 25713 +7
- Misses 4580 4581 +1
- Partials 722 723 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
antonfirsov
left a comment
There was a problem hiding this comment.
I'm fine to go with this as is, since it's definitely an improvement.
|
@antonfirsov There's more! I inlined the method. Using Vector256 works now.
|
You can't avoid them, but you can reduce the number. I have an AVX2 YCbCr->BGRX converter you can look at here: https://github.com/saucecontrol/PhotoSauce/blob/master/src/MagicScaler/Magic/PlanarConversionTransform.cs#L153-L203 The main difference is I permute the values first, separating the even and odd indexes into the low and high lanes respectively. That allows the unpack operations to do all the remaining work. It also means you can get by with 3 permutes instead of 4 since your alpha vector is constant. And if you make your green coefficients negative, you can use a couple more FMAs in there :) |
|
@saucecontrol Ahah! I’ll have to give that a try! Thanks! |
|
Very nice @saucecontrol We're getting much closer now. Once we refactor to resolve directly to I'll be running the latest nightly against the playground benchmarks once this is merged and built.
|
Add AVX2 Vector4Octet.Pack implementation

Prerequisites
Description
This should add some performance boost until #1242 and others are complete.
I struggled writing this; any performance tips are most welcome.
Note: No difference with SIMD disabled as we're not using it.