SSE support #60

HEnquist · 2021-02-24T22:01:18Z

This is still very much work in progress, but I thought I would show what I'm up to.
Looking at the AVX code, it's quite big and many of the things there don't look like they would work well with the much smaller SSE instruction sets. Instead I started out from the scalar code, and have made a hybrid solution where I use SSE code whenever it's ready, and fall back to the scalar one otherwise.
For now I have butterflies of length 2,3,4,5,8,16,32, and Radix4 working. The other algorightms will use these as inners, to also get some speedup.
Surprisingly I get about the same speedup for f32 as for f64. For radix4, that varies between +30% to +90%. There may be more to gain by tweaking a bit here and there.
For the individual butterflies the gain increases with the length, and for 16 and 32 they are about 3x as fast as the scalars (both for f32 and f64).
I will continue now with implementing the rest of the butterflies. (I haven't abandoned the estimating planner, this got caught up in this when it started working well).

src/sse/sse_radix4.rs

Calcmacros

HEnquist · 2021-03-23T21:15:48Z

I tried running rustfmt on the autogenerated butterflies, but it made a mess of the calc! macros. Those lines tend to become a bit long, but since it becomes a table I think it's quite ok. Like here: https://github.com/HEnquist/RustFFT/blob/ssesimple/src/sse/sse_prime_butterflies.rs#L3317
The formatter sometimes put each term on its own row, and sometimes broke the row into a few shorter ones. The result was much harder to read, so I excluded that file from formatting. Ok?

Parallel vs dual

ejmahler · 2021-03-24T03:26:50Z

Yeah, excluding any autogenerated file from rustfmt seems reasonable to me.

ejmahler · 2021-03-31T04:12:08Z

Where do you think this is at? If you think it's ready, I can do another review pass.

HEnquist · 2021-03-31T07:48:09Z

There is one naming question left: #60 (comment)
Apart from that I think it's in good shape.

HEnquist · 2021-04-02T19:32:16Z

I realized the radix4 bit reverse shuffle needs some more work. I started looking into using it also for the scalar radix4, and made some proper benches for just the shuffle. Right now it's faster at medium lenghts (about 10k - 100k), but it actually gets slower for large arrays. I have some ideas for how to solve that.
It would also be much nicer to use the same shuffle function for sse and scalar radix4, instead of having two copies like here.
I would propose that we pause this PR for the moment, and I sort out the scalar shuffler first in a separate PR. After that is ready and merged, I update this PR to use the same shuffler. What do you think?

ejmahler · 2021-04-03T00:36:33Z

Instead of putting this on hold, it would make sense to me to just have the SSE code use the radix4 reordering from master. That way, we can check this in sooner, and we can consider the reordering improvement to be an orthogonal change that this doesn't depend on.

Unrelated: I found that doing direct bit reversal was too slow compared to the recursive approach with scalar code, but I wonder if it would be faster using SSE?

HEnquist · 2021-04-03T13:52:58Z

Ready for review!
I went back to the scalar shuffler for the radix4, and changed the function names, all the ones with 1st and 2nd in them.

I have played a little with the bit reverse. This version runs at the same speed as the current one:

pub fn reverse_bits(value: usize, bits: usize) -> usize {
    let mut result: usize = 0; 
    let mut value = value;
    for _ in 0..bits {
        result = (result<<2) + (value & 0x03);
        value = value>>2;
    }
    result
}

pub unsafe fn bitrev_transpose<T: Copy>(width: usize, height: usize, input: &[T], output: &mut [T], bits: usize) {
    for x in 0..width {
        let x_rev = reverse_bits(x, bits);
        for y in 0..height {
            let input_index = x_rev + y * width;
            let output_index = y + x * height;

            *output.get_unchecked_mut(output_index) = *input.get_unchecked(input_index);
        }
    }
}

It walks through input and output arrays in the same order as the recursive one. Now I'll start working on getting this to split the work into cache friendly chunks.

ejmahler

This is looking great. I left a bunch of comments, but they're much more minor than last time. I don't think we'll need a 3rd review - when these changes go in, I'll be ready to merge.

Thanks again for all your hard work to make this PR happen, I think users will be very grateful for the improved performance.

src/algorithm/butterflies.rs

src/array_utils.rs

src/lib.rs

src/sse/sse_butterflies.rs

src/sse/sse_planner.rs

src/sse/sse_utils.rs

src/sse/sse_vector.rs

HEnquist · 2021-04-14T18:58:18Z

I believe I have addressed all the comments now, except the question about target_feature vs inline(always) where I need a little input.

ejmahler · 2021-04-14T23:17:02Z

Thanks for the update, and thanks for doing some thorough research into the inlining issue. I finally have a day off, so I'll be looking into this today. I anticipate being able to merge this today.

ejmahler · 2021-04-15T01:54:23Z

Ok, after digging into it, I understand the problem space a little more. Because of the distinction between single and parallel F32 (Basically, parallel vs remainder), we can't possibly avoid having multiple copies of the core FFT function. If we move the inlines vs SSE 4.1 declarations, we just slightly change what gets duplicated and what doesn't.

But right now, because of the structure of

let alldone = array_utils::iter_chunks(buffer, 2 * self.len(), |chunk| {
                    self.perform_parallel_fft_butterfly(chunk)
                });
                if alldone.is_err() && buffer.len() >= self.len() {
                    self.perform_fft_butterfly(&mut buffer[len - self.len()..]);
                }

being inside a target feature, both of these functions get inlined together along with the loop code. The end result is fewer function calls and probably better instruction locality.

I noticed that F64 doesn't need the distinction between mainloop and remainder FFTs because it can only store one at a time, so i took a stab at applying this inline change there to see if it suffers from the same 5-10% drawbacks - and it turns out it does. So it definitely isn't instruction locality, because inspecting the assembly confirms that the actual FFT function code doesn't get duplicated.

My last hunch is that maybe the un-inlineable function call in the loop is the problem: When the target_feature is on perform_fft_contiguous, we run a loop that does nothing but call perform_fft_contiguous n times - and without some refactoring there's no way to even test whether that's the case. So for now I'm going to shelve this line of thought. I would like to revisit it at some point after this merges though, because if we can get it to work without the performance hit, there will be a huge binary size reduction.

ejmahler · 2021-04-15T02:14:12Z

And it's in! Thanks again for making this happen.

My plan is to look into the precision issues today, then publish a v5.1, then update num-complex, then push v6.0, hopefully all today

HEnquist · 2021-04-15T06:17:10Z

Excellent! Thank you so much :)
This means I should hurry up with the new radix 4 shuffler. It's ready, just needs some cleanup. I will do it today!

HEnquist added 22 commits February 11, 2021 22:24

Working sse butterfly4, but slow

14115ad

Some tweaks, not helping

bf09a57

Try some tweaks

7374642

Use only sse and sse2

05c8227

WIP sse butterflies

782213e

Add missing file

96301a9

More butterflies

ec4946b

First version of sse radix4

64826e8

Some cleanup

28261cf

Split sse mod off

ff94dc9

Separate macros

42cfaa8

First 32bit sse radix4

7e085dd

Improve radix4

ea32919

SSE 64-bit butterfly len 32

0ce51bd

Disable some debug printing

a21847b

Add sse bf32 len 32, clean up comments

fc37e0c

Speed up sse radix4

ab9ec72

Some cleanup

c7eebad

Format

5afd51f

Add sse butterfly 3

8bec5de

Add benches for comparison

391b1d1

Remove some dummy debug tests

bc3d7d2

ejmahler reviewed Feb 25, 2021

View reviewed changes

src/sse/sse_radix4.rs Outdated Show resolved Hide resolved

HEnquist added 7 commits February 25, 2021 20:39

Add comparison bench

5f2bdc8

Do double fft4s for f32 radix4

892ebfe

Do four bf4s in radix4

0de9c15

Change confusing single and double in function names

3843b9b

Change confusing single and double in function names

482ea5d

Separate f32 and f64 sse butterfly boilerplate macros

5358610

Column butterflies WIP

258bbc8

HEnquist added 4 commits March 21, 2021 23:57

Format

98d5f3b

Clean up comments

3f44ea7

Clean up comments

1caa3a7

Merge pull request #5 from HEnquist/calcmacros

233c726

Calcmacros

HEnquist added 3 commits March 23, 2021 22:40

Replace dual with parallel in function names

5851850

Format

f130371

Merge pull request #6 from HEnquist/parallel_vs_dual

bae558a

Parallel vs dual

Add sse feature, update readme

0fa51a6

Clean up function names, use scalar radix4 shuffler

1384301

ejmahler requested changes Apr 5, 2021

View reviewed changes

HEnquist added 4 commits April 5, 2021 11:28

Cleanup following review

cbafb4d

Clean up and simplify sse benches

9ccb55e

Minor cleanup

cada569

Typos in docs

62d6c9c

ejmahler added 2 commits April 14, 2021 19:00

One last sweep for references to SSE3

8f15bbf

Added SSE feature flag documentation to lib.rs

9d50972

ejmahler approved these changes Apr 15, 2021

View reviewed changes

ejmahler merged commit 1f72b41 into ejmahler:master Apr 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSE support #60

SSE support #60

HEnquist commented Feb 24, 2021

HEnquist commented Mar 23, 2021

ejmahler commented Mar 24, 2021

ejmahler commented Mar 31, 2021

HEnquist commented Mar 31, 2021

HEnquist commented Apr 2, 2021

ejmahler commented Apr 3, 2021

HEnquist commented Apr 3, 2021

ejmahler left a comment

HEnquist commented Apr 14, 2021

ejmahler commented Apr 14, 2021

ejmahler commented Apr 15, 2021 •

edited

Loading

ejmahler commented Apr 15, 2021

HEnquist commented Apr 15, 2021

SSE support #60

SSE support #60

Conversation

HEnquist commented Feb 24, 2021

HEnquist commented Mar 23, 2021

ejmahler commented Mar 24, 2021

ejmahler commented Mar 31, 2021

HEnquist commented Mar 31, 2021

HEnquist commented Apr 2, 2021

ejmahler commented Apr 3, 2021

HEnquist commented Apr 3, 2021

ejmahler left a comment

Choose a reason for hiding this comment

HEnquist commented Apr 14, 2021

ejmahler commented Apr 14, 2021

ejmahler commented Apr 15, 2021 • edited Loading

ejmahler commented Apr 15, 2021

HEnquist commented Apr 15, 2021

ejmahler commented Apr 15, 2021 •

edited

Loading