Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSE support #60

Merged
merged 128 commits into from
Apr 15, 2021
Merged

SSE support #60

merged 128 commits into from
Apr 15, 2021

Conversation

HEnquist
Copy link
Contributor

This is still very much work in progress, but I thought I would show what I'm up to.
Looking at the AVX code, it's quite big and many of the things there don't look like they would work well with the much smaller SSE instruction sets. Instead I started out from the scalar code, and have made a hybrid solution where I use SSE code whenever it's ready, and fall back to the scalar one otherwise.
For now I have butterflies of length 2,3,4,5,8,16,32, and Radix4 working. The other algorightms will use these as inners, to also get some speedup.
Surprisingly I get about the same speedup for f32 as for f64. For radix4, that varies between +30% to +90%. There may be more to gain by tweaking a bit here and there.
For the individual butterflies the gain increases with the length, and for 16 and 32 they are about 3x as fast as the scalars (both for f32 and f64).
I will continue now with implementing the rest of the butterflies. (I haven't abandoned the estimating planner, this got caught up in this when it started working well).

src/sse/sse_radix4.rs Outdated Show resolved Hide resolved
@HEnquist
Copy link
Contributor Author

I tried running rustfmt on the autogenerated butterflies, but it made a mess of the calc! macros. Those lines tend to become a bit long, but since it becomes a table I think it's quite ok. Like here: https://github.com/HEnquist/RustFFT/blob/ssesimple/src/sse/sse_prime_butterflies.rs#L3317
The formatter sometimes put each term on its own row, and sometimes broke the row into a few shorter ones. The result was much harder to read, so I excluded that file from formatting. Ok?

@ejmahler
Copy link
Owner

Yeah, excluding any autogenerated file from rustfmt seems reasonable to me.

@ejmahler
Copy link
Owner

Where do you think this is at? If you think it's ready, I can do another review pass.

@HEnquist
Copy link
Contributor Author

There is one naming question left: #60 (comment)
Apart from that I think it's in good shape.

@HEnquist
Copy link
Contributor Author

HEnquist commented Apr 2, 2021

I realized the radix4 bit reverse shuffle needs some more work. I started looking into using it also for the scalar radix4, and made some proper benches for just the shuffle. Right now it's faster at medium lenghts (about 10k - 100k), but it actually gets slower for large arrays. I have some ideas for how to solve that.
It would also be much nicer to use the same shuffle function for sse and scalar radix4, instead of having two copies like here.
I would propose that we pause this PR for the moment, and I sort out the scalar shuffler first in a separate PR. After that is ready and merged, I update this PR to use the same shuffler. What do you think?

@ejmahler
Copy link
Owner

ejmahler commented Apr 3, 2021

Instead of putting this on hold, it would make sense to me to just have the SSE code use the radix4 reordering from master. That way, we can check this in sooner, and we can consider the reordering improvement to be an orthogonal change that this doesn't depend on.

Unrelated: I found that doing direct bit reversal was too slow compared to the recursive approach with scalar code, but I wonder if it would be faster using SSE?

@HEnquist
Copy link
Contributor Author

HEnquist commented Apr 3, 2021

Ready for review!
I went back to the scalar shuffler for the radix4, and changed the function names, all the ones with 1st and 2nd in them.

I have played a little with the bit reverse. This version runs at the same speed as the current one:

pub fn reverse_bits(value: usize, bits: usize) -> usize {
    let mut result: usize = 0; 
    let mut value = value;
    for _ in 0..bits {
        result = (result<<2) + (value & 0x03);
        value = value>>2;
    }
    result
}

pub unsafe fn bitrev_transpose<T: Copy>(width: usize, height: usize, input: &[T], output: &mut [T], bits: usize) {
    for x in 0..width {
        let x_rev = reverse_bits(x, bits);
        for y in 0..height {
            let input_index = x_rev + y * width;
            let output_index = y + x * height;

            *output.get_unchecked_mut(output_index) = *input.get_unchecked(input_index);
        }
    }
}

It walks through input and output arrays in the same order as the recursive one. Now I'll start working on getting this to split the work into cache friendly chunks.

Copy link
Owner

@ejmahler ejmahler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great. I left a bunch of comments, but they're much more minor than last time. I don't think we'll need a 3rd review - when these changes go in, I'll be ready to merge.

Thanks again for all your hard work to make this PR happen, I think users will be very grateful for the improved performance.

src/algorithm/butterflies.rs Outdated Show resolved Hide resolved
src/array_utils.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/sse/sse_butterflies.rs Show resolved Hide resolved
src/sse/sse_planner.rs Outdated Show resolved Hide resolved
src/sse/sse_utils.rs Outdated Show resolved Hide resolved
src/sse/sse_utils.rs Outdated Show resolved Hide resolved
src/sse/sse_utils.rs Outdated Show resolved Hide resolved
src/sse/sse_vector.rs Show resolved Hide resolved
@HEnquist
Copy link
Contributor Author

I believe I have addressed all the comments now, except the question about target_feature vs inline(always) where I need a little input.

@ejmahler
Copy link
Owner

Thanks for the update, and thanks for doing some thorough research into the inlining issue. I finally have a day off, so I'll be looking into this today. I anticipate being able to merge this today.

@ejmahler
Copy link
Owner

ejmahler commented Apr 15, 2021

Ok, after digging into it, I understand the problem space a little more. Because of the distinction between single and parallel F32 (Basically, parallel vs remainder), we can't possibly avoid having multiple copies of the core FFT function. If we move the inlines vs SSE 4.1 declarations, we just slightly change what gets duplicated and what doesn't.

But right now, because of the structure of

let alldone = array_utils::iter_chunks(buffer, 2 * self.len(), |chunk| {
                    self.perform_parallel_fft_butterfly(chunk)
                });
                if alldone.is_err() && buffer.len() >= self.len() {
                    self.perform_fft_butterfly(&mut buffer[len - self.len()..]);
                }

being inside a target feature, both of these functions get inlined together along with the loop code. The end result is fewer function calls and probably better instruction locality.

I noticed that F64 doesn't need the distinction between mainloop and remainder FFTs because it can only store one at a time, so i took a stab at applying this inline change there to see if it suffers from the same 5-10% drawbacks - and it turns out it does. So it definitely isn't instruction locality, because inspecting the assembly confirms that the actual FFT function code doesn't get duplicated.

My last hunch is that maybe the un-inlineable function call in the loop is the problem: When the target_feature is on perform_fft_contiguous, we run a loop that does nothing but call perform_fft_contiguous n times - and without some refactoring there's no way to even test whether that's the case. So for now I'm going to shelve this line of thought. I would like to revisit it at some point after this merges though, because if we can get it to work without the performance hit, there will be a huge binary size reduction.

@ejmahler ejmahler merged commit 1f72b41 into ejmahler:master Apr 15, 2021
@ejmahler
Copy link
Owner

And it's in! Thanks again for making this happen.

My plan is to look into the precision issues today, then publish a v5.1, then update num-complex, then push v6.0, hopefully all today

@HEnquist
Copy link
Contributor Author

Excellent! Thank you so much :)
This means I should hurry up with the new radix 4 shuffler. It's ready, just needs some cleanup. I will do it today!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants