Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bluestein's algorithm #6

Merged
merged 7 commits into from
Dec 8, 2020
Merged

Bluestein's algorithm #6

merged 7 commits into from
Dec 8, 2020

Conversation

HEnquist
Copy link
Contributor

@HEnquist HEnquist commented Nov 26, 2020

This PR adds an implementation of Bluesteins algorithm. This is often but not always faster than Raders algorithm.
The code is based on the implementation in "fourier" by @calebzulawski
https://github.com/calebzulawski/fourier/blob/master/fourier-algorithms/src/bluesteins.rs

This is useful for prime length FFTs, especially the ones where Rader's gets slow.
The planner has been modified to use Raders when the prime factors of (len-1)/2 are all 7 or less, and Bluesteins for other cases. This seems to give the fastest implementation for nearly all lengths, and all the ones where the difference is large are correct.

This graph shows the average time to compute a FFT for all prime lengths below 1000, for Rader and Bluestein:
rader_vs_bluestein
The plot was made with this simple tool: https://github.com/HEnquist/fftbenching/tree/bluestein_vs_rader

for (w, wi) in scratch_b.iter_mut().zip(self.w_twiddles.iter()) {
*w = *w * wi;
}
self.inner_fft_inv.process(&mut scratch_b, &mut scratch_a);
Copy link
Owner

@ejmahler ejmahler Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an optimization to recommend: You can compute an inverse FFT by:

  1. Conjugating the inputs
  2. Computing a forward FFT on the data from step 1
  3. Conjugating the output of step 2

Conjugating will be very easy, since you're already looping over both the inputs and outputs, and it means this algorithm would only need to store a forward FFT, not both a forward and inverse.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will also mean we don't need to take "inverse" in the constructor either, because we can grab it from the inner FFT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I will do this and update the PR.

@ejmahler
Copy link
Owner

ejmahler commented Dec 4, 2020

Sorry it took me a while to see this. I think this is a great addition.

Apparently, bluestein's works with any inner size, not just powers of 2.

For example, if we could somehow make a variant of radix4 to work with size-24 and size-12 as base cases instead of size-32 and size-16 it would process FFTs of size 3 * 2^n. For a FFT of size 17, 2*17 - 1 = 33. If we exclusively went with the "next power of two" rule, we'd be forced to use a size-64 FFT. But with that radix4 variant, we could use 48 instead.

Something I discovered while working on the SIMD branch is that, for the most part, 3*2^n FFTs are just as fast as 2^n FFTs, so it would represent a clear improvement in performance.

That doesn't need to be done for this PR, but if you're excited to keep working on RustFFT, that might be an area to explore (:

fn perform_fft(&self, input: &mut [Complex<T>], output: &mut [Complex<T>]) {
assert_eq!(self.len(), input.len());

let mut scratch_a = vec![Complex::zero(); self.inner_fft_fw.len()];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor optimization: If we have to allocate in here, it might be wise to only do one allocation instead of two. IE, create a vec of twice the size, and use split_at_mut to get two slices.

Benefits would include less memory fragmentation, fewer allocator calls, improved memory locality. I don't know if it would actually be measurably faster though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes unfortunately it's not possible to avoid the allocation without changing the FFT trait so process() takes a &mut self.
But this is a good idea. I'm guessing that with overhead, the time to allocate a twice as long vector takes much less than twice the time.

@HEnquist
Copy link
Contributor Author

HEnquist commented Dec 4, 2020

Sorry it took me a while to see this. I think this is a great addition.

Apparently, bluestein's works with any inner size, not just powers of 2.

For example, if we could somehow make a variant of radix4 to work with size-24 and size-12 as base cases instead of size-32 and size-16 it would process FFTs of size 3 * 2^n. For a FFT of size 17, 2*17 - 1 = 33. If we exclusively went with the "next power of two" rule, we'd be forced to use a size-64 FFT. But with that radix4 variant, we could use 48 instead.

Something I discovered while working on the SIMD branch is that, for the most part, 3*2^n FFTs are just as fast as 2^n FFTs, so it would represent a clear improvement in performance.

That doesn't need to be done for this PR, but if you're excited to keep working on RustFFT, that might be an area to explore (:

This is a very interesting idea! I would gladly explore that. Doing it as a separate PR seems appropriate.

BTW, I assume this is the new permanent home for RustFFT, so shouldn't issues be enabled for this repo?

@ejmahler
Copy link
Owner

ejmahler commented Dec 4, 2020

Thanks for pointing that out, I enabled issues

@HEnquist
Copy link
Contributor Author

HEnquist commented Dec 5, 2020

I have now implemented all the suggested changes. I also updated the planner to use a mixed radix inner fft when that is possible and faster than a longer radix4.
The new graph then looks like this:
rader_vs_bluestein_new
The improvements can be seen as the new lower steps between ~260-380 and ~520-770.

}
self.inner_fft_inv.process(&mut scratch_b, &mut scratch_a);
self.inner_fft.process(&mut scratch_b, &mut scratch_a);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one final request: Could you leave a comment here pointing out that this is computing an inverse FFT, because we have conjugated the inputs and outputs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this now and added comments on what is going on.
I also added conjugations so the first FFT is always a forward type, and the second always an inverse. It seems using an iFFT as the first step and an FFT as the second also produces correct results. I haven't investigated why that is, but that's not very intuitive so I prefer to do it the "normal" way.

@ejmahler ejmahler merged commit 982a040 into ejmahler:master Dec 8, 2020
@ejmahler ejmahler mentioned this pull request Mar 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants