-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Fast Convolution algorithm #80
Comments
Fast convolution would be great. There is a fast convolution function build for the Teensy MCU. You might get some more insight reading the trendy forum. Ps |
Thanks for the pointer, I wasn't keeping track so that's good to know.
I was talking in terms of so called O-notation. Conventional convolution of input sequence with a filter kernel of length N requires N multiplications per output sample, hence I called it linear time. If we are talking about processing a block of length M, it would take MN multiplications. Here's an illustration:
|
Woah! Thanks for sharing so much detail, this is really great. I've been wanting to add convolution, but I have had little to no time to even consider the approach so this is truly exciting! Your plan sounds solid to me. There are definitely some oddities with the arm wrt code size. The common tables, get brought in through the init functions for all table sizes, including the once that are not used. I'm open to solutions on this front. Related to that, this would be the first thing that relies on the functions within the cmsis dsp lib, and while I do agree that we should use as much of that as possible I am hoping that despite being focused on an embedded platform, this library could be portable enough to test/use on PCs as well as embedded enviornments. |
ah. Now I understand😄. Thx |
Now this is an issue description, awesome. Unfortunately, my years of slogging out business software in Java don't afford me the knowledge to address your implementation strategy specifically, but if I may anyway add my .02€, I'd like to echo @stephenhensley
Please consider this when designing your implementation. It'd be a great to have both a reference implementation, intuitively designed, and free from platform specific oddities, and an optimized platform specific implementation. Optimally the implementations could be compared during testing for conformance. I'm looking forward to your contribution. |
My further experiments show that linker effectively purges unreferenced symbols. For example, if fixed point (Q15,Q31) FFT's are not used, then corresponding tables are not included in the executable. What gets included are tables for all FFT sizes for a given format, since FFT size is passed to the init function as run-time parameter and thus in general cannot be determined at build time. By the way someone currently commented out most of the FFT sizes in libdaisy apparently to bring the size down. My current solution is rather simple, I've created a replacement source file that I compile and link instead of the existing arm_common_tables.c. The difference is that all tables are placed into the BSS SDRAM section and there's an "init" function to fill them in at run-time. Otherwise, if we want to support all sizes (and for binary partitioned convolution they are all required) the resulting twiddle factor tables occupy almost all available internal flash, so there's practically no more room for the user application code.
Thanks, that was actually one of the questions I had, but forgot to write down.
|
Thanks, but I bet you'd rather have a working code without issue description than the other way round :)
As we all know, trying to make something universal ends up making it non-efficient or non-readable. Usually both.
Thanks! The plain spectrum-domain convolution (the one that still has the block latency) is working perfectly already. I'm finessing all the unit and performance tests. Next I will work on the partitioned mode (zero-latency, which is the final goal), which will still take several weeks to complete since I'm only working on this in my spare time. |
One caveat though is that SDRAM latency is higher than internal SRAM. |
Having an efficient fast convolution with low latency would be very nice, indeed. If it can help, I can add a second API to FFTReal to split the processing in smaller chunks. This would help spreading the load on several blocks in the processing thread without using another (simulated) thread, resulting in a more predictable overall load. |
Welcome to the party! |
Sure. FFTReal was designed 25 year ago and optimized for those days CPUs, and hasn’t really evolved since. I don’t expect it to compete with modern designs on current CPUs. |
Were there any advances in this direction? Looking into implementing reverbs and cabinet simulators for guitars using the Daisy so that would be really handy :) |
Hi, unfortunately I was very busy for the last half of a year. |
@AXP do you have a working repo that is public so we could check it out and help test/contribute? I have some experience using kissfft and wouldn't mind adding it to whatever profiling you already have. I figure that people using this on the Daisy boards will want to use the ARM FFT, but those using DaisySP without a Daisy board might benefit from a more efficient FFT library. |
@AXP I'm so interested in this! Any progress? Is this abandoned? Is there a possibility of you sharing what you've developed so others can contribute? |
Im keen to look into IR stuff too for guitar pedals. mainly speaker emulation and EQ profiles. Has there been any progress on this front? |
Hello everybody. I am sorry but currently there's very little chance I could continue working on this. I have two jobs and I'm in the process of relocation to a different country. But I will try to find time to upload what I've already done to a public repo so others may pick up. |
I'd like to add a fast convolution support to DaisySP. Let's discuss the requirements.
1. Definition
In short, it is an algorithm that allows computing linear convolution of a stream of data with a constant (usually long) vector in logarithmic rather than linear time by employing the fact that convolution in time domain is equivalent to multiplication in frequency domain, and the latter could be efficiently computed with Fast Fourier Transform in O(N*logN) time.
Typical applications of linear convolution are FIR filtering and reverberation.
There are two problems that are immediately faced with naive implementation:
There are elegant solutions to both problems.
The linear convolution may be redefined via cyclic convolution with a smart choice of FFT size and input signal padding, which is usually referred to as "Overlap and Add" and "Overlap and Save" algorithms (the latter will more appropriately called "Overlap and Discard" in this issue). The latency may be made arbitrarily small and even completely avoided by partitioning the filter kernel into a series of segments that are processed separately and recombined at the output which is possible since convolution is a linear operation.
2. High level requirements
2.1. The algorithm should be functionally equivalent to the existing arm_fir_f32 function
2.2. The algorithm should have valid results for any filter kernel lengths (including not power-of-2 and add values)
2.3. There is no preference between Overlap-and-Save and Overlap-and-Discard algorithms (initial profiling on Daisy shows that O&D has a slight edge)
2.4. The algorithm should reuse existing arm_* functions as much as possible unless it can be shown that custom implementation gives at least 10% overall performance increase
2.5. The algorithm should not use dynamic memory allocation
2.6. The algorithm should support the following channel mappings:
2.6.1. Mono -> Mono (1 filter kernel) [mandatory]
2.6.2. Mono -> Stereo (2 filter kernels) [optional]
2.6.3. Stereo -> Stereo (2 filter kernels L->L, R->R) [optional]
2.6.4. Stereo -> Stereo (4 filter kernels L->L, L->R, R->L, R->R) [optional]
2.7. The algorithm should allow multiple instances
2.8. The algorithm should use single precision floating point data
2.9. The algorithm should support single sample processing API
2.10. The algorithm should support block sample processing API for arbitrary block sizes
2.11. The algorithm should have the following unit tests:
2.11.1. Verification of processing results against arm_fir_f32 for various filter lengths, including non-power-of-2 and odd values, the target metric should be root mean squared error between reference and device-under-test and should not exceed -90 dB
2.11.2. Performance profiling (processing time per sample vs. filter kernel), including comparison with arm_fir_f32
2.11.3. Performance profiling (processing time per sample vs. block length) including comparison with arm_fir_f32
2.12. The unit tests should be self-contained and finish the execution with clear verification result (e.g. "PASS" or "FAIL") printed to console over USB UART.
2.13. The algorithm should be accompanied by an example application (e.g. spring reverb, since it's hard to do algorithmically)
2.14. The algorithm should be implemented in object-oriented fashion and follow DaisySP coding style
**3. Implementation details **
3.1. I propose to have two separate implementations, one is pure FFT convolution with latency and another is the partitioned convolution which would be based on the first one. In some applications the latency could be tolerated and thus partitioning would introduce unnecessary processing overhead.
3.2. In general, FFT convoloution algorithms allow some freedom in selecting the FFT size to decrease average processing cost at the expense of even more latency. I propose to limit the FFT size to 2x of the filter length (nearest power-of-2 of course). This is a reasonable compromise and also most useful for later inclusion into partitioned convolution scheme.
3.3. So far I haven't decided whether to allow or not run-time filter length choice (up to a pre-configured limit). Fixing it at compile time allows significant optimization (about 15% of throughput).
3.4. ARM filtering routines require the filter coefficients to be reversed in time (i.e. late-first order). I propose not to mimic this behavior as it is counter-intuitive for reverbs especially. But I can add an boolean parameter to flip the impulse response if required.
3.5. The ARM FFT routines inconveniently require out-of-place execution, while also modifying (i.e. corrupting) the input buffer. Otherwise several further optimizations would be possible.
3.6. Currently I propose to have the algorithm contain all necessary memory buffers within its class. Are there any arguments for user-provided memory storage?
3.7. Currently linking ARM CMSIS libraries for FFT and FIR routines significantly bloats the resulting binary. I currently can barely fit into 128kB flash. I didn't investigate it in depth, but it looks like the precomputed tables from arm_common_tables.c are the culprit.
3.8. The straightforward partitioned convolution has individual FFT's lined up in a way that results in a very spiky computation cost. It could easily result in audio drop-outs. Maybe I'll need to come up with some other, less efficient on the average, but more even processing scheme.
3.9. The partitioned convolution can be very efficiently processed in multiple threads in parallel. I'm currently thinking about fake parallelism, where the processing API would be split into two parts: bulk of the work (i.e. FFT's) would be supposed to be executed in the main application loop, while the time critical input/output handling would take part in the audio callback (i.e. in the ISR).
3.10. Just FYI, FFT convolution seems to overtake direct FIR filtering at 64 sample filter lengths on Daisy.
3.11. Haven't decided on the naming yet. My current class is called FastConv. I'm not sure whether to make two classes to implement # 3.1, or simply add compile-time configuration parameter. In the former case, I'd need a name for the second class.
Most probably I'm forgetting something, but I'd like to get the discussion going as soon as possible, so I can adjust my plans.
The text was updated successfully, but these errors were encountered: