-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiments on accelerating algorithms #1
Comments
vDSP The vDSP library looks interesting, since it can be mixed with the native C++ code. However the complex number arithmetic appears to require the "split complex" memory layout instead of the regular interleaved The forward SDFT can be implemented by following sequence:
Compared to the vanilla C++ implementation I can't see any significant performance difference, just same time measurements and same CPU usage, so 👎. The allocated memory is aligned by default as required and the clang compiler seems to be doing its job very well. According to the LLVM docs the auto-vectorization is on by default. E.g. explicitly switched off via compiler flags Metal The first Metal experiment shows a typical "command queue" overhead problem. Although the SDFT can be computed in parallel for a single sample, the equal computation needs to be sequentially repeated for all samples of the frame buffer. Maybe an indirect command encoding can help to deal with that. OpenCL Just same story as Metal... The OpenCL 2.0 spec describes a mechanism of enqueuing kernels from kernels. Limiting signal bandwidth Probably the fastest way of computing SDFT is not computing it at all... One main feature of the SDFT is arbitrary spectral resolution and thus the possibility of limiting the signal bandwidth to save CPU cycles. As long as the source signal bandwidth is known in advance, there is no need to compute all spectral bands at analysis step. At synthesis step, the destination signal bandwidth can also be adjusted according to the applied pitch shifting factor. Utilize both CPU and GPU simultaneously If delayed by one frame, the computation task can be spread between CPU and GPU. For example in case of SDFT the frame size can be reduced to something like 64 or 32 samples, which will result a latency of about 1 ms at 44,1 kHz and is still an order of magnitude better than STFT. Reduce sample rate This is currently the most useful hack, which is actually an another way of bandwidth limitation. E.g. sample rate conversion 48000 (adc) => 16000 (dsp) => 48000 (dac) works just fine on the CPU with headroom for spectral processing. |
atan2
approximation (instead ofstd::arg
)The text was updated successfully, but these errors were encountered: