Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autogenerate entire prime butterfly files, rather than just chunks #137

Merged
merged 23 commits into from
Mar 25, 2024

Conversation

ejmahler
Copy link
Owner

@ejmahler ejmahler commented Mar 19, 2024

This PR replaces the old python script for generating chunks of prime butterflies with a Rust program (under tools/gen_simd_butterflies) that generates the entire file. It's a single program capable of generating output for SSE, NEON, and Wasm Simd, determined by a command line argument. When we support FCMA, it will be trivial to add that as an option. More instruction sets like AVX could be added, but it will take more work so I didn't include it here.

I've been thinking about long-term sustainability of current approach to SIMD. We're talking about adding a new code path for the FCMA instruction set, there's AVX512 to consider planning around, AVX10 was announced, etc, and I don't think it's sustainable to keep copying as much code as we're copying if we want implementations for each of these - which I do. I think the biggest challenge is the butterflies. Even with our excellent boilerplate macros, they are a massive pain to work with and it's very easy to make mistakes. Any system-wide change we want to make has to be duplicated dozens of times - once per length, and then repeated per instruction set.

The high-level goal of this PR is to wrangle some of the complexity of our butterfly system - to take something that is extremely repetitive and verbose and replace it with something concise - all while keeping everything human-readable. This code generation functions very similarly to a macro, but unlike a macro, we can easily see the expanded code, contributing to the goal of human readability I mentioned above.

I also don't think our O(n^2) pile of multiplies and adds we do in the body of these FFTs is easy to replicate with a macro, and that's the most important part to automate!

If this ends up making our code easier to reason about, I would like to expand it to perhaps include good-thomas butterflies and the bigger mixed radix butterflies. There will always be some hand-written ones we don't want to automate (1, 2, 3, 4, 8 being the most likely sizes that we always hand-write) but all others operate on a pretty strict formula that we don't need to trouble ourselves with writing over and over and over again.

This PR also rearranges the body of these FFTs so that the O(n^2) pile of math expressed as FMAs, meaning we can use explicit FMA on neon while keeping the implementation of each instruction set identical. It mildly helps Wasm Simd and largely doesn't change the performance of SSE, but the rearranging alone brings solid performance wins to Neon. Converting the math ops to FMA helps even more.

This PR also moves the rotations before the big math pile - that's because for FCMA, we'll be able to omit the rotations entirely as long as we make sure to start from the "rotations first" format, and I want to keep all the implementations as similar as possible, so they all do rotations first.

Attached are performance measurements.
autogen_sse_f32
autogen_sse_f64

autogen_wasm_simd_f32
autogen_wasm_simd_f64

autogen_neon_f32
autogen_neon_f64

@ejmahler
Copy link
Owner Author

ejmahler commented Mar 19, 2024

Aside from fixing the build, there are 2 more tasks to complete before I can merge this:

  • Evaluate using handlebars templates instead of tinytemplate, with the hope that it removes the need to preprocess the file
  • Set up a CI step to verify that the autogenerated output matches the actual contents of the file, similar to the way rustfmt works, to make sure that code changes are properly propagated to the generation script

@ejmahler ejmahler changed the title Complete autogeneration of entire prime butterfly files, rather than just chunks Autogenerate entire prime butterfly files, rather than just chunks Mar 19, 2024
@HEnquist
Copy link
Contributor

This looks great!
It would be neat to avoid preprocessing. I'm not familiar with tinytemplate and handlebar, but I use jinja2 in python, and it supports custom delimiters. There is minijinja for rust, which also supports this:
https://docs.rs/minijinja/latest/minijinja/syntax/index.html#custom-delimiters
This could maybe be used to avoid escaping.

@ejmahler ejmahler merged commit 801d6a4 into master Mar 25, 2024
19 checks passed
@ejmahler ejmahler deleted the autogenerate-prime-simd branch March 25, 2024 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants