-
Notifications
You must be signed in to change notification settings - Fork 73
Refactor Mandelbrot benchmark #152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why were the serial implementations removed? I think it is important for the examples to show the cartesian product of (scalar, simd) x (serial, parallel) implementations, so that one can see the differences for example when going scalar-serial to scalar-parallel vs simd-serial to simd-parallel.
I think for this it would make more sense to have a different parallel simd algorithm that just does not reuse the current
Maybe for benchmarking we could add this to the benchmark similarly to how we do it for the ispc kernels. |
For consistency and for code reuse. ISPC never had a single-threaded version (the name was misleading), and the I don't know if there is any way to use generics to avoid this duplication.
I guess, but that would mean even more copy&paste. I also tried to get a solution for this using generics / traits, but the issue is that for the B/W version, we don't even have to store the |
3f25184
to
a4bffe1
Compare
./mandelbrot 800 600 --threads 1 --algo vector
|
Also, a weird thing: widening all vector types to |
The examples serve three purposes, in decreasing order of importance:
From the teaching point-of-view I think it is useful for new users to see the scalar+serial code first, and then see how it changes in the both axes (scalar+parallel, vector+serial), and then be able to see how to combine both. If we directly provide the vector+parallel implementation, I am afraid that new users won't be able to make the connections required to go from the scalar+serial case to the vector+parallel case on their own problems. For measuring codegen quality against ISPC we need to compare the same algorithms, and need to be particularly careful about not doing optimizations by hand that compilers are not allowed to do automatically (e.g. because they could affect the results). Ideally, we wouldn't need to sacrifice any of these two objectives at all to beat the benchmarks game. However, that is often just not realistic. For this third objective, I would prefer if we, along the scalar+serial, scalar+parallel, vector+serial, and vector+parallel implementations, a 5th implementation, e.g. vector+parallel+benchmarks_game, that does that. It doesn't have to reuse anything that the others two do (e.g. it doesn't have to use the same code for the b&w picture, it doesn't have to be able to print color pictures, etc.) and we could make that the default so that this crate would be fast when passed the same arguments that the benchmarks game passes. What do you think? |
I agree that the purpose of this example is to educate, which is why the code heavily commented, but the difference between serial and parallel is literally a single This is an embarassingly parallel problem, the parallel code is so simple it's practically not there. I agree that for some other benchmarks the parallel code is a lot different than the serial code, but in this case, it's not different enough to warrant a different implementation.
I disagree about comparing codegen. Our objective should be to write a fast implementation while sticking to idiomatic Rust code (and there's room for improvement here). Nobody said ISPC's implementation is state-of-the-art. Besides, some of these optimizations provide more educational examples than ISPC's code.
I think we might be able to win without sacrificing anything. I did some tests with the existing parallel SIMD code, and (while my changes broke the ability to output color images) I got within 12% of the fastest C++ code. I think it might be possible to work with the compiler and get even closer to the target performance. |
This makes sense. I'm fine with it as long as we document this for the scalar+parallel code.
Neither did I. ISPC's SPMD code is much nicer to read and write than |
4145a39
to
b75aea8
Compare
Alright, I've rewritten the SIMD implementation to be similar to the scalar one, and use an iterator. I haven't noticed any performance loss. You can compare the scalar and SIMD code now, they're basically the same line-for-line, only difference is all scalar data types are now SIMD-ified. I don't think it gets any easier for learning at this point. Unfortunately, the With this change, we keep color-output support, and we're within 20% of the fastest implementation's performance. I plan to optimize the output functions after I get this PR merged, then see if we can apply some more tricks from the C++ implementation. |
This looks really good! Please ping me when CI is green, and thank you for doing this and putting up with my often wrong ideas (if you are on IRC, sometimes its easier to quickly get on the same page there than through here). |
613cdd3
to
37a917f
Compare
448714f
to
1850078
Compare
I think it would be enough to just disable the test on windows, and fill an issue to not forget about this. |
They fail for some unknown reasons.
The sparc64 build bots are reaching the time limit of 50 min, we should disable one example there or two to put them back in track, otherwise CI becomes unreliable when travis is under high load. |
This PR refactors the whole Mandelbrot benchmark:
scalar
/simd
/ispc
, and they are all parallelized.structopt
Benchmark results for image of size 16000x16000:
Unfortunately, we're not gonna get much closer to the C++ implementation, since some of those optimizations would take away our ability to create color images.