Skip to content

Refactor Mandelbrot benchmark #152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 11, 2018

Conversation

GabrielMajeri
Copy link
Contributor

@GabrielMajeri GabrielMajeri commented Sep 9, 2018

This PR refactors the whole Mandelbrot benchmark:

  • there are only three supported algorithms now, scalar / simd / ispc, and they are all parallelized.
  • argument parsing for binary is done with structopt
  • scalar implementation rewritten with a struct + iterator, helps auto-vectorization.
  • some optimizations (e.g. precalculate the initial x values for the scalar and SIMD implementations)

Benchmark results for image of size 16000x16000:

scalar: 5.4s
ispc: 3.7s
simd: 1.8s
fastest C++ implementation: 1.1s

Unfortunately, we're not gonna get much closer to the C++ implementation, since some of those optimizations would take away our ability to create color images.

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 9, 2018

and they are all parallelized.

Why were the serial implementations removed? I think it is important for the examples to show the cartesian product of (scalar, simd) x (serial, parallel) implementations, so that one can see the differences for example when going scalar-serial to scalar-parallel vs simd-serial to simd-parallel.

Unfortunately, we're not gonna get much closer to the C++ implementation, since some of those optimizations would take away our ability to create color images.

I think for this it would make more sense to have a different parallel simd algorithm that just does not reuse the current ppm/pbm system but does it all in one go. That can be added in a different PR though.

fastest C++ implementation: 1.1s

Maybe for benchmarking we could add this to the benchmark similarly to how we do it for the ispc kernels.

@GabrielMajeri
Copy link
Contributor Author

@gnzlbg

Why were the serial implementations removed?

For consistency and for code reuse. ISPC never had a single-threaded version (the name was misleading), and the simd and simd_par were extremly similar, basically copy&paste and change chunks into par_chunks.

I don't know if there is any way to use generics to avoid this duplication.

I think for this it would make more sense to have a different parallel simd algorithm that just does not reuse the result

I guess, but that would mean even more copy&paste. I also tried to get a solution for this using generics / traits, but the issue is that for the B/W version, we don't even have to store the count. It's completly different from what we're conceptually doing now.

@GabrielMajeri GabrielMajeri force-pushed the refactor-mandelbrot branch 2 times, most recently from 3f25184 to a4bffe1 Compare September 9, 2018 14:27
@GabrielMajeri
Copy link
Contributor Author

@gnzlbg

  • Instead of adding a separate single-threaded implementation, would it be OK to add a command line switch to limit rayon's number of threads with a command line argument? Something like:
./mandelbrot 800 600 --threads 1 --algo vector
  • I also have the idea to make the output code SIMD for all code paths. It's not part of the benchmark per se, so it would be best to make it as fast as possible.

@GabrielMajeri
Copy link
Contributor Author

Also, a weird thing: widening all vector types to _x8 from _x4, and removing mul_adde, gave me a 10% performance boost, when compiling for SSE2.

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 9, 2018

Instead of adding a separate single-threaded implementation, would it be OK to add a command line switch to limit rayon's number of threads with a command line argument? Something like:

The examples serve three purposes, in decreasing order of importance:

  • teaching: they show new users how to write SPMD like code with this library and rayon.
  • measure codegen quality: by comparing the same algorithm implementation in both ISPC and rust
  • beat the benchmarks game

From the teaching point-of-view I think it is useful for new users to see the scalar+serial code first, and then see how it changes in the both axes (scalar+parallel, vector+serial), and then be able to see how to combine both. If we directly provide the vector+parallel implementation, I am afraid that new users won't be able to make the connections required to go from the scalar+serial case to the vector+parallel case on their own problems.

For measuring codegen quality against ISPC we need to compare the same algorithms, and need to be particularly careful about not doing optimizations by hand that compilers are not allowed to do automatically (e.g. because they could affect the results).

Ideally, we wouldn't need to sacrifice any of these two objectives at all to beat the benchmarks game. However, that is often just not realistic. For this third objective, I would prefer if we, along the scalar+serial, scalar+parallel, vector+serial, and vector+parallel implementations, a 5th implementation, e.g. vector+parallel+benchmarks_game, that does that. It doesn't have to reuse anything that the others two do (e.g. it doesn't have to use the same code for the b&w picture, it doesn't have to be able to print color pictures, etc.) and we could make that the default so that this crate would be fast when passed the same arguments that the benchmarks game passes.

What do you think?

@GabrielMajeri
Copy link
Contributor Author

new users won't be able to make the connections required to go from the scalar+serial case to the vector+parallel case on their own problems

I agree that the purpose of this example is to educate, which is why the code heavily commented, but the difference between serial and parallel is literally a single par_chunks versus a chunks.
We can tell people to imagine that we wrote chunks there, and that would be it. The big difference exists between the scalar and SIMD code.

This is an embarassingly parallel problem, the parallel code is so simple it's practically not there. I agree that for some other benchmarks the parallel code is a lot different than the serial code, but in this case, it's not different enough to warrant a different implementation.

For measuring codegen quality against ISPC we need to compare the same algorithms,

I disagree about comparing codegen. Our objective should be to write a fast implementation while sticking to idiomatic Rust code (and there's room for improvement here). Nobody said ISPC's implementation is state-of-the-art. Besides, some of these optimizations provide more educational examples than ISPC's code.

Ideally, we wouldn't need to sacrifice any of these two objectives at all to beat the benchmarks game.

I think we might be able to win without sacrificing anything. I did some tests with the existing parallel SIMD code, and (while my changes broke the ability to output color images) I got within 12% of the fastest C++ code. I think it might be possible to work with the compiler and get even closer to the target performance.

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 9, 2018

I agree that the purpose of this example is to educate, which is why the code heavily commented, but the difference between serial and parallel is literally a single par_chunks versus a chunks.
We can tell people to imagine that we wrote chunks there, and that would be it. The big difference exists between the scalar and SIMD code.

This makes sense. I'm fine with it as long as we document this for the scalar+parallel code.

Nobody said ISPC's implementation is state-of-the-art.

Neither did I. ISPC's SPMD code is much nicer to read and write than packed_simd+rayon, and until recently it was still faster in most archs. Comparing against ISPC has been a valuable source of potential improvements to packed_simd internals that has significantly improved its performance. Comparing packed_simd against anything that's better than ISPC at code clarity and/or performance would be a great addition. One option would be to compare against the C++ code used in the benchmarks game by adding it and linking against it in the same way that we currently do with ISPC.

@GabrielMajeri
Copy link
Contributor Author

GabrielMajeri commented Sep 9, 2018

Alright, I've rewritten the SIMD implementation to be similar to the scalar one, and use an iterator. I haven't noticed any performance loss. You can compare the scalar and SIMD code now, they're basically the same line-for-line, only difference is all scalar data types are now SIMD-ified. I don't think it gets any easier for learning at this point.

image

Unfortunately, the scalar code has a diverging function, while the SIMD code has a undiverged function, because as I said before, replacing less-equal with greater-than breaks the tests, for some unknown reason. As I said, I am no SIMD expert, so it would be great if you could investigate that.

With this change, we keep color-output support, and we're within 20% of the fastest implementation's performance. I plan to optimize the output functions after I get this PR merged, then see if we can apply some more tricks from the C++ implementation.

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 10, 2018

This looks really good! Please ping me when CI is green, and thank you for doing this and putting up with my often wrong ideas (if you are on IRC, sometimes its easier to quickly get on the same page there than through here).

@GabrielMajeri GabrielMajeri force-pushed the refactor-mandelbrot branch 2 times, most recently from 613cdd3 to 37a917f Compare September 10, 2018 15:26
@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 10, 2018

I think it would be enough to just disable the test on windows, and fill an issue to not forget about this.

They fail for some unknown reasons.
@gnzlbg gnzlbg merged commit 1629005 into rust-lang:master Sep 11, 2018
@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 11, 2018

The sparc64 build bots are reaching the time limit of 50 min, we should disable one example there or two to put them back in track, otherwise CI becomes unreliable when travis is under high load.

@GabrielMajeri GabrielMajeri deleted the refactor-mandelbrot branch September 11, 2018 03:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants